Sunday, November 6, 2011

How to deal with application induced system instability?

Most of the WebSphere instability is caused by application code defect. However, as WebSphere system engineers, we need to help stabilise the WebSphere system. There are three areas where we can help.

  1. Take appropriate measures to achieve relative stability and help with customer experience
  2. Work with the application team to isolate and fix the defects
  3. Use engineering processes to prevent system instability

To achieve relative stability is possible if you take the right  approach. For example, for a high traffic system under heavy load, adding JVM instances is quite often the shortest path to relative stability and help significantly improve customer experience. For a slow memory leak, scheduled recycling of the servers is very effective in achieving a level of stability.

Before you create more JVM instances, you have the following to consider.

  1. Does the application support vertical clustering?
  2. Does the application support horizontal clustering?
  3. Does the application have only limited clustering support?
Some application does not support vertical clustering. Some do not support horizontal clustering. Some do not support clustering at all. The most interestingly, some application only has limited support of clustering due to design strategy, for example, all JVM instances have a cache that depends on frequent interaction with the backend to function. These frequent interation with the database may cause database contention, especially under heavy traffic.

For slow memory leak, increasing heap size, switching to 64 bit systems, as well as more frequent recycle can help with maintain a level of stability and buy time to fix the bugs causing memory leak.
Work closely with the application team, stay away from finger pointing, build a good work relationship with all engineering and application teams, and pro-actively produce dumps and share logs. Teamwork and collaboration will help you to isolate the defects and fix them.

Be very careful in performing changes. Design and implement audit and peer review processes. Diligent and careful system engineering processes help in preventing system from occurring or recurring.

No comments: