Top 10 Causes of Java EE Enterprise Performance Problems

Performance problems are one of the biggest challenges to expect when designing and implementing Java EE related technologies. Some of these common problems can be faced when implementing either lightweight or large IT environments; which typically include several distributed systems from Web portals & ordering applications to enterprise service bus (ESB), data warehouse and legacy Mainframe storage systems.

It is very important for IT architects and Java EE developers to understand their client environments and ensure that the proposed solutions will not only meet their growing business needs but also ensure a long term scalable & reliable production IT environment; and at the lowest cost possible. Performance problems can disrupt your client business which can result in short & long term loss of revenue.

This article will consolidate and share the top 10 causes of Java EE performance problems I have encountered working with IT & Telecom clients over the last 10 years along with high level recommendations.

Please note that this article is in-depth but I'm confident that this substantial read will be worth your time.

#1 - Lack of proper capacity planning

I'm confident that many of you can identify episodes of performance problems following Java EE project deployments. Some of these performance problems could have a very specific and technical explanation but are often symptoms of gaps in the current capacity planning of the production environment.

Capacity planning can be defined as a comprehensive and evolutive process measuring and predicting current and future required IT environment capacity. A proper implemented capacity planning process will not only ensure and keep track of current IT production capacity and stability but also ensure that new projects can be deployed with minimal risk in the existing production environment. Such exercise can also conclude that extra capacity (hardware, middleware, JVM, tuning, etc.) is required prior to project deployment.

In my experience, this is often the most common "process" problem that can lead to short- and long- term performance problems. The following are some examples.

Problems observed

Possible capacity planning gaps

A newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g., java.lang.OutOfMemoryError is observed).

-Lack of understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization

-Lack of memory static and / or dynamic footprint calculation of the newly deployed application

-Lack of performance and load testing preventing detection of problems such as Java Heap memory leak

A newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.

-Lack of understanding of the current CPU utilization (e.g., established baseline)

-Lack of understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU)

-Lack of load and performance testing failing to predict the impact on existing CPU utilization

A new Java EE middleware system is deployed to production but unable to handle the anticipated volume.

-Missing or non-adequate performance and load testing performed

-Data and test cases used in performance and load testing not reflecting the real world traffic and business processes

-Not enough bandwidth (or pages are much bigger than capacity planning anticipated)

One key aspect of capacity planning is load and performance testing that everybody should be familiar with. This involves generating load against a production-like environment or the production environment itself in order to:

There are several technologies out there allowing you to achieve these goals. Some load-testing products allow you to generate load from inside your network from a test lab while other emerging technologies allow you to generate load from the "Cloud".

I'm currently exploring the free version of Load Tester, a new load testing tool I found allowing you to record test cases and generate load from inside your network or from the Cloud.

Regardless of the load and performance testing tool that you decide to use, this exercise should be done on a regular basis for any dynamic Java EE environments and as part of a comprehensive and adaptive capacity planning process. When done properly, capacity planning will help increase the service availability of your client IT environment.

#2 - Inadequate Java EE middleware environment specifications

The second most common cause of performance problems I have observed for Java EE enterprise systems is an inadequate Java EE middleware environment and / or infrastructure. Not making proper decisions at the beginning of new platform can result in major stability problems and increased costs for your client in the long term. For that reason, it is important to spend enough time brainstorming on required Java EE middleware specifications. This exercise should be combined with an initial capacity planning iteration since the business processes, expected traffic, and application(s) footprint will ultimately dictate the initial IT environment capacity requirements.

Now, find below typical examples of problems I have observed in my past experience:

Trying to leverage a single middleware and / or JVM for many large Java EE applications can be quite attractive from a cost perspective. However, this can result in an operation nightmare and severe performance problems such as excessive JVM garbage collection and many domino effect scenarios (e.g., Stuck Threads) causing high business impact (e.g., App A causing App B, App C, and App D to go down because a full JVM restart is often required to resolve problems).

Recommendations

#3 - Excessive Java VM garbage collections

Now let's jump to pure technical problems starting with excessive JVM garbage collection. Most of you are familiar with this famous (or infamous) Java error: java.lang.OutOfMemoryError. This is the result of JVM memory space depletion (Java Heap, Native Heap, etc.).

I'm sure middleware vendors such as Oracle and IBM could provide you with dozens and dozens of support cases involving JVM OutOfMemoryError problems on a regular basis, so no surprise that it made the #3 spot in our list.

Keep in mind that a garbage collection problem will not necessarily manifest itself as an OOM condition. Excessive garbage collection can be defined as an excessive number of minor and / or major collections performed by the JVM GC Threads (collectors) in a short amount of time leading to high JVM pause time and performance degradation. There are many possible causes:

Before pointing a finger at the JVM, keep in mind that the actual "root" cause can be related to our #1 & #2 causes. An overloaded middleware environment will generate many symptoms, including excessive JVM garbage collection.

Proper analysis of your JVM related data (memory spaces, GC frequency, CPU correlation, etc.) will allow you to determine if you are facing a problem or not. Deeper level of analysis to understand your application memory footprint will require you to analyze JVM Heap Dumps and / or profile your application using profiler tools (such as JProfiler) of your choice.

Recommendation

#4 - Too many or poor integration with external systems

The next common cause of bad Java EE performance is mainly applicable for highly distributed systems; typical for Telecom IT environments. In such environments, a middleware domain (e.g., Service Bus) will rarely do all the work but rather "delegate" some of the business processes, such as product qualification, customer profile, and order management, to other Java EE middleware platforms or legacy systems such as Mainframe via various payload types and communication protocols.

Such external system calls means that the client Java EE application will trigger creation or reuse of Socket Connections to write and read data to/from external systems across a private network. Some of these calls can be configured as synchronous or asynchronous depending of the implementation and business process nature. It is important to note that the response time can change over time depending on the health of the external systems, so it is very important to shield your Java EE application and middleware via proper use of timeouts.


Major problems and performance slowdown can be observed in the following scenarios:

Finally, I also recommend that you spend adequate time performing negative testing. This means that problem conditions should be "artificially" introduced to the external systems in order to test how your application and middleware environment handle failures of those external systems. This exercise should also be performed under a high-volume situation, allowing you to fine-tune the different timeout values between your applications and external systems.

#5 - Lack of proper database SQL tuning & capacity planning

The next common performance problem should not be a surprise for anybody: database issues. Most Java EE enterprise systems rely on relational databases for various business processes from portal content management to order provisioning systems. A solid database environment and foundation will ensure that your IT environment will scale properly to support your client growing business.

In my production support experience, database-related performance problems are very common. Since most database transactions are typically executed via JDBC Datasources (including for relational persistence API's such as Hibernate), performance problems will initially manifest as Stuck Threads from your Java EE container Thread manager. The following are common database-related problems I have seen over the last 10 years:

* Note that Oracle database is used as an example since it is a common product used by my IT clients.*

Recommendations

#6 - Application specific performance problems

To recap, so far we have seen the importance of proper capacity planning, load and performance testing, middleware environment specifications, JVM health, external systems integration, and the relational database environment. But what about the Java EE application itself? After all, your IT environment could have the fastest hardware on the market with hundreds of CPU cores, large amount of RAM, and dozens of 64-bit JVM processes; but performance can still be terrible if the application implementation is deficient. This section will focus on the most severe Java EE application problems I have been exposed to from various Java EE environments.

My primary recommendation is to ensure that code reviews are part of your regular development cycle along with release management process. This will allow you to pinpoint major implementation problems as per below and prior to major testing and implementation phases.

Thread safe code problems

Proper care is required when using Java synchronization and non-final static variables / objects. In a Java EE environment, any static variable or object must be Thread safe to ensure data integrity and predictable results. Wrong usage of static variable for a Java class member variable can lead to unpredictable results under load since these variables/objects are shared between Java EE container Threads (e.g., Thread B can modify static variable value of Thread A causing unexpected and wrong behavior). A class member variable should be defined as non static to remain in the current class instance context so each Thread has its own copy.

Java synchronization is also quite important when dealing with non-Thread safe data structure such as a java.util.HashMap. Failure to do so can trigger HashMap corruption and infinite looping. Be careful when dealing with Java synchronization since excessive usage can also lead to stuck Threads and poor performance.

Lack of communication API timeouts

It is very important to implement and test transaction (Socket read () and write () operations) and connection timeouts (Socket connect () operation) for every communication API. Lack of proper HTTP/HTTPS/TCP IP... timeouts between the Java EE application and external system(s) can lead to severe performance degradation and outage due to stuck Threads. Proper timeout implementation will prevent Threads to wait for too long in the event of major slowdown of your downstream systems.

Below are some examples for some older and current APIs (Apache & Weblogic):

Communication API

Vendor

Protocol

Timeout code snippet

commons-httpclient 3.0.1

Apache

HTTP/HTTPS

HttpConnectionManagerParams.setSoTimeout(txTimeout); // Transaction timeout

HttpConnectionManagerParams.setConnectionTimeout(connTimeout);// Connection timeout

axis.jar (v1.4 1855)

Apache

WS via HTTP/HTTPS

*** Please note that version 1.x of AXIS is exposed to a known problem with SSL Socket creation which ignores the specified timeout value. Solution is to override the client-config.wsdd and setup the HTTPS transport to <transport name="https" pivot="java:org.apache.axis.transport.http.CommonsHTTPSender"/> ***

((org.apache.axis.client.Stub) port).setTimeout(timeoutMilliseconds); // Transaction & connection timeout

WLS103 (old JAX-RPC)

Oracle

WS via HTTP/HTTPS

// Transaction & connection timeout

((Stub)servicePort)._setProperty("weblogic.webservice.rpc.timeoutsecs", timeoutSecs);

WLS103 (JAX-RPC 1.1)

Oracle

WS via HTTP/HTTPS

((Stub)servicePort)._setProperty("weblogic.wsee.transport.read.timeout", timeoutMills); // Transaction timeout

((Stub)servicePort)._setProperty("weblogic.wsee.transport.connection.timeout", timeoutMills); // Connection timeout

I/O, JDBC or relational persistence API resources management problems

Proper coding best practices are important when implementing a raw DAO layer or using relational persistence APIs such as Hibernate. The goal is to ensure proper Session / Connection resource closure. Such JDBC related resources must be closed in a finally {} block to properly handle any failure scenario. Failure to do so can lead to JDBC Connection Pool leak and eventually stuck Threads and full outage scenario.

Same rule apply to I/O resources such as InputStream. When no longer used, proper closure is required; otherwise, it can lead so Socket / File Descriptor leak and full JVM hang.

Lack of proper data caching

Performance problems can be the result of repetitive and excessive computing tasks, such as I/O / disk access, content data from a relational database, and customer-related data. Static data with reasonable memory footprint should be cached properly either in the Java Heap memory or via a data cache system.

Static files such as property files should also be cached to prevent excessive disk access. Simple caching strategies can have a very positive impact on your Java EE application performance.

Data caching is also important when dealing with Web Services and XML-related APIs. Such APIs can generate excessive dynamic Class loading and I/O / disk access. Make sure that you follow such API best practices and use proper caching strategies (Singleton, etc.) when applicable. I suggest you read JAXB Case Study on that subject.

Excessive data caching

Ironically, while data caching is crucial for proper performance, it can also be responsible for major performance problems. Why? Well, if you attempt to cache too much data on the Java Heap, then you will be struggling with excessive garbage collections and OutOfMemoryError conditions. The goal is to find a proper balance (via your capacity planning process) between data caching, Java Heap size, and available hardware capacity.

Here is one example of a problem case from one of my IT clients:

The important point to remember from this story is that when too much data caching is required to achieve proper performance level, it is time to review the overall solution and design.

Excessive logging

Last but not the least: excessive logging. It is a good practice to ensure proper logging within your Java EE application implementation. However, be careful with the logging level that you enable in your production environment. Excessive logging will trigger high IO on your server and increase CPU utilization. This can especially be a problem for older environments using older hardware or environments dealing with very heavy concurrent volumes. I also recommend that you implement a "reloadable" logging level facility to turn extra logging ON / OFF when required in your day to day production support.

#7 - Java EE middleware tuning problems

It is important to realize that your Java EE middleware specifications may be adequate but may lack proper tuning. Most Java EE containers available today provide you with multiple tuning opportunities depending on your applications and business processes needs.

Failure to implement proper tuning and best practices can put your Java EE container in a non-optimal state. I highly recommend that you review and implement proper Java EE middleware vendor recommendations when applicable.

Find below a high-level view and sample check list of what to look for.


#8 - Insufficient proactive monitoring

Lack of monitoring is not actually "causing" performance problems, but it can prevent you from understanding the Java EE platform capacity and health situation. Eventually, the environment can reach a break point, which may expose several gaps and problems (JVM memory leak, etc.). From my experience, it is much harder to stabilize an environment after months or years of operation as opposed to having proper monitoring, tools, and processes implemented from day one.

That being said, it is never too late to improve an existing environment. Monitoring can be implemented fairly easily. My recommendations follow.

#9 - Saturated hardware on common infrastructure

Another common source of performance problems is hardware saturation. This problem is often observed when too many Java EE middleware environments along with its JVM processes are deployed on existing hardware. Too many JVM processes vs. availability of physical CPU cores can be a real problem killing your application performance. Again, your capacity planning process should also take care of hardware capacity as your client business is growing.

My primary recommendation is to look at hardware virtualization. Such an approach is quite common these days and has quite a few benefits such as reduced physical servers, data center size, dedicated physical resources per virtual host, fast implementation, and reduced costs for your client. Dedicated physical resources per virtual host is quite important since the last thing you want is one Java EE container bringing down all others due to excessive CPU utilization.


#10 - Network latency problems

Our last source of performance problems is the network. Major network problems can happen from time to time such as router, switch, and DNS server failures. However, the more common problems observed are typically due to regular or intermittent latency when working on a highly distributed IT environment. The diagram below highlights an example of network latency gaps between two geographic regions of a Weblogic cluster communicating with an Oracle database server located in one geographic region only.

 

Intermittent or regular latency problems can definitely trigger some major performance problems and affect your Java EE application in different ways.

Tuning strategies such as JDBC row data "prefetch", XML data compression, and data caching can help mitigate network latency. But such latency problems should be reviewed closely when first designing the network topology of a new IT environment.

I hope this article has helped you understand some of the common performance problems and pressure points you can face when developing and supporting Java EE production systems. Since each IT environment is unique, I do not expect that everybody will face the exact same problems. As such, I invite you to post your comments and share your views on the subject.

 

 

 

 

Top