Figure 1 is a screen shot of the Refractor control panel and shows the
response time for a short load test that was run against a Websphere
application server. The top chart is a plot of response time by WebSphere
component as it processes each URL as measured from the HTTP server.
It is displayed as a series of bars per unit of time that show on
average what the component of time spent in each tier was during that
time slice. In this picture the bars are one second in width and
the Web Server response time is shown in green, the Java or WebSphere Application
Server (WAS) response time is shown in blue and the database which is DB2 is
shown in red. Response time alone isn't a complete characterization of
the behavior of the application so we have plotted in the bottom half of
the Refractor Panel, the throughput in requests per second at the (WAS).
On the left hand side of the panel we see a detaled listing of the processes
in each tier. This is the workload that was automatically identified by the Refractor so
that the analyst doesn't have to know what PIDs were associated with what
function.
Figure 1
In order to better understand the behavior of a tier in the application,
we need to look at how the response time is broken down. We also want to
identify whether a given tier is scaling and if not, where is the lack of
scalability occuring. In Fig 2 we have plotted data specific to Tier 2.
Figure 2
The left hand side of the chart panel shows two charts that address the
application scalability as a function of throughput. The upper left is the
standard scatter plot of throughput vs. response time. In a load test,
the response time will remain fairly constant until some queuing begins
to occur in the application. Once queuing starts, response time rises
rather sharply as throughput is increased only marginally. This is seen
in the WAS at about 200 requests per second. Any attempt to increase
throughput above 200 requests per second results in substantially higher response times.
The charts on the right hand side of the chart panel are plotted against time and
show througput vs. time and the components of the WAS response time vs.
time. In the bottom right we see that throughput is increasing as the load
is ramped up for the
first 100s of the load test. The main component of WAS response time for
the first minute is CPU usage and some client latency which is just the
time it takes to access DB2 for supporting functions in the application.
After the first minute, the DB2 response time increases substantially. At
about 150s we see that a new component of response time in the WAS shows up.
That is Server Serialization or queuing. This increases dramatically between
150s and 200s. During this timeframe we can see also that the throughput
of the WAS actually drops. That is typical of saturation in a server and
can be seen in the folding back pattern of the upper left scatter plot.
The black line shown is an annotation that we have made to highlight the
progression of throughtput vs. response time. Response time doubles
from 10ms to 25ms as the throughput increases from 0 to 200 requests per second. This is also
clear in the chart of throughput and response time vs time from 0 to 150s.
Above 200 requests per second, response time increases exponentially and
throughput actually drops. This leads to the foldback curve annoted on
the chart.
The second indicator of scalability is the Througput vs CPU shown for the WAS.
This is a scatter plot that shows the percent of a CPU used vs. througput by
the WAS. To understand this, consider that if a single request costs .5s
of CPU time then 2 requests per second should consume 1 CPU, 4 requests per
second should consume 2 cpus and so on. Therefore, if the cost per request
is consistent across the range of throughputs, this plot should be very
close to linear which it is. A chart that tapers down would indicat that
there are inefficiencies in higher loading. In this plot, there are two outlier
points which are associated with a garbage collection interval that can
be seen in the throughput chart at around 120s.
The source of the increased response time is both internal queuing as well
as increased response time from the database server.