Capacity Planners Benefit From Extending Testing Onto The Live Systems

Date: 15th June 2012
Author: Deri Jones

A number of interesting things are happening at the moment in the crossover of ecommerce capacity planning space and web monitoring and testing, as the need for more realism in data to create more realistic models increases.

This month we heard from a friendly team of Capacity Planners in the middle of some major load tests, who had a couple of very interesting tips and tricks to get the most out of their project.

Some Capacity planners rely on internal load testing teams, perhaps from a separate IT department, where the (expensive ?!)  test software contracts are held.  Others will outsource, and some will run load tests from their own team.

In this case it was one of our teams performing the load tests, but it was what went on in parallel, that was the noteworthy gain.

The particular project had required Load Testing of a more realistic nature than previously, after a less than excellent performance through the retailers Christmas season.

Outside of that, some of the realistic dynamic journeys that had been constructed were also set up to run in 24/7 mode as monitoring User Journeys. These were initially intended to run just during the busy load test nights, in order to provide an extra metric on how the User Experience was being impacted by load.  Some of the business team had expressed specific interest in this as the business impact of performance was an issue, not just due to the opportunity cost of lost sales when performance was poor, but also because there had been questions about certain 3rd-party suppliers impact: ones that were providing key parts to certain pages.

As it turned out, the metrics from these 24/7 dynamic journey metrics provided data of unique value during the project, coming to the rescue at a tricky time when it looked like the system just could not deliver what was needed, resulting in heated discussions about changing launch dates .

Heavy LoadThe Load Testing was planned to be in two phases . The initial round was to be done as soon as the new code base and system were in a fit state to provide quick feedback on overall system capacity. The second phase, to be the main proof, was to run on the finished system, just prior to it going live (with a little slack planned for small fixes).

All testing was on the fully sized hardware and infrastructure, so avoiding any of the inaccuracies that occur when testing a scaled-down system and trying to multiply up to give ‘equivalent figures’ to the real infrastucture.

Phase I kicked off well enough and the initial pre-Test rounds went smoothly. These pre-tests are at lower volume with all the test scripts to prove firstly that the scripts don’t have any corner cases that throw errors, if for example the website has quite different content for different product categories or sub-categories.   They also go to prove that the site under test itself doesn’t throw any corner case errors.  It can happen that the site build may have inadvertantly missed perhaps a whole category of products that didn’t get copied into the database, or that specific code functionality key to certain pages, maybe AJAX or web service delivered, had not been correctly built without any missing or buggy libraries left in.

But the next day problems struck.  As the simulated traffic levels were ramped up, a capacity ceiling was hit early on that was much lower than expected:  1/3 as high in fact.

Initial suspicions lay with one of the settings on an Intrusion prevention security component in the infrastructure, as that vendor’s kit  had proven to be a problem some months back.  24 hours was spent in getting in the right people to take a look and make changes.  Sadly that made no difference.

Attention shifted to some code base issues, and their impact on how database queries were being called and cached, and whether some of the clever caching code was actually introducing more waiting time through subtle locking issues.

So a couple of A / B test rounds were done using code bases with small changes in that area.  One test was against a version that had a library re-written overnight, that attempted to disable a certain part, to see if that could identify a smoking gun.

Time on the project was ticking by. Phase I time had run out, the final code base was made available and Phase II was in progress, when the dynamic user journey monitoring  provided input that resolved the growing concerns.

The 24/7 monitoring journey had shown the impact of the various changes made throughout the week: showing clearly some of the increased journey delivery times caused as various code base features had been disabled.

But it was a surprise when the Journeys showed a step-change one afternoon between load test runs when none of the infrastructure was supposed to be changing.

The monitoring showed a clear increase in step delivery time for one particular step and a high error rate – Alerting had been configured on these monitor journeys so SMS’s went out and triggered attention from the team.

Checking out with the various teams as to who had changed what on the systems it transpired that a sys-admin guy had pulled one of the app-servers from the farm,  not expecting it to have any effect as the load balancing should ensure that everything continued happily spread now across 9 not 10 servers, and as the system was not under load, 9 would be fine.

That was a sensible assumption, but as is often the case in IT,  assumptions based on previous experience and based on systems ought to behave can be dangerous!

Plugging the 10th server back was spotted by the monitoring, which sent out ‘problem resolved’ SMS messages, as the performance had returned to normal.

A couple of hours investigation on the system, found that an unexpected piece of code was looking for data that was only been served by that 10th server.  With the server in place, all servers worked, but all were dependent on calls to No 10.  Also, the web service on No 10 was coded up up to protect itself under load, so above a preset request level it was pushing requests into a queue and preventing the queue processing from taking too much resource by adding big wait times.

The code on all the servers was amended, so that the calls were made to web services on the same hardware rather than the one on server 10,  which was the design specification of the code.   The code error had slipped through because a testing configuration option had been added to the code when it was running on just 1 server in development, and the fact that the code worked differently across the web farm had not been observed.

The final Load testing in Phase II could therefore complete – the desired traffic levels were reached, and the retail site went live as planned.

The team chose to continue with the benefit of the 24/7 User Journey monitoring, so that they’d be able to pick up immediately any user experience impact caused during the planned programme of tweaks and new features, if any small changes under the bonnet should have bigger impact than expected.

The success highlights the benefit of performance measurement that is “technology agnostic”, that is done completely outside of the systems itself and not depending on any code or logging that is produced by it.  The dynamic User Journey approach enables understanding of final user experience delivered by the end to end systems from the UI, through internal systems, logistics and warehouse management, to delivery for complex systems integration.

In line with approaches like the Rapid Problem Resolution RPR approach, the approach is evidence based, and being user-experience focused provides a common language / understanding that can be  used across departments. This means  all teams are focused on the things that impact performance as experienced by users – not just on the performance metrics of their own team or department.

Of particular note from the Capacity planning team, having handed the system over to Opps, was the time-saving benefit of the drill down technical details provided by the journey monitoring portal. This meant less time spent firefighting in finding evidence of what was actually happening, less time spent in meetings speculating and setting tasks for teams to go away and come back with more logs ‘just in case they’ll help’  – and much stream lining of the prioritization of infrastructure activity and making development investment decisions and forecasts.

In other news, it’s been an incredibly busy time for load testing these past few weeks, working with some for the busiest eCommerce sites in the UK such as Cineworld, Clarks, Transport for London as part of their Olympics 2012 prepartions and, the UK’s biggest supermarket with a rolling programme of testing