Date: 13th September 2010
As part of our web performance testing services, we’ve been web load testing for a long time now. We’ve tested most sectors now, from government departments, to law firms, to online etailers, to utility companies and banks.
We’ve tested big name sites that turned out to have pretty modest web capacity handling, right through to smaller companies who have web sites surprisingly efficient at handling high load levels, who have followed good practise like ITIL Capacity Planning.
But the team have just come off some week-end testing, that may well have been the biggest website load test we’ve done so far – for a high street retailer. It was case of Christmas load testing – find out how their site would handle the expected levels of traffic round that shopping period.
During the website load test plan development stages, the specification started off pretty unfocused: the marketing guys were talking about a big number of unique visitors, and the load balancer guy wanted instead a big number of pages per minute. Then the web analytics guy wanted a set of defined pages to be hit in the right percentages: X order pages, versus Y order-amendment pages and so on. The Capacity Planning guy had some desired numbers in mind too, from his capacity modelling of the eCommerce engine.
And one of the apps guys framed his spec in concurrent users: and he meant users on the core e-commerce transaction engine, not just users getting regular web-pages (many of which are served from cache)
The big number was going to be 10,000; it started as pages per hour, then per minute, then concurrent users, then orders, then ‘average users’.
Anyway, to cut a long story short, we ended up with a spec where the target was 20,000 orders being placed concurrently, with over 1.2M pages for items being added to baskets: across a combination of five different User Journeys running in parallel. And those User Journeys were whoppers.
A normal load test User journey is 5 or 8 or 10 pages long – e.g. a typical retail one: working through a site, finding a product via navigating through menus or making searches and choosing at random from the products offered.
But this time, the Journeys were all over 60 pages long. Sixty.
And the client wanted a bunch of think time to be emulated, so some of the test runs had journeys with up to 35 minute think time.
We knew this website performance consultancy was going to be pushing our load test infrastructure further than ever before. With such long think time, it meant that we had to keep big numbers of virtual user threads open for much longer than a mainstream web load test would – and that would mean keeping page objects in memory longer. And with so many pages per Journey, that too meant more objects to be kept in memory for longer. So the memory hit on our infrastructure was something we looked into prior to testing.
We also added to our hardware, bringing in the best part of 20 dedicated servers, spread around the internet at nearly a dozen various data-centres to ensure that we could pull down enough bandwidth to saturate the clients 620Mbs network capacity, without impacting too much in any one data-centre we were using.
Everything was under control, with the overnight testing due to start, and with about 10 of us on the conference call – guys from the client’s various tech teams, our tech guys plus a few 3rd party suppliers to the client.
So far so good. Until with less than an 1 hour to go, the client’s Project Manager decided to tear up the project plan: and take a wholly new approach.
Despite the planning we’d done, despite the ITIL Capacity Management bases they’d been trying to fit in.
Instead of testing the various parts of the site in isolation first, Journey by Journey; he wanted to test the whole thing at once: effectively jumping to what we had planned for the last couple of nights of stress testing.
And worse – instead of ramping up to find where their site started to smoke, he wanted to kick off in overload. He wanted to ensure his site really would be overloaded right away, and then ramp down – with the thinking that it would give error messages straight away that they could look into and start to analyse for engineering fixes – and could ramp down to see at what load the errors then stop
The downside, when as discussed earlier of this overload approach, without having first obtained metrics for the individual blocks of the site: is that it’s possible for some bottlenecks to be hidden amongst the mass of error messages from other blocks. And for other code blocks that are on the edge of becoming a bottleneck, but don’t do so, because capacity limits elsewhere prevent enough traffic: i.e. anything that happens in pages towards the end of a Journey will be hit less hard once the site starts to get into overload and throws errors at the earlier pages: because a percentage of Journeys will simply fail and stop before they reach the later pages.
On the night, we managed to rein things back in a little, so far as first testing the Login Journey in isolation: there was no way we could test more widely without knowing the capacity limits for logging in, as that was a key step in several of the journeys, so that happened first.
That measurement told us that sure enough, the site couldn’t handle enough logins/minute as were needed to hit the magic target numbers: so a quick bit of re-scripting of the Journeys was needed – to adjust think-time for the login steps, and to adjust ramp-up times, so that we’d not exceed the logins per minute bottleneck.
Throughout the load test, we also had the same Journeys running on our website monitoring portal 24/7 – so we could see how those graphs changed from the usual day time levels. And our network monitoring service was talking to their routers, and so we had plots of bandwidth on the key pipes.
With that done, and the client insistent, we lit the blue touch paper and retired (quoting a phrase familiar to anyone lighting fireworks a few years back at a Guy Fawkes night bonfire).
Well, it was a tough one – there was metaphorical smoke coming from various different parts of the client’s infrastructure. And our memory monitoring across servers was showing some red too.
More than we had expected, based on the offline internal testing we had done in preparation early in the week, where we had got to 30,000 with no memory issues at all.
Based on that we tried out some of the additional configuration options we had prepared during the week, to streamline even more the memory per virtual user in our test engine.
We were able to continue to deliver enough load to give the client what they wanted, over a period of test nights.
And out of it, came some useful learning for us – to take our website load testing technology to another level.
Firstly, no matter what testing you do on a ‘trial’ basis – it’s only when you test the real thing that you find out how things perform. Although we had done dry runs within SciVisum: using Journeys designed to be as long as the client ones and with steps doing similar things in the same mix, and hitting our own dummy web sites: only with the clients actual journeys hitting their actual sites did we see the increased memory load.
Secondly, we managed to go back over our test engine code, and improve it quite substantially. It was only this clients unusual combination of think time, many steps and testing from overload backwards: that had helped us make these gains. We would not in a month of Sundays have spec’ed that lot up as a sensible place to benchmark our platform!
Thirdly, we were able to think about improvements to our testing tools in terms of real-time support for the testing team. With 3 of our guys working together through the night: it was quite a challenge to juggle and share out the tasks of looking at code configuration settings, tracking memory across a load of boxes, watching the output of a number of different log files, and handle the client conversations on the phone, looking into errors being thrown by the client site so that we gave them good quick feedback and etc.
And thinking ahead to plan the next test run, once the current one is done.
Talking at 6am after testing all night to a new colleague Gomez, he said
“phew, I’m confident we identified some engineering issues to work through there. From the engineering conversations, there’s a number of places where the coders now have to decide when and where they rework their site”
“I’m 100% confident that they’ve never had so much load, in such a complex mix of dynamic User Journeys, on their site, confidence that in comparison to other unit testing done earlier we’ve uncovered some really useful issues for improving their site, confident that our own load test infrastructure has taken a big capacity step up too.”