Date: 8th February 2013
It’s a regular theme at many meetings I have, everyone working together, trying to motivate teams from development, testing opps and marketing, and get them all focused on a vital issue…
We don’t want customers to be the first to tell us about performance problems on our site!
Extremely rarely, a client will be worried about their problems getting onto the local TV news, or national news.
But the BBC website was very open about problem on February 6th.
This is what their site was showing:
It’s not the first time they’ve been equally honest:
- June 2012: ‘ we suffered a major failure of BBC Online last night. ‘
- Mar 2011: ‘the whole of BBC Online was down last night for an hour from 22:40’
The lesson to be learnt is:
Not that major outages can affect any site, unexpectedly and that we should design-in resiliency to all possible technical root cause failures so that the site itself can stay live despite those problems (which is true).
It is rather:
That all our sites suffer minor problems, at frequent intervals, affecting a few visitors, or just some of the online functionality.
And that we all need:
- a set of metrics that find and report 24/7 user experience and hence report these non-catastrophic problems
- an agreement that both business and tech teams, understand and can act on the metrics
- less time spent in after the event analysis and finger pointing: and more on finding tech root causes and solutions
The Impact Of Sporadic Errors
These problems may only be brief: so brief that often normal tech alerting may not have time to fire off.
At other times a problem may only effect a percentange of users, say 5% and so will never trigger a tech-based monitoring solution.
It maybe problems that simply are not visible to the tech systems: eg the ‘Buy Now’ button for certain holday dates and packages, does not work user side and no page request is sent to the servers at all.
Hence Sporadic Error reporting becomes vital to the website team that have already got their site to a point of stability so that that outages long enough to get a warning page like the BBC one above just never occur.
We have been worked with several clients in a number of industries over the past few months to investigate the real incremental cost and impact on KPIs that can result from sporadic errors, little “blips” in performance that either do not last for a sufficent amount of time to trigger a warning, slow down or cut off alert, and are not immediately replicable.
In some cases these “blips” are due to the natural ebb and flow of network traffic, browser peculiarities, local conditions and the like and the thinking has typically been that such things cost more to try and investigate and control than resolving every last anomaly is worth to the business. In a lot of cases this assessment in correct, but what we have been working on is how do you identify the ones that actually would be worth the time and effort to analyse further.
Short term errors, while often insignficant in themselves, can build up to a substantial total over time. Their very nature means that they can be easy to overlook, or to miss underlying patterns and probable causes, that would be more readily apparent in errors and issues that last for longer periods of time and are seen by more of the operations and content teams.
With traffic volumes so high on many sites now, and content provision often so complex, a total of a few minutes problem every day can quickly add up over the course of a week or a month.
Last year SciVisum released a Sporadic Error Report.
This is one way to report on Sporadic errors: based on the vital Customer User Journeys measuring the routes that business and tech teams have agreed are fair routes to gauge genuine user experience.