As organisations move along the eCommerce maturity curve and look to put distance between themselves and the competition an increasing need is emerging for more sophisticated and responsive IT support that protects business strategy, brand and, most importantly, bottom-line.
IT departments especially, and business in general, are looking to find tools that can provide “intelligence” as well as just “data”. Online performance is not only about technology, that just powers the engine, in an increasingly crowded, demanding, competitive marketplace it’s about the user satisfaction that is needed to achieve strategic goals.
Isolated firefighting and ad hoc bug hunting is being driven out for by 4 factors:
Relentless pressure to reduce IT operational spend
Increasingly sophisticated needs, interconnection and integration of complex systems
The drive for increased business efficiency, consistent application performance and stability
The increased complexity of technical support including outsourced and cloud elements
In order to pre-empt as many problems as possible, and prevent them from recurring, a structured approach to problem discovery, diagnosis and documentation is required, and that process needs to be based on robust tools that offer realism in their data.
When tracking down technical errors it can help to have a structure to work to. We have found that, although there are many methodologies for tracking errors all of which have their suitabilities,, pros and cons, that of Rapid Problem Resolution is very useful model when investigating and analysing the errors, warnings and slowdowns shown by Dynamic User Journey monitoring.
In most cases an IT error can be classified, at base, as one of four basic types:
Design: The system doesn’t work the way it should
Functionality: The system should work but doesn’t, resulting in an error or incorrect output
Stability: The system does not work consistently and sometimes fails resulting in an error or incorrect output
Performance: The works system but not at the intended speed
For each of these four types of error there are three error states:
One off: usually identified as the result of a major incident
On-going: that users are experiencing right now (continuous)
Recurring Problems: causing repeated incidents with similar symptoms (intermittent)
(Of course there are non-technical errors which can affect system performance, and these are dealt with later in the article as they are “softer” issues and maybe beyond the IT department’s remit)
Difficulty of investigation
Errors do not occur in a vacuum and investigating a problem, especially under pressure, is not easy to isolate what is causing the problem. Of the four error types, three may be time consuming but are easier to deal with:
Single incident/Technology Identified: typically 80-90% of problems are of this type and are easily tracked down to a causing technology or a change related cause such as hardware failure, software failure, misconfiguration or operations error
Recurring Problem / Technology Identified: this type include such issues as intermittent hardware failure, known error or intermittent software failure,
Single incident/Technology Unidentified: every so often a one off problem occurs and the cause may never be found whether the root cause is hardware, software or, more likely, end user error or operations error. Often these are written off when the issue cannot be replicated and is not reported as having occurred again. In complex systems, especially ones undergoing constant change, these things sometimes just happen.
However the error type, Recurring Problem / Technology Unidentified can cause a disproportionately high adverse impact on business efficiency, IT service levels, workload, morale and KPI. The ownership of these issues is often unclear. They are referred to as “grey problems” in some resolution methodologies and can be the result of such things as poor application logic, intermittent application error, transient overload, intermittent infrastructure error or incorrect failover operations.
These types of issue are often assessed as low or medium priority, mostly because no one knows where the ownership for an unidentified issues should lie. However, If not resolved quickly they can be placed on “too difficult to deal with” pile where they stack up. Not only that but such niggles can be at the very least an early warning of bigger errors to come, and it can be dangerous and wasteful in the long run to build further on a system that is known to have such problems.
Not dealing with grey problems likely to result in:
The problem growing with business load until a tipping point is reached and it causes a Major Incident
Ongoing/recurring problems can create a fog that obscures other problems
On-going/recurring low level problems constraining other IT options eg “I’d rather we didn’t expand the use of XYZ because it’s already got issues”
Users being ground down by constant low level problems leading to out of proportion levels of dissatisfaction.
A business will often accept big one off incidents more readily than on-going unresolved issues. Everyone knows that some failure is inevitable but constant low level issues impact efficiency, frustrate users and distort workflow on the inside, and cost money, loyalty and brand equity on the outside.
Without a clear route to diagnosis progress towards resolution is very slow. This is where tools that can provide intelligence and let those trying to solve, or explain to those responsible for causing, the problem “see what the customer sees” and provide user experience realism coupled with in-depth data analysis capabilities are needed.
Issues with Investigation in Practise
When it comes to investigation of an issue there are often non-technical, organisational issues to consider as well. Technical depts often split into specialist teams of Platform Support and Application Support.
While there are obvious benefits to such specialisation the following issues can arise:
Lack of communication between teams
Specialised silos with narrow concerns and no idea of place in bigger picture
Certain functions and knowledge are outsourced with no in-house knowledge
Team KPIs only based on performance of own technology
ITIL suggests a cross team “problem solving group”. While a good idea in practise very few problem management functions are sufficiently mature, or work together often enough, for this to work effectively. Often the team does not work as such and instead everyone reverts to focussing on attempting to prove that their team’s technology is not to blame.
Of course looking for verification is very different from true impartial investigation, and investigation is a lot easier when everyone is speaking a common language and all understand how their contribution fits into the success of the business plan, not just hitting their own departmental KPIs.
Putting Things Together to Understand Problems wirh SciVisum Monitoring and Testing Tools
When looking at problem resolution as part of a wider TQM (Total Quality Management), rather than just as ad hoc incidents, it is important to build a process that that will lead to continuing improvement. To do this you need several key features as a base:
Time stamped data is very important in diagnosis
Ability to replay samples showing what actually occurred
Correlation across different tools and systems to show the entire picture before isolating the problem. beginning from the User’s Experience.
With these things in place IT support can take a much more holistic view of what is happening, and so suggest improvements that are beyond the specific technical problem that is occurring at that moment, business people may look at what happened and understand how better to work with IT / Development to prevent this happening again and to more easily achieve goals, and finance, supplier management and capacity planners can better understand resource needs. Here is an example of how SciVisum tools do just that.:
- Users are experiencing slow down on site, shown by sample time over set acceptable threshold (defined by client)
- Support team are alerted by email/SMS
- Observation that web server is slow (using Time To First Byte tools)
- Investigation shows certain hops are slower than average at this time (MTR tools)
- Server monitoring shows 85% capacity used
- Web analytics show site well under max capacity found in most recent Load Test but higher than normal
- Higher current load is caused by Summer sale campaign (comparison in SRM marketing tool)
- Looking at components shows more video added to site as part of campaign so the site configuration is now not the same as load test.
- Video is being supplied by a third party who have not complied with file size guidelines for content
- Annotation can now be added against the slowdown
- Process amended for notification of campaigns, testing and SRM and SLA with 3rd party provider
This approach can also help with difference between real and perceived symptoms as reported by customers and internal departments (especially as these are likely to use technical terms loosely or incorrectly). The SV common language can help overcome this problem even as the realism of the data and the ability to drill down to a very granular level enables detailed understanding. Having both a top level experiential picture and an detailed performance view means that nothing is seen in isolation from either the system or the business processes that created it and it must support.
Nothing beats a first hand account of a problem when it comes to diagnosis. Ideally IT support need to talk to users when diagnosing and resolving problems. Of course this is rarely possible when supporting customer facing websites, but the DUJs “doing what the customer does” gives as close a view as possible to what they did at the time of the error.
Being able to go back to a specific time stamped sample and investigate all aspects of what happened with it, replay it, follow the same path, compare it, annotate it removes the vagueness which plagues so much of IT support and wastes so much time.
Communication Leads To Improvements Beyond Technology
Where this leads into improving communications, processes or procedures the impact on the business will be far greater than that achieved by just fixing an isolated technical problem. It may turn out that problem has just emerged as a symptom of a bigger business process/strategy related cause.
In turn this helps with prioritisation discussions for the business. If IT can return to a strategic meeting with a clear list of causes (and even opportunity cost for each) then it suddenly becomes a business strategy issue not a a vague “IT Problem” that the business just sees as an annoyance. In this way the IT dept is also protected against dealing with the “wrong” thing first. Such clear information allows the business to set priorities (or, more ideally, agree priorities with IT) possessed of a clear understanding of what those decisions will mean in practical terms. A decision making process can also be developed and agreed so that IT is not disempowered and going back to the business all the time for “permission”. For example it may be determined that problems should be fixed in order of potential revenue loss, or that problems with X area of the website are more important than Y, what do if 3rd party suppliers are not performing to SLA, or when to order a new load test or the request new hardware.
Understanding of business purposes in this way will also help IT to understand how they fit into the overall strategic picture – and enable them to make informed proposals about the best way to technically support it, what hardware is needed, what problems might they be able to solve with new applications etc
Non Technical Problems
Of course problems can also have non technical root causes, for example:
Lack of training
Incorrect or weak process
Lack of understanding of technical consequences by non-technical users
If a problem occurs for any of these reasons an awful lot of IT time can be wasted trying to track down non-existent technical faults, and confidence in the system and in the IT dept can suffer with wider ramifications for business development strategy in a wider sense.
An issue could involve a number of technical and non-technical components that need to be resolved and addressed to prevent a recurrence, for example:
Misunderstanding of new system requirements leads to a bug in the software development
A patch is made but is applied to the server incorrectly
The patch was applied incorrectly due to lack of training for the engineer who applied it due to staff member who usually does this task being away on holiday
The known problem is down to be “fixed at a later date” as it only occurs under unusually heavy load
This problem is not communicated to the marketing dept, who, in turn have not let the IT dept know that they have a major seasonal sale about the go live.
No load testing has been done ahead of the sale because IT did not know it was happening, and marketing did not know it was possible or important.
Once the bug has been fixed the problem of procedures and training also needs to be addressed or the risk of such things happening again in the future remains high. Educating other departments in the use of monitoring tools, and familiarising them with the concepts, data and intelligence it is possible to get from them will lead to wider improvements far beyond the traditional narrow focus on “is it up today”.
If use of tools can be demonstrated to have delivered this level of value to the business, and are taken into account as part of the whole ROI calculation, suddenly the question of the “need” for monitoring and measurement can been seen in its true light in terms of business impact.
SV Monitoring Suite
All products in the Monitoring Suite have been designed with different user needs in mind, but all are delivered through the intuitive Customer Portal, and enjoy the one-on-one managed service support, that our clients value so highly.
To help support all teams, and provide a “single point of truth”, all products in the SV Monitor Suite are designed to ensure that everyone can understand and be proficient in using the wide ranging metrics to deliver ongoing improvements.