Getting to the Root of the Problem - Automating IT Fault Diagnosis
Date:
By: Steve Barrie
Classification: Research Report
The key to automating the management of the IT infrastructure is to be able to accurately diagnose the root cause of a problem. Accurate diagnosis can be achieved only if the solution is based on a model that truly represents the hardware and software components as well as the relationships between them.
Thankfully, this idea fits closely to the requirements of service management through SLAs and the two processes complement each other. Whilst there are many tools that will measure the quality of service being delivered, it is much more difficult to find out what is causing the service degradation.
It is interesting that the solutions available for root cause analysis (RCA) are varied and offer different levels of automation and effectiveness in complex IT environments. This lack of best practice is an indication of immaturity, which makes the subject an excellent target for developing competitive advantage.
Getting to the Root of the Problem examines the different RCA approaches that are available and derives some ideas for best practices that enable a practical solution for accurate diagnosis in a service-driven environment.
The topics covered include:
- The Nature of Problems – The behaviour of complex IT environments and the dependencies that exist between hardware and software components within the context of services.
- Autodiscovery – Finding hardware and software components and understanding the relationships between them. How do you recognise changes within the environment and translate all of this information into an accurate model?
- Monitoring the Infrastructure – what can be monitored and the tools that carry out the tasks.
- Rationalising Events – making sense of all of the information coming from the monitored environment. How to keep the ‘noise’ to a minimum and conserve IT resources at the same time.
- Offline Analysis – building behaviour patterns and analysing them in the security of a test environment. What can we find by mining event information and is it useful enough to warrant the investment?
- Service Culture – taking the business-oriented view whilst also creating the most efficient solution.
- Best Practices – Derive some best practice concepts based upon a mixture of bottom-up monitoring and top-down service measurement. What is practical for most businesses and what works for large scale organisations. How do we get accurate RCA, hence service level improvements, whilst also getting more from existing investments in management tools?
The report includes reviews of the following products:
- Aprisma SPECTRUM
- BMC PATROL
- Computer Associates Unicenter
- Hewlett Packard OpenView
- IBM Tivoli
- Mercury Interactive Topaz
- Micromuse Netcool
- SMARTS InCharge