B.1 Troubleshooting In Complex It Environments - IBM z13s Technical Manual

Table of Contents

Advertisement

B.1 Troubleshooting in complex IT environments

In a 24x7 operating environment, a system problem or incident can drive up operations costs
and disrupt service to clients for hours or even days. Current IT environments cannot afford
recurring problems or outages that take too long to repair. These outages can result in
damage to a company's reputation and limit the company's ability to remain competitive in the
marketplace.
However, as systems become more complex, errors can occur anywhere. Some problems
begin with symptoms that can go undetected for long periods of time. Systems often
experience "soft failures" (sick but not dead) that are much more difficult to detect. Moreover,
problems can grow, cascade, and get out of control.
The following everyday activities can introduce system anomalies and trigger either hard or
soft failures in complex, integrated data centers:
Increased volume of business activity
Application modifications to comply with changing regulatory requirements
IT efficiency efforts, such as consolidating images
Standard operational changes:
– Adding or upgrading hardware
– Adding or upgrading software, such as operating systems, middleware, and
independent software vendor products
– Modifying network configurations
– Moving workloads (provisioning, balancing, deploying, disaster recovery (DR) testing,
and so on)
Using a combination of existing system management tools helps to diagnose problems.
However, they cannot quickly identify messages that precede system problems and cannot
detect every possible combination of change and failure.
When using these tools, you might need to look through message logs to understand the
underlying issue. But the number of messages makes this process a challenging and
skills-intensive task, and also error-prone.
To meet IT service challenges and to effectively sustain high levels of availability, a proven
way is needed to identify, isolate, and resolve system problems quickly. Information and
insight are vital to understanding baseline system behavior along with possible deviations.
Having this knowledge reduces the time that is needed to diagnose problems, and address
them quickly and accurately.
The current complex, integrated data centers require a team of experts to monitor systems
and perform the real-time diagnosis of events. However, it is not always possible to afford this
level of skill for these reasons:
A z/OS sysplex might produce more than 40 GB of message traffic per day for its images
and components alone. Application messages can significantly increase that number.
There are more than 40,000 unique message IDs defined in z/OS and the IBM software
that runs on z/OS. Independent software vendor (ISV) or client messages can increase
that number.
454
IBM z13s Technical Guide

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents