Root Cause Corrective Action Reports
The Perils of the IT Profession
One of the common challenges that all technicians face, no matter what area of IT they work in, is the absolute attention to detail our profession demands. Switch a couple of characters in a script, forget to set your SID, set the wrong flag at the wrong time and the end result usually isn’t very pretty. Many commands we issue on a regular basis are destructive by their very nature.
If you never make mistakes, send me a resume. I’m always looking for a “Patron Saint of IT” here at Remote DBA Experts. It will also save us on travel costs because I’m sure you’ll be able to spread your wings and fly here on your own.
Then there’s the software glitches. The problems that pop up out of the blue and make you go:
“WHAT THE? – How did THAT happen? I’ve done this 317 times in a row and it worked every time.”
For you math majors, here’s my calculation for one of the Foot Principles of IT Support:
CLOSER YOU ARE TO PRODUCTION TURNOVER
+ THE GREATER THE VISIBILITY OF THE PROJECT
= THE MORE LIKELY A PREVIOUSLY UNKNOWN SOFTWARE GLITCH WILL OCCUR
I don’t care what software you are using, you will run into the “only occurs on this release, on this version of the operating system, using this particular feature on the third Tuesday of the sixth month when it’s cloudy outside” BUG. Be sure to expect management to stop by and ask “well, why didn’t you test this on the third Tuesday of the sixth month when it was cloudy outside?”
Finally, there are the hardware problems that we have all come so accustomed to and fond of. It’s that triple redundant, super fault tolerant, titanium based, twice-the-price component that you just got management to sign off on a few months ago. The one that your hardware vendor assured you would solve all of your performance and availability problems. The component they described as “state-of the art and self healing”. Fixes itself, they said.
The one that forced you to create that 15 page justification document with all the pretty charts and graphs stating that this piece of hardware will still be running in the year 2040. The component that just decided to flake out and take down your entire online, 10-thousand dollar a minute web-based ordering system. You then call the vendor and they say “yeah, we’ve heard about that happening occasionally – we thought we got that fixed.”
Root Cause Corrective Action Reports
A customer that is affected by an application outage or slowdown needs to have a firm understating on what caused the problem, the activities performed to correct the problem and the action items that will be undertaken to mitigate or prevent the problem from occurring again.
The Root Cause Corrective Action Document provides information on the underlying causal factors that generated the problem and a timeline of events that occurred during the problem event. This ensures that all problems are properly analyzed and that all steps are taken to prevent future occurrences. This is a key component of our problem resolution strategy in addition to obtaining customer feedback on the quality of our problem resolution capabilities.
I can’t stress the importance of using some form of problem notification document. A customer that is unsure of “what happened” is going to an unhappy customer. Giving your customer a clear picture of the problem event and the steps you will take to prevent the problem’s reoccurrence shows them that the quality of their environments is IMPORTANT TO YOU and you do NOT TAKE PROBLEMS LIGHTLY. The time you spend crafting the Root Cause Correction Action Document will pay big dividends in customer happiness.
The Root Cause Correction Action Document’s components are fairly simple. Here’s a brief description of each of the sections:
The heading section contains the customer name, document date, numeric document identifier, the date the problem occurred and the person preparing the document.
A clear, concise definition of the exact problem. You need to remember that not everyone reading your document will have a technical background. Leave the technical mumbo-jumbo out of it. You are trying to inform your customer of the event NOT confuse them.
What was the impact on their business? Don’t sugar coat it. Tell them that the failure caused a 14 hour outage on their production ERP system. The business impact of the example shown below would be “A 37 minute delay occurred in replication between server ORAPGH and DB2DEL. During this time DB2DEL reports did not provide current data.”
This is a chronological timeline of the events that led to the problem (if you know what they are), and the steps that were taken to correct the issue. Include every step that occurred up to and including verifying that the fixed system was indeed operational. Here’s an example:
Wednesday May 5, 2010
18:08 Remote DBA Experts’ log monitor for the replication engine determined that replication was not successfully occurring between the two production platforms (ORAPGH and DB2DEL)
18:10 Remote DBA Experts notifies Delaware business units that data replication has stopped and reports being generated will not be current.
18:15 As a recommended action previously provided by the software vendor, Remote DBA Experts stopped and started the replication engine on both platforms.
18:30 Remote DBA Experts verified that the replication engine was running and restarts replication processes.
18:45 Remote DBA Experts verifies that replication was successfully occurring between ORAPGH and DB2DEL.
18:47 Remote DBA Experts begins monitoring the delay to determine the length of time it will take the replication engine to resynchronize the data between ORAPGH and DB2DEL. Delay estimation is calculated to be 15 minutes.
19:00 Monitors show that both environments are synchronized.
19:05 Remote DBA Experts notifies Delaware business units that replication is occurring and all data is current.
19:10 Logs and trace files are collected and a Severity One problem is initiated with software vendor.
Thursday May 6, 2010
07:00 Software vendor contacts Remote DBA Experts support personnel. States that problem was caused by a previously unidentified software bug. Recommends upgrading product to newest release (we’ve never heard that one before).
Problem Root Cause
The underlying causal factor that created the problem event. In the case above, the root cause was due to a software code issue that caused replication to terminate abnormally.
There are times when the problem is exacerbated by contributing factors. In our example, if a long running job prevented us from successfully stopping the replication engine (leading to a longer outage), we would include a description of that issue in this section.
This section contains the actual steps that were taken to correct the problem. It does not restate the steps in chronological order. It is a brief description of the activities taken to correct the issue.
The Future Prevention Section is the most important component of the Root Cause Corrective Action document. This section provides the steps that you will take as a service provider to ensure that the problem does not reoccur. It contains a list of action items, the person responsible for completing that action item and a date the action item will be complete.
Signed by the technicians involved with the problem and a member of the service provider management team.
The Importance of Following Up
If you have been reading my previous blogs, you know that we feel so strongly about customer feedback at Remote DBA Experts that we have created a customer feedback strategy called “The Customer Feedback Engine.” We have established multiple communication flows to ensure that we receive feedback from all of the personnel that we support including management, DBAs, developers and end-users.
One of the key strategies is the role our Service Assurance Manager plays. Remote DBA Experts’ Service Assurance Manager contacts all customers when a problem occurs. Whether or not we caused the problem is immaterial. We feel that is our responsibility as a service provider to let our customers know that their application’s performance and availability is important to us.
As my old boss Dan Pizzica used to tell me (when I was a VERY junior technician) “It really doesn’t make a difference who or what broke YOUR database (strong emphasis on the word YOUR). You are the technician who is ultimately responsible for fixing it. The buck stops with you. If you can’t protect your environments, you aren’t doing your job.” We all know he’s absolutely correct.
It was not the intent of this blog to coerce readers into using the Root Cause Corrective Action information we provide to customers here at Remote DBA Experts verbatim. It was to promote the benefits that a structured, well thought out problem notification document provides to all of us who are responsible for keeping our customer’s environments highly available and high performance.
Thanks for Reading,
Director Of Service Delivery