Sunday, April 1, 2012

Post mortem of an epic disaster that almost happened

Seldom do I blog about work but I can't really resist doing an entry on the most miserable failures I have seen in my career.

All names are falsified to protect to guilty and innocent alike.

Ahem. Here I go.

People who know me will find me pretty confrontational at times, because of my stance at drilling down to the most painful spot to address problems at it's root, while people prefer to sweep the dirt under the carpet to maintain the status quo. I learn that from history that it is a timeless truth that ignoring a problem or pretending that it doesn't exist will never solve a problem. But instead, when left unchecked, the problem will snowball into a hollywood-class disaster that will make Fukushima seems like paradise.

Although some short term pain may be unpleasant and unavoidable, it is my prerogative to steamroll over the problem as long as it in my interest to do so. At work, I do somewhat run a team of contractors and vendors as part of my job in IT infrastructure department. Learning from a couple of negative examples, I knew that it is important to keep morale high and the way I do so is to let my team know that their contribution is important and recognised, never to be taken for granted. The team needs to be properly protected from external parties from abuse and misuse. This will give me a reliable team which will be willing to go for the extra miles whenever needed.

This is somewhat lacking within my organisation. We engage many contractors and vendors to support our operations but they were often treated unfairly as outsiders and the bias is obvious throughout. However this is not the topic of this blog entry and the spotlight is on what happens when you blindly treat them as resources and not your arms and legs.

The star in this story is one of the developer in the application support team who is in charge of several key systems, which are crucial to the operation of the organisation. Let's just call him developer X, who is supplied by a vendor for close to a year by the time of writing.

Over the course of the year, there were a number of incidents on one of the key systems. As I dug deeper, the investigation showed that there were multiple errors generated by the application and the IT infrastructure systems are working perfectly. Developer X took a short little peep at the errors and did not proceed to investigate further, saying that the errors did not cause the failures so far. Troubled by the lack of evidence that the troubleshooting was done properly, and more importantly, being sick of wasting my valuable resources on handling application failures, my patience finally ran out. Instead of treating the failures as isolated incidents, they are properly grouped a chain of incidents and brought to the attention of Developer X's manager, the business owner and the VP. Even so, the glacial progress was miserable at best, with the case running into triple digits in downtime. I don't understand why did the business owner didn't tighten the screw on Developer X when they are paying for the application.

Unfortunately, the fun had just begun. Due to a company restructuring exercise, changes are required for two of the key systems. The deadline was to be two months before the deployment to the production environment. Towards the last few weeks of the deadline, it became apparent that Developer X did not understand the requirements of the exercise. There were so many missing and incomplete deliverable that it would have be a complete disaster if not that the restructuring exercise deadline was extended by another month due to other reasons.

Getting really worried and convinced that we were going to have a couple of major deployment failures and serious data integrity issues in the production environment,  I decided that hopium wasn't in my diet and the stakes were too great. Despite that it is not my job to meddle with application development, I had to divert attention to this exercise to prevent a perfectly preventable catastrophe on multiple key systems.

It turned out to be a case of a bad apple, with the manager unaware of the situation and a tub of worms to be sorted out. Things got more and more amazing as we went go. The developer didn't understand how the applications functioned and didn't know what was needed to complete the project. It got even better when Developer X wanted to deploy the solution into the UAT environment when the solution was not approved by his manager nor the business owner. Further more, the deployment process was not followed, without proper documentations and testing.

It didn't help when we have a problem understanding Developer X because of conversation problems and Developer X's comprehension skills is blindly poor.

This was a major red flag as this is a recipe for disaster. At this point, it was obvious that we have a severe competency issue and the problem had to be escalated all the way to the VP. I could see that there was a high risk of data corruption due to the design of the solution and deployment failure due to poor documentation.

The vendor has to send in

The deployment plan turned out to be problematic as well, with important activities being overlooked.

Things wa

No comments: