Wednesday, May 13, 2009

Starting With Root Cause Analysis (and mitigation)

Root cause analysis doesn't have its roots in Software. It was originaly applied to manufacturing processes and has been adapted to Software defects. 

One approach is to list known causes (defect in requirements, defect in design, defect in coding, defect in testing, defect in environment, etc.) and associate each defect with that cause. That is not what I'll be doing here.

I'll be discussing going through a thought experiment for some subset of failures to determine what is likely a series of root causes and how to manage that long-term to make your product better by learning from your team's mistakes. 

First, it's likely that you only have the bandwidth to do this for a relatively few items and it's important to pick those items well. One excellent candidate is user-reported bugs that are deemed 'critical' (Hopefully, you have some reasonable way to do this). In all cases, it's helpful for the bug being analyzed to be fixed and be verified as fixed. 

Then for each defect, record the following: (This is my list, and can be suited to your needs)
  • ID  / title - This comes from your bug tracking system and is used to clearly identify what issue is being discussed
  • Function -  This is the actual system capability that failed and it may be part of a formal list of system capabilities or may be some general statement such as 'Detailed Data Display'
  • Effect -  This is the impact to the user or some discoverable impact to the system.  It may or may not be the same as your title, depending on your defect report standards. A good example would be 'User unable to log in after changing to a long password'
  • Failure -  This is a description of the behavior or design where what was implemented differs from the expectations. It may or may not be the same as your title, depending on your defect report standards. Often, it is more detailed than the Effect and may require some code or environment analysis to clarify.  A good example of this is 'Users are able to create new passwords that are more than 45 characters long, but any password longer than that will not be validated successfully'
  • Notes - This is you can discuss any historical / contextual information that doesn't fit elsewhere, but that would be helpful if reviewed in the future. An example could be "This appears to be an issue that has existed since the product's first release, before we did formalized testing"
  • Cause(s) -  This is where you apply one of several techniques. Rather than go into them here, you can read about Ishikawa Diagrams and make a Pareto Chart to fill out the next steps. There are other related topics you can apply to fill out this section if you find they suit your needs better.
  • Recommended Actions - If you created a Pareto chart, you have the highest contributors to the cause at the top of your list of causes. You work your way down the list to address items insofar as they are helpful. It's likely that some of the causes will need to be left as continued exposure to risk if the cost to implement is not acceptable.
  • Responsibility - Not only do you identify actions, but you need to assign them to someone. If you have other methods of assigning work, you can simply refer to that here. For now, we'll assume that this spreadsheet is used to track these items.
  • Target Completion Date - The person that is responsible for completing this action should come up with some acceptable date to complete this. Record that date here.
  • Action Taken - In the end, it's possible that the action isn't exactly what was targeted. If different, record here.
  • Date Action Taken - Record the date that the action was completed so that you can focus only on the items that are not completed.
There is an implied workflow here. On some schedule, you will need to update this list (unless you are doing this as a one-time exercise). Once you have assigned actions, you will need to follow up with people to ensure they are complete or that if any changes need to be made, that those changes are made and that the mitigation is complete.

In addition to online resources regarding root cause analysis, you can also look for classroom training.

No comments: