Someone asked me recently why I insist on calling some things incidents and others as problems. From this persons perspective it was hard to see the difference. The bigger question he didn't ask was why it was so important to create the separation. So today I am going to try to explain the difference and why I think it's important to you on your path to becoming a DevOps Master.
An incident as defined by ITIL is as follows: "An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an Incident. For example Failure of one disk from a mirror set." A configuration item is just about anything in ITIL Terms. Everything from a NIC or Hard Drive to a Web Service can be declared a Configuration Item. It's just something to tie incidents and problems to.
A problem as defined by ITIL is as follows: "A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created, and the Problem Management Process is responsible for further investigation."
Here are some other quick hit differences:
Problems should be traced down to a root-cause level where as incidents should be resolved as quickly as possible to restore the service to operation.
Incidents like a known issue with a Java application memory leak need become a Problem when they repeatedly occur.
Incidents can generally be resolved by the person doing the work.
Problems must be closed by a manager assigned in the problem management process.
So why does it matter to a DevOps Master?
DevOps masters avoid solving incidents with scripts when possible. Their time is better spent solving problems. Even better working to prevent incidents with repeatable and audit-able processes.
A large percentage of incidents are caused by human error. Whether that is tripping over a power cable, missing a step in a documented procedure, or doing something manually that should have been done with automation. You can't prevent people from making mistakes you can only help them to avoid them. Spending time investigating why someone caused an incident because they skipped a procedural step is wasteful. Instead a DevOps Master should be writing a script to step through the complete process. Scripts aren't fool proof but they are consistent and repeatable.
Problem Management processes almost always require Root Cause Analysis(RCA). Root Cause Analysis done properly requires data. The more complex the system and problem the more data you must collect. When the first occurrence of an incident happens you generally aren't expecting it so may not have the right monitors in place or need to adjust them. RCA beyond obvious things like tripping over a cable or changing a system level password during a maintenance window may take weeks to complete. Assuming you do an RCA for a problem you should create a mitigation plan and strategy to permanently correct the problem.
Your goal as a DevOps Master should always be to prevent incidents from affecting the environment. Whether that is deploying more servers, adding more proactive monitoring or making sure all the cables are safely under the floor where no one can trip over them. You will accept that incidents will happen. You will plan for and test your reactions to what you can think of in advance. You won't waste time trying to figure out what the new kid John was thinking when he pulled the cable from the your telecommunications provider out of the Firewall. It's just one occurrence and not worth investigating.