Monitoring DevOps tools

If you think there are a lot of tools to configure your systems you haven't looked at the tools available to monitor your stuff. The set is so large that it is easy to get overwhelmed. So again in this article I am going to give you a list that I use to narrow the field. Then I am going to give you a list of my favorites.

  • Is it agent, agentless, or hybrid? As with most configuration management tools this question cuts both ways. The best in this class of tools with agents have well documented deployment paths that use various configuration management tools. For instance, they will have Chef or Puppet packages that cut down your time to deploy them tremendously. The choice on this question is how much time you have to deploy it and how fast a response do you need from the tool. Agent based tools are faster in most cases. Agentless tools rely on some form of remote execution tool like ssh or remote powershell and an SNMP(Simple Network Monitoring Protocol) agent. Because the server in an agentless system has to do all the work polling they tend to have more complex to scale. They also can require more risks to take because you have to allow more ports through your firewall. Hybrids allow you to deploy the tool in different ways dependent on the security requirements. So for medium and large companies they tend be a better choice.
  • How does reporting work? This is what you need the tool to do so paying attention to it is critical. The tools vary widely with the number and type of standard reports they have. They also vary on how easy it is to do custom reports. In more and more cases monitoring tools are pared with a reporting system to handle this issue. Writing custom reports can be as simple as a gui interface. They can be as difficult as a DSL(Domain Specific Language) or traditional programming language to create the reports. If your business needs reports from your systems be sure to confirm that you can create the reports that your business needs to meet it's needs. For instance, can you easily get a report to tell you how many people failed to sign up for your mailing list? Can you tell where people are stopping or failing to complete an action.
  • Does it do alerting and if so how can you be alerted? Alerting sounds like a no brainer but not all tools do it. Some tools are just built to display a set of stats for people to analyze. Which means they are normally easier to deploy and configure. At the other end of the scale are tools that will try to predict failures and alert you before the problem happens. This sounds great, and is cool, but you need to know a lot about your environment so that you can set the boundaries around good and bad events. That means that it will require a lot of time to tune properly to remove false positives. Also can an alert trigger an action? If it can then you can automate simple things and free a human to sleep or do something more productive. If it can tell you that the disk is filling up can it do the steps your team would do to free up space? This will help a lot with the work life balance.
  • What dashboards are available out of the box? Can you customize them simply? Most of the tools come with a set of what are called canned dashboards. When you are starting off with monitoring tools it's best to choose a tool with as more than less as long as they tell you something. If the tool has a great dashboard for monitoring Java Applications but your company writes it's apps in Ruby then what good is it to you? All of them will let you customize these dashboards. You can roll up the stats so you can show the entire set of a stat in an environment(Development, Test, Production) in one chart. Be careful though as Techs we love our data and chart customization can get out of hand. Over time you will want to add custom dashboards to make it easier to troubleshoot your devices.
  • How resource intensive is it on both the server, network, and clients being monitored? This is another one of those it depends discussion. If you are only monitoring a small number of things(computer, network equipment, etc) then this is less of a concern. You always need to be concerned about this because your first instinct is going to be to monitor everything. We can both from a device and data point perspective. I have seen, and caused, situations where we have monitored ourselves to death. For instance, at one company where we had limited bandwidth to our remote sites, we overwhelmed the network with just monitoring traffic. There was no bandwidth left for little things like file transfers and pulling up the company intranet. The problem with this is often until you do a proof of concept with the tool you may not be able to answer it. As a general rule agent based tools can help a lot because they only need to send changes and not everything. In all cases though you need to be sure that you are going to get something from the data you collect. It also makes it harder to filter when the time comes to create dashboards and reports.

The best way to handle most of these issues is to define a set of things you know you need to monitor, things you think you want to monitor, and things you know you don't need to monitor. Then apply that list to the questions above. It's a simple flow of can I get the data, can I report on it, can I make a dashboard for it and finally will we have enough resources for all of it.

The problem I have with this set of tools is that they all have a high level of complexity during the implementation phase. Even the simplest of them can talk several person weeks to get setup correctly and start returning on the investment. Once you do have it setup you will be amazed at how much it will help you become more efficient.

You should also be open to deploying multiple tools in this class. Monitoring a multi-tiered applications completely may seem easy at first but it is difficult to do accurately. Keep in mind that monitoring anything is a complex process. It is not uncommon for companies to deploy two or three tools to meet all of their monitoring needs at a complexity level that makes sense for their company. You may have one tool for monitoring the basic information like disk space and cpu utilization, another to monitor application health, and a third to monitor user behavior.

Ok so what are my recommendations? Here are a few of my favorites. Nagios(http://www.nagios.org/) This is where a lot of the following tools started. It is a great tool but as the number of your systems starts to increase configuration can be difficult to manage effectively. Which is why the next two started trying to do.

Zenoss(http://www.zenoss.com/documents/datasheet_core_commercial_compare.pdf) - This a freemium modeled application with a community version that does the basics well. The Commercial version adds more analysis and optimization information. Check the linked PDF for a more detailed explanation.

Groundwork (http://www.gwos.com/pricing/enterprise/) This tool takes a lot of the hard edges off of Nagios. They continue to add features and let you monitor your first 50 hosts for free with their enterprise tool.

We will be adding a link in the coming months to the site listing as many as we can find and giving our opinion. These are a great starting point but before you make a decision look at other and make sure they don't better serve your monitoring needs.

Our sister site has a review of Zenoss and Groundwork that are worth a look even if they are somewhat old at this point.