Sunday, September 30, 2012

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:
  • Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)
  • Installation and configuration should be easily scriptable
    • server installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef
    • API would be ideal
  • Robust notifications/alerting rules
    • escalations
    • service dependencies
    • event handler scripts
    • alerts based on subsets of hosts/services
      • for example alert me only when 2+ servers of the same type are down
  • Out-of-the-box plugins
    • database-specific checks for example
  • Scalability
    • the monitoring server shouldn't become a bottleneck as more clients are added
      • nagios is OK with 100-200 clients (with passive checks)
    • hierarchy of servers should be supported
    • agent-based clients
  • Reporting/dashboards
    • Hosts/services status dashboards
    • Downtime/outages dashboards
    • Latency (for HTTP checks)
  • Resource graphing would be great
    • but in my experience very few tools do both alerting and resource/metrics graphing well
    • in the past I used Nagios for alerting and Ganglia/Graphite for graphing
  • Integration with other tools
    • Send events to graphing tools (Graphite), alerting tools (PagerDuty), notification mechanisms (irc, Campfire), logging tools (Graylog2)
I also asked a question on Twitter about what monitoring tool people would recommend, and here are some of the tools mentioned in the replies:
  • Sensu
  • OpenNMS
  • Icinga
  • Zenoss
  • Riemann
  • Ganglia
  • Datadog
Several people told me to look into Sensu, and a quick browsing of the home page tells me it would be worth giving it a whirl. So I think I'll do that next. Stay tuned, and also please leave a comment if you know of other tools that might fit the profile I am looking for.

Update: more tools mentioned in comments or on Twitter after I posted a link to this blog post:
  • New Relic (which I am actually in the process of evaluating, having paid for 1 host)
  • Circonus
  • Zabbix
  • Server Density
  • Librato
  • Comostas
  • OpsView
  • Shinken
  • PRTG
  • NetXMS
  • Tracelytics

Sunday, September 23, 2012

3 things to know when starting out with cloud computing

In the same vein as my previous post, I want to mention some of the basic but important things that someone starting out with cloud computing needs to know. Many times people see 'the cloud' as something magical, as the silver bullet that will solve all their scalability and performance problems. These people are in for a rude awakening if they don't pay attention to the following points.

Expect failure at any time

There are no guarantees in the cloud. Failures can and will happen, suddenly and mercilessly. Their frequency will increase as you increase the number of instances that you run in the cloud. It's a sickening feeling to realize that one of your database instances is gone, and there's not much you can do to bring it back. At that point, you need to rely on your disaster recovery plan (and you have a DR plan, don't you?) and either launch a new instance from scratch, or, in the case of a MySQL master server for example, promote a slave to a master. The silver lining in all this is that you need a disaster recovery plan anyway, even if you host your own servers in your own data center space. The cloud just makes failures happen more often, so it will test your DR plan more thoroughly. In a few months, you will be an expert at recovering from such failures, which is not a bad thing.

In fact, knowing that anything can fail in the cloud at any time forces you to architect your infrastructure such that you can recover from failures quickly. This is a worthy goal to have no matter where your infrastructure is hosted.

There's more to expecting failures at any time. You need a solid monitoring system to alert you that failures are happening. You need a solid configuration management system to spin up new instances of a given role quickly and with as little human intervention as possible. All these are great features to have in any infrastructure.

Automation is key

If you only have a handful of servers to manage, 'for' loops with ssh commands in a shell script can still seem like a good and progressive way of running tasks across the servers. This method breaks at scale. You need to use more industrial-grade configuration management tools such as Puppet, Chef or CFEngine. The learning curve can be pretty steep for most of these tools, but it's worth investing your time in them.

Deploying automatically is not enough though. You also need ways to ensure that things have been deployed as expected, and that you haven't inadvertently introduced new bugs. Automated testing of your infrastructure is key. You can achieve this either by writing scripts that check to see if certain conditions have been met, or, even better, by adding those checks to your monitoring system. This way, your monitoring checks are your deployment unit tests.

If the cloud is a kingdom, its APIs are the crown jewels. You need to master those APIs in order to automate the launching and termination of cloud instances, as well as the creation and management of other resources (storage, load balancers, firewall rules). Libraries such as jclouds and libcloud help, but in many situations you need to fall back to the raw cloud provider API. Having a good handle on a scripting language (any decent scripting language will do) is very important.

Dashboards are essential

One of the reason, and maybe the main reason, that people use the cloud is that they want to scale their infrastructures horizontally. They want to be able to add new instances at every layer (web app, database, caching) in order to handle anything that their users throw at their web site. Simply monitoring those instances for failures is not enough. You also need to graph the resources that are consumed, and the metrics that are generated at all layers: hardware, operating system, application, business logic. Tools such as Graphite, Ganglia, Cacti, etc can be of great assistance. Even homegrown dashboards based on tools such as the Google Visualization API can be extremely useful (I wrote about this topic here).

One aspect of having these graphs and dashboards is that they are essential for capacity planning, by helping you in establishing correlations between traffic that hits your web site and resources that are consumed and that can become a bottleneck (such as CPU, memory, disk I/O, network bandwidth etc). If you run your database servers in the cloud for example, you will very quickly realize that disk I/O is your main bottleneck. You will be able to correlate high CPU wait numbers with slowness in your database, which reflects in slowness in your overall application. You will then know that if you reach a certain CPU wait threshold, it's time to either scale your database server farm horizontally (easier said than done especially if you do manual sharding), or vertically (which may not even be possible in the cloud, because you have pretty restrictive choices of server hardware).

As an aside: do not run your relational database servers in the cloud. Disk I/O is so bad, and relational databases are so hard to scale horizontally, that you will soon regret it. Distributed NoSQL databases such as Riak, Voldemort or HBase are a different matter, since they are designed from the ground up to scale horizontally. But for MySQL or PostgreSQL, go for bare metal servers hosted at a good data center.

One last thing about scaling horizontally is the myth of autoscaling. I've very rarely seen companies that are able to scale their infrastructure up and down based on traffic patterns. Many say they do it, but when you get down to it it's mostly a manual process. It's somewhat easier to scale it up, but scaling it down is a totally different matter. You need to worry at that point that resources that you are taking away from your infrastructure are properly accounted for. Let's say you terminate a web server instance behind a load balancer -- is the load balancer aware of the fact that it shouldn't send traffic to that instance any more?

Another aspect of having graphs and dashboards is to see how business metrics (such as sales per day, or people that abandoned their shopping carts, or payment transactions, etc.) are affected by bottlenecks in your infrastructure. This assumes that you do graph business metrics, which you should. Without dashboards, you are flying blind.


I could go on with more caveats about the cloud, but I think that these 3 topics are a good starting point. The interesting thing about all 3 topics though is that they apply to any infrastructure, not only to cloud-based ones. But the scale of the cloud puts these things into their proper perspective and forces you to think them through.



Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...