Thursday, November 29, 2012

Code performance vs system performance

Just a quick thought: as non-volatile storage becomes faster and more affordable, I/O will cease to be the bottleneck it currently is, especially for database servers. Granted, there are applications/web sites out there which will always have to shard their database layer because they deal with a volume of writes well above what a single DB server can handle (and I'm talking about mammoth social media sites such as Facebook, Twitter, Tumblr etc).  By database in this context I mean relational databases. NoSQL-like databases worth their salt are distributed from the get go, so I am not referring to them in this discussion.

For people who are hoping not to have to shard their RDBMS, things like memcached for reads and super fast storage such as FusionIO for writes give them a chance to scale their single database server up for a much longer period of time (and by a single database server I mostly mean the server where the writes go, since reads can be scaled more easily by sending them to slaves of the master server in the MySQL world for example).

In this new world, the bottleneck at the database server layer becomes not the I/O subsystem, but the CPU. Hence the need to squeeze every ounce of performance out of your code and out of your SQL queries. Good DBAs will become more important, and good developers writing efficient code will be at a premium. Performance testing will gain a greater place in the overall testing strategy as developers and DBAs will need to test their code and their SQL queries against in-memory databases to make sure there are no inefficiencies in the code.

I am using the future tense here, but the future is upon us already, and it's exciting!

Friday, November 09, 2012

Quick troubleshooting of Sensu 'no keepalive from client' issue

As I mentioned in a previous post, we started using Sensu as our internal monitoring tool. We also integrated it with Pager Duty. Today we terminated an EC2 instance that had been registered as a client with Sensu. I started to get paged soon after with messages of the type:

 keepalive : No keep-alive sent from client in over 180 seconds

Even after removing the client from the Sensu dashboard, the messages kept coming. My next step was of course to get on the #sensu IRC channel. I immediately got help from robotwitharose and portertech.  They had me try the following:

1) Try to remove the client via the Sensu API.

I used curl and ran:

curl -X DELETE http://sensu.server.ip.address:4567/client/myclient

2) Try to retrieve the client via the Sensu API and make sure I get a 404

curl -v http://sensu.server.ip.address:4567/client/myclient

This indeed returned a 404.

3) Check that there is a single redis process running

BINGO -- when I ran 'ps -def | grep redis', the command returned TWO redis-server processes! I am not sure how they got to be both running, but this solved the mystery: sensu-server was talking to one redis-server process, and sensu-api was talking to another. When the client was removed via the sensu-api, the Sensu server was still seeing events sent by the client, such as this one from /var/log/sensu/sensu-server.log:


{"timestamp":"2012-11-10T01:41:14.154418+0000","message":"handling event","event":{"client":{"subscriptions":["all"],"name":"myclient","address":"10.2.3.4","timestamp":1352502348},"check":{"name":"keepalive","issued":1352511674,"output":"No keep-alive sent from client in over 180 seconds","status":2,"history":["2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2"],"flapping":false},"occurrences":305,"action":"create"},"handler":{"type":"pipe","command":"/etc/sensu/handlers/ops_pagerduty.rb","api_key":"myapikey","name":"ops_pagerduty"},"level":"info"}

To actually solve this, I killed the 2 redis-server processes (since 'service redis-server stop' didn't seem to do it), then stopped sensu-server and sensu-api, then started redis-server, and finally started sensu-server and sensu-api again.

At this point, the Sensu dashboard showed the 'myclient' client again. I removed it one more time from the dashboard (I could have done it via the API too) and it finally went away for good.

This was quite some obscure issue. I wouldn't have been able to solve it were it not for the awesomeness of the #sensu IRC channel (and kudos to the aforementioned robotwitharose and portertech!)

I hope google searches for 'sensu no keepalive from client' will result in this blog post helping somebody out there! :-)

Sensu rocks BTW.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...