Friday, October 18, 2013

Avoiding keepalive storms in sensu

Sensu is a great new monitoring tool, but also a bit rough around the edges. We've been willing to live with that, because of its benefits, in particular ease of automation and increased scalability due to its use of a queuing system. Speaking of queueing systems, Sensu uses RabbitMQ for that purpose. We haven't had performance or stability issues with the rabbit, but we have been encountering a pretty severe issue with the way Sensu and RabbitMQ interact with each other.

We have systems deployed across several cloud providers and data centers, with site-to-site VPN links between locations. What started to happen fairly often for us was what we call a "keepalive storm", where all of a sudden all Sensu clients were seen by the Sensu server as unavailable, since no keepalive had been sent by the clients to RabbitMQ.  The thresholds for the keepalive timers in Sensu are hardcoded (at least in the Sensu version we are using, which is 0.10.2) and are defined in /opt/sensu/embedded/lib/ruby/gems/2.0.0/gems/sensu-0.10.2/lib/sensu/server.rb as 120 seconds for warnings and 180 seconds for critical alerts:


             thresholds = {
                :warning => 120,
                :critical => 180
              }

What we think was happening is that the connections between the Sensu clients and RabbitMQ (which in our case is running on the same box as the Sensu server) were reset, either because of a temporary glitch in the site-to-site VPN connection, or because of some other undetermined but probably network-related cause. In any case, this issue was becoming severe and was causing the engineer on pager duty to not get a lot of sleep at night.

After lots of hair-pulling, we found a workaround by specifying a non-default value for the heartbeat parameter in the RabbitMQ configuration file rabbitmq.config. Here's what the documentation says about the heartbeat parameter:

Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 600


Note that the default value is 600 seconds, much larger than the 120 and 180 second keepalive thresholds defined in Sensu. So what we did was set a heartbeat value of less than 120. We chose 60 seconds for this value and it seemed to work fine. We still have keepalive storms, but they are definitely due to real but temporary issues in site-to-site VPN connectivity and they usually resolve themselves immediately.

One more thing: we install Sensu via its Chef community cookbook. The Sensu cookbook uses the RabbitMQ community cookbook, which doesn't define the heartbeat parameter as an attribute. We had to add that attribute, as well as use it in the rabbitmq.config.erb template file.

Just for reference, we modified cookbooks/rabbitmq/attributes/default.rb and added:


#avoid sensu keepalive storms!
default['rabbitmq']['heartbeat'] = 60

We also modified cookbooks/rabbitmq/templates/default/rabbitmq.config.erb and added:

{heartbeat, <%= node['rabbitmq']['heartbeat'] %>}


Disabling public key authentication in sftp

I just had an issue trying to sftp into a 3rd party vendor server using a user name and password. It worked fine with Filezilla, but from the command line I got:

Received disconnect from A.B.C.D: 11:
Couldn't read packet: Connection reset by peer

(A.B.C.D denotes the IP address of the sftp server)

I then ran sftp in verbose mode (-v) and got:

debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/mylocaluser/.ssh/id_rsa
Received disconnect from A.B.C.D: 11:
Couldn't read packet: Connection reset by peer

This made me realize that the sftp server is configured to accept password authentication only. I inspected the man page for sftp and googled around a bit to figure out how to disable public key authentication and I found a way that works:

sftp -oPubkeyAuthentication=no remoteuser@sftpserver

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...