Posts

Showing posts from 2013

Ops Design Pattern: local haproxy talking to service layer

Modern Web application architectures are often composed of a front-end application layer (app servers running Java/Python/Ruby aided by a generous helping of JavaScript) talking to one or more layers of services (RESTful or not) which in turn may talk to a distributed database such as Riak.

Typically we see a pair of load balancers in HA mode deployed up front and handling user traffic, or more commonly serving as the origin for CDN traffic. In order to avoid deploying many pairs of load balancers in between the front-end app server layer and various services layers, or in between one service layer and another, one design pattern I've successfully used is an haproxy instance running locally (on 127.0.0.1) on each node that needs to talk to N other nodes running some type of service. This approach has several advantages:

No need to use up nodes, be they bare-metal servers, cloud instances or VMs, for the sole purpose of running yet another haproxy instance (and you actually need 2 n…

Setting HTTP request headers in haproxy and interpolating variables

I had the need to set a custom HTTP request header in haproxy. For versions up to 1.4.x, the way to do this is :

reqadd X-Custom-Header:\ some_string

However, some_string is just a static string, and I could see no way of interpolating a variable in the string. Googling around, this is possible in haproxy 1.5.x with this method:

http-request set-header X-Custom-Header %[dst_port]

where dst_port is the variable we want to interpolate and %[variable] is the syntax for interpolation.

Other examples of variables available for you in haproxy.cfg are in Section 7.3 "Fetching samples" in the haproxy 1.5 configuration manual.

Creating sensu alerts based on graphite data

We had the need to create Sensu alerts based on some of the metrics we send to Graphite. Googling around, I found this nice post by @ulfmansson talking about the way they did it at Recorded Future. Ulf recommends using Sean Porter's check-data.rb Sensu plugin for alerting based on Graphite data. It wasn't clear how to call the plugin, so we experimented a bit and came up with something along these lines (note that check-data.rb requires the sensu-plugin gem):

$ ruby check-data.rb -s graphite.example.com -t "movingAverage(stats.lb1.prod.assets-backend.session_current,10)" -w 100 -c 200

This run the check-data.rb script against the server graphite.example.com (-s option) requesting the value or the target metric movingAverage(stats.lb1.prod.assets-backend.session_current,10) (-t option) and setting a warning threshold of 100 for this value (-w option), and a critical threshold of 200 (-c option).  The target can be any function supported by Graphite. In this example, it…

Avoiding keepalive storms in sensu

Sensu is a great new monitoring tool, but also a bit rough around the edges. We've been willing to live with that, because of its benefits, in particular ease of automation and increased scalability due to its use of a queuing system. Speaking of queueing systems, Sensu uses RabbitMQ for that purpose. We haven't had performance or stability issues with the rabbit, but we have been encountering a pretty severe issue with the way Sensu and RabbitMQ interact with each other.

We have systems deployed across several cloud providers and data centers, with site-to-site VPN links between locations. What started to happen fairly often for us was what we call a "keepalive storm", where all of a sudden all Sensu clients were seen by the Sensu server as unavailable, since no keepalive had been sent by the clients to RabbitMQ.  The thresholds for the keepalive timers in Sensu are hardcoded (at least in the Sensu version we are using, which is 0.10.2) and are defined in /opt/sensu…

Disabling public key authentication in sftp

I just had an issue trying to sftp into a 3rd party vendor server using a user name and password. It worked fine with Filezilla, but from the command line I got:

Received disconnect from A.B.C.D: 11: Couldn't read packet: Connection reset by peer
(A.B.C.D denotes the IP address of the sftp server)
I then ran sftp in verbose mode (-v) and got:
debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password debug1: Next authentication method: publickey debug1: Offering RSA public key: /home/mylocaluser/.ssh/id_rsa Received disconnect from A.B.C.D: 11: Couldn't read packet: Connection reset by peer
This made me realize that the sftp server is configured to accept password authentication only. I inspected the man page for sftp and googled around a bit to figure out how to disable public key authentication and I found a way that works:
sftp -oPubkeyAuthentication=no remoteuser@sftpserver

Keepalived, iproute2 and HAProxy (part 2)

In part 1 of this 2-part series, I explained how we initially set up keepalived and iproute2 on 2 HAProxy load balancers with the goal of achieving high availability at the load balancer layer. Each of the load balancers had 3 interfaces, and we wanted to be able to ssh into any IP address on those interfaces -- hence the need to iproute2 rules. However, adding keepalived into the mix complicated things.

To test failover at the HAProxy layer, we simulated a system failure by rebooting the primary load balancer. As expected, keepalived transferred the floating IP address to the secondary load balancer, and everything worked as expected. However, things started going south when the primary load balancer came back online. We had a chicken and egg problem: the iproute2 rules related to the floating IP address didn't kick in when rc.local was run, because the floating IP wasn't there yet. Then keepalived correctly identified the primary system as being up and transferred the float…

The mystery of stale haproxy processes

We had the situation with our haproxy-based load balancers where our monitoring alerts were triggered by the fact that several haproxy processes were running, when in fact only one was supposed to be running. Looking more into it, we determined that each time Chef client ran (which by default is every 30 minutes), a new haproxy process was launched. The logic in the haproxy cookbook applied to that node was to do a 'service haproxy reload' every time the haproxy configuration file changed. Since our haproxy configuration file is based on a Chef template populated via a Chef search, that meant that the haproxy reload was happening on each Chef client run.

If you look in /etc/init.d/haproxy, you'll see that the reload launches a new haproxy process, while the existing process is supposed to finish serving existing connections, then exit. However, the symptom we were seeing was that the existing haproxy process never closed all the outstanding connections, so it never exited.…

Some gotchas around keepalived and iproute2 (part 1)

I should have written this blog post a while ago, while these things were still fresh on my mind. Still, better late than never.

Scenario: 2 bare-metal servers with 6 network ports each, to serve as our HAProxy load balancers in an active/failover configuration based on keepalived (I described how we integrated this with Chef in my previous post).

The architecture we have for the load balancers is as follows

1 network interface (virtual and bonded, see below) is on a 'front-end' VLAN which gets the incoming traffic hitting HAProxy1 network interface is on a 'back-end' VLAN where the actual servers behind HAProxy live1 network interface is on an 'ops' VLAN which we want to use for accessing the HAProxy server for monitoring purposes
We (and by the way, when I say we, I mean mostly my colleagues Jeff Roberts and Zmer Andranigian) used Open vSwitch to create a virtual bridge interface and bond 2 physical interfaces on this bridge for each of the 'front-end'…

Setting up keepalived with Chef on Ubuntu 12.04

We have 2 servers running HAProxy on Ubuntu 12.04. We want to set them up in an HA configuration, and for that we chose keepalived.

The first thing we did was look for an existing Chef cookbook for keepalived -- luckily, @jtimberman already wrote it. It's a pretty involved cookbook, probably one of the most complex I've seen. The usage instructions are pretty good though. In any case, we ended up writing our own wrapper cookbook on top of keepalived -- let's call it frontend-keepalived.

The usage documentation for the Opscode keepalived cookbook contains a role-based example and a recipe-based example. We took inspiration from both. In our frontend-keepalived/recipes/default.rb file we have:


include_recipe 'keepalived'

node[:keepalived][:check_scripts][:chk_haproxy] = {
  :script => 'killall -0 haproxy',
  :interval => 2,
  :weight => 2
}
node[:keepalived][:instances][:vi_1] = {
  :ip_addresses => '172.30.10.10',
  :interface => 'frontend_…

Using wrapper cookbooks in Chef

Not sure if this is considered a Chef best practice or not -- I would like to get some feedback, hopefully via constructive comments on this blog post. But I've started to see this pattern when creating application-specific Chef cookbooks: take a community cookbook, include it in your own, and customize it for your specific application.

A case in point is the haproxy community cookbook. We have an application that needs to talk to a Riak cluster. After doing some research (read 'googling around') and asking people on Twitter (because Tweeps are always right), it looks like the preferred way of putting a load balancer in front of Riak is to run haproxy on each application server that needs to talk to Riak, and have haproxy listen on 127.0.0.1 on some port number, then load balance those requests to the Riak backend. Here is an example of such an haproxy.cfg file.

So what I did was to create a small cookbook called haproxy-riak, add a default.rb recipe file that just calls

in…

Installing ruby 2.0 on Ubuntu 12.04

I wanted to find a way to install ruby 2.0 on Ubuntu 12.04 via Chef, without going the rvm route, which is harder to automate. After several tries, I finally found a list of .deb packages which are necessary and sufficient as pre-requisites. Writing it down here for my future reference, and maybe it will prove useful to others out there.

I downloaded most of these packages from packages.ubuntu.com (if you google their names you'll find them), then I installed them via 'dpkg -i'. Here they are:


libffi5_3.0.9-1_amd64.deb
libssl0.9.8_0.9.8o-7ubuntu3.1_amd64.deb
zlib1g-dev_1.2.3.4.dfsg-3ubuntu4_amd64.deb

The actual ruby .deb package was built with fpm by my colleague Jeff Roberts courtesy of this blog post.




No snowflakes allowed

"Snowflake" is a term I learned from my colleague Jeff Roberts. It is used in the Chef community (maybe in the configuration management community at large as well) to designate a server/node that is 'unique', i.e. not in configuration management control. In a Chef environment, it means that the node in question was never added to Chef and never had chef-client run on it.

We've all been in situations where it seems overkill to go through the effort of automating the setup of a server. Maybe the server has a unique purpose within our infrastructure. Maybe we didn't feel like spending the time to create Chef recipes for that server. Whatever the reasoning, it seemed low-risk at the time.

Well, I am here to tell you there is danger in this way of thinking. Example: we deployed a server in EC2 manually. We installed the Sensu client on it manually and pointed it at our Sensu server. Everything seemed fine. Then one day we updated our Sensu configuration (via Chef)…

Video and slides for 'Five years of EC2 distilled' talk

On February 19th I gave a talk at the Silicon Valley Cloud Computing Meetup about my experiences and lessons learned while using EC2 for the past 5 years or so. I posted the slides on Slideshare and there's also a video recording of my presentation. I think the talk went pretty well, judging by the many questions I got at the end. Hopefully it will be useful to some people out there who are wondering if EC2 or The Cloud in general is a good fit for their infrastructure (short answer: it very much depends).

I want to thank Sebastian Stadil from Scalr for inviting me to give the talk (he first contacted me about giving a talk like this in -- believe it or not -- early 2008!). I also want to thank Adobe for hosting the meeting, and CDNetworks for sponsoring the meeting.

A workflow of managing Chef with knife

My colleague Jeff Roberts has been trying hard to teach me how to properly manage Chef cookbooks, roles, and nodes using the knife utility. Here is a workflow for doing this in a way that aims to be as close as possible to 'best practices'. This assumes that you already have a Chef server set up.

1) Install chef on your personal machine

In my case, my personal laptop is running OS X (gasp! I used to be a big-time Ubuntu-everywhere user, but I changed my mind after being handed a MacBook Air).

I won't go into the gory details on installing chef locally, but here are a few notes:

 I installed the XCode command line tools, then I installed Homebrew. Note that plain XCode wasn't sufficient, I had to download the dmg package for the command line tools and install that. I installed git via brew. I installed chef viasudo true && curl -L http://opscode.com/chef/install.sh | sudo bash I installed the EC2 plugin for knife viacd /opt/chef/embedded/bin/; sudo ./gem install k…

Some gotchas when installing Octopress on Ubuntu

Here are some quick notes I took while trying to install the Octopress blog engine on a box running Ubuntu 12.04. I tried following the official instructions and I chose the RVM method. The first gotcha is that you have to have the development tools (compilers, linkers etc) installed already. So you need to run:

# apt-get install build-essential


Then I ran the recommended commands in order to install rvm and rubygems:

# curl -L https://get.rvm.io | bash -s stable --ruby
# source /usr/local/rvm/scripts/rvm
# rvm install 1.9.3
# rvm use 1.9.3
# rvm rubygems latest


I then installed git and grabbed the octopress source code:

# apt-get install git
# git clone git://github.com/imathis/octopress.git octopress
# cd octopress/

When trying to install bundler, I got this error:

# gem install bundler

ERROR:  Loading command: install (LoadError)
   cannot load such file -- zlib
ERROR:  While executing gem ... (NameError)
   uninitialized constant Gem::Commands::InstallCommand

Googling around, I found this answer

IT stories from the trenches #2

The year was 1996. I was a graduate student in CS at USC. My first contact with Unix (I had started my career as a programmer in C++ under DOS and later Windows 3.1; yes, I am dating myself big time here!). My first task as a research assistant was to recompile the X server so it can use shared memory extensions. This was part of an experiment that was studying multimedia applications over a high speed proprietary network.

I started optimistically. I got acquainted with the compilers and linkers on the HP-UX platform we were using, I learned all kinds of neat Unix command line utilities, but there was only one glitch -- the X binary wouldn't compile. I tried everything in my power. I even searched on Altavista (no Google in 1996). Nothing worked. Finally, I posted a plea for help on comp.windows.x. A gentle soul by the name of Kaleb Keithley replied first on that thread, then on separate email threads (I was using pine as my email client at the time) and nudged me towards the solu…

IT stories from the trenches #1

I thought it might be interesting to tell some stories/vignettes that capture various lessons I learned throughout my career in IT. I call them 'stories from the trenches' because in general the lessons were acquired the hard way (but maybe that's the best way to acquire lessons...).

Here's the first one.

It was my second day on the job as a Unix system architect.

We weren't using LDAP or NIS to centralize user management so we were copying user entries in /etc/passwd and /etc/shadow from one server and pasting them on other servers that we needed new users created on.

On one of these (production) servers I typed 'ci /etc/passwd' instead of 'vi /etc/passwd'. This had the unfortunate effect of invoking the RCS check-in command line utility ci, which then moved '/etc/passwd' to a file named '/etc/passwd,v'. Instead of trying to get back the passwd file, I panicked and exited the ssh shell. Of course, at this point there was no passwd …