Posts

Showing posts from 2009

Deploying Tornado in production

We've been using Tornado at Evite very successfully for a while, for a subset of functionality on our web site. While the instructions in the official documentation make it easy to get started with Tornado and get an application up and running, they don't go a long way towards explaining issues that you face when deploying to a production environment -- things such as running the Tornado processes as daemons, logging, automated deployments. Here are some ways we found to solve these issues.

Running Tornado processes as Unix daemons

When we started experimenting with Tornado, we ran the Tornado processes via nohup. It did the job, but it was neither solid nor elegant. I ended up using the grizzled Python library, specifically the os module, which encapsulates best practices in Unix system programming. This module offers a function called daemonize, which converts the calling process into a Unix daemon. The function's docstring says it all:

Convert the calling process into a…

NetApp SNMP monitoring with Nagios

Here are some tips regarding the monitoring of NetApp filers with Nagios. First off, the Nagios Exchange includes many NetApp-specific monitoring scripts, all based on SNMP. I ended up using check_netapp3.pl, but I hit some roadblocks when it came to checking disk space on NetApp volumes (the SNMP queries were timing out in that case).

The check_netapp3.pl script works fine for things such as CPU load. For example, I created a new command called check_netapp_cpu in /usr/local/nagios/etc/objects/commands.cfg on my Nagios server:

define command {
command_name check_netapp_cpu
command_line $USER1$/check_netapp3.pl -H $HOSTADDRESS$ -C mycommunity -v CPULOAD -w 50 -c 80
}

However, for things such as percent of disk used for a given NetApp volume, I had to use good old SNMP checks directly against the NetApp. Any time you use SNMP, you need to know which OIDs to hit. In this case, the task is a bit easier because you can look inside the check_netapp3.pl script to see some example of N…

Compiling Python 2.6 with sqlite3 support

Quick note to self, hopefully useful to others too:

If you compile Python 2.6 (or 2.5) from source, and you want to enable sqlite3 support (which is included in the stdlib for 2.5 and above), then you need to pass a special USE flag to the configuration command line, like this:

./configure USE="sqlite"

(note "sqlite" and not "sqlite3")

5 years of blogging

Today marks the 5th anniversary of my blog. It's been a fun and rewarding experience, and I hope to never run out of interesting topics to post about ;-)

As a sort of retrospective, I was curious to see which of my blog posts have been getting the most traffic. Here's the top 10 over the last 9 months, according to Google Analytics:

1. Performance vs. load vs. stress testing (as an aside, I think this has been wildly popular because I inadvertently hit on a lot of keywords in the title)
2. Experiences deploying a large-scale infrastructure in Amazon EC2
3. Ajax testing with Selenium using waitForCondition
4. Useful tools for writing Selenium tests
5. Load balancing in EC2 with HAProxy
6. Python unit testing part 1: the unittest module
7. HTTP performance testing with httperf, autobench and openload
8. Running a Python script as a Windows service
9. Apache virtual hosting with Tomcat and mod_jk
10. Configuring Apache 2 and Tomcat 5.5 with mod_jk

It's interesting that 2 of t…

Monitoring multiple MySQL instances with Munin

I've been using Munin for its resource graphing capabilities. I especially like the fact that you can group servers together and watch a common metric (let's say system load) across all servers in a group -- something that is hard to achieve with other similar tools such as Cacti and Ganglia.

I did have the need to monitor multiple MySQL instances running on the same server. I am using mysql-sandbox to launch and manage these instances. I haven't found any pointers on how to use Munin to monitor several MySQL instances, so I rolled my own solution.

First of all, here is my scenario:
server running Ubuntu 9.04 64-bit
N+1 MySQL instances installed as sandboxes rooted in /mysql/m0, /mysql/m1,..., /mysql/mNmunin-node package version 1.2.6-8ubuntu3 (installed via 'apt-get install munin-node')
Step 1

Locate mysql_* plugins already installed by the munin-node package in /usr/share/munin/plugins. I have 5 such plugins: mysql_bytes, mysql_isam_space_, mysql_queries, mysql_slo…

Behaviour Driven Infrastructure

I just read a post by Matthew Flanagan on Behaviour Driven Infrastructure or BDI, a concept that apparently originates with Martin Englund's post on this topic. The idea is that you describe what you need your system to do in natural language, using for example a tool such as Cucumber. What's more, you can then use the cucumber-nagios plugin to express the desired behaviour of the new system as a series of Nagios checks. The checks will initiall fail (just like in a TDD or BDD development cycle), but you will make them pass by deploying the appropriate packages and applications to the system.

I also expressed the need for automated testing of production deployments in one of my blog posts. However, BDI goes one step further, by describing a test plan for production deployments in natural language. Pretty cool, and again I can only wish that the Python testing tools kept up with Ruby-based tools such as Cucumber and friends....

Great series of posts on Tokyo Tyrant

Matt Yonkovit has started a series of posts on Tokyo Tyrant at Percona's MySQL Performance Blog. Great in-depth analysis of the reliability and performance of TT.

Part 1: Tokyo Tyrant -- is it durable?
Part 2: Tokyo Tyrant -- the performance wall
Part 3: Tokyo Tyrant -- write bottleneck

(parts 4 and 5, about replication and scaling, are hopefully coming soon)

Google using buildbot for Chromium continuous integration

Via Ben Bangert, this gem of a page showing the continuous integration status for the Chromium project at Google. It's cool to see that they're using buildbot. But just like Ben says -- I wish they open sourced the look and feel of that buildbot status page ;-)

NFS troubleshooting with iostat and lsof

Scenario: you mount a volume exported from a NetApp on several Linux clients via NFS

Problem: you see constant high CPU usage on the NetApp, and some of the Linux clients become sluggish, primarily in terms of I/O

Troubleshooting steps:

1) If iostat is not already on the clients, install the sysstat utilities.

2) On each client mounting from the filer, or on a representative sample of the clients, run iostat with -n so that it shows NFS-related statistics. The following
command will run iostat every 5 seconds and show NFS stats in a nicely tabulated output:

# iostat -nh 5
3) Notice which client exhibits the most NFS operations per second, and correlate it with the NFS volume on that client which shows the most NFS reads and/or writes per second.

At this point you found the most likely culprit in terms of sending NFS traffic to the filer (there could be several client machines in this position, for example if they are part of a cluster).

5) If not already installed, download and install

Automated deployments with Puppet and Fabric

I've been looking into various configuration management/automated deployment tools lately. At OpenX we used slack, but I wanted something with a bit more functionality than that (although I'm not badmouthing slack by any means -- it can definitely be bent to your will to do pretty much whatever you need in terms of automating your deployments).

From what I see, there are 2 types of configuration management tools:
The first type I call 'pull', which means that the servers pull their configurations and their marching orders in terms of applying those configurations from a centralized location -- both slack and Puppet are in this category. I think this is great for initial configuration of a server. As I described in another post, you can have a server bootstrap itself by installing Puppet (or slack) and then 'call home' to the central Puppet master (or slack repository) and get all the information it needs to configure itselfThe second type I call 'push', …

Thierry Carrez on running your own Ubuntu Enterprise Cloud

Thierry Carrez, who works in the Ubuntu Server team, has a great series of blog posts on how to run your own Ubuntu Enterprise Cloud. I haven't had a chance to  try this yet, but it's high on my TODO list. Thierry uses the Ubuntu Enterprise Cloud product (which has been part of Ubuntu server starting with 9.04) together with Eucalyptus. Here are the links to Thierry's posts:
Part 1Part 2Part 3

Compiling, installing and test-running Scribe

I went to the Hadoop World conference last week and one thing I took away was how Facebook and other companies handle the problem of scalable logging within their infrastructure. The solution found by Facebook was to write their own logging server software called Scribe (more details on the FB blog).

Scribe is mentioned in one of the best presentations I attended at the conference -- 'Hadoop and Hive Development at Facebook' by Dhruba Borthakur and Zheng Shao. If you look at page 4, you'll see the enormity of the situation they're facing: 4 TB of compressed data (mostly logs) handled every day, and 135 TB of compressed data scanned every day. All this goes through Scribe, so that gives me a warm fuzzy feeling that it's indeed scalable and robust. For more details on Scribe, see the wiki page of the project. It's my intention here to detail the steps needed for compiling and installing it, since I found that to be a non-trivial process to say the least. I'm g…

Brandon Burton on 'Automation is the cloud'

Great post from Brandon Burton, my ex-colleague at RIS/Reliam, on why automation is the foundation of cloud computing. Brandon discusses automation at various levels, starting with virtualization and networking, then moving up the layers and covering OS, configuration management and application deployment. Highly recommended.

Pybots success stories and a call for help

Any report of the death of the Pybots project is an exaggeration. But not by much. First, some history.

Some history

The idea behind the Pybots project is to allow people to run automated tests for their Python projects, while using Python binaries built from the very latest source code from the Python subversion repository.

The idea originated from Glyph, of Twisted fame. He sent out a message to the python-dev mailing list in which he said:
"I would like to propose, although I certainly don't have time to implement, a program by which Python-using projects could contribute buildslaves which would run their projects' tests with the latest Python trunk. This would provide two useful incentives: Python code would gain a reputation as generally well-tested (since there is a direct incentive to write tests for your project: get notified when core python changes might break it), and the core developers would have instant feedback when a "small" change breaks more code…

Jeff Roberts on a scalable DNS scheme for EC2

My ex-colleague from OpenX, Jeff Roberts, has another great blog post on 'A Scalable DNS Scheme for Amazon's EC2 Cloud'. If you need to deploy an internal DNS infrastructure in EC2, you have to read this post. It's based on battle-tested experience.

A/B testing and online experimentation at Microsoft

Via Greg Linden, I found a great presentation from Ronny Kohavi on "Online experimentation at Microsoft". All kinds of juicy nuggets of information on how to conduct meaningful A/B testing and other types of controlled online experiments.

One of my favorite slides is 'Key Lessons', from which I quote:

"Avoid the temptation to try and build optimal features through extensive planning without early testing of ideas""Experiment often""Try radical ideas. You may be surprised"
The entire presentation is highly recommended. You can tell that this wisdom was earned in the school of hard knocks, which is the best school there is in my experience, at least for software engineering.

Bootstrapping EC2 images as Puppet clients

I've been looking at Puppet lately as an alternative to slack for automated deployment and configuration management. I can't say I love it, but I think it's good enough that it warrants banging your head against the wall repeatedly until you learn how to use it. I do wish it was written in Python, but hey, you do what you need to do. I did look at Fabric, and I might still use it for 'push'-type deployments, but it has nowhere near the features that Puppet has (and its development and maintenance just changed hands, which makes it too cutting edge for me at this point.)

But this is not a post about Puppet -- although I promise I'll blog about that too. This is a post on how to get to the point of using Puppet in an EC2 environment, by automatically configuring EC2 instances as Puppet clients once they're launched.

While the mechanism I'll describe can be achieved by other means, I chose to use the Ubuntu EC2 AMIs provided by alestic. As a parenthesis, if …

Is your hosting provider Reliam?

Some exciting news from RIS Technology, the hosting company I used to work for. They changed their name to Reliam, which stands for Reliable Internet Application Management. And I think it's an appropriate name, because RIS has always been much more involved into the application stack than your typical hosting provider. When I was there, we rolled out Django applications, Tomcat instances, MySQL, PostgreSQL and Oracle installations, and we maintained them 24x7, which required a deep understanding of the applications. We also provided the glue that tied all the various layers together, from deployment to monitoring.

Since I left, RIS/Reliam has invested heavily in a virtual infrastructure that can be combined where it makes sense with physical dedicated servers. The DB layer is usually dedicated, since the closest you are to bare metal, the better off you are in terms of database access. But the application layer can easily be virtualized and scaled on demand. So you can the scalin…

New presentation and new job

I gave a presentation last night on 'Agile and Automated Testing Techniques and Tools' to the Pasadena Java User Group. It was a version of the talk I gave earlier this year to the XP/Agile SoCal User Group. This time I posted the slides to Slideshare.

If you look attentively at the first slide, you'll notice I have a new job at Evite, as a Sr. Systems Architect. I started a couple of weeks ago, and it's been great. Expect more blog posts on automated deployments to the cloud (using Ubuntu images from Alestic), on continuous integration/build/release management processes, on Hadoop, and on any interesting stuff that comes my way.

noSQL databases? map-reduce? Erlang? it's all in this cartoon

Hilarious cartoon (not sure why it's titled 'Fault Tolerance' though) seen on the High Scalability blog. Captures very well the spirit and hype of our times in the IT world.

Python well represented in NASA's Nebula cloud

I found out today from the cloud-computing mailing list about NASA's Nebula project. Here's what the 'About' page of the project's web site says:

"NEBULA is a Cloud Computing environment developed at NASA Ames Research Center, integrating a set of open-source components into a seamless, self-service platform. It provides high-capacity computing, storage and network connectivity, and uses a virtualized, scalable approach to achieve cost and energy efficiencies."

The Services page has some nice architectural diagrams. I wasn't surprised to see that their VM enviroment is managed via Eucalyptus. I also shouldn't have been surprised by the large number of Python modules and applications they're using, especially on the client side. Pretty much all the frontend applications are Python bindings for the various backend technologies they're using (such as LUSTRE, RabbitMQ, Subversion). Of course Trac is there too.

But the most interesting thing for P…

How to roll your own Amazon EC2 image

Jeff Roberts, the vim-fu guru, does it again with a great post on "Bundling versioned AMIs rapidly in Amazon's EC2". It's a step-by-step guide on how to roll your own AMI, bundle it and upload it to S3, while keeping it versioned at the same time. Highly recommended.

Automated testing of production deployments

When you work as a systems engineer at a company that has a large scale system infrastructure, sooner or later you realize that you need to automate pretty much everything you do. You can't afford not to, if you want to keep up with the ever-present demands of scaling up and down the infrastructure.

The main promise of cloud computing -- infinite elastic scaling based on demand -- is real, but you can only achieve it if you automate your deployments. It's fairly safe to say that most teams that are involved in such infrastructures have achieved high levels of automation. Some fearless teams practice continuous deployment, others do frequent dark launches. All these practices are great, but my thesis is that in order to achieve fearlessness you need automated tests of your production deployments.

Note the word 'production' -- I believe it is necessary to go one step beyond running automated tests in an isolated staging environment (although that is a very good thing to …

Managing multiple MySQL instances with MySQL Sandbox

MySQL doesn't support multi-master replication, i.e. you can't have one MySQL instance acting as a replication slave to more than one master. There are times when you need this functionality, for example for disaster recovery purposes, where you have a machine with tons of CPU, RAM and disk running several MySQL instances, each being a replication slave to a different MySQL master.

One tool I've used for easy management of multiple MySQL instances on the same box is MySQL Sandbox. It's nothing fancy -- a Perl module which offers a collection of scripts -- but it does make your life much easier.

To install MySQL Sandbox, download it from its Launchpad page, then run 'perl Makefile.PL; make; make install'. You also need to download a MySQL binary tarball which will serve as a common base used by your MySQL instances.

Here's an example of a script I wrote which creates a new MySQL Sandbox instance under a common directory (/var/mysql_slaves in my case). The sc…

Kent Langley's '10 rules for launching a web site'

The advice in this blog post by Kent Langley resonates with my experiences launching Web infrastructures of all types, large and small. Deploy early and often, automate your deployments, use version control, create checklists, have a rollback plan -- these are all very sensible things to do.

I would add one more very important thing that seems to be missing from the list: have an extensive suite of automated tests to check that your deployment steps did the right thing. Many people just stop at the automation step, and don't go beyond that to the testing step. It will come back to haunt them in the long run. But this is fodder for another blog post, which will be coming real soon now ;-)

Greatest invention since sliced bread: vimdiff

If you work at the ssh command prompt all day long (like I do), and if you need to compare text files and merge differences (like I do), then make sure you check out vimdiff (thanks to Jeff Roberts for bringing it to my attention).

If you run 'vimdiff file1 file2', the tool will split your screen vertically, with file1 displayed in a vim session on the left and file2 on the right. The differences between the 2 files will be highlighted. To jump from difference to difference, use ]c (forward) and [c (backward). When the cursor is on a difference block, use :diffget or do to merge the difference from the other file into the file where the cursor is; use :diffput or dp to merge the other way. To jump from one file's window to the other, use Ctrl-w-w. Google vimdiff for other tips and tricks. Definitely a good tool to have in your arsenal.

If you have the luxury of a graphical enviroment, I also recommend meld (thanks to Chris Nutting for the tip).

Recommended blog: Elastician

Elastician is the blog of Mitch Garnaat, the author of the amazingly useful boto Python library -- a collection of modules for managing AWS resources (EC2, S3, SQS,  SimpleDB and more recently CloudWatch).

Mitch has a great picture on what he calls the 'Cloud Computing Hierarchy of Needs' (in a reference to Maslow's self-actualization hierarchy). Very insightful.

Python mock testing techniques and tools

This is an article I wrote for Python Magazine as part of the 'Pragmatic Testers' column. Titus and I have taken turns writing the column, although we haven't produced as many articles as we would have liked.

Here is the content of my article, which appeared in the February 2009 issue of PyMag:

Mock testing is a controversial topic in the area of unit testing. Some people swear by it, others swear at it. As always, the truth is somewhere in the middle.

Let's get some terminology clarified: when people say they use mock objects in their testing, in most cases they actually mean stubs, not mocks. The difference is expanded upon with his usual brilliance by Martin Fowler in his article "Mocks aren't stubs".

In his revised version of the article, Fowler uses the terminology from Gerard Meszaros's 'xUnit Test Patterns' book. In this nomenclature, both stubs and mocks are special cases of 'test doubles', which are 'pretend' objects…