Showing posts from July, 2011

Processing mail logs with Elastic MapReduce and Pig

These are some notes I took while trying out Elastic MapReduce (EMR), and more specifically its Pig functionality, by processing sendmail mail logs. A big help was Eric Lubow's blog post on EMR and Pig. Before I go into  details, here's my general processing flow:

N mail servers (running sendmail) send their mail logs to a central server running syslog-ng.A process running on the central logging server tails the aggregated mail log (at 5 minute intervals), parses the lines it finds, extracts relevant information from each line, and saves the output in JSON format to a local file (actually there are 2 types of files generated, one for sender information and one for recipient information, corresponding to the 'from' and 'to' lines in the mail log -- see below)Another process compresses the generated files in bzip2 format and uploads them to S3. I have 2 sets of files, one set with names similar to "from-2011-07-12-20-58" and containing JSON records of th…

Results of a survey of the SoCal Piggies group

My colleague Warren Runk had the idea of putting together a survey to be sent to the mailing list of the SoCal Python Interest Group (aka SoCal Piggies), with the purpose of finding out which topics or activities would be most interesting to the members of the group in terms of future meetings. We had 10 topics in the survey, and people responded by choosing their top 5. We also had free-form response fields for 2 questions: "What do you like most about the meetings?" and "What meeting improvements are most important to you?".

We had 26 responses. Here are the votes results for the 10 topics we proposed:

#1 (18 votes): "Good practice, pitfall avoidance, and module introductions for beginners"

#2 (17 votes): "5 minute lightning talks"

#3 - #4 (15 votes): "Excellent code examples from established Python projects" and "New and upcoming Python open source projects"

#5 (14 votes): "30 minute presentations"

#6 (13 votes): &quo…

Accessing the data center from the cloud with OpenVPN

This post was inspired by a recent exercise I went through at the prompting of my colleague Dan Mesh. The goal was to have Amazon EC2 instances connect securely to servers at a data center using OpenVPN.

In this scenario, we have a server within the data center running OpenVPN in server mode. The server has a publicly accessible IP (via a firewall NAT) with port 1194 exposed via UDP. Cloud instances which run OpenVPN in client mode are connecting to the server, get a route pushed to them to an internal network within the data center, and are then able to access servers on that internal network over a VPN tunnel.

Here are some concrete details about the network topology that I'm going to discuss.

Server A at the data center has an internal IP address of and is part of the internal network There is a NAT on the firewall mapping external IP X.Y.Z.W to the internal IP of server A. There is also a rule that allows UDP traffic on port 1194 to X.Y.Z.W.

I have a…