Tuesday, January 29, 2013

IT stories from the trenches #1


I thought it might be interesting to tell some stories/vignettes that capture various lessons I learned throughout my career in IT. I call them 'stories from the trenches' because in general the lessons were acquired the hard way (but maybe that's the best way to acquire lessons...).

Here's the first one.

It was my second day on the job as a Unix system architect.

We weren't using LDAP or NIS to centralize user management so we were copying user entries in /etc/passwd and /etc/shadow from one server and pasting them on other servers that we needed new users created on.

On one of these (production) servers I typed 'ci /etc/passwd' instead of 'vi /etc/passwd'. This had the unfortunate effect of invoking the RCS check-in command line utility ci, which then moved '/etc/passwd' to a file named '/etc/passwd,v'. Instead of trying to get back the passwd file, I panicked and exited the ssh shell. Of course, at this point there was no passwd file, so nobody could log in anymore. Ouch. I had to go to my boss, admit my screw-up, and together we took the server down (it was a Solaris server) then booted in single user mode off of an installation CD, mounted /etc and moved passwd,v back to passwd. I mentioned this was a production server, right? DATABASE production server. MAIN database production server. Miraculously, I kept my job.

Anyway, lesson learned? Well, several lessons in fact:

1) use system utilities for creating users and groups, and ssh keys instead of passwords; of course, these days all these menial tasks should be automated via configuration management tools
2) DON'T PANIC. As long as you are logged in as root on a remote system, there's ample opportunity for fixing things that you may have broken.
3) know how to fix things by taking a machine offline in single user mode; it will come in handy one day