Having a family of my own is an emotional asset, especially in tumultuous times.
Today my fellow IT guy was having trouble doing something with openssl on our fileserver, so he removed it with the intension of reinstalling it, hoping this would resolve the issue. This is something he said he's done with no problem before, but something went differently this time, and the uninstall ended up removing several screens worth of packages. My introduction to this result was when I lost my internet connection, followed shortly by people complaining that they couldn't print.
Fortunately I had an ssh session open to the server, so between the two of us we had two active connections to the machine, we couldn't establish any new ones. My first reaction was to try reinstalling openssl, but yum was gone. Some further testing unveiled several other rather important programs no longer there. Whatever had been removed, the shared drives still worked, and even though we couldn't connect with the client, vmware was still running in the background. So we were effectively coasting, since the file sharing is most of what this machine does.
This is normally the time that you switch over to the backup machine while you try to fix things, the only problem is that we haven't received approval for a backup machine yet. So we had to go to the backup tapes, which is where it gets interesting. Since we had no backup machine, we never had a chance to test the backups in an environment that would could play around and break things in. So we'd never actually done a proper restore, and had to figure it out on the fly, a disaster recovery no-no.
Some searching around online and a few panicked queries in IRC later and we started pulling our data from the previous day's backup off the tape and into a temporary directory, and hit a stumbling block. When I tried to move our backup of /usr into /, it wouldn't let me, claiming that the active /usr wasn't empty. This is probably where we should have calmed down, and figured out how to do things properly. Instead, we decided to move /usr to /usrbak, then move the backed up /usr to /, and it worked! This was soon followed by /etc, /lib, and /var, all of which were missing things, and each time we moved one of those folders over something else would start working again. Then we broke it.
We tried doing the move with /sbin (might have been /bin, the details escape me). First I moved /sbin to /sbinbak, then when to move the backed up /sbin to / and got an error message. After a brief panic, I realized I could use /sbinbak/mv to move the backed up /sbin to /, and it worked! Then I did the same thing to /lib64, which was stupid, because after that I couldn't do anything. It was then that I got that cold, numb feeling of fear.
After some flailing, we gave up, let everyone know that we were taking down the server in an hour so they could finish up what they were doing (because linux, beautifully, was still sharing files like nothing was wrong), and that we'd be up all night reinstalling everything.
After some bitching, I started downloading a liveCD in the hope that I could use that to move the backed up /lib64 to /. While that was downloading I called my wife and complained for awhile, which did a lot to calm me down. Went back to the office and eventually my officemate remembered that the fedora install CD had some kind of rescue mode on it. So we went ahead and brought down the server, went into recovery mode, and were delighted to see it had created a virtual (functional!) filesystem, and mounted our broken filesystem in /mnt/sysimage. From there we were able to copy over /lib64, and anything else we thought might be useful, then rebooted.
Everything looked good as it booted up, but I was pessimistic since when we had been restoring from the tape we noticed many files hadn't been copied, I was certain something important had been missed. But everything was listing as successful, and eventually we got a login screen. We were able to log in successfully, and began testing everything, and everything worked! We spent about an hour trying absolutely everything, and eventually decided we had dodged a bullet. No working until 2am again after all!
During this crisis, we had been given permission to go ahead and order that backup system. So we decided that until we have a functional system to fall back on, we're not going to mess with anything, just backup religiously and let things work until we have that equipment.
Even though everything appears to have worked out (thankfully on the day
before I'm leaving for the HOPE conference, not the day
of), I was still all stressed out and nauseous, I picked up my daughter from
trian's work so she could finish her workday sans child, and went home. Deanna was remarkable in calming me down and getting me back to normal just by being her usual goof ball self. In a previous life this would still have me frazzled and irritable. So I'm really thankful to have my ladies in my life.