This is an article I originally wrote for Red Hat’s Enable Sysadmin.Nate
Late on a tuesday afternoon, I had somewhere to be after work, that made driving all the way home and then back again a waste of time. So I was in my office late, killing some time, getting a little work done, before I had to go visit a friend. I went to login to my Red Hat Enterprise Virtualization 3.6 manager, to do something or other. My login was rejected. Which was odd, but I’d seen this before. See, my RHEV manager is a VM running on a stand-alone KVM host, separate from the cluster it manages. I had been running RHEV since 3.0, before hosted engine was a thing, and I hadn’t gone through the effort of migrating. I was already in the process of building a new set of clusters, with a new manager, but this manager was still controlling most of our production VMs. The manager had filled its disk again, and the underlying database had stopped itself to avoid corruption. See, for whatever reason we had never setup disk space monitoring on this system. It’s not like it was an important box right?
So I logged into the KVM host that ran the VM, and started the well-known procedure of creating a new empty disk file, and attaching it via virsh. The procedure goes something like this. Become root, use dd to write a stream of zeros to a new file, of the proper size, in the proper location, then use virsh to attach the new disk to the already running VM. Then of course login to the VM, and do your disk expansion. So I login, sudo -i, and start my work… cd /var/lib/libvirt/images, ls -l to find the existing disk images, start carefully crafting my dd command.. dd… bs=1k count=40000000 if=/dev/zero.. of=./vmname-disk… which was the next disk again? <tab> of=vmname-disk2.img <back arrow, back arrow, back arrow, back arrow, backspace> don’t want to dd over the existing disk.. That’d be bad. Let’s change that 2 to a 2, and enter. OH CRAP, I CHANGED THE 2 TO A 2 NOT A 3!!! CTRL-CCCCCCCCCCCCCCCCCCCC. I still get sick thinking about this. I’d done the stupidest thing I possibly could have done, I started dd, as root, over top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! The kind that’s at work late, trying to get this one little thing done before he heads off to see his friend.. The kind that thinks he knows better, and thought he was careful enough to not make such a newbie mistake. Gah.
So.. How fast does DD start writing zeros? Faster than I can move my fingers from the <enter> key to the ctrl-c keys. I tried a number of things to try to recover the running disk from memory, really all I did was make things worse I think. The system was still up, but still broken like it was before I touched it, so it was useless. I decided that since my VMs were still running, and I’d already done enough damage for one night, I stopped touching things, and went home. The next day I owned up to the boss, and co-workers pretty much the moment I walked in the door. We started to take an inventory of what we had, and what was lost. I had taken the precaution of setting up backups ages ago. So we should have that to fall back on. I opened up a ticket with Red Hat support, and filled them in on how dumb I’d been. I can only imagine the reaction of the support person when they read my ticket. I worked helpdesk for years, I know how this usually goes. They probably gathered their closest coworkers to mourn for my loss, or get some entertainment out of the guy who’d been so foolish. I say this in jest, Red Hat’s support was awesome through this whole ordeal, and I’ll tell you how soon.
So I figured the next thing I was going to need from my broken server, which is still running, was the backups I’d been diligently collecting. They were on the VM, but on a separate virtual disk. So I figured they were safe. The disk I’d over-written was the last disk I’d made to expand the volume the database was on, so that logical volume was toast, but I’ve always setup my servers such that the main mounts, /, /var, /home, /tmp, and /root were all separate logical volumes. In this case, /backup was an entirely separate virtual disk. So I scp -r’d the entire /backup mount to my laptop. It copied, and I felt a little sigh of relief. All of my production systems are still running, and I had my backup. My hope was that it would be a relatively simple recovery. Build a new vm, install RHEV-M, and restore my backup. Simple right? By now my boss had involved the rest of the directors and let them know that we were looking down the barrel of a possibly bad time. We started to organize a team meeting to discuss how we were going to get through this. I get back to my desk and start to look through the backups I had copied off of the broken server. All the files are there, but they’re tiny.. Like a couple hundred kb each, instead of the hundreds of meg or even gig they should be.. Happy feeling.. Gone.
Turns out, my backups were running, but at some point after a RHEV upgrade, the db backup utility had changed. Remember how I said this system had existed since 3.0? Well 3.0 didn’t have an engine-backup utility, so in my RHEV training, we’d learned how to make our own. Mine broke when the tools changed, and for who knows how long, it had been getting an incomplete backup. Just some files from /etc, no database. Ohhhh.. Fudge. (I didn’t say fudge).
I updated my support case with the bad news, and started thinking about what it’d take to break through one of these 4th floor windows right next to my desk. Ok, not really.
Well.. At this point we had, basically, three RHEV clusters with no manager. One of those was for dev work, but the other two were allll production. We started using these team meetings to discuss how we were going to recover from this mess. I don’t know what the rest of my team was thinking about me, but I can say that everyone was surprisingly supportive and un-accusatory. I mean, with one typo I’d thrown off the entire department. Projects were put on hold, workflows disrupted. But at least we had time. We couldn’t reboot machines, we couldn’t change configurations, couldn’t get to vm consoles, but at least everything was still booted up and operating. Red Hat support had escalated my SNAFU to a RHEV engineer, a guy I’d worked with in the past. I don’t know if he remembered me, but I remembered him, and he came through, again. About a week in, for some unknown reason (we never figured out why), our Windows VMs started dropping offline. They were still running as far as we could tell, but they dropped off the network! Just boom, offline. In the course of a workday, we lost about a dozen windows systems. All of our RHEL machines were working fine, just some Windows machines, not even every Windows machine, about a dozen of them.
Well great, how could this get worse? Oh right, add a ticking time-bomb. Why were the Windows servers dropping off? Would they all eventually drop off? Would the RHEL systems eventually drop off? I made a panicked call back to support, I emailed my account rep, I called in every favour I’d ever collected from contacts I had within Red Hat to get help as quickly as possible. I ended up on a conference call with two support engineers, and we got to work. After about 30 minutes on the phone, we’d worked out the most insane recovery method.
We had the newer RHEV manager I mentioned earlier, that was up and running, and had two new clusters attached to it. Our recovery goal was to get all of our workloads moved from the broken clusters, to these two new clusters. Want to know how we ended up doing it? Well, as our windows VMs were dropping like flies, the engineers and I came up with this plan. My clusters used fibre channel SAN storage as their storage domains. We took a machine that was not in use, but had a fibre channel HBA in it, and attached the LUNs for both the old clusters storage domains AND the new clusters storage domains to it. The plan here is to make a new VM on the new clusters, attach blank disks to the new VM, of the proper size, and then use DD (the irony is not lost on me) to block for block copy the old broken VMs disk over to the newly created VMs empty disk. I don’t know if you’ve ever delved deeply into a RHEV storage domain, but under the covers it’s all LVM. The problem is, the LV’s aren’t human readable. Just UUID’s that the RHEV manager’s database links from VM to disk. The VMs are running, but we don’t have the database to reference. So how do you get this data? Virsh..
Luckily, I’ve managed KVM, and XEN clusters long before RHEV was a thing that was viable. I was no stranger to libvirt’s virsh utility. With the proper authentication (which the engineers gave to me), I was able to virsh dumpxml on a source VM while it was running, get all the info I needed about its memory, disk, cpus, even mac address, and create an empty clone of it on the new clusters. Then once I had everything perfect, shut down the VM on the broken cluster with either virsh shutdown, or by logging into the VM and shutting it down. The catch here is, if I missed something, and shut down that VM, there’s no way I’d be able to power it back on. Once it was no longer in memory, the config was completely lost. It’s all in the database, and I’d hosed that. Once I had everything, I’d login to my migration host, the one that was connected to both storage domains, and use DD to copy, bit for bit the source storage domain disk over to the destination storage domain disk. Talk about nerve wracking, but it worked! We picked one of the broken windows VMs and did this, within about half an hour we’d completed all the steps, and brought it back online. WOW! We did hit one snag. See, we’d used snapshots here and there. RHEV snapshots are lvm snapshots. Consolidating them without the RHEV manager was a bit of a chore, and took even more leg work and research before we could dd the disks. I had to mimic the snapshot tree by creating symbolic links in the right places, and then start the dd process. Wow. I worked that one out late that evening after the engineers were off, probably enjoying time with their families. They asked me to write the process up in detail later. I suspect that it turned into some internal Red Hat documentation, never to be given to a customer because of the chance of royally hosing your storage domain.
Somehow, over the course of 3 months, and probably a dozen scheduled maintenance windows, I managed to migrate every single vm (of about 100 vms) from the old zombie clusters over to the working clusters. This included our Zimbra collaboration system (10 vms in itself), our file servers (another dozen vms), our ERP, even oracle databases. We didn’t lose a single VM, no more unplanned outages, the RHEL systems, and even some windows systems, never fell to the mysterious drop-off that those dozen or so Windows servers did early on. During this ordeal I had trouble sleeping, I was stressed out, I felt so guilty for creating all this work for my co-workers, I even had trouble eating. No exaggeration, I lost 10lbs.
So, don’t be like Nate. Monitor your important systems, check your backups, and for all that’s holy, double-check your dd output file!
Happy SysAdmin Appreciation Day!