Recently our Exchange 2013 server – running on Windows Server 2008 (not R2) – started having backup failures. We use FalconStor DiskSafe for backup, which, like most (all?) Exchange-aware backup application uses VSS to snap the server.
During the investigation with FalconStor support, we discovered VSS was timing out during the verification of the snapshot, and running “vssadmin list writers” showed a number of the VSS writers in a failed state. No updates had recently been applied to the server, nor had any configuration changes been made recently, so there was no obvious potential cause of the problem.
FalconStor support was able to point us to this link at IBM. While not an exact description of our problem, the article linked to a Microsoft utility, DevNodeClean. The description for that utility is:
“On a computer that is running Windows Server 2003 or a later version, a storage device that is connected by using a fiber channel or by using the iSCSI protocol may be connected for only a short time. When a storage device is connected, Windows creates registry information for the device. Over time, the registry may contain many entries for devices that will never be used again. This utility can be used to remove this information from the registry.”
Interesting. Searching a bit more I came across a 3rd party complied version of DevNodeClean, which has more information about the problem it resolves.
I can’t find the link now, but I also came across a page that suggested reviewing the size of the registry itself. The actual registry files are located at c:\windows\system32\config. The SYSTEM file, which represents HKLM\SYSTEM, was in this case over 1.5GB! Checking a few other servers of similar age showed SYSTEM files of 50MB or less in most cases.
By now, I was pretty confident this was our issue. I ran DevNodeClean (the Microsoft provided one), which took – no joke – six days to complete. As it continued to delete unused devices from the registry, our backups began running again intermittently. After about five days of running the utility, all backups began working properly.
It seems what happens is that any backup product that calls VSS – FalconStor, Veeam, Microsoft DPM, etc. – causes a registry entry to be created for every single VSS snapshop. I believe the same happens when using Hyper-V snapshots. These registry entries are not cleaned-up, and over time, they bloat the registry to the point where it can essentially break VSS. In our case, we were snapping two LUNs every hour, so over the course of a few years of the server’s life, there were thousands of orphaned registry entries.
There is supposedly a hotfix for 2008 R2, but it’s not clear if that fixes the problem, and in any event it’s not available for 2008.
It seems the best way to deal with this is to run DevNodeClean as a scheduled task on a regular basis.
Unfortunately, the registry cannot be compacted online, so we will need to schedule a maintenance window to actually shrink the bloat.