Restored Linux VM's with Veeam 4 Corrupt

Post by **tsightler** » Nov 14, 2009 5:36 am this post

[Updated by Gostev on Nov 22th]

It has come to my attention that competition is trying to spread FUD and make big deal over this issue, while in fact it is not a big deal.
1. Backups are NOT corrupted.
2. You can only run into this issue with NON-DEFAULT restore mode, in 1 restore mode of 3 existing modes.
3. Despite what competition may be claiming, there is no actual user data loss or corruption - VM will still boot and work.

The only real issue is OS and file system check tools complaining about unexpected content of the unused disk blocks. Linux ext3 file system and disk test tools merely suspect a problem seeing unused blocks being non-zeroed, and warn about this. This is specific to certain file systems only, for example, Windows NTFS considers this situation absolutely normal.

[Updated by Gostev on Nov 18th]

Issue summary:

What IS NOT affected
1. Actual backups are not corrupted.
2. Guest OS file level restore is not affected.
3. VM file level restore (VMX, VMDK) is not affected.
4. Entire VM restore with registration for Windows VMs is not affected.
5. Entire VM restore for Linux VMs in the default (agentless) restore mode is not affected.

What IS affected
Entire VM restore for Linux VMs in agent-based mode (used for ESX hosts for which you have purposely enabled service console agent-based operations in the host settings) is affected in the following way:
• All VMDK blocks containing actual data are restored properly (there is no data corruption/loss).
• All VMDK blocks without data are not zeroed and will contain data that was previously stored in the corresponding VMFS blocks. Thus, issue is not reproducible when restoring to "clean" VMFS datastore.
• VM boots up and runs fine, but OS may complain about file system integrity issues.
• Disk check tools like fsck may complain about file system integrity issues if forced to check the whole disk.

Cause
Unlike Windows NTFS, some Linux file systems and disk test tools expect the unused disk blocks to be zeroed, while treating and reporting non-zeroed disk blocks as potential disk data corruption issue.

Fix for Veeam Backup 4.0 is available through support. The fix is included in Veeam Backup 4.1 (scheduled for release in Dec 09).

Original post:

OK, I've got a MAJOR issue. Last week we had to restore a couple of RHEL5 VM's. The process seemed to go OK, the systems restored and booted without a problem and the machines seemed perfectly fine. Today, an administrator of one of the systems started getting filesystem errors and reported them to me. At first it looked like some minor corruption so I rebooted to a rescue CD to run an 'fsck' on the filesystems. This was disasterous. Each and every filesystem reported tons and tons of corruption, so bad that it was uncorrectable.

Because this was a development system on which we were doing testing with Oracle cluster services I didn't think too much of it. They had been through panic reboots and were running Oracle modules that weren't supported by Redhat. That being said, it did cause me some concern so I decided to preform a restore of one of our smaller, mostly inactive linux system and run an 'fsck' on the restored volume. Guess what? It showed the massive corruption as well. It appears that something that is part of the Veeam restore process is causing subtle corruption of the restored VMDK.

This is a critical issue since, as cool as Veeam is, it's most critical function is to correctly restore data. I have not tested Windows systems yet. I'm planning to perform some additional testing on a small test system and to open a support case but I'm putting this out there to see if anyone else has experienced any problems with completely restored VM's.

Post by **tsightler** » Nov 14, 2009 6:47 pm this post

OK, I just tested this in a way where it's 100% clear that Veeam is causing a problem. Here' the steps I took:

1. Took a small, running RHEL 5.4 VM and cloned it using vCenter to a new virtual machine.
2. Booted the cloned VM's with a Redhat rescue CD and ran 'fsck' to verify the status of the disks. Since it was a clone of a running system this replayed the ext3 journal and found no other errors.
3. Shut the clone VM down so that the disks were completely clean.
4. Created a Veeam job to backup the powered off RHEL 5.4 VM which went fine (backup to Linux host).
5. Restored the VM with Veeam.
6. Boot the restored VM with rescue CD and run 'fsck' which finds hundreds of errors on every ext3 filesystem. This should be 100% clean since the VM was backed up in a clean, powered off state.

Obviously Veeam is not properly restoring the VM to it's original state. Anyone using Veeam to backup Linux VM's should be very cautious. I'm going to test backing up to the Windows local disk on the Veeam server rather than the Linux host that's currently our backup target. Perhaps it's a problem with the Veeam agent.

Post by **tsightler** » Nov 14, 2009 7:31 pm this post

This problem occurs with a Veeam replicated VM as well. I can clone a running VM with vCenter and get a clean copy, but using the replication feature gives me VM that appears to be OK at first, but will fail any attempt to run 'fsck' and will eventually generate errors with enough disk operations.

Post by **Gostev** » Nov 14, 2009 9:19 pm this post

Hi Tom, I will ask our QC to attempt to replicate this issue. Can you please compare CRC/hash of source (original) VMDK and target (replicated or restored) VMDK, and see if they are the same? Also, any results from testing with local storage?

We do have entire VM backup/restore covered by autotest which compares source and target VMDK hashes to make sure that they are absolutely identical bit-wise. But of course it is always possible that an issue may depend on additional factors not covered by autotests.

Post by **tsightler** » Nov 14, 2009 10:15 pm this post

Hi Gostev,

I've opened a support case. So far this issue is 100% reproducible both with direct backup to Linux and backup to local Windows storage. When I restore the VM, boot the rescue CD, and run file system check, I get tons of errors. I'm trying a restore with "agentless" mode in case the problem is on the restore side.

I'll check the md5 sum on the restored disks. I certainly hope the backup itself is OK because I have two systems I will have to rebuild from scratch if not.

Post by **tsightler** » Nov 14, 2009 10:50 pm this post

OK, a restore preformed from Windows local disk, to an ESX server in "agentless" mode seemed to generate a working VMDK. I'm going to try restoring the two systems we originally noticed this issue on using agentless mode and see what happens. Just to make sure were on the same page here's where we're at with our testing:

1. Restore from remote Linux host to ESX server with Veeam agent -- CORRUPT
2. Restore from local Windows disk to ESX server with Veeam agent -- CORRUPT
3. Restore from local Windows disk to ESX server with "agentless" mode -- GOOD

Post by **tsightler** » Nov 15, 2009 1:00 am this post

Update -- I used "agentless" mode to restore both of the VM's that previously restored with corruption and, while the restore is MUCH slower, everything works perfectly using this mode. There appears to be something wrong with the agent on the ESX side of a restore/replication that causes some corruption. So here's a list of our testing:

1. Restore from remote Linux host to ESX server with Veeam agent -- CORRUPT
2. Restore from local Windows disk to ESX server with Veeam agent -- CORRUPT
3. Restore from local Windows disk to ESX server with "agentless" mode -- GOOD
4. Restore from remote Linux host to ESX server with "agentless" mode -- GOOD

The MD5 sums of the VMDK's do NOT match the original when restored with the Veeam agent enabled. For now I've temporarily forced "agentless" mode on all of my ESX hosts. This only affects restores and replicas since our backups use vStorage API SAN mode. At least I know that the data in the vbk/vbr files is correct, it's only a glitch with restores. I do not know if this impacts the restore of Windows systems but I don't really see how it couldn't. It obviously does NOT impact file level restores.

Post by **Gostev** » Nov 15, 2009 10:10 am this post

I have an idea that his is specific to ext3 and how it treats content of unused blocks on the file system - so this may not affect Windows systems (Windows does not care about the content of unused blocks). Difference between agent and agentless mode could be in the way how disk space is pre-allocated for the restored VMDK, so in one case ext3 can see and report those problems with errors. Also there could be some differences in ESX4 comparing to ESX3.5 around this. Will investigate.

Post by **tsightler** » Nov 15, 2009 2:31 pm this post

I'll be interested to hear. I'm not aware of ext3 caring about unused blocks. If it did it would have to initialize them with some data during format and then, as far as VMware/Veeam is concerned, I think it wouldn't be an unused block anymore. I'm actually somewhat concerned that the Veeam product is not restoring a bit-for-bit copy of the VMDK, no matter the underlying filesystem. We could potentially use Veeam to backup all types of filesystems, and I didn't really think it would matter since it's making an image of the VMDK (well, with the possible issue of not being able to preform a FLR). I'll wait to here though. I'm happy to have my two systems back up and running and know that the data is in good.

Post by **Gostev** » Nov 15, 2009 6:39 pm this post

Veeam agent preallocates the VMDK and then does restore bit to bit, but only blocks with actual data blocks (not empty blocks). I am suspecting that this has to deal with those unused blocks and some ESX4 change affected how disk space preallocation works. Autotests work on Windows and there is no such problem (restored VMDK checksum matches with source VMDK checksum, so files are bit-identical).

Anyway, just my theory... I will post as soon as I hear from devs/QC, right now they also only have theories

cicero · Post by **cicero** » Nov 16, 2009 1:31 pm this post

just one guess, without reading alle your posts.
i had the same issue because of user failure:
we simply startet the replicated guests with VI-Client instead of Veeam-Client.
So filesystems in incremental backuped guests get messed.

took us (support and me) about 4 Weeks of investigation ... silly me

cicero

Post by **tsightler** » Nov 16, 2009 2:07 pm this post

This is not just on replicated VM's, but VM's that are restored from a backup. Once a VM is restored and registered it should be ready to be powered on. It's 100% reproducible in our case that a restore using the console agent fails, but a restore using agentless mode is good, at least that's the case for our Redhat Enterprise 5.x systems.

Post by **Gostev** » Nov 16, 2009 2:30 pm this post

cicero wrote:just one guess, without reading alle your posts.
i had the same issue because of user failure:
we simply startet the replicated guests with VI-Client instead of Veeam-Client.
So filesystems in incremental backuped guests get messed.

took us (support and me) about 4 Weeks of investigation ... silly me

No, this is different issue, even documented in the Known Issues section of the Release Notes:

• If a replicated VM is started using means other than Veeam Backup user interface, all existing restore points will become invalid and will remain orphaned for the consequent job runs.

Post by **Gostev** » Nov 16, 2009 9:31 pm this post

Gostev wrote:I have an idea that his is specific to ext3 and how it treats content of unused blocks on the file system - so this may not affect Windows systems (Windows does not care about the content of unused blocks). Difference between agent and agentless mode could be in the way how disk space is pre-allocated for the restored VMDK, so in one case ext3 can see and report those problems with errors. Also there could be some differences in ESX4 comparing to ESX3.5 around this. Will investigate.

We have reproduced the issue and looks like this guess above was correct. As far as I understood, ESX4 has a few different types of flat disks and the one our agent creates are not eagerly zeroed. So unused blocks have rubbish (some old raw data from storage device) and ext3/fsck do not like this (NTFS should not really care). One other workaround we have found (besides agentless mode) is restoring disks as thin (there is an option in Restore wizard).

Will update on fix timelines as soon as I have them.

donikatz · Post by **donikatz** » Nov 17, 2009 9:24 pm this post

Wow, this is huge, thanks for this thread! We have quite a few Veeam 4 / ESX 4 backup jobs for production RHEL4 servers and have not noticed anything with test restores. I'm now re-testing our restores more in depth. Yikes!

Thankfully we use vStorage API and VCB for all backup jobs, so only restore jobs are affected. Still, this is a major, major problem. Restores are slow enough as it is over ethernet, let alone agentless. Please make this Veeam's highest bug fix priority.

Thanks

Post by **Gostev** » Nov 18, 2009 1:16 pm this post

Tom and Doni, I've just heard that the fix is already available and currently being tested - should be signed off soon - please contact/ping support later today to get it. This will make VMDK zeroed eagerly when restoring disks as thick, so ext3 will no longer go nuts about unexpected empty blocks' content.

donikatz · Post by **donikatz** » Nov 18, 2009 2:44 pm this post

Great news, thanks for the quick turnaround!

donikatz · Post by **donikatz** » Nov 18, 2009 4:57 pm this post

Interesting: I've been testing this with an RHEL4 VM with LVM2 (ext3 partitioned) and haven't been able to reproduce it. Booting from a Live CD, fsck -y is clean no matter which restore method. Does using LVM2 change the way Veeam handles restore and unused blocks?

Post by **tsightler** » Nov 18, 2009 6:04 pm this post

If you're only running "fsck -y" then that's probably why it's clean. The ext3 journal appears to believe everything's fine, so running "fsck -y" will simply replay the journal and report everything is in good shape. Try forcing a check with "fsck -f" which basically says "even if you think this volume is clean, check out everything just to be sure" and see how it looks. I no longer have any RHEL4 systems so it's possible that RHEL4 is somehow different, although the underlying filesystem is the same, but it appears that I can reproduce this at will with my RHEL5 systems. I've tried 4 systems all together, three with ext3/LVM2, and another that was simply straight partitions.

Also, based on the problem description from Anton, it might not happen with ESX 3.5 which I think does eager thick by default, and it also might not happen if the underlying disk that you're restoring to is already empty since the problem seems to be mostly caused by the lack of "zero's" in the free space.

donikatz · Post by **donikatz** » Nov 18, 2009 6:12 pm this post

Ah, makes sense, I'll try the -f switch. Of course, please let us know your results when you get the hotfix and try it out. Thanks!

Post by **tsightler** » Nov 18, 2009 6:42 pm this post

tsightler wrote:If you're only running "fsck -y" then that's probably why it's clean.

Just to clarify, this sentence should probably read:

"If you're running 'fsck -y' than that's probably why the system reports the filesystem is clean."

Basically, running fsck without -f will only actually do anything if the system believes that the filesystem is dirty even after a journal replay. In this case the corruption is occurring silently to blocks on the disk that the filesystem wasn't even writing to so it has no reason to believe that a journal replay wouldn't be enough to get everything ship-shape. Since the corruption is actually happening to the block device outside of the filesystems awareness, you have to force fsck to check everything out.

donikatz · Post by **donikatz** » Nov 18, 2009 9:29 pm this post

Ok, I was able to reproduce the problem. Thanks for the tip, Tom.

Post by **tsightler** » Nov 24, 2009 3:47 pm this post

OK, so as of yesterday I have a hotfix for this problem from Veeam. It appears to have a negative impact on performance of the restore, but it's better to have a slightly slower restore than a corrupt filesystem. This seems to take care of the issue for me. If you're restoring Linux VM's, I highly suggest either forcing agentless mode, or contacting Veeam to get the hotfix.

donikatz · Post by **donikatz** » Nov 25, 2009 12:04 am this post

Haven't had time to get it myself yet. Is it faster than agentless? Thanks

Post by **tsightler** » Nov 25, 2009 1:02 am this post

Yeah, it's faster than agentless. I don't have exact numbers, but we were restoring a 50GB RHEL5.4 VM used for testing an Oracle RAC setup. I'm almost 100% sure that prior to the "fix" the restore was 70+MB/sec (systems has lot's of empty space). When I tested the "fix" today I only got 40MB/sec. That's still much faster than agentless, which was more like 20-25MB/sec. This may be to be expected since it's having to zero out all the sectors, but that has other advantages as well, such as a higher level of security.

Post by **Gostev** » Dec 18, 2009 1:35 pm this post

Fixed in version 4.1

R&D Forums

Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Re: Restored Linux VM's with Veeam 4 Corrupt

Who is online