Restored Linux VM's with Veeam 4 Corrupt

#1 Backup & Replication for VMware

Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sat Nov 14, 2009 5:36 am

[Updated by Gostev on Nov 22th]

It has come to my attention that competition is trying to spread FUD and make big deal over this issue, while in fact it is not a big deal.
1. Backups are NOT corrupted.
2. You can only run into this issue with NON-DEFAULT restore mode, in 1 restore mode of 3 existing modes.
3. Despite what competition may be claiming, there is no actual user data loss or corruption - VM will still boot and work.

The only real issue is OS and file system check tools complaining about unexpected content of the unused disk blocks. Linux ext3 file system and disk test tools merely suspect a problem seeing unused blocks being non-zeroed, and warn about this. This is specific to certain file systems only, for example, Windows NTFS considers this situation absolutely normal.


[Updated by Gostev on Nov 18th]

Issue summary:

What IS NOT affected
1. Actual backups are not corrupted.
2. Guest OS file level restore is not affected.
3. VM file level restore (VMX, VMDK) is not affected.
4. Entire VM restore with registration for Windows VMs is not affected.
5. Entire VM restore for Linux VMs in the default (agentless) restore mode is not affected.

What IS affected
Entire VM restore for Linux VMs in agent-based mode (used for ESX hosts for which you have purposely enabled service console agent-based operations in the host settings) is affected in the following way:
• All VMDK blocks containing actual data are restored properly (there is no data corruption/loss).
• All VMDK blocks without data are not zeroed and will contain data that was previously stored in the corresponding VMFS blocks. Thus, issue is not reproducible when restoring to "clean" VMFS datastore.
• VM boots up and runs fine, but OS may complain about file system integrity issues.
• Disk check tools like fsck may complain about file system integrity issues if forced to check the whole disk.

Cause
Unlike Windows NTFS, some Linux file systems and disk test tools expect the unused disk blocks to be zeroed, while treating and reporting non-zeroed disk blocks as potential disk data corruption issue.

Fix for Veeam Backup 4.0 is available through support. The fix is included in Veeam Backup 4.1 (scheduled for release in Dec 09).

Original post:


OK, I've got a MAJOR issue. Last week we had to restore a couple of RHEL5 VM's. The process seemed to go OK, the systems restored and booted without a problem and the machines seemed perfectly fine. Today, an administrator of one of the systems started getting filesystem errors and reported them to me. At first it looked like some minor corruption so I rebooted to a rescue CD to run an 'fsck' on the filesystems. This was disasterous. Each and every filesystem reported tons and tons of corruption, so bad that it was uncorrectable.

Because this was a development system on which we were doing testing with Oracle cluster services I didn't think too much of it. They had been through panic reboots and were running Oracle modules that weren't supported by Redhat. That being said, it did cause me some concern so I decided to preform a restore of one of our smaller, mostly inactive linux system and run an 'fsck' on the restored volume. Guess what? It showed the massive corruption as well. It appears that something that is part of the Veeam restore process is causing subtle corruption of the restored VMDK.

This is a critical issue since, as cool as Veeam is, it's most critical function is to correctly restore data. I have not tested Windows systems yet. I'm planning to perform some additional testing on a small test system and to open a support case but I'm putting this out there to see if anyone else has experienced any problems with completely restored VM's.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sat Nov 14, 2009 6:47 pm

OK, I just tested this in a way where it's 100% clear that Veeam is causing a problem. Here' the steps I took:

1. Took a small, running RHEL 5.4 VM and cloned it using vCenter to a new virtual machine.
2. Booted the cloned VM's with a Redhat rescue CD and ran 'fsck' to verify the status of the disks. Since it was a clone of a running system this replayed the ext3 journal and found no other errors.
3. Shut the clone VM down so that the disks were completely clean.
4. Created a Veeam job to backup the powered off RHEL 5.4 VM which went fine (backup to Linux host).
5. Restored the VM with Veeam.
6. Boot the restored VM with rescue CD and run 'fsck' which finds hundreds of errors on every ext3 filesystem. This should be 100% clean since the VM was backed up in a clean, powered off state.

Obviously Veeam is not properly restoring the VM to it's original state. Anyone using Veeam to backup Linux VM's should be very cautious. I'm going to test backing up to the Windows local disk on the Veeam server rather than the Linux host that's currently our backup target. Perhaps it's a problem with the Veeam agent.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sat Nov 14, 2009 7:31 pm

This problem occurs with a Veeam replicated VM as well. I can clone a running VM with vCenter and get a clean copy, but using the replication feature gives me VM that appears to be OK at first, but will fail any attempt to run 'fsck' and will eventually generate errors with enough disk operations.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby Gostev » Sat Nov 14, 2009 9:19 pm

Hi Tom, I will ask our QC to attempt to replicate this issue. Can you please compare CRC/hash of source (original) VMDK and target (replicated or restored) VMDK, and see if they are the same? Also, any results from testing with local storage?

We do have entire VM backup/restore covered by autotest which compares source and target VMDK hashes to make sure that they are absolutely identical bit-wise. But of course it is always possible that an issue may depend on additional factors not covered by autotests.
Gostev
Product Manager
 
Posts: 4885
Joined: Sun Jan 01, 2006 1:01 am
Full Name: Anton Gostev

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sat Nov 14, 2009 10:15 pm

Hi Gostev,

I've opened a support case. So far this issue is 100% reproducible both with direct backup to Linux and backup to local Windows storage. When I restore the VM, boot the rescue CD, and run file system check, I get tons of errors. I'm trying a restore with "agentless" mode in case the problem is on the restore side.

I'll check the md5 sum on the restored disks. I certainly hope the backup itself is OK because I have two systems I will have to rebuild from scratch if not.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sat Nov 14, 2009 10:50 pm

OK, a restore preformed from Windows local disk, to an ESX server in "agentless" mode seemed to generate a working VMDK. I'm going to try restoring the two systems we originally noticed this issue on using agentless mode and see what happens. Just to make sure were on the same page here's where we're at with our testing:

1. Restore from remote Linux host to ESX server with Veeam agent -- CORRUPT
2. Restore from local Windows disk to ESX server with Veeam agent -- CORRUPT
3. Restore from local Windows disk to ESX server with "agentless" mode -- GOOD
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sun Nov 15, 2009 1:00 am

Update -- I used "agentless" mode to restore both of the VM's that previously restored with corruption and, while the restore is MUCH slower, everything works perfectly using this mode. There appears to be something wrong with the agent on the ESX side of a restore/replication that causes some corruption. So here's a list of our testing:

1. Restore from remote Linux host to ESX server with Veeam agent -- CORRUPT
2. Restore from local Windows disk to ESX server with Veeam agent -- CORRUPT
3. Restore from local Windows disk to ESX server with "agentless" mode -- GOOD
4. Restore from remote Linux host to ESX server with "agentless" mode -- GOOD

The MD5 sums of the VMDK's do NOT match the original when restored with the Veeam agent enabled. For now I've temporarily forced "agentless" mode on all of my ESX hosts. This only affects restores and replicas since our backups use vStorage API SAN mode. At least I know that the data in the vbk/vbr files is correct, it's only a glitch with restores. I do not know if this impacts the restore of Windows systems but I don't really see how it couldn't. It obviously does NOT impact file level restores.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby Gostev » Sun Nov 15, 2009 10:10 am

I have an idea that his is specific to ext3 and how it treats content of unused blocks on the file system - so this may not affect Windows systems (Windows does not care about the content of unused blocks). Difference between agent and agentless mode could be in the way how disk space is pre-allocated for the restored VMDK, so in one case ext3 can see and report those problems with errors. Also there could be some differences in ESX4 comparing to ESX3.5 around this. Will investigate.
Gostev
Product Manager
 
Posts: 4885
Joined: Sun Jan 01, 2006 1:01 am
Full Name: Anton Gostev

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Sun Nov 15, 2009 2:31 pm

I'll be interested to hear. I'm not aware of ext3 caring about unused blocks. If it did it would have to initialize them with some data during format and then, as far as VMware/Veeam is concerned, I think it wouldn't be an unused block anymore. I'm actually somewhat concerned that the Veeam product is not restoring a bit-for-bit copy of the VMDK, no matter the underlying filesystem. We could potentially use Veeam to backup all types of filesystems, and I didn't really think it would matter since it's making an image of the VMDK (well, with the possible issue of not being able to preform a FLR). I'll wait to here though. I'm happy to have my two systems back up and running and know that the data is in good.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby Gostev » Sun Nov 15, 2009 6:39 pm

Veeam agent preallocates the VMDK and then does restore bit to bit, but only blocks with actual data blocks (not empty blocks). I am suspecting that this has to deal with those unused blocks and some ESX4 change affected how disk space preallocation works. Autotests work on Windows and there is no such problem (restored VMDK checksum matches with source VMDK checksum, so files are bit-identical).

Anyway, just my theory... I will post as soon as I hear from devs/QC, right now they also only have theories :)
Gostev
Product Manager
 
Posts: 4885
Joined: Sun Jan 01, 2006 1:01 am
Full Name: Anton Gostev

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby cicero » Mon Nov 16, 2009 1:31 pm

just one guess, without reading alle your posts.
i had the same issue because of user failure:
we simply startet the replicated guests with VI-Client instead of Veeam-Client.
So filesystems in incremental backuped guests get messed.

took us (support and me) about 4 Weeks of investigation ... silly me :oops:

cicero
cicero
Enthusiast
 
Posts: 32
Joined: Wed Mar 18, 2009 9:48 am
Full Name: David Winkler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby tsightler » Mon Nov 16, 2009 2:07 pm

This is not just on replicated VM's, but VM's that are restored from a backup. Once a VM is restored and registered it should be ready to be powered on. It's 100% reproducible in our case that a restore using the console agent fails, but a restore using agentless mode is good, at least that's the case for our Redhat Enterprise 5.x systems.
tsightler
Veeam MVP
 
Posts: 430
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby Gostev » Mon Nov 16, 2009 2:30 pm

cicero wrote:just one guess, without reading alle your posts.
i had the same issue because of user failure:
we simply startet the replicated guests with VI-Client instead of Veeam-Client.
So filesystems in incremental backuped guests get messed.

took us (support and me) about 4 Weeks of investigation ... silly me :oops:

No, this is different issue, even documented in the Known Issues section of the Release Notes:

• If a replicated VM is started using means other than Veeam Backup user interface, all existing restore points will become invalid and will remain orphaned for the consequent job runs.
Gostev
Product Manager
 
Posts: 4885
Joined: Sun Jan 01, 2006 1:01 am
Full Name: Anton Gostev

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby Gostev » Mon Nov 16, 2009 9:31 pm

Gostev wrote:I have an idea that his is specific to ext3 and how it treats content of unused blocks on the file system - so this may not affect Windows systems (Windows does not care about the content of unused blocks). Difference between agent and agentless mode could be in the way how disk space is pre-allocated for the restored VMDK, so in one case ext3 can see and report those problems with errors. Also there could be some differences in ESX4 comparing to ESX3.5 around this. Will investigate.

We have reproduced the issue and looks like this guess above was correct. As far as I understood, ESX4 has a few different types of flat disks and the one our agent creates are not eagerly zeroed. So unused blocks have rubbish (some old raw data from storage device) and ext3/fsck do not like this (NTFS should not really care). One other workaround we have found (besides agentless mode) is restoring disks as thin (there is an option in Restore wizard).

Will update on fix timelines as soon as I have them.
Gostev
Product Manager
 
Posts: 4885
Joined: Sun Jan 01, 2006 1:01 am
Full Name: Anton Gostev

Re: Restored Linux VM's with Veeam 4 Corrupt

Postby donikatz » Tue Nov 17, 2009 9:24 pm

Wow, this is huge, thanks for this thread! We have quite a few Veeam 4 / ESX 4 backup jobs for production RHEL4 servers and have not noticed anything with test restores. I'm now re-testing our restores more in depth. Yikes!

Thankfully we use vStorage API and VCB for all backup jobs, so only restore jobs are affected. Still, this is a major, major problem. Restores are slow enough as it is over ethernet, let alone agentless. Please make this Veeam's highest bug fix priority.

Thanks
donikatz
Expert
 
Posts: 109
Joined: Sun Jan 01, 2006 1:01 am

Next

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: cmartinmcse, Google [Bot], Pavel Shterlyaev and 9 guests