VMware’s vSphere has been the predominant hypervisor in datacenters for the better part of two decades. With its increased adoption over time, we’ve seen staggering growth in the number of workloads that are housed within datacenters. This growth can be attributed to many factors, including:
- Ease of deployment via automated workflows that are fast and consistent
- As physical CPU core counts and speeds continue to increase, the number of virtual machines (VMs) that run per host keeps increasing. This results in greater VM-to physical server density
- By abstracting the physical hardware layer, VMs become not only easier to manage and maintain, but become portable between datacenters and clouds.
Because of this, the IT industry has more responsibility than ever when it comes to safeguarding their data. Despite increased vigilance and user training, relentless cyberthreats continue to rise in frequency and severity, which complicates your zero-trust journey. Additionally, your data is still at risk due to many of the classic symptoms, such as user error, hardware failures or natural disasters.
Fortunately, the vSphere ecosystem reduced the complexity of protecting that data, but not all protection is created equal. The goal of the following guide is to provide some insights and best practices regarding the effective protection of your VMware vSphere environment.
Apply the basics
More than ever, organizations face increasing demands when it comes to application and data availability. Being able to reliably, consistently and quickly backup and restore data from VMs is front and center when it comes to risk management for IT departments.
Ultimately, there is no one-size-fits-all data protection strategy that will work for every enterprise, or even for every workload in a single company. There are, however, several best practices that can be followed to help fortify your readiness when disaster strikes. Although this guide is focused on VMware vSphere specifically, there are a few non-specific rules that are good to keep in mind.
The 3-2-1 Rule
The 3-2-1 Rule has been long-standing within the backup industry, and it’s just as relevant today as it was many years ago. In short, the rule states that you should have three copies of your data on two different types of media, one of which is offsite in a different location. A simple example would be backing up your production data to an on-site disk repository and to a tape-based backup media that is moved offsite. This gives you three copies of your data (i.e., production, the disk repository and tape), on two different mediums (i.e., disk and tape), one of which is off-site (i.e., the tape).
In recent years, we’ve seen a sharp increase in ransomware attacks. Attackers encrypt your data and make it inaccessible until you agree to pay a ransom. However, Veeam’s global industry research shows that paying a ransom does not guarantee recovery. When dealing with these attacks, your best defense is a reliable backup. To ensure that your data is protected, it is strongly recommended that at least one copy of your backup remains offline. This could be a copy stored on tape that is physically sitting in a safe, away from attackers. Before bringing the offline copy online, just make sure you’re in a safe environment and that your backup cannot be compromised.
Physical location and security
A significant amount of time is spent securing networks with antivirus software, intrusion detection/prevention systems and firewalls. In addition, production server access typically has some level of physical security around it. For example, maybe you’re required to have a badge as you go through a security checkpoint to enter a datacenter.
All too often, backups don’t stand up to the same scrutiny. When planning out a strong business resilience strategy, it is imperative to keep in mind where your data is stored and how it’s kept safe from attack. Security is one aspect of this, as is regulatory compliance. It can be all too easy to send a backup to a cloud service, only to later realize that you’re violating data locality restrictions. To summarize, since backups are now an attack vector for many cyberthreats, you must treat your backup data with at least the same scrutiny as your production data, if not more so.
Start with a healthy environment
vCenter has an array of alerts built right into it. In addition to out-of-the-box configurations, thresholds can be adjusted and custom alerts can be created. An easy but important first step toward ensuring successful backups, and ultimately reliable recovery, is to make sure that your source environment is in a healthy state. If your organization doesn’t have any third party monitoring tools for your environment, then it’s even more imperative to leverage vCenter’s monitoring capabilities.
Within the vSphere environment, you will also want to ensure that your VMs are in a healthy state. “Healthy” can be a relative term when it comes to specific workloads or applications, but some key indicators include:
- Ensuring that VMware Tools are up to date
- Installing all applicable OS patches or hotfixes
- Check to see if there are any impending reboots required
VMware Tools is an important component of successfully and reliably backing up VMs. In most cases, vendors rely on VMware Tools to be present and functioning in order to communicate with the guest OS. Common interactions include tasks like obtaining guest OS IP addresses or hostnames or acting as a broker to use the VIX API to perform actions within the guest OS. In many cases, the VIX API is used to run pre- and post-backup scripts, particularly for many Unix or Linux-based guest OSes. Microsoft Volume Shadow Copy Service (VSS) can be used on a Windows-based OS, but needs VMware Tools to be present.
Additionally, you should validate that the guest OS is in a healthy state. This means new VMs should be deployed from a hardened template. If any key patches are required for the environment, (e.g., for recent vulnerabilities, issue-specific hotfixes, security-mandated builds, etc.) then it would be ideal to have these deployed as part of the template and therefore present before the first backup. From a risk management and compliance standpoint, it is ideal that you restore VMs into a compliant state as opposed to one that still requires additional steps post-restore.
Similarly, be sure to check for pending reboots. In situations where core OS files may have been changed, a restored VM may not boot back up properly if the installation process was not successfully completed (i.e., the OS was not rebooted after a patch, or a hotfix required so).
Lastly, be sure to check the other major components of your infrastructure, including key areas like network, storage, and compute. Although vSphere is resilient when it comes to overcoming warnings and errors, there can still be underlying issues that could lead to backup failure. Network issues may prevent the VIX API from connecting to guest OSes and cause inconsistent backups, for example.
Likewise, degraded storage arrays may not be able to handle the increased I/O that occurs during a backup job.
Keep your environment up to date
Maintaining an active support agreement is always recommended, especially for enterprises that aim to avoid downtime and risk. Although multiple versions of vSphere may be supported at the same time (e.g., 6.5, 6.7, and 7.0 all had support agreements available at the same time), it’s important to understand that each of these versions had multiple releases (e.g., Update 1, Update 2, etc.), with multiple builds per branch.
Understanding your current support state is critical when it comes to the overall health of your environment. VMware’s lifecycle information, which includes end-of-support dates, is readily available at https://lifecycle.vmware.com. It’s also important to note that support dates are defined by major versions and not the various updates that come out. For example, ESXi 7.0 support, whether for Update 1, Update 2 or beyond, all end on the same date. It is generally recommended that you run the latest major release (i.e., the latest update) along with a newer build for the branch. Newer releases are likely to include any necessary patches, hotfixes, optimizations and security remediations.
When it comes to keeping your vSphere environment up to date, an important consideration is third party integrations. Backup software tends to be one of the first areas where many customers experience incompatibility, largely because it runs 24/7 in at least some capacity for many enterprises. Although performing upgrades within the same branch (i.e., staying within the “Update 3” branch) typically will not cause issues, it is still recommended that you check with vendors who support third–party integrations into vSphere. Other common platforms that could be impacted by upgrades include network integrations, storage integrations and out-of-band management software.
In addition to the above, regular checks should be performed to review drivers and firmware that may apply to the environment. Drivers and firmware can have a significant impact on performance. Operations such as sending network traffic can also be significantly impacted by firmware and drivers, both positively and negatively.
VMware has a published Hardware Compatibility List (HCL) (https://www.vmware.com/resources/compatibility/search.php), which can be used to determine which versions of drivers and firmware have already been tested with specific hardware components. It is worth noting that there may be other releases that do not appear on the list. This does not mean that these versions will not work, but they may not be a supported configuration and could introduce additional problems when you engage with VMware Global Support. Lastly, it is worthwhile to review vendor documentation and release notes for these drivers and firmware. On occasion, critical bugs are fixed in newer releases. That said, the older releases may still be on the HCL.
Selecting an optimal data transport method
When it comes to backing up and restoring vSphere workloads, there are typically three types of modes used to access and read the VM data:
- SAN Transport: Data is read directly from the underlying storage connected to the ESXi host. This is accomplished through a direct connection to the array (e.g., Fibre channel)
- HotAdd: VMDKs are mounted to a backup proxy/helper VM that reads the data
- Network Block Device (NBD): Data is copied over the network
Each of the above has its pros and cons, and in many instances, you can mix and match any or all the transports within the same environment. However, it is important that you understand which mode will provide the best performance in your environment and that you design a backup infrastructure that does this. Understanding where your data physically sits, as well as rates of change, will help in determining the best method to use.
In most instances, backup proxies are deployed and used to read the data that’s going to be backed up, and then transfers the data to a backup repository. How these proxies perform this task varies based on the mode selected from the above list. It is also important to consider the capacity requirements for these proxies to ensure that they do not inhibit performance on existing clusters.
SAN Transport: When using the SAN Transport mode, physical proxies or helper appliances are deployed, which directly access the underlying storage system. If your storage array uses Fibre Channel or iSCSI, then this may be the fastest backup method since it bypasses overhead activity that may be introduced by the hypervisor. For optimal results, the proxy should be physically close to storage to minimize potential bottlenecks like latency. SAN Transport mode typically provides the best backup experience since the host will directly access the storage and not have to traverse any VMkernel interfaces. However, SAN Transport isn’t always ideal for restores. Thick disks will typically have better restore performance over thin disks, largely due to the multiple disk management layers that are required to maintain thin provisioned disks. Generally speaking, network mode restores (see below) will offer better restore performance for thin provisioned disks. Plus, SAN Transport cannot protect VMs that are running on VVOL or vSAN datastores.
HotAdd: In other instances, where direct storage access may not be available, Hot Add mode may be beneficial. This is typically accomplished by deploying a virtual appliance onto one or more hosts. Once a backup job kicks off, the appliance will mount the VMDK(s) from the VM that’s being backed up and read the data. This offers flexibility, since proxies can be deployed virtually in many clusters or sites, without needing to deploy physical hosts. When using this method, you’ll want to ensure that you review sizing recommendations to ensure that your clusters can accommodate increased resource demands.
Network Block Device (NBD): As the name implies, NBD will copy blocks of data over the network with vSphere’s Network File Copy (NFC) protocol. This mode can typically be used from a physical host, which does not have access to the underlying storage, or is used from a virtual appliance. Testing this mode is important, since results can vary widely based on factors like network bandwidth and latency, and the volume of data that needs to be backed up. Ideally, the interface used should be at least 10 Gbit in order to move data quickly. It is also important to consider whether the encryption of backup traffic is required. If it is, encryption will add overhead, which will likely impact overall performance. Lastly, starting with vSphere 7, there’s now an option to explicitly assign the “vSphere Backup NFC” to a VMkernel adaptor, thus creating a dedicated interface for network-based backups.
Leverage CBT for efficient backups
Image-based backups are the industry standard when it comes to backing up vSphere workloads. That is in part due to the ease and consistency of creating them. VMware offers vSphere Storage APIs – Data Protection (formerly known as VMware vStorage APIs for Data Protection or VADP), which allows for a quick and efficient way to perform incremental backups by using Change Block Tracking (CBT).
The concept behind CBT is that the hypervisor only detects and tracks blocks that have changed within the VM since the last full backup. This is a transparent operation for the guest OS, but still yields all the benefits of quickly identifying data that needs to be backed up. The result is that the need to read, compare and back up all the data each time a backup job is run is no longer relevant. Instead, the data generated by the CBT driver can quickly point to changed blocks of data.
To use CBT, the VM in question must meet the following criteria:
- The host must be running ESXi 4.0 or newer and the VM must have a virtual hardware version of 7 or newer
- The VM must be on a datastore that’s backed by NFS, VMFS or an RDM in virtual compatibility mode
- The VM’s VMDK must not be an independent disk
Although CBT is enabled by default, it can be disabled should the need arise. If CBT needs to be enabled or be disabled on a VM, it is important to note that the VM in question must have zero snapshots at the time of transition.
For additional details about CBT, please refer to https://kb.vmware.com/s/article/1020128
When to use application-aware backups
The most common approach to taking backups within a vSphere environment is to use an image-level approach. Image-level backups typically work by taking a snapshot of a running VM and backing up the snapshot. This effectively creates what is called a “crash-consistent” backup. Most modern OSes can recover from a crash-consistent backup, but many applications cannot. Although this is a drastic improvement over legacy physical backup solutions, it can still leave a bit of a gap.
Database applications are one of the most common applications that may not be reliably backed up via a crash-consistent backup. The reason behind this is that any operations that are in memory (e.g., table inserts, updates, deletes, etc.) have not been committed to the database yet, and therefore may be lost.
Application-aware backups (sometimes referred to as application-consistent or something similar) interact with the guest OS to ensure any data in-memory is written to disk. If the guest OS runs a Windows OS, then the Volume Shadow Copy Service (VSS) is leveraged. If the guest OS runs a Linux distribution, then vendors may offer scripts to allow for the “freezing” and “thawing” of the guest OS to ensure data consistency.
VMware tools must be installed within the guest OS due to its interaction with the operating system. Whether the VM being backed up is Windows or Linux, the workflow is typically similar or the same.
- The backup server deploys a guest agent to communicate with the Guest OS via the VIX API
- The application is prepared for a VSS snapshot
- The memory is quiesced and outstanding I/O is written to disk
- The agent requests a VSS snapshot
- A vSphere snapshot is triggered, and contains a quiesced copy of the VM and application
- The guest OS resumes as normal
- The vSphere snapshot is backed up
Backups vs. replication
Whether it’s from VMware directly or through their partners, there are many options available when it comes to replicating your workloads. By its very nature, replication incurs overhead: You’ll need at least twice as much storage available for the VM you wish to protect. If you plan on storing multiple restore points, then you’ll need to allocate storage for those as well. Network throughput will also be a concern, depending on the frequency and size of the data that’s being replicated. Lastly, you’ll need ESXi hosts to register these replicas.
When designing replication architecture, it is important that you understand what you’re aiming to achieve. Similar to a RAID 1 disk array, replication is more of a safeguard against infrastructure failure as opposed to a traditional backup. If there is a failure on the VM, whether it was due to data loss or an OS issue, the same actions will be replicated and thus end with the same result for both data copies. This can be offset by keeping multiple restore points. The ideal use case for replication is to allow workloads to failover quickly while achieving a low recovery point objective (RPO) and recovery time objective (RTO).
vSphere Replication is included in multiple vSphere editions, but it does have its limits. Most notably, on the low end, it is limited to a five-minute RPO and supports up to 24 restore points per replicated VM. Third party solutions may be able to improve upon this by leveraging the vSphere APIs for I/O Filtering (VAIO).
First introduced in vSphere 6.0 Update 1, VAIO allows partners to insert filters into the I/O path between the VM and the storage. Partners who leverage these filters can intercept and alter data to be used for caching and replication. This, in essence, provides a gateway to capture data with little to no performance impact to the VMs in question.
By using VAIO, third-party vendors can offer replication with lower RPOs as well as an increased number of restore points. Applying these increased thresholds to push toward a lower RPO is typically referred to as Continuous Data Protection (CDP). Architectures vary based on requirements, but additional proxies or appliances often need to be deployed to accommodate the increased resources that are required due to the more frequent transmission of data.
Just like VMware’s Hardware Compatibility List, a list of partners with certified VAIO implementations can be found at https://www.vmware.com/resources/compatibility/search.php?deviceCategory=vaio.
Trust (but verify) your backups
A backup is only valuable if you can restore from it. Regularly testing your backups, as well as your disaster recovery (DR) plan, is of utmost importance, since there are many different ways for a backup to fail, including:
- The backup file itself is corrupt
- The data that’s backed up is corrupt (e.g., OS or application failures)
- The backup may not contain the expected data
A thorough test of your backups should be performed regularly and consist of more than just a file check. Testing should include performing a full restore to a non-production environment. Furthermore, the guest OSes should be verified and be free of errors. Lastly, any applications that run within the VM should be verified by the application owners. Testing can be a time-consuming and error-prone process when performed manually. So, if possible, tests should be automated and ideally performed regularly after each backup is performed. This will allow you to detect and identify errors immediately and avoid failures when a restore is required.
When testing backups, be sure to consider the impact that this can have on your vSphere environment. Some questions you should be considering as part of this process are:
- Is there ample compute/memory/storage capacity and throughput available to run these VMs?
- Will there be an impact on network traffic?
- Is there a risk to production VMs if the backup VMs are powered on?
- Will these VMs be compliant with security or regulatory rules if they’re being restored to a different location?
When testing backups, it is recommended that you restore to a network that does not have access to the production network. This is to ensure that production data does not accidentally get routed to the test copy. Fortunately, with vSphere networking, this can be as easy as creating a new standard switch that does not have an uplink to the rest of the network. Depending on the complexity of the application(s) in question, a restore can also be performed without enabling network access. In this scenario, tests can be run via the VM’s console on vCenter or the ESXi host UI.
If restoring to a new location, such as a cloud service, it is imperative that you have the proper documentation, access permissions and configurations available ahead of time. This should include tasks like:
- Network connectivity to and from the backups and new restore location
- Required permissions to add or alter resources if needed
- Deploy any proxies or helper appliances that may be needed for a restore
As workloads continue to grow at an exponential rate, so does the volume of data that IT organizations are tasked with protecting. VMware vSphere is the de facto hypervisor in most enterprises, largely due to its stability and versatility. Despite that, it is still prudent to safeguard these workloads against threats, whether they be mechanical, natural or malicious in nature.