Performing Migrations
HDM Migrations
HDM is used to migrate VMs from on-premises to cloud environments. HDM supports the following migration types:
- Agile Rapid Migration (ARM)
- Try Before Commit (TBC)
- Cold Migration
Migrate a VM using vCenter
Note: The migration operation can be performed using the PrimaryIO GUI interface in vCenter, or through the SQS interface library. The following sections describe the steps for executing the migration operation using vCenter.
Note: Currently, HDM migration support is limited to one simultaneous cluster per vCenter. If the VMs to be migrated are from multiple clusters, the process will need to be repeated for each of them.
Pre-requisites
- HDM deployment must be complete in the on-premises and cloud environments.
- The HDM SPBM policy must be configured.
Steps
- Prepare the VM to migrate
- Migrate the VM
Prepare to Migrate
The VMs to be migrated will first need to be prepared:
- Perform various checks to ensure that the VM's OS type and configuration are supported for migration.
- If required, make the necessary configuration updates.
Pre-requisites
- The VM must be powered on.
- The VM's OS type must be supported for migration.
- The latest version of VMware tools must be installed on the VM, and the tools service must be running and functional.
- Administrator/root credentials of the VM must be available.
- The OS must be present on the first VMDK of the VM
- All OS related partitions must be present on the same disk/device. For example:
- The 'System Reserved' and 'System' partitions must be on the same disk
- /boot or /home must be present on the same disk
- LVM must be created from partitions on the same disk
- The E1000E and VMXNet3 network adapters must be available in the on-premises and cloud vCenters
- The VM must have access to the Internet.
- A minimum of 50 MB must be available in the system partition
- The OS or repository must configured to download the required install packages from the Internet.
- For Ubuntu 16.04, LVM is not supported.
- For all versions of Ubuntu, ensure that either the static IP is configured for internal network, or the DHCP lease is set to 30 days or greater.
- If the OS is Linux, the sudo user’s home directory must be
/home
.
- Wait for the boot process to complete. For example, on Linux:
- Use command
systemctl is-system-running
to ensure the system is fully operational
- The VM cannot have UEFI BIOS. Only IBM PC BIOS is supported.
- For a Windows VM, firewalls must be disabled for the duration of the prepare to migrate operation.
- For the Windows Domain user, the local security policy must be modified for the duration of the prepare to migrate operation:
- Select Local Security Policy, followed by Local Policies, then Security Options
- On the policy
User Account Control: Behavior of the elevation prompt for administrators in Admin Approval Mode,
choose the Elevate without prompting
option
- Disable the policy User Account Control: Turn on Admin Approval Mode
- Reboot the VM
Steps
- In the on-premises vCenter, right click on the VM to be migrated.
- Select the HDM -> Prepare to Migrate option.
- Specify the administrator/root user credentials in the pop-up wizard.
Note: A full clone or linked clone of a VM must go through the “prepare to migrate” operation, even if its base VM has already has. For example, in the HDM migrate wizard in vCenter, clones are not shown in the list of available VMs for migration unless they have been explicitly prepared for migration.
Apply SPBM Policy
The migrate operation requires the VM to have the HDM SPBM policy applied to its disks. This may have already been attached when the Enable I/O Monitoring was performed (see the Enable I/O Monitoring section for more details). But in cases where either the attempt to attach the VM policy failed at that time, or the VM was created at a later point in time, the following steps can be taken to verify - and, if required, apply - SPBM policy:
Pre-requisites
- Enable I/O Monitoring has already been executed.
Steps
- In the on-premises vCenter, right click on the VM to be migrated.
- Select VM Policies, followed by Edit VM Storage Policies
- In the popup window, there is no need to do anything if the VM storage policy has already been set to the HDM Analyzer Profile. However, if it is set to Datastore Default, change it to HDM Analyzer Profile and select Apply to all.
Migrate the VM
Pre-requisites
- Prepare to Migrate has been successful on the VM (For warm migration and TBC only).
- The HDM SPBM policy has been applied to all disks within the VM (For warm migration and TBC only).
- VM migration with maximum 10 CD/DVD devices is supported. If you attempt to migrate a VM with more than 10 CD/DVD devices, an error will be reported. Please remove and retry migration.
- Ensure the OS type configured on the vCenter is the same as the actual OS running within the virtual machine. If there is a mismatch, it is very likely; the migration will fail during the commit phase.
- Migration of VMs with “..” (two dots) as a substring is not supported. Please rename before migration to ensure a successful migration.
- Please verify that the MAC ID for the machine is not in use by any VMs already in the cloud. This can happen if you migrate a VM twice or an existing VM with a conflicting ID.
- Migrating Windows VMs with an Evaluation License will result in the migrated VM failing the guest OS's license check. The operating system enforces this behavior, and the VM will power off after 45 minutes. This is not an HDM product bug but the license enforcement of Microsoft.
- Check for Operating system support against desired migration mode in the table in appendix.
- Ensure that the guest OS type within the application VM and at the source vCenter is same. Inconsistency in this may lead to migration failures.
- Migration of VMs with multiple NICs and mix mode of IP allocation (DHCP and Static) during migration is not supported.
Steps
- In the on-premises vCenter, right click on the VM to be migrated.
- Select the HDM -> Migrate option.
- The Migrate wizard will open. Select the migration type to be used to migrate the VM to the cloud.
- On the Select Cloud page, review the cloud where the VM is to be migrated and ensure that adequate resources are available.
- Choose the list of VMs to migrate:
- Keep the option “Application Dependency” checked.
- Review the cache size, CPU, and memory required in the cloud.
- If the Warm and Cold Migration type has been chosen, select the target resources on the cloud where the virtual machine will be migrated.
- If the Warm and Cold Migration type has been selected, map the network for the VM.
- Confirm all selections and select MIGRATE.
- The migration status page will display the status of the migration as it progresses.
- The migration status can also be tracked in vCenter tasks.
- The migrated VM can be seen in a new resource pool _HDM_MIGRATEPOOL in the on-premises vCenter in a powered off state, while the same VM will be in a powered on state in the cloud vCenter.
- All migrations can be monitored by selected Cluster, followed by Monitor, HDM, Migration, then the In Progress tab.
- For virtual machines that have been migrated using Warm Migration, the following steps are required to complete the migration workflow:
- START TRANSFER : This is an optional step where the virtual machine data can be transferred using HDM Bulktransfer. Select the virtual machine, then select Start Transfer .
- CONFIGURE & SYNC : Once the virtual machine data has been moved to the cloud, select the newly-moved virtual machine to sync the latest changes. If there is an error "Failed to post message for sync. Please retry after sometime". If seen please retry the operation in a few minutes. (Ref CP-5862)
- COMMIT : Once the data has been synced, commit all changes to the migrated virtual machine on the cloud and clean up the HDM configuration.
- VMs that have been migrated to the cloud will be shown in Cluster, followed by Monitor, HDM, Migration, then Summary
Migrate Time Snapshot
As part of the VM migration, HDM creates a “migrate time snapshot” for the VM. This snapshot is useful to restore data in the event of certain failures.
Note: Restoring the VM from the migrate time snapshot will result in loss of data for the time the VM was in the Cloud.
To view the snapshot:
- In the on-premises vCenter, right click on the VM in the resource pool _HDM_MIGRATEPOOL
- Select Snapshots, followed by Manage Snapshots
The Manage Snapshots popup should display a snapshot named _hdmxx. For example, _hdm1 shown below:
Cache Size For Migrated VMs
HDM allocates a cache quota in the cloud for all migrated VMs to ensure optimal performance of the applications running in the cloud. The cache size allocation follows these rules:
- If the on-premises VM has been monitored for I/Os for a sufficient amount of time, the working set is derived by the I/O analysis and the cache size is based on the working set size. Note that this is only valid for Standard and Performance modes of deployment.
- If the VM is not monitored, the cache size is based on the size of the VMDKs:
- 15% for Standard and Performance modes of deployment
- 25% for the Lite mode of deployment
- There is also a minimum cache size per deployment mode:
- 5 GB for Lite and Performance modes of deployment
- 3 GB for the Standard mode of deployment
Migrate a VM using the SQS Interface
HDM migration for the ARM use case is supported through the SQS interface. The prerequisites for using the SQS interface for migration are:
- The HDM appliance must have access to Internet.
- The SQS queues for command and response must be created in the Amazon SQS service.
- The HDM appliance should be configured for the SQS message bus with the correct message bus token
Some important command messages exchanged between the client and HDM are:
- Heartbeat: This message is from HDM to the client, to communicatethe state of HDM. Clients usually look for a ‘Ready’ state before sending the next migration request to HDM.
- SourceInventoryRequest: This message provides the list of on-premises VMs and their details. Clients select what to migrate from this list.
- SourceCloneRequest: This is essentially the migration request to HDM. It has parameters to specify the migration type and associated details.
- BulkMigrationDoneRequest: This message is important for migrations initiated using the offline bulk transfer option, because it tells HDM whether or not the offline bulk transfer is complete.
Details for how these are used for warm and cold migration are provided below.
Cold Migration
Follow these steps to perform a cold migration using the SQS interface:
- Wait until system status is ‘Ready’. This is indicated by the heartbeat between HDM and the SQS client.
- Send a ‘SourceInventoryRequest’ message to get the list of VMs that can be migrated using HDM.
- Choose a VM from the list in the response message.
- Set the mode of migration to ‘cold’.
- Submit a request for migration using ‘SourceCloneRequest’.
- After submission, periodic responses will be sent with the status of the submitted request.
Cold migration of a VM will entail the following steps:
- A bulk transfer of the VM will be initiated.
- Progress and updates will be sent via message bus responses.
- vCenter Tasks will show the progress of the export and import of ovf on the source and target vCenters, respectively.
- This operation may get queued, depending on the resource profile, the number of migrations in progress, and total number of VMs migrated. A maximum of eight migrations can run concurrently.
- Necessary network changes will be performed on the migrated VM.
If the bulk transfer fails, it will be automatically retried a few times. The failure will be reported in the response message on the final failure only. In the vCenter, however, the retries and their statuses can also be seen.
Warm Migration
Pre-requisites
- The HDM deployment mode cannot be ‘Appliance Only’. Any other mode is acceptable.
- The VM must be in a powered-on state and must have the latest VMware tools installed.
Steps
- Wait until system status is ‘Ready’. This is indicated by the heartbeat between HDM and the SQS client.
- Send a ‘SourceInventoryRequest’ to get the list of VMs that can be migrated using HDM.
- Choose a VM from the list returned in the response message.
- Set the mode of migration to ‘warm’ and the bulk transfer mode to 'online' or 'offline'.
- The root user credentials of the VM for the “prepare to migrate” step will be required.
- Submit the request for migration using ‘SourceCloneRequest’.
- After submission, periodic responses will be sent with the status of the submitted request.
Warm migration of a VM will entail the following steps:
- Prepare to migrate the VM.
- Apply the HDM SPBM policy to the VM.
- Take a snapshot of the VM and create a linked clone on it.
- Migrate the compute VM, where the cloud cache is created and the data path to the on-premises environment is also maintained.
- Bulk transfer the previously created VM snapshot to the cloud
- Reconcile the new data in the cloud cache with the VM image transferred to the cloud
- Power off the running VM and reboot from the reconciled VM image.
If any of the above steps fail, HDM will retry the step before declaring that the entire migration has failed.
Migrate Back a VM
Note: The flow described here is useful for the TBC use case.
Prerequisites
- The VM must be in a migrated state. It should be listed in _HDM_MIGRATEPOOL in the on-premises vCenter.
Steps
- In the on-premises vCenter, right click on the VM to be migrated back. Select HDM, followed by Migrate Back.
- Select the VMs to be migrated back. The dependent VMs will be migrated back together.
- Review the selection and select MIGRATE BACK.
- The status of the migration back can be seen in the wizard.
- The migrate back task can also be tracked in vCenter.
- Once the migration back is successful, the VM will be deleted from the cloud vCenter. It will then be moved from the _HDM_MIGRATEPOOL to the original resource pool where it resided prior to the migration. At this point, the VM will have to be explicitly powered on.
Steps to migrate Application dependent VMs
- Create a tag for the Application dependent VMs under HDM-APPLICATION-DEPENDENCY category as shown below:
- Assign this tag to all the VMs which are Application dependent:
- In the migration wizard, please make sure to select Application dependency check as shown in the following screen:
- The Application dependency VMs with the same tag will get listed in wizard and get differentiated with different color legends as shown in the following screen:
- The Application dependent selected VMs can be modified by deselecting Application dependency check:
Generating statistical data of migrated VM’s
HDM provides a facility to download the detailed statistcal data in .csv format for migrated VM's. It provides details like migration status, start date & time, end date & time, Network Data Transfer throughput, Read IOPS ..etc
To download the statistical report for migrated VMs:
- In the on-premises vCenter, select the cluster
- On the right-hand panel, select Monitor, followed by HDM, then migration and then summary
- As shown in the figure below (highlighted in red), you can download the statistics in .csv format
HDM Monitoring
HDM monitors the VMs in a cluster for I/O and resource usage activity. The following data will be provided:
- Active data set identification for VMs
- Recommendation of what VMs to migrate
- Cache size required on the cloud to meet the VM’s workload requirements
- CPU resource utilization of the VMs
- I/O performance statistics for the VMs
- Network and cache usage statistics for the migrated VMs
Note: In Lite mode (Standalone and Cluster), the monitoring of VMs is limited to applying the HDM SPBM policy. The detailed monitoring of VMs is only present in the Standard and Performance modes.
Monitoring VMs On-Premise
HDM monitors the I/O activity on all on-premises VMs on the cluster where it is installed. To view monitored data for these VMs:
- In the on-premises vCenter, select the cluster
- On the right-hand panel, select Monitor, followed by HDM, then Profiling
Note: In Lite mode (both standalone and cluster), this view is not present. Instead, the following message will be displayed:
- You should see the doughnuts for:
- Storage monitored: The amount of storage monitored within vCenter. Monitoring is only active on the selected cluster within vCenter.
- Active dataset: The size of the active dataset, compared with the total. I/O activity is recorded periodically.
- Active VM storage health: The health of the storage is determined by its latency and throughput.
- Top storage utilized VMs: The VMs that access the storage most frequently.
- A table summarizing the activity of the VMs is presented below the doughnuts. Some columns of particular importance are:
- Storage: The storage capacity of the VMs.
- Cache size: The estimated size of the VM's working set. If the VM is migrated to the cloud, this represents the minimum amount of cache that needs to be provisioned to maintain optimal performance.
- Read IOPs: The observed rate of reads happening on the VM.
- Write IOPs: The observed rate of writes happening on the VM.
- Health : The health of the VM storage, based on the observed read/write IOPs.
Monitoring VMs in the Cloud
Migration Status
HDM keeps track of the number of VMs migrated and migrated back, as well as their statuses and other essential information. This data is accessible through the following steps:
- In the on-premises vCenter, select the cluster
- On the right hand panel, select Monitor, followed by HDM, Migration, then In Progress
I/O And Resource Usage
The VMs that have been migrated to the cloud are also monitored for resource utilization and I/O activity. To view monitored data for these VMs:
- In the on-premises vCenter, select the cluster
- On the right hand panel, select Monitor, followed by HDM, then Monitoring
Note: In Lite mode (both Standalone and Cluster), this view is not present. Instead, the following message will be displayed:
- You can view graphs for:
- Utilization for compute, memory, and cache resources
- Read and write data transferred over the WAN
- Detailed statistics for each migrated VM is displayed in tabular form
Dashboard
A summary of the migration statistics and cloud resource utilization can be found in a single dashboard. To view this dashboard data:
- In the on-premises vCenter, select the cluster.
- On the right hand panel, select Monitor, followed by HDM, Cloud Burst, then Dashboard.
- A detailed log of the migrate and migrate back activities are also displayed in tabular form.
HDM Policies
Recovery Time Objective/Recovery Point Objective (RTO/RPO)
HDM maintains an optimal cache in the cloud for migrated VMs to provide superior I/O performance. The cache maintains the working set of VM to serve read requests without having to traverse the WAN for every I/O. The cache also absorbs writes, which are flushed to the on-premises environment at regular intervals.
The frequency of the write flush is based on the Recovery Time Objective/Recovery Point Objective (RTO/RPO) requirements. By default it is set to flush to the on-premises environment every 20 minutes. Therefore, in the event of a failure, the application can only lose up to 20 minutes worth of data.
Guidelines to Configure RTO/RPO policies
Configuring RTO/RPO should be based on the application need. The trade-offs are:
- If the time is reduced, the write data flush will be triggered more often. This can cause additional WAN traffic, especially for applications that perform frequent overwrites.
- If the time is increased, the write data flush will be triggered less frequently. In the event of a failure that results in VMs having to migrate back to the on-premises environment, more data will reside on the VMs since the last RTO/RPO flush, which can result in higher data loss.
The setting is maintained at the cluster level, so it will be inherited by all VMs within the cluster.
Note: RTO/RPO is set to the default value of 20 minutes, which is acceptable for most applications. Care should be taken prior to reconfiguring it, keeping in mind the recovery trade-offs for the application.
Steps to Configure
To configure the RTO/RPO policy:
- In the on-premises vCenter, select the cluster
- On the right hand panel, select Configure, followed by HDM, then Administration
- Modify the default value of the Recovery Time Objective (RTO) according to the needs of the application
HDM System Health
HDM uses periodic messages (heartbeat) to monitor the health of its components and determine the overall health of the system. If the heartbeat from a component is missed for two minutes, the component will be marked as failed. Additional probes will be conducted to understand the nature of the failure. Once the reason for the failure is understood, the recovery process will be initiated.
HDM in a Healthy State
When there are no failures, all HDM components will show the state as ‘Healthy’ and their color will be seen as blue in the appliance control panel. The overall state of the HDM is good if nothing is colored red or yellow. This can be seen in the appliance’s control panel, or on the HDM plugin within vCenter. To view this data, select Menu, followed by HDM, Administration, HDM Health, then Component Health.
In the event of a failure, the affected components will be shown here.
HDM in a Degraded State
When the system is in a degraded state due to a failure, it can be seen in the following locations:
- The vCenter dashboard
- The appliance control panel
- The vCenter event log
- The state in the SQS heartbeat
vCenter Dashboard
- Select vCenter HDM, followed by Dashboard to view a notification mentioning Services not ready or down...
Appliance Control Panel
In the event of a failure, some components may be affected. The state of those components will be listed in the appliance control panel as Poor and the overall state of HDM will be set to Not Ready. The component color will change from blue to red.
- Simply hover over the faulted component to view details regarding the error.
vCenter Event Log
If failure events impact HDM operations, they will be recorded in the vCenter events log, as well as the HDM events logs. The HDM events logs can be accessed by selecting Menu, followed by HDM, Administration, then Event Logs.
The screenshot below illustrates the types of failure and repair events that will appear in the events logs. These include component failure events and their successful recovery. The failure listed at the top of the log is an unrecoverable failure that will require an HDM reset.
These failure messages can also be seen in the vCenter events log, which can be accessed by selecting vCenter, followed by Monitor, then Events:
- To narrow down the events generated by HDM, apply a filter on the event type ID. On the extreme right column, apply the filter “com.hdm”
- After applying the filter, the view will be limited to the event generated by HDM. The selected event illustrated below corresponds to a failure of an HDM service:
Health State in the SQS Heartbeat
The changes in system health are also reported through the SQS heartbeats. The typical SQS heartbeat messages (which can be retrieved from the sqs-python client) correspond to the various events listed below.
-
A fully deployed system that is functional and is free of failed components will have a heartbeat similar to the one listed below:
'status': 'Ready',
'status_details': 'All the components are deployed and up.',
'appliance_id': 'fa142afc-3c64-4086-b442-4ffcdc1580b2',
-
When a component failure is detected, it will have a heartbeat containing details of the failed component, similar to the one below:
'status': 'Not Ready',
'status_details': 'Services not ready or down On Prem IO Manager, On Prem
Message Gateway'
'appliance_id': 'fa142afc-3c64-4086-b442-4ffcdc1580b2',
-
When a reboot of a component VM host is detected, the heartbeat will display the details of the failed VM host, like the one illustrated below:
'status': 'Not Ready',
'status_details': "HDM infrastructure VM rebooted or faulted
['HDM_OnPrem_ESXi_Manager-0']",
'appliance_id': 'fa142afc-3c64-4086-b442-4ffcdc1580b2',
Failure Handling in HDM
HDM strives to meet the following requirements for handling failures:
- Cold VM migration will be resumed automatically after repair in case of specific HDM component failures.
- In the event of a specific HDM component failure during cold VM migration, the migration will be automatically resumed following repair.
- In the event of pathological scenarios where migration can’t be resumed after a failure and subsequent recovery, VM availability will be ensured.
- HDM system state will be recovered to enable new migrations to be served, even if ongoing migrations had to be cancelled and migrated VMs were migrated back.
Ensuring VM Availability
As part of failure recovery, HDM will resume the transfers of VMs that were in the process of being cold migrated. Under pathological conditions, or in the event of warm migration, HDM may identify that some VMs that were already migrated or some ongoing migrations can no longer continue to run in the cloud. This is typically due to a component failure that was used by the VM to connect to the on-premises environment. To ensure application availability, these VMs will be migrated back to the on-premises environment.
Data Consistency and Data Loss
VMs being cold migrated can never experience data loss. Conversely, VMs utilizing “Try Before Commit” that are migrated back as part of failure recovery do not get the opportunity to synchronize the on-premises environment with the latest cloud data. Since the on-premises environment is synchronized using an RTO/RPO interval, these VMs will hold data since the last RTO/RPO flush.
Since the RTO/RPO flushes occur through point-in-time snapshots of the cloud data, this data is expected to be crash consistent. Modern applications and file systems are designed to withstand crashes. Therefore, they should be able to use this data on-premises.
In an extreme case, if the OS or the application is incapable of utilizing the data, the data can be restored from the migrate time snapshot, with the caveat that it causes the loss of all data that was written while the VM was in the cloud.
HDM Failure Recovery
Nature of Failures
While there can be a wide range of failures, HDM mainly deals with the following:
- HDM Component Failures:
- Appliance restart
- HDM component VM restart
- HDM individual component failure due to software issues
- Network failures
- Transient network disconnect
- Permanent network disconnect
- System failures
- Storage failures
- Memory errors
Single Component Failure
The impact of the failure can result in the failure of a single HDM component, which HDM is designed to recover from automatically. The following scenarios are possible:
Failure When There is No Activity
Even when the system is idle following a successful deployment, a component failure may cause it to go from “Healthy” to “Degraded”. The HDM system health will be in the degraded state. It can be viewed in vCenter, as well as the appliance (See the System Health section for details).
- vCenter Events will list a new event for a component failure
HDM will attempt to recover from the failure and bring the system back to a “Healthy” state. There are three important stages of the recovery process:
- Failure detection and moving to degraded state
- HDM system repair
- Healthy state
Following recovery, the following message is logged into vCenter Events:
Failure during Migrations
If a failure occurs during the migration operation, HDM will enter a degraded state. HDM will repair itself and will return to a healthy state. The ongoing migration operation may fail and those VMs can be migrated back as part of this recovery. The recovery process will look like this:
For redundant components like HDM message gateway, recovery is complete only when the required level of redundancy is restored. If a migration operation is attempted before the recovery is complete, it will fail.
Failure when there are Migrated VMs
Some VMs already migrated to the cloud may be affected by a component failure. This will result in those VMs getting migrated back from the cloud to the on-premises environment. The VMs will hold data since the last RTO/RPO flush.
The recovery process will look like this:
Note:
- As part of the failure recovery, if the migrated back VMs can be booted successfully, they will appear in the _HDM_RECOVERYSUCCESS pool. Otherwise, they will be placed in the _HDM_RECOVERYFAILED pool.
- There are cases where HDM is not able to perform the auto repair of its failed components. This could be due to either a software issue, or the error condition is permanent (e.g., permanent network or storage disconnect). In these cases, an HDM reset can be issued to recover fro the error and restart the entire process. See the Troubleshooting section for more details.
Recovery Resource Pools
VMs migrated back as part of failure recovery are kept in recovery resource pools. There are two types:
HDM_RECOVERY_SUCCESS
This resource pool hosts the VMs that have been migrated back as part of failure handling, and are likely to be successfully booted in the on-premises vCenter. They may have some data loss equivalent to the last RTO/RPO flush cycle (default 20 minutes).
HDM_RECOVERY_FAILED
This resource pool hosts the VMs that have been migrated back as part of failure handling, but are unlikely to have consistent data. These VMs will be required to restore their data from the migration time snapshot.
Note: Restoring data from the migrate time snapshot will cause the loss of all data written while the VM was in the cloud.
Recovering VMs from the HDM Recovery Pools
The following steps should be followed to recover VMs from recovery resource pools:
- Power on the VM and verify the sanity of the data.
- If the power on and data sanity checks pass:
- delete the HDM migration time snapshot.
- Move the VM to the resource pool where it originally resided prior to the migration.
- If the power on or the data sanity failed:
- Restore the data from the migrate time snapshot.
- Delete the HDM migrate time snapshot.
- Move the VM to the resource pool where it originally resided prior to the migration.
- Power on the VM
Note: Failure to move the VMs to their original resource pool will cause their subsequent migration and migration back to only occur from the HDM recovery pool.
Multiple Component Failure
If a second component fails while the system is recovering from the failure of a single component, HDM will detect the failure send a notification through vCenter events. This will also be the case if multiple HDM component hosts restart simultaneously. HDM components may not fully recover and migrated VMs may not migrate back.
Multiple component failures may require an HDM reset to restore the system.
HDM Appliance Restart
Restarting the appliance does not impact migrated VMs, nor does it impact future migration operations. Operations in progress during the appliance restart will be affected as follows:
- ARM cold migration of a VM: The ongoing bulk transfer will fail and the operation will be retried from the beginning.
- VM migration to the cloud will resume after restart and complete successfully
- ARM warm migration:
- If the appliance reboot occurs while migrating the compute VM, the migration and the bulk transfer will both fail.
- If the appliance reboot occurs after the successful migration of the compute VM, only the bulk transfer will be retried.
- VMs running in the cloud will continue to run. Some operations such as RTO/RPO periodic flush will resume after the reboot completes.
- SQS heartbeats will be missing throughout the duration of the reboot, which should last approximately two minutes.
- vCenter plugin will display a message throughout the duration of the reboot, that it cannot connect with the appliance.
HDM Component VM Restart
An HDM deployment consists of a set of microservices running as containers in a VM that are deployed on-premises, as well as in the cloud. Depending on the deployment type, some or all of the following VMs will be included:
- HDM_Cloud_Manager
- HDM_Cloud_Cache
- HDM_OnPrem_Manager
- HDM_OnPrem_ESXi_Manager
Rebooting any of these VMs triggers the following repair actions:
- All affected VMs are migrated back to the last RTO/RPO state.
- A vCenter event is logged, communicating that a “Docker reboot” has occurred on the identified VM.
- All components within that VM are repaired.
- All future operations involving the repaired components should work correctly.
WAN Network Disconnect
The WAN network disconnect may result in HDM components losing network connectivity with the central heartbeat monitoring entity.
Transient Network Failure
HDM can recover from short network outages (those lasting less than 5 minutes) by retrying ongoing operations.
Permanent Network Failure
If the network outage lasts for an extended period of time (greater than 5 minutes), the HDM recovery may not succeed and an HDM reset may be required.
ESXii Host Restart
If an on-premises ESXi host is restarted or the PRAAPA iofilter daemon service is restarted, the ongoing migrations will fail, VMs already migrated to the cloud will be migrated back, and new VM migrations will fail. An HDM reset and re-deployment of HDM on-premises and cloud components will be required prior to retrying the migration operation.
System Failures
Failures such as storage or memory may result in some HDM component failures, or their impact could be limited to few operations or I/Os. If I/Os or some operations fail, they will be retried.
Boot Failure During Migrate
A guest VM boot may fail if VMware tools are not available early enough to detect the successful boot. HDM will retry (or, in this case, reboot) the operation a few times.
Note: Multiple retries can delay the boot. In this case, wait 30 minutes for the migration operation to complete.
Bulk Transfer Failure During ARM Migration
If the bulk transfer fails during cold migration, the operation will be retried a few times. Errors such as transient network issues can be dealt with using this mechanism.
Note: All retries will be attempted a fixed number of times. Once the number of retries has been exhausted, the operation will be marked as failed.
If a VM snapshot has been bulk transferred to the cloud and a failure occurs while the cloud cache syncs with it, the portion that has already transferred to the cloud must be explicitly deleted. HDM failure handling does not automatically delete the bulk transferred VM.
Unresolved Issues
Refer to the Troubleshooting section if failure issues are not resolved. The failure might have been caused by a known product issue.
PrimaryIO support may be required.