NOTE: In HDM 2.1.3, TBC is only supported for demo and test purposes.
This section describes the steps for executing the migration operation through vCenter. As noted above, only Cold Migration is supported in HDM 2.1.3.
Pre-requisites
Steps
Figure 2: On-premises vCenter
Figure 3: Migration Wizard - Step 2
Figure 4: Migration Wizard – Step 3
Select which VMs you wish to migrate, then NEXT
NOTES:
Figure 5: Migration Wizard - Step 4
Select the vApp target for the migration and the storage profile to be applied to the VM once the migration is complete, then NEXT
NOTE: The target vApp should have HDM_INTERNAL_NETWORK connected
Figure 6: Migration Wizard - Step 5
Figure 7: Migration Wizard - Step 6
Figure 8: Migration Wizard – Step 7
The progress of the migration can be seen either on the migration status page (figure 9), or in the vCenter task page (figure 10). The migrated VM can be seen in a new resource pool HDM_MIGRATE_POOL in the on-premises vCenter and will be in a powered off state (figure 11). All migrations can be monitored in vCenter by selecting Cluster 🡪 Monitor 🡪 HDM, and then selecting the Migration tab, followed by the In Progress tab (figure 12). Virtual machines that have been migrated to the cloud can be seen in vCenter by selecting Cluster 🡪 Monitor 🡪 HDM, and then selecting the _Migration tab, followed by the Summary_ tab.
Figure 9: Migration Wizard – VM Migration Status Page
Figure 10: vCenter Task Page
Figure 11: vCenter Resource Pool Page
Figure 12: vCenter HDM Migration Progress Tab
Figure 13: vCenter HDM Migration Summary Tab
There are known limitations with virtual machine disk controller configurations when migrating to VMware Cloud Director. HDM does not support the following migrations:
HDM keeps track of the health of its constituent components using periodic messages (heartbeat). This also enables it to determine the overall health of the system. If the heartbeat from a component is missed for two minutes, the component is marked as failed. Additional probes are made to understand the nature of the failure. Once the reason for the failure is understood, the recovery process will be initiated.
Determine system and component health by selecting Menu followed by HDM in the appliance control panel, then selecting the Administration tab, followed by the HDM Health tab, and then the Component Health tab.
When there are no failures, the status of all HDM components are listed as Good and will appear with a green checkmark icon in the Appliance control panel (figure 14). Red or yellow icons with corresponding negative messages in the status column indicate troubles (figure 15). This information is also available on vCenter on the HDM plug-in (figure 16).
Figure 14: Healthy Components
Figure 15: Component Troubles
Figure 16: vCenter on the HDM Plug-in
When system health is degraded, it can be viewed using any of the following tools:
Access the vCenter dashboard by selecting HDM, followed by the Dashboard tab. It will contain a notification such as, “Services not ready or down ...” (figure 17)
Figure 17: vCenter Dashboard
The appliance control panel will update the status of failed components as “Poor” and HDM overall state will be set to “Not Ready”. The component color will also change from blue to red (figure 18). To get details about the error, hover your cursor over the failed component (figure 19).
Figure 18: Appliance Control Panel: Determining Failed Components
Figure 19: Appliance Control Panel: Details on Failed Components
Access the vCenter event log by selecting HDM from the menu, followed by the Administration, HDM Health, and Event Logs tabs (figure 20). All failure events that impact the operation of HDM will be recorded here (NOTE: repair and recovery events will also be shown). In the example in figure 20, the most recent failure is unrecoverable and therefore requires an HDM reset. You can obtain additional details on any failure messages by selecting vCenter from the menu, followed by the Monitor and Events tabs (figure 21).
Figure 20: vCenter Event Log
Figure 21: Event Log Details
To isolate the events generated by the HDM, a filter can be applied to the “Event Type ID” on the extreme right hand column. Figure 22 illustrates this by applying the filter, “com.hdm” which limits the view to the events generated by HDM.
Figure 22: Applying Filters
HDM handles failures in the following manner:
As a component of failure recovery, HDM will resume the transfer of any VM in the process of a cold migration. In the case of a TBC migration, HDM may identify that some VMs that were already migrated, or some ongoing migrations, can no longer continue to run in the cloud. This is typically due to a failure in the component that was used by the VM to connect to the on-premises environment. To ensure application availability, these VMs are migrated back to the on-premises environment.
HDM mainly deals with the following types of errors:
The impact of the failure can result in a single HDM component failing. HDM is designed to automatically recover from these failures. The following scenarios are possible:
Even when the system is idle after a successful deployment, a component failure may cause it to go from “Healthy” to “Degraded”.
The HDM system health in the degraded state can be viewed in vCenter and Appliance (See System Health section for details). The vCenter events log would also list the component failure as a new event.
HDM would attempt to recover from the failure and bring the system back to a “Healthy” state. The recovery process would include three important stages:
After recovery, a message is logged in the vCenter events log:
Figure 23: Post-Recovery Message in vCenter Events Log
If a failure occurs during the migration operation, HDM will move to a degraded state. HDM will repair itself and return to a healthy state. The ongoing migration operation will be paused once the failure is detected and resumed once the repair operation is completed and the system has returned to a healthy state.
For redundant components like the HDM message gateway, recovery can only be considered complete after the required level of redundancy has been restored. Any migration operation attempted before the recovery is complete will result in a failure.
(Applicable for TBC)
Some VMs that have already migrated to the cloud may be affected by a component failure, causing them to be migrated back to the on-premises environment. The VMs would contain the data until after the last RTO/RPO flush.
NOTES:
(Applicable for TBC)
VMs that are migrated back as part of a failure recovery are kept in one of two types of recovery resource pools:
This resource pool hosts the VMs that have been migrated back as part of failure handling and are likely to be successfully booted using the on-premises vCenter. However, they may have some data loss equivalent to the last RTO/RPO flush cycle (default 20 minutes).
This resource pool hosts the VMs that have been migrated back as part of failure handling but are unlikely to have consistent data. These VMs will be required to restore their data from the migration time snapshot.
NOTE: Restoring data from the migrate time snapshot means that all data from the time the VM was in the cloud environment will be lost.
Follow these steps to recover VMs from the recovery source pools:
NOTE: Failure to move the VMs to their original resource pool will cause their subsequent migration and migration back from the HDM to be limited to the recovery pool.
If second component fails while the system is still recovering from the failure of a single component, HDM will detect the failure and notify the user through vCenter events. This is also applicable if simultaneous multiple HDM component hosts restart. HDM components may not fully recover and ongoing migrations will not resume. Multiple component failures may require an HDM reset to restore the system.
Restarting the Appliance does not impact migrated VMs, nor does it impact future migration operations.
Operations that are in progress during the appliance restart will be affected in the following way:
The HDM deployment consists of a set of microservices that are packaged as Docker containers inside the VM that are deployed on-premises, as well as in the cloud. Depending on the deployment type, one or both of the following VMs may be present:
When either of these are rebooted, the following repair actions are triggered:
The WAN disconnect may result in HDM components losing network connectivity with the central heartbeat monitoring entity.
HDM can recover from short network outages (those lasting less than 2 minutes) by retrying ongoing operations.
If the network outage lasts for an extended period of time (greater than 2 minutes), the HDM recovery may not succeed and an HDM reset may be required.
If the on-premises ESXi host is restarted, the ongoing migrations will be paused and will resume once the host is back up.
System failures such as storage or memory may result in some HDM component failures, or their impact could be limited to a few operations or IOs. In these cases, the impacted process will be retried.
Guest VM boot may fail due to reasons such as VMware tools not starting early enough to detect a successful boot. HDM will retry (in this case reboot) the operation a few times.
NOTE: Multiple retries can delay the boot. In these cases, the user may have to wait up to 30 minutes for the migration operation to complete.
Refer to the HDM 2.1.3 Troubleshooting Guide if the failure issues are not resolved. The failure may have been caused by a known product issue.
If PrimaryIO technical support is required, refer to the Install Guide for the details.