This section covers troubleshooting for known issues in HDM 2.2. In the event there are issues that are not covered in this document or are not easily resolved, contact PrimaryIO support for assistance.
Do not delete the component VMs or the content library deployed as part of the HDM product deployment. Doing so will result in the product to operate incorrectly, and will require an “HDM Reset” followed by a redeployment.
CP-4217: When the PrimaryIO Manager service within the PrimaryIO appliance gets restarted, functionality such as "Enable/Disable Monitoring" and the rendering of screens on the vCenter HDM plugin UI can be adversely affected. The service is usually back up in a few seconds, so the user should retry the functionality or refresh their screen.
CP-4867: When an ESXi on the PrimaryIO Appliance fails in non DRS- or HA-enabled clusters, the vApp option properties on the virtual machine are reset. This is a normal VMware behavior that can even be seen when virtual machines are migrated across vCenter. When the PrimaryIO Appliance is rebooted, services will not be deployed, causing all operations to fail.
CP-4906: During flex mode of migration, if HDM_ON_CLOUD_NETWORK and the virtual machine application network are the same, not all Linux virtual machine IPs will be accessible post migration. To fix this, run the following command within the migrated Linux virtual machine:
for f in $(find /proc/sys/net -name rp_filter); do
echo 0 > $f
done
CP-4917: Application dependency feature is not supported for Cold Migration. Each virtual machine migrated using cold migration will be migrated separately.
CP-5035 : Whether TBC, warm, or cold migration is employed, upon booting in the cloud, any VM configured for DHCP to Static IP address conversion will retain its previously-configured DHCP IP address, in addition to its new Static IP address. This is true for all Linux distributions. The OS’s old DHCP lease files are not deleted as part of DHCP to Static IP address conversion. Therefore, once the VM becomes available in the cloud, OS “network startup” scripts renew the old DHCP IP address, in addition to adding the new Static IP address. There is no workaround for this issue. The DHCP IP address will not disturb the newly-configured Static IP address.
CP-5076: In TBC and warm migration use cases, virtual machines migrated back will retain the post-migration cloud IP address. To resolve this issue, the virtual machine's network will need to be reconfigured.
CP-5104: Even in cases where the license has not yet expired, re-installation of HDM is not supported if the license-enforced migration limit has been reached.
CP-5107: On HDM_Cloud_Cache reboot, the cache service does not come up. An HDM Reset will be required to recover migrated virtual machines and clean the HDM deployment.
CP-5112: If the on-premises VM remains powered on following a cold or warm migration, any virtual machine configured to sync data via a static IP address will cause an IP address collision upon power-on. As a result, the migrated VM will be unable to employ the newly-configured static IP address. This is mainly seen in the SLES Linux distribution where IP address collisions are detected as part of the “network startup” scripts, and IP addresses fail to come online. However, regardless of the specific Linux or Windows distribution, any time two VMs are assigned the same IP address, only one will be reachable through that IP address. To avoid this issue, only keep one VM (either the migrated or the original on-premises) powered on at any given time.
Inconsistent data in HDM plugin UI:
CP-5003: In the migration pop-up, the amount of data transferred and the compression ratio are shown as zero, even though both are running in the background and progress can be seen in the HDM migration task.
CP-5064: Historical IO analysis data is shown for powered-off virtual machines. The viewed timelines are not consistent with the historical timelines.
VMC does not always honor the "Disable DRS" settings on the component VMs deployed in the cloud. This can result in resources such as Cloud_Cache becoming separated from the migrated VMs running in the cloud. This is generally not a cause for concern, but can result in the VM becoming unresponsive if the ESXi host were to fail. The system will correctly roll back the VM to on-premises and start it from the last RTO/RPO checkpoint. (Ref: CP-5612)
Failures during HDM deployment on-premises or in the cloud will require the state to be cleaned using HDM reset, then the deployment can be retried. In the current release, HDM does not attempt to automatically recover from deployment failures. (Ref: CP-4686)
During deployment, synchronization with the NTP server is required. This operation may take time. During this period, the deployment may seem stuck or taking longer than expected. No action is required. After this expected delay, the deployment will continue as usual. (Ref: CP-4419)
This failure can occur either due to incorrect credentials or incorrect Cloud Director Organization name. Providing the correct credentials or organization name should resolve this issue. (Ref: CP-5395)
Insufficient CPU or memory resources on-premises or in the cloud may cause the HDM deployment to fail and log the following event in vCenter: “Insufficient resources to satisfy configured failover level for vSphere HA.”
This can be avoided by choosing the HDM deployment type based on the availability of on-premises and cloud resources. (Ref: CP-4243)
If the network selected during deployment of the PrimaryIO appliance is incorrect and vCenter has not been added yet, the following procedure can be used to change the network:
If vCenter has already been added, it is no longer possible to change the network. The appliance will need to be redeployed.
HDM does not support an IPv6 configured VMware environment.
After changing the default appliance password, the new password has been forgotten.
Resolution: Contact PrimaryIO support to reset the password.
Resolution: The user must be assigned administrator privileges for the specified vCenter. Follow the steps below to do this:
This user will now be able to add the specified vCenter to the appliance.
A vCenter entry in the PrimaryIO appliance can show an ‘ERROR’ state if its credentials have been changed externally in vCenter after adding it to the appliance. This issue can also occur if vCenter is no longer reachable.
Resolution:
The HDM plugin fails to appear in the vCenter UI, despite having been registered with vCenter.
Resolution:
Unregister the HDM plugin from vCenter.
Search and delete the ‘PrimaryIO’ or ‘praapa’ files from vCenter. Execute the following command to locate the files:
find / -name piohyc -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;"
Restart vCenter.
Re-register the HDM plugin with the same vCenter it was deleted from.
The HDM filter deployment on a cluster is unsuccessful and the ‘Install Filter Task’ shows vCenter in an invalid state (‘vim.fault.InvalidState’).
This can happen if the previous version of HDM has not been cleanly uninstalled, or the PrimaryIO appliance IP that was used during installation is not reachable. In this case, check the EAM logs within vCenter and look for the error message, `"https://<IP>/bundle/primaryio_6.x.zip is not reachable". Check to ensure the IP of the appliance is the same one that is shown in the error message, and that the URL is reachable.
Resolution: Log into “https://<vcenter_ip>/eam/mob” using vCenter administrator credentials to access the ESXi Agent Manager. Next to agency you will see ArrayofManagedObjectReference.
Select the associated action to obtain the list of agents. Then, check AgencyConfigInfo to see if if belongs to PrimaryIO.
Select Uninstall if the agent was not cleanly uninstalled earlier. Wait for the uninstallation task to succeed in vCenter, then select Destroy to clean the agent.
Refresh “https://<vcenter_ip>/eam/mob
” to ensure the agency has been successfully removed. This will remove vCenter from the invalid state. Use HDM Reset to redeploy the product on-premises.
If vCenter and the cloud ESXi are configured on different networks and resolved through different DNS settings, the cloud vCenter will be added if the correct DNS is set during Add Cloud. However, ESXi resolution will fail, resulting in the subsequent failures of cold and warm migrations.
Similarly, Add Cloud may fail if the cloud vCenter has been configured with FQDN, but the DNS used to resolve vCenter is incorrect or is not provided during the operation.
Resolution: Add rules in the network to forward the resolution of the specified vCenter or ESXi FQDNs to the correct DNS.
After creating the HDM SPBM policy, the VM storage policies should list praapa as a new caching component. However, the creation of a new VM storage policy, and its subsequent display in vCenter, may take time. As a result, praapa may not be displayed under Caching Component following the creation of the HDM SPBM Policy and I/O monitoring may not succeed for the VMs.
Resolution: In vCenter, select Administration, followed by System Configuration, Services, then Restart VMware vSphere Profile-Driven Storage Service. This will update the vCenter state and praapa will be visible.
During Enable Monitoring the HDM SPBM policy creation can fail.
Workaround: Manually create the SPBM policy with the name HDM Analyzer Profile. If the praapa policy is not shown, delete and rescan the praapa storage provider by selecting vCenter, followed by Configure.
VMs with IDE controllers do not allow the HDM SPBM policy to be attached in the powered on state.
Resolution: Power the VMs off, then attach the policy. (Ref: CP-4297)
This problem is specific to an Enhanced Linked setup. Consider configuring an Enhanced Linked setup using two vCenter servers, each containing ESXi hosts:
Assume that HDM is installed using vCenter vcenterserver01.dopio.com. The HDM SPBM policy HDM Analyzer Profile will be visible in vcenterserver02.dopio.com, but the policy attributes will not available. Attempting to apply this policy from vcenterserver02.dopio.com to any VM may fail.
Workaround : Use the vCenter vcenterserver01.dopio.com to apply an SPBM policy to any VM.
Configuration of the SQS message bus can fail if incorrect tokens or non-existent queues are provided during configuration. Ensure that the queues provided are already created in the Amazon SQS service.
Re-deployment on-premises can fail during configuration of the HDM ESXi Manager. Services within the ESXi Manager must be configured to communicate with the I/O filter service within ESXi to service I/O requests. If the I/O filter service praapa is stopped or is in the process of restarting within ESXi, configuration will fail. Check the I/O filter status on the ESXi Configure page in vCenter and restart the service to retry the on-premises deployment.
Certain unsupported OSes with a UEFI boot loader configured get into efishell during boot. This is because the bios is unable to located the bootloader. To continue with the boot process the bootloaders path has to be specified manually in the efishell. The path for some of the commonly used OS and their versions is below. A similar solution can be adopted for EFI configured OSs not in this list. (ref: CP-5644)
centos fs0:\EFI\centos\shimx64.efi
ubuntu EFI fs0:\EFI\ubuntu\grubx64.efi
You might also find the following articles useful.
https://kb.vmware.com/s/article/2061784
This issue can occur when CA certificate details are provided for the key manager. Providing valid CA certificates for the key manager will resolve the issue. (Ref: CP-5408)
During add cloud operation, users should specify the correct default application network on-cloud or map on-premises network to on-cloud network. Failing to do so, the VM migration may succeed but applications on the migrated VMs may fail. In the SQS based migration, the network mapping can be specified at the migration time. (Ref: CP-4433)
Certain virtual machine parameters may not be retained post migration. These will need to be set manually. (Ref : DP-2859)
If the VM being migrated has the substring "HDM" or "hdm" (case-insensity) in the name the VM will not be listed in the migration list. Example "test-hdm-vm", "video-hdmi" or will not be migrated.
After complete deployment, users may not find any action available to migrate VM in vCenter or it may just show ‘loading...’.
Resolution: The solution will be do logout-login of vCenter and try again migrating, if it still not giving any option then please restart vsphere-ui (service-control --stop vsphere-ui;service-control --start vsphere-ui;
) service from the vcenter ssh console.
For each Virtual Machine migration a vCenter task is created. The progress and status of migration is then updated in the vCenter task object. The same status is also reflected on the Migration’s In-Progress tab on HDM vCenter Plugin for global and cluster view. In cases where the task has been completed, a vSphere server may delete the reference from the vSphere database. In this case, as the task object has been removed, the correct status of migration does not reflect on the UI and shows the migration state as ERROR. Users can check the task status under vCenter task list for the virtual machine to get the correct status.
In HDM on-cloud deployment, if the Internal network on the on-cloud is configured for DHCP IP addresses, IP addresses assigned to migrated VM (WARM or TBC) with DHCP can experience lease timeout. This is because on some versions of Linux distributions (mainly all Ubuntu and SLES distributions) DHCP lease is not renewed for the IBFT enabled NIC which is connected to the Internal network for TFTP/iSCSI booting.
All the versions of Windows and RHEL/CentOS distributions are not affected by this issue.
If the Internal network on the on-cloud is configured for IP addresses with Static IP address pool, this issue will not arise.
Workaround:
(Ref: DP-2777)
This issue can occur when the key manager details are not updated correctly. When editing the key manager, ensure that the operation is successful. If the operation fails, the key manager will be marked as unreachable. This will lead to the failure of try in cloud and warm migration. To ensure successful integration, be sure to use correct credentials and certificates to edit the key manager. Ensure that the key manager is reachable from the appliance at all times. (Ref: CP-5444)
HDM does not currently support UEFI BIOS. As a result, the prepare to migrate operation will fail for these VMs. Only use VMs with IBM PC BIOS for HDM migration.
The Cloud vCenter enables a few VMDK operations such as add/grow disks on the migrated VMs. HDM does not support these operations on the migrated VMs. While the operation may succeed, these VMs can’t be migrated back or re-migrated to the cloud. (Ref: CP-2595)
To warm migrate linked clones, attach the SPBM policy HDM Analyzer Profile to the base VM of the linked clone.
In an FQDN based deployment, HDM may not be able to resolve the cloud vCenter or ESXi, which can cause the warm migration to fail. During HDM installation, the DNS entry should have been configured to resolve the FQDN. If this is missing, manually add the DNS nameserver to the HDM cloud cache component, using the following procedure:
etc/resolv.conf
to resolve the FQDN.(Ref: CP-4330)
After migrating back, moving the VM to its original resource pool can sometimes fail in vCenter 6.7. If this happens, use vCenter to move the VM from the HDM_MIGRATE_POOL to its original resource pool. (Ref: CP-4652)
Once the deployment has completed, vCenter may not show an option to migrate the VM. Alternatively, it may just display loading....
Resolution: Logout and re-log into vCenter, then attempt the migration again. If the issue persists, restart the vsphere-ui (service-control --stop vsphere-ui;service-control --start vsphere-ui;
) service from the vCenter ssh console.
VMs that are migrated back are first powered off in the cloud. For Ubuntu 16.04 and RHEL 7.4, this operation can require an extended period of time. I/O errors can also be seen in the guest VMs during this process.
This is due to an issue in a specific version of the Linux kernel, where the iSCSI connection is closed too early during the shutdown process. This creates a backlog of ongoing I/Os, which stalls the shutdown of the guest VM. HDM will perform multiple retries and will ultimately hard power off these VMs during migration back. This issue is discussed further in a public forum:
https://bugzilla.redhat.com/show_bug.cgi?id=1164756
Migration requests initiated through SQS can fail if the appliance lacks Internet access. Ensure that the appliance has Internet access while using SQS for migration.
During warm migration, VMs in the cloud that are managed by HDM are identified by the postfix ‘ARM’. This postfix is later removed once the data has been transferred and changes have been synced to the cloud VM. The renaming of a VM can fail if an existing VM in the cloud has the same name.
This may happen if the wizard is still open when the operation completes. Sometimes the status on the wIzard is not updated in a timely manner. For accurate status of the operation, refer to vCenter tasks. (Ref : CP-5622)
Warm Migration or TBC can fail if static IP specified during the migration is not part of the subnet on the cloud network. In this case an error is flagged in the events in VCD/vCenter with “Invalid network parameter: Specified address is not in the subnet range.” The same failure will happen if you specify an IP address that is already in use on the cloud the error in this case would be “The following IP/MAC addresses have already been used by running virtual machines: MAC addresses: IP addresses: .... Use the Fence vApp option to use same MAC/IP. Fencing allows identical virtual machines in different vApps to be powered on without conflict, by isolating the MAC and IP addresses of the virtual machines.” (Ref: DP-2887)
Migrating Windows VMs with an Evaluation License will result in the migrated VM failing the guest OS's license check. The operating system enforces this behavior, and the VM will power off after 45 minutes. This is not an HDM product bug but the license enforcement of Microsoft. (Ref: DP-2879)
Premigration checks are done before initiating any migration. For warm or TBC migrations few out of the many checks are
If a cdrom drive is not present we will display a warning that a cdrom drive will be added. However if the “prepare-to-migrate” has not been run this warning overrides this check and the migration can proceed. However since the prepare-to-migrate has not been run the migration will eventually fail. (Ref:CP-5713)
Failures for migrate, migrate back, on-premises deployment, and add cloud are shown in vCenter under Tasks and Events. Any failure attributed to the HDM component going down will also be captured within vCenter events with a description that begins with com.primaryio.hdm
. While the failures can be seen in the HTML and Flash views of vCenter, the details for the failures are not always available in both the views (e.g., it is only available in Flash view in vCenter version 6.5).
If an HDM cloud component VM remains shutdown for an extended period of time, the health of the failed HDM components may not be reflected in the PrimaryIO appliance. However, the failed component will be detected and alerts can be seen by selecting Home, followed by HDM, then Dashboard. Once the component VM is successfully rebooted, the appliance will correctly reflect the health of all components. (Ref: CP-4647)
Sometimes HDM operations fail and vCenter delivers the error message Request exceeded the limit set.
Resolution: The VAPI endpoint service may need to be restarted. To do this, log into vCenter using Flash and select Home, followed by Administration, Deployment, System Configuration, then Services. Right click on the VAPI endpoint service and select Restart. (Ref: DP-2700)
This can happen when HDM is in the process of repairing a failure, or while HDM is engaged in a recovery attempt. It can happen when HDM is deployed but there is no activity on the system, or when there is migration activity on the system.
Wait for the system repair to finish. If it appears that HDM recovery has stalled, reset HDM and proceed with the redeployment with a clean state.
If a failure causes migrations to not complete, the HDM recovery process will migrate back the affected VMs that were previously migrated, or are in the process of migrating. These VMs can be found in the on-premises HDM recovery pool.
In the event of an HDM component failure, migrated VMs can be migrated back. As part of HDM recovery, the failed component will be repaired and only the affected VMs will be migrated back and appear in _RECOVERY_POOLSUCCESS. However, HDM cloud cache is a critical component and its failure would affect all migrated VMs. As a result, all VMs will be migrated back. (Ref: DP-2739)
The HDM appliance can get hung if the datastore where it resides runs out of space. If this occurs, power off the HDM appliance VM, create enough space in the datastore, thne power the appliance back on. Please note that a full datastore can cause other issues and this resolution may not always work. In this case, an HDM reset may be necessary, or HDM can be re-deployed with adequate storage.
In the event of a network failure, HDM can migrate back previously migrated VMs. In some cases, these VMs can be found with additional NICs beyond what they were originally configured with. HDM adds these NICs during the migration operation and typically cleans them up automatically. However, in some failure scenarios when automatic cleanup has failed, these NICs will need to be manually cleaned up. (Ref: CP-4841)
Sometimes VMs are migrated back to the on-premises environment following a temporary migration to the cloud (e.g., the TBC use case), or because of a failure during the migration. Network availability is critical during the entire migration operation, as well as for the continuity of HDM component services. The disruption in network connectivity is tolerated for a limited time (typically up to 4-5 minutes), after which HDM will enter into failure and recovery mode. A recovery operation can only succeed when the network is available. If an HDM reset is required, it can only succeed when network connectivity is available for cleaning up state in the cloud.
HDM operations are designed for retries and resilience for somewhat jittery networks. However, if connectivity is lost for more than 4-5 minutes continuously, failures will be triggered. Then, when the network becomes available, the process of recovery and repair will be initiated.
Because uninstallation happens at the cluster level, it will fail if any host in the cluster becomes unavailable.
Resolution: Resolve the connection issue with the unreachable ESXi host, then retry the uninstallation process.
Uninstallation happens at the cluster level, so it will fail if a VIB or HDM component removal cannot be performed on any host in the cluster.
Resolution: Resolve the connectivity issue with the unreachable ESXi host, then retry the uninstallation process.
During HDM undeployment, the ‘praapa’ VIB can fail to uninstall on some ESXi instnaces in the cluster.
Resolution: Restart the iofilter daemon on ESXi (ssh log into the ESXi and run the command /etc/init.d/iofilterd-prapaa restart), then uninstall HDM. (Ref: DP-2578)
The tags are created to set affinity rules on HDM component hosts managed by the cloud HDM vCenter. These rules are only used by PrimaryIO HDM and do not affect other VMs in the cloud vCenter. Currently, these are not deleted as part of the HDM uninstallation process.
Resolution: Follow these steps to clean HDMDRS* tags from the cloud vcenter:
Uninstallation of the HDM filter requires the cluster hosts to be in maintenance mode, one host at a time. If a host fails to enter maintenance mode for any reason, the HDM filter uninstall will fail.
Resolution: Manually put the host into maintenance mode and retry the uninstall operation. This may require a vMotion on the active VMs to another host in the cluster. If the cluster only has one host, the active VMs may be required to be powered off. Please note that the PrimaryIO appliance host should not be powered off prior to initiating the uninstall.
A task to uninstall the HDM filter from the appliance executes successfully, but the HDM icon on vCenter still shows the HDM version, and the ‘praapa’ service remains on the ESXi host.
Resolution: This issue is seen when the uninstall is attempted on a cluster where the ESXi hosts cannot be put into maintenance mode, either because DRS has not been configured or there is an insufficient number of ESXi instances for vMotion. This causes the uninstall request to be registered, but not executed. Follow these steps to complete the uninstallation:
An entry for the HDM plugin continues to be displayed in the Administration / Client Plug-Ins listing, or in the Menu, even after it has been unregistered from the appliance.
This is a vCenter listing issue and does not affect vCenter functionality.
Issues: The new plugin may not load properly if the same or a lower version of the plugin is registered. However, a higher version of the plugin will always work.
Resolution: Complete the following steps:
find / -name *piohyc* -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;
During uninstall, disable monitoring on some VMs can fail in certain situations.
Resolution: The workaround for this is to power the VMs off, then disable monitoring. (Ref: CP-4431)
Do not perform cloud undeployment while migrations are in progress. While the UI does not prevent this action, the undeployment can fail and cleanup of the migrating VMs may not happen. (Ref: CP-4373)
The HDM on-premises uninstall can fail if there are any VMs that have snapshots that were created by HDM (named pio_*). This usually applies to VMs that were migrated back as part of failure recovery, and reside in HDM_RECOVERY_SUCCESS.
Resolution: The migrate time snapshot (pio_*) should be explicitly deleted prior to retrying the undeploy. (Ref: CP-3749)
If the I/O filter service ‘praapa’ is stopped or is restarting within an ESXi, detaching the SPBM policy on the VMs can fail, because the service is unavailable and policy state cannot be cleaned. Check service status on the ESXi configure page and retry the detach operation.
If there is a network disconnect between the appliance and the cloud, HDM cloud components and migrated VMs will not be deleted. Even if the HDM reset task on vCenter lists the operation as completed, HDM components and migrated VMs may still reside in the cloud, leaving the system in an unclean state.
Resolution: Manually delete the HDM components and migrated VMs to complete the cleanup process and make it ready for a new deployment.
As part of the HDM reset, the PrimaryIO appliance must be restarted. In some cases, this restart may not succeed, or the appliance may not be assigned an IP address.
Resolution: This is probably a transient error, so restart the appliance to enable the HDM reset to successfully complete. (Ref: CP-4610)
During HDM reset, if any of the ESXi hosts in the on-premises cluster are rebooted, the cleanup for that host may not happen. While the HDM reset may still succeed, any future attempts for an on-premises deployment may fail.
Resolution: Retry the HDM reset, then attempt to redeploy. (Ref: CP-4648)
After rebooting the appliance, the HDM reset task remains queued, even after the appliance has been rebooted. This may happen if there have been transient issues while configuring network adapters within the appliance, following the reboot.
Resolution: Reboot the appliance again. (Ref: CP-4836)
With non DRS or HA enabled clusters, when the appliance fails on an ESXi, the vApp option properties are reset on the VM.
Resolution: Perform the following steps:
(Ref: CP-4848)
The Download debug Logs fails on chrome versions between 84.0.4147.135 and 88.0.4324.104. Please use firefox or a different version of chrome. (Ref: CP-5763)