This section covers troubleshooting for known issues in HDM 2.1.3. In the event there are issues that are not covered in this document or are not easily resolved, contact PrimaryIO support for assistance.
Failures during HDM deployment on-premises or in the cloud will require the state to be cleaned using HDM reset, then the deployment can be retried. In the current release, HDM does not attempt to automatically recover from deployment failures. (Ref: CP-4686)
During deployment, synchronization with the NTP server is required. This operation may take time. During this period, the deployment may seem stuck or taking longer than expected. No action is required. After this expected delay, the deployment will continue as usual. (Ref: CP-4419)
Insufficient CPU or memory resources on-premises or in the cloud may cause the HDM deployment to fail and log the following event in vCenter: “Insufficient resources to satisfy configured failover level for vSphere HA.”
This can be avoided by choosing the HDM deployment type based on the availability of on-premises and cloud resources. (Ref: CP-4243)
If the network selected during deployment of the PrimaryIO appliance is incorrect and vCenter has not been added yet, the following procedure can be used to change the network:
If vCenter has already been added, it is no longer possible to change the network. The appliance will need to be redeployed.
HDM does not support an IPv6 configured VMware environment.
After changing the default appliance password, the new password has been forgotten.
Resolution: Contact PrimaryIO support to reset the password.
Resolution: The user must be assigned administrator privileges for the specified vCenter. Follow the steps below to do this:
This user will now be able to add the specified vCenter to the appliance.
A vCenter entry in the PrimaryIO appliance can show an ‘ERROR’ state if its credentials have been changed externally in vCenter after adding it to the appliance. This issue can also occur if vCenter is no longer reachable.
Resolution:
The HDM plugin fails to appear in the vCenter UI, despite having been registered with vCenter.
Resolution:
Unregister the HDM plugin from vCenter.
Search and delete the ‘PrimaryIO’ or ‘praapa’ files from vCenter. Execute the following command to locate the files:
find / -name piohyc -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;"
Restart vCenter.
Re-register the HDM plugin with the same vCenter it was deleted from.
The HDM filter deployment on a cluster is unsuccessful and the ‘Install Filter Task’ shows vCenter in an invalid state (‘vim.fault.InvalidState’).
This can happen if the previous version of HDM has not been cleanly uninstalled, or the PrimaryIO appliance IP that was used during installation is not reachable. In this case, check the EAM logs within vCenter and look for the error message, `"https://<IP>/bundle/primaryio_6.x.zip is not reachable". Check to ensure the IP of the appliance is the same one that is shown in the error message, and that the URL is reachable.
Resolution: Log into “https://<vcenter_ip>/eam/mob” using vCenter administrator credentials to access the ESXi Agent Manager. Next to agency you will see ArrayofManagedObjectReference.
Select the associated action to obtain the list of agents. Then, check AgencyConfigInfo to see if if belongs to PrimaryIO.
Select Uninstall if the agent was not cleanly uninstalled earlier. Wait for the uninstallation task to succeed in vCenter, then select Destroy to clean the agent.
Refresh “https://<vcenter_ip>/eam/mob
” to ensure the agency has been successfully removed. This will remove vCenter from the invalid state. Use HDM Reset to redeploy the product on-premises.
If vCenter and the cloud ESXi are configured on different networks and resolved through different DNS settings, the cloud vCenter will be added if the correct DNS is set during Add Cloud. However, ESXi resolution will fail, resulting in the subsequent failures of cold and warm migrations.
Similarly, Add Cloud may fail if the cloud vCenter has been configured with FQDN, but the DNS used to resolve vCenter is incorrect or is not provided during the operation.
Resolution: Add rules in the network to forward the resolution of the specified vCenter or ESXi FQDNs to the correct DNS.
After creating the HDM SPBM policy, the VM storage policies should list praapa as a new caching component. However, the creation of a new VM storage policy, and its subsequent display in vCenter, may take time. As a result, praapa may not be displayed under Caching Component following the creation of the HDM SPBM Policy and I/O monitoring may not succeed for the VMs.
Resolution: In vCenter, select Administration, followed by System Configuration, Services, then Restart VMware vSphere Profile-Driven Storage Service. This will update the vCenter state and praapa will be visible.
During Enable Monitoring the HDM SPBM policy creation can fail.
Workaround: Manually create the SPBM policy with the name HDM Analyzer Profile. If the praapa policy is not shown, delete and rescan the praapa storage provider by selecting vCenter, followed by Configure.
VMs with IDE controllers do not allow the HDM SPBM policy to be attached in the powered on state.
Resolution: Power the VMs off, then attach the policy. (Ref: CP-4297)
This problem is specific to an Enhanced Linked setup. Consider configuring an Enhanced Linked setup using two vCenter servers, each containing ESXi hosts:
Assume that HDM is installed using vCenter vcenterserver01.dopio.com. The HDM SPBM policy HDM Analyzer Profile will be visible in vcenterserver02.dopio.com, but the policy attributes will not available. Attempting to apply this policy from vcenterserver02.dopio.com to any VM may fail.
Workaround : Use the vCenter vcenterserver01.dopio.com to apply an SPBM policy to any VM.
Configuration of the SQS message bus can fail if incorrect tokens or non-existent queues are provided during configuration. Ensure that the queues provided are already created in the Amazon SQS service.
Re-deployment on-premises can fail during configuration of the HDM ESXi Manager. Services within the ESXi Manager must be configured to communicate with the I/O filter service within ESXi to service I/O requests. If the I/O filter service praapa is stopped or is in the process of restarting within ESXi, configuration will fail. Check the I/O filter status on the ESXi Configure page in vCenter and restart the service to retry the on-premises deployment.
This issue can occur when CA certificate details are provided for the key manager. Providing valid CA certificates for the key manager will resolve the issue. (Ref: CP-5408)
This issue can occur when the key manager details are not updated correctly. When editing the key manager, ensure that the operation is successful. If the operation fails, the key manager will be marked as unreachable. This will lead to the failure of try in cloud and warm migration. To ensure successful integration, be sure to use correct credentials and certificates to edit the key manager. Ensure that the key manager is reachable from the appliance at all times. (Ref: CP-5444)
HDM does not currently support UEFI BIOS. As a result, the prepare to migrate operation will fail for these VMs. Only use VMs with IBM PC BIOS for HDM migration.
The Cloud vCenter enables a few VMDK operations such as add/grow disks on the migrated VMs. HDM does not support these operations on the migrated VMs. While the operation may succeed, these VMs can’t be migrated back or re-migrated to the cloud. (Ref: CP-2595)
If the vCenter task for cold migration is cancelled by the user, the existing task gets cancelled. However, HDM will retry the cold migration until all retry attempts have been exhausted. To truly cancel the operation, cancel all retries. (Ref: CP-4365)
Prior to warm migration, HDM will check for free space on the target. However, because the actual storage required for thin disks can vary, the available space on the target cannot always be pre-validated. As a result, failure due to insufficient storage space during migration is still possible. To avoid this failure, ensure that the target has enough free space prior to attempting the migration. (Ref: CP-4301)
To warm migrate linked clones, attach the SPBM policy HDM Analyzer Profile to the base VM of the linked clone.
In an FQDN based deployment, HDM may not be able to resolve the cloud vCenter or ESXi, which can cause the warm migration to fail. During HDM installation, the DNS entry should have been configured to resolve the FQDN. If this is missing, manually add the DNS nameserver to the HDM cloud cache component, using the following procedure:
etc/resolv.conf
to resolve the FQDN.(Ref: CP-4330)
During the add cloud operation, specify the correct default application network for the cloud, or map the on-premises network to the cloud network. If this is not done, the VM migration may succeed but applications on the migrated VMs may still fail. In an SQS based migration, network mapping can be specified during migration. (Ref: CP-4433)
Certain VM parameters will not be retained post migration, so will need to be enabled manually. (Ref : CP-2859)
After migrating back, moving the VM to its original resource pool can sometimes fail in vCenter 6.7. If this happens, use vCenter to move the VM from the HDM_MIGRATE_POOL to its original resource pool. (Ref: CP-4652)
Once the deployment has completed, vCenter may not show an option to migrate the VM. Alternatively, it may just display loading....
Resolution: Logout and re-log into vCenter, then attempt the migration again. If the issue persists, restart the vsphere-ui (service-control --stop vsphere-ui;service-control --start vsphere-ui;
) service from the vCenter ssh console.
VMs that are migrated back are first powered off in the cloud. For Ubuntu 16.04 and RHEL 7.4, this operation can require an extended period of time. I/O errors can also be seen in the guest VMs during this process.
This is due to an issue in a specific version of the Linux kernel, where the iSCSI connection is closed too early during the shutdown process. This creates a backlog of ongoing I/Os, which stalls the shutdown of the guest VM. HDM will perform multiple retries and will ultimately hard power off these VMs during migration back. This issue is discussed further in a public forum:
https://bugzilla.redhat.com/show_bug.cgi?id=1164756
Migration requests initiated through SQS can fail if the appliance lacks Internet access. Ensure that the appliance has Internet access while using SQS for migration.
During warm migration, VMs in the cloud that are managed by HDM are identified by the postfix ‘ARM’. This postfix is later removed once the data has been transferred and changes have been synced to the cloud VM. The renaming of a VM can fail if an existing VM in the cloud has the same name.
This may happen if the wizard is still open when the operation completes. Sometimes the status on the wIzard is not updated in a timely manner. For accurate status of the operation, refer to vCenter tasks. (Ref : CP-5622)
A vCenter task is created for each VM migration. The progress and status of the migration is then updated in the vCenter task object, as well as in the Migrations In-Progress tab on the HDM vCenter plugin. Once the task is complete, a vSphere server will delete the reference from the vSphere database. As a result, rather than displaying the correct migration status in the user interface, it will be listed as ERROR. The correct status can be seen in the vCenter task list for the VM.
If the Internal cloud network is configured for DHCP for HDM cloud deployments, IP addresses assigned to migrated VMs (warm or TBC) may experience lease timeouts. This is because some Linux distributions (mainly Ubuntu and SLES) do not renew DHCP leases for the IBFT enabled NIC that is connected to the Internal network for TFTP/iSCSI booting.
No versions of Windows or RHEL/CentOS distributions are affected by this issue.
This issue will also not be seen if the internal cloud network is configured for Static IP addresses.
Workaround:
(Ref: DP-2777)
Failures for migrate, migrate back, on-premises deployment, and add cloud are shown in vCenter under Tasks and Events. Any failure attributed to the HDM component going down will also be captured within vCenter events with a description that begins with com.primaryio.hdm
. While the failures can be seen in the HTML and Flash views of vCenter, the details for the failures are not always available in both the views (e.g., it is only available in Flash view in vCenter version 6.5).
If an HDM cloud component VM remains shutdown for an extended period of time, the health of the failed HDM components may not be reflected in the PrimaryIO appliance. However, the failed component will be detected and alerts can be seen by selecting Home, followed by HDM, then Dashboard. Once the component VM is successfully rebooted, the appliance will correctly reflect the health of all components. (Ref: CP-4647)
Sometimes HDM operations fail and vCenter delivers the error message Request exceeded the limit set.
Resolution: The VAPI endpoint service may need to be restarted. To do this, log into vCenter using Flash and select Home, followed by Administration, Deployment, System Configuration, then Services. Right click on the VAPI endpoint service and select Restart. (Ref: DP-2700)
This can happen when HDM is in the process of repairing a failure, or while HDM is engaged in a recovery attempt. It can happen when HDM is deployed but there is no activity on the system, or when there is migration activity on the system.
Wait for the system repair to finish. If it appears that HDM recovery has stalled, reset HDM and proceed with the redeployment with a clean state.
If a failure causes migrations to not complete, the HDM recovery process will migrate back the affected VMs that were previously migrated, or are in the process of migrating. These VMs can be found in the on-premises HDM recovery pool.
In the event of an HDM component failure, migrated VMs can be migrated back. As part of HDM recovery, the failed component will be repaired and only the affected VMs will be migrated back and appear in _RECOVERY_POOLSUCCESS. However, HDM cloud cache is a critical component and its failure would affect all migrated VMs. As a result, all VMs will be migrated back. (Ref: DP-2739)
The HDM appliance can get hung if the datastore where it resides runs out of space. If this occurs, power off the HDM appliance VM, create enough space in the datastore, thne power the appliance back on. Please note that a full datastore can cause other issues and this resolution may not always work. In this case, an HDM reset may be necessary, or HDM can be re-deployed with adequate storage.
In the event of a network failure, HDM can migrate back previously migrated VMs. In some cases, these VMs can be found with additional NICs beyond what they were originally configured with. HDM adds these NICs during the migration operation and typically cleans them up automatically. However, in some failure scenarios when automatic cleanup has failed, these NICs will need to be manually cleaned up. (Ref: CP-4841)
Sometimes VMs are migrated back to the on-premises environment following a temporary migration to the cloud (e.g., the TBC use case), or because of a failure during the migration. Network availability is critical during the entire migration operation, as well as for the continuity of HDM component services. The disruption in network connectivity is tolerated for a limited time (typically up to 4-5 minutes), after which HDM will enter into failure and recovery mode. A recovery operation can only succeed when the network is available. If an HDM reset is required, it can only succeed when network connectivity is available for cleaning up state in the cloud.
HDM operations are designed for retries and resilience for somewhat jittery networks. However, if connectivity is lost for more than 4-5 minutes continuously, failures will be triggered. Then, when the network becomes available, the process of recovery and repair will be initiated.
Because uninstallation happens at the cluster level, it will fail if any host in the cluster becomes unavailable.
Resolution: Resolve the connection issue with the unreachable ESXi host, then retry the uninstallation process.
Uninstallation happens at the cluster level, so it will fail if a VIB or HDM component removal cannot be performed on any host in the cluster.
Resolution: Resolve the connectivity issue with the unreachable ESXi host, then retry the uninstallation process.
During HDM undeployment, the ‘praapa’ VIB can fail to uninstall on some ESXi instnaces in the cluster.
Resolution: Restart the iofilter daemon on ESXi (ssh log into the ESXi and run the command /etc/init.d/iofilterd-prapaa restart), then uninstall HDM. (Ref: DP-2578)
The tags are created to set affinity rules on HDM component hosts managed by the cloud HDM vCenter. These rules are only used by PrimaryIO HDM and do not affect other VMs in the cloud vCenter. Currently, these are not deleted as part of the HDM uninstallation process.
Resolution: Follow these steps to clean HDMDRS* tags from the cloud vcenter:
Uninstallation of the HDM filter requires the cluster hosts to be in maintenance mode, one host at a time. If a host fails to enter maintenance mode for any reason, the HDM filter uninstall will fail.
Resolution: Manually put the host into maintenance mode and retry the uninstall operation. This may require a vMotion on the active VMs to another host in the cluster. If the cluster only has one host, the active VMs may be required to be powered off. Please note that the PrimaryIO appliance host should not be powered off prior to initiating the uninstall.
A task to uninstall the HDM filter from the appliance executes successfully, but the HDM icon on vCenter still shows the HDM version, and the ‘praapa’ service remains on the ESXi host.
Resolution: This issue is seen when the uninstall is attempted on a cluster where the ESXi hosts cannot be put into maintenance mode, either because DRS has not been configured or there is an insufficient number of ESXi instances for vMotion. This causes the uninstall request to be registered, but not executed. Follow these steps to complete the uninstallation:
An entry for the HDM plugin continues to be displayed in the Administration / Client Plug-Ins listing, or in the Menu, even after it has been unregistered from the appliance.
This is a vCenter listing issue and does not affect vCenter functionality.
Issues: The new plugin may not load properly if the same or a lower version of the plugin is registered. However, a higher version of the plugin will always work.
Resolution: Complete the following steps:
find / -name *piohyc* -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;
During uninstall, disable monitoring on some VMs can fail in certain situations.
Resolution: The workaround for this is to power the VMs off, then disable monitoring. (Ref: CP-4431)
Do not perform cloud undeployment while migrations are in progress. While the UI does not prevent this action, the undeployment can fail and cleanup of the migrating VMs may not happen. (Ref: CP-4373)
The HDM on-premises uninstall can fail if there are any VMs that have snapshots that were created by HDM (named pio_*). This usually applies to VMs that were migrated back as part of failure recovery, and reside in HDM_RECOVERY_SUCCESS.
Resolution: The migrate time snapshot (pio_*) should be explicitly deleted prior to retrying the undeploy. (Ref: CP-3749)
If the I/O filter service ‘praapa’ is stopped or is restarting within an ESXi, detaching the SPBM policy on the VMs can fail, because the service is unavailable and policy state cannot be cleaned. Check service status on the ESXi configure page and retry the detach operation.
If there is a network disconnect between the appliance and the cloud, HDM cloud components and migrated VMs will not be deleted. Even if the HDM reset task on vCenter lists the operation as completed, HDM components and migrated VMs may still reside in the cloud, leaving the system in an unclean state.
Resolution: Manually delete the HDM components and migrated VMs to complete the cleanup process and make it ready for a new deployment.
As part of the HDM reset, the PrimaryIO appliance must be restarted. In some cases, this restart may not succeed, or the appliance may not be assigned an IP address.
Resolution: This is probably a transient error, so restart the appliance to enable the HDM reset to successfully complete. (Ref: CP-4610)
During HDM reset, if any of the ESXi hosts in the on-premises cluster are rebooted, the cleanup for that host may not happen. While the HDM reset may still succeed, any future attempts for an on-premises deployment may fail.
Resolution: Retry the HDM reset, then attempt to redeploy. (Ref: CP-4648)
After rebooting the appliance, the HDM reset task remains queued, even after the appliance has been rebooted. This may happen if there have been transient issues while configuring network adapters within the appliance, following the reboot.
Resolution: Reboot the appliance again. (Ref: CP-4836)
With non DRS or HA enabled clusters, when the appliance fails on an ESXi, the vApp option properties are reset on the VM.
Resolution: Perform the following steps:
(Ref: CP-4848)