This section covers troubleshooting for issues known in the HDM product. In case there are issues that are not covered in this document or are not resolved easily, users should contact PrimaryIO support for further help.
Failures during the HDM deployment on-premises or on-cloud will require the user to first clean the state using HDM reset and retry the deployment. In the current release, HDM doesn’t attempt to automatically recover from deployment failures. (Ref: CP-4686)
During deployment, the synchronization with NTP server is required. This operation may take time and during this period, the deployment may seem stuck or taking longer than expected. User is not required to take any action here, there will be a delay after which the deployment would continue as usual.
(Ref: CP-4419)
This failure can occur either due to incorrect credentials or incorrect Cloud Director Organization name. Providing the correct credentials or organization name should resolve this issue. (Ref: CP-5395)
If there are insufficient CPU or memory resources on-premises or on-cloud, the HDM deployment can fail with event in vCenter “Insufficient resources to satisfy configured failover level for vSphere HA”.
This should be avoided by choosing HDM deployment type according to the resource availability on-premises and on-cloud or change the Deployment Type.(Ref: CP-4243)
If the network selected during PIO Appliance deployment is incorrect and vCenter has not been added yet, user can change the network using the following procedure
If the add vCenter has already been performed, change of the network is not possible on the deployed Appliance. User needs to redeploy PIO Appliance.
HDM does not support IPv6 configured vmware environments.
The PIO Appliance password was changed and the user has forgotten the new password.
Resolution: Contact PrimaryIO Support to change the password.
Resolution: The user will have to be assigned, administrator privileges for the vCenter that needs to be added. Follow the steps described below to assign administrator privileges to the user:
This user will now be able to add the vCenter to the PIO Appliance.
A vCenter entry in the PIO Appliance can show an ‘ERROR’ state if its credentials have been changed externally in the vCenter after adding it to the PIO Appliance. This issue can also occur if the vCenter is no longer reachable.
Resolution:
HDM plugin fails to appear in vCenter’s UI even though it has been registered with the vCenter.
Resolution:
Unregister the HDM plugin from vCenter.
Search and delete ‘PrimaryIO’ or ‘praapa’ files from vCenter. To find the files execute the following command:
find / -name piohyc -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;"
Restart vCenter.
Register HDM plugin again with the same vCenter from where it was deleted.
In case, On-Cloud Cloud Director is configured on different networks and resolved through different DNS settings. In this case On-Cloud Cloud Director will get added if correct DNS is set during Add Cloud, but ESXi resolution will fail resulting in Cold and Warm migration Failure.
Similarly, “Add Cloud” may fail if On-Cloud Cloud Director has been configured with FQDN and DNS to resolve vCenter is incorrect or is not provided during the operation.
Resolution: Customers should add rules in their Network to forward resolution of these FQDN to the correct DNS, whether it is for the vCenter or ESXi.
This can happen if the Cloud Director cloud end-point is not reachable. The current operation being performed (deployment, mugration) will fail. The connectivity to cloud end-point needs to be re-established before continuing. (Ref : CP-5596)
This issue can occur when CA certificate details are provided for Key Manager. Providing valid CA certificates for Key Manager will resolve the issue. (Ref: CP-5408)
Following are known limitations with virtual machine disk controller configurations for migration to VMware Cloud Director. HDM does not support migration of:
This can happen is the CPU resources are exhausted for the Organization VDC into which the migration happened. Do update the CPU resources and try powering-on the VM. (Ref: CP-5469)
This is a known issue with these operating systems. IP addresses will need to be allocated manually. (Ref : CP-5626)
If the vCentre task for cold migration is cancelled by the user, the existing task gets cancelled. However, HDM re-attempts, the cold migration till all the retry attempts are over. The user should cancel all the re-attempts, so as to truly cancel the operation, (Ref: CP-4365)
Even though HDM checks free space on the target of a warm migration, the available space on target cannot always be pre-validated because actual storage required for thin disks can vary. This may lead to failure because of insufficient storage space during migration.
User should explicitly make sure that the target has enough free space before attempting to migrate.
(Ref: CP-4301)
During add cloud operation, users should specify the correct default application network on-cloud or map on-premises network to on-cloud network. Failing to do so, the VM migration may succeed but applications on the migrated VMs may fail. In the SQS based migration, the network mapping can be specified at the migration time. (Ref: CP-4433)
Certain virtual machine parameters may not be retained post migration. These will need to be set manually. (Ref : DP-2859)
After complete deployment, users may not find any action available to migrate VM in vCenter or it may just show ‘loading...’.
Resolution: The solution will be do logout-login of vCenter and try again migrating, if it still not giving any option then please restart vsphere-ui (service-control --stop vsphere-ui;service-control --start vsphere-ui;
) service from the vcenter ssh console.
For each Virtual Machine migration a vCenter task is created. The progress and status of migration is then updated in the vCenter task object. The same status is also reflected on the Migration’s In-Progress tab on HDM vCenter Plugin for global and cluster view. In cases where the task has been completed, a vSphere server may delete the reference from the vSphere database. In this case, as the task object has been removed, the correct status of migration does not reflect on the UI and shows the migration state as ERROR. Users can check the task status under vCenter task list for the virtual machine to get the correct status.
In HDM on-cloud deployment, if the Internal network on the on-cloud is configured for DHCP IP addresses, IP addresses assigned to migrated VM (WARM or TBC) with DHCP can experience lease timeout. This is because on some versions of Linux distributions (mainly all Ubuntu and SLES distributions) DHCP lease is not renewed for the IBFT enabled NIC which is connected to the Internal network for TFTP/iSCSI booting.
All the versions of Windows and RHEL/CentOS distributions are not affected by this issue.
If the Internal network on the on-cloud is configured for IP addresses with Static IP address pool, this issue will not arise.
Workaround:
(Ref: DP-2777)
Failures for migration/ migration back, on-premises deployment and add on-cloud are shown in vCenter under Tasks and Events. Any failure seen due to the HDM component going down is also captured within vCenter events with the description starting with ‘com.primaryio.hdm.'
. This is currently shown in HTML and Flash view of the vCenter. However the details for the failures are not always available in both the views, for example for vCenter version 6.5 it is available in the Adobe Flash view. Such information is also available in the Health Tab of HDM plugin.
If a HDM on-cloud component VM remains shutdown for an extended period, the failed HDM components health may not get reflected in the PIO Appliance.
However, the failed component will be detected and alerts can be seen in the Home -> HDM ->Administration->Health->Component Health. Once the component VM gets rebooted successfully, the PIO Appliance will correctly show the health of all components. (Ref: CP-4647)
This can happen sometimes when there is a failure that HDM is repairing, for example a component failure. The repair may be in progress but it is not complete. Users might notice a hung system while HDM is still attempting, retrying for recovery. It may happen both when HDM is deployed but there is no activity on the system or it may happen when there is migration activity on the system.
Users can wait for sometime and give time for the system repair to complete. If it looks that HDM recovery is stalled for some reason, it recommended that the user executes HDM reset and proceeds with the redeployment with a clean state.
A user starts a few migrations or the user might have ongoing migrations. All or some of the migrations might not complete due to a failure. A failure in Cloud_Msg_GW could cause this. An HDM Reset and cancelling of tasks on VCenter is required to resolve this.
HDM Appliance can get hung if the datastore on which it was present has no space left.
Users should power off the HDM Appliance VM, create enough space in the datastore and power on the Appliance again. Please note that datastore full can cause other issues and the resolution here may not always work. User may need to do HDM reset or re-deploy HDM with appropriate storage.
HDM is deployed both on-premises and on the on-cloud. VMs migrate from the on-premises to the on-cloud, there are occasions when VMs are migrated back to the on-premises after a temporary migration to the on-cloud (for example for the TBC use case) or because of a failure during the migration. The Network availability is critical during the entire Migration operation and for continuity of HDM component services.
Thus the disruption in the network connectivity is tolerated for a limited time, typically upto 4-5 minutes beyond which HDM gets into Failure and Recovery mode. Recovery operation will also succeed only when the Network is available. If the system gets into a state where HDM reset needs to be executed, it can succeed only when the network connectivity is available for cleaning up the state on the on-cloud.
HDM operations are designed for retries and resilience for somewhat jittery Networks. More than 4-5 minutes of continuous disconnect triggers failures and eventual process of recovery and repair if that network becomes available.
Uninstallation happens at the cluster level and it will fail if any host in the cluster is unavailable.
Resolution: System administrators will have to first resolve the connection issue with the unreachable ESXi host and then retry the uninstallation process.
An entry for the HDMplugin continues to be displayed in the Administration->Client Plug-Ins listing or in the Menu even after it has been unregistered from the appliance.
This is a vCenter listing issue and does not affect the vCenter functionality.
Issues: The new plugin may not load properly if we register the same or lower plugin version but a higher plugin version will always work.
Resolution: Steps are:
Following command will find and delete all stale entries for HDM and will restart vsphere-ui service.
"`find / -name *piohyc* -exec rm -rf {} +; service-control --stop vsphere-ui;service-control --start vsphere-ui;`"
Do not perform on-cloud undeployment while migrations are in progress. The UI does not prevent this action. However, the undeployment can fail and the migrating VMs cleanup may not happen.
(Ref: CP-4373)
If there is a network disconnect between appliance and on-cloud, HDM components on-cloud and migrated VMs will not get deleted. Even though the HDM reset task on vCenter shows as completed, HDM components as well as migrated VMs might still be present on the on-cloud leaving the system in an unclean state.
Resolution: The resolution is to manually delete HDM components and migrated VMs for completing the process of clean up and get it to state for new deployment.
As part of HDM reset, PIO Appliance VM restart is required. In some cases, this restart may not succeed or the PIO Appliance may not get IP.
Resolution: This could be a transient error situation and the user should attempt restart of the PIO Appliance so that the HDM reset can complete successfully. (Ref: CP-4610)
During HDM reset, if any of the ESXi hosts in the on-premises cluster is rebooted for some reason, the cleanup for that host may not happen. The HDM reset may still succeed, however a future attempt for on-premises deployment may fail.
Resolution: In such a situation, users should do HDM reset again and perform redeployment.
(Ref: CP-4648)
Post appliance reboot, HDM reset task stays in queued state even after the appliance has been rebooted. This may happen if there has been some transient issues while configuring all the network adapters within the PIO appliance post reboot.
Resolution: In such a situation, users should reboot the appliance again. (Ref: CP-4836)
In case of non DRS or HA enabled clusters, when an ESXi on which PIO Appliance fails the vApp option properties set on the virtual machine gets reset.
Resolution: Need to perform the following steps:
(Ref: CP-4848)