service vmware–aam restart
service vmware–aam stop
service vmware–aam start
There are 3 different types of Admission control policy available.
Isolation of ESXi hosts is validated on the basis of heartbeats. The timeline for declaring isolation of slave and master ESXi is different. In this case we will discuss isolation of slave ESXi.
HA triggers a master election process before it will declare a slave ESXi host is isolated. In this timeline, “s” refers to seconds:
When an ESXi host is isolated, the value in “power on” file is raised to 1, HA reads this file and validates that ESXi host has been isolated. There is one Power on file per ESXi host and this file contains entries of all those VM’s which are currently powered on an ESXi host.
By default, HA slot size is determined by the Virtual machine Highest CPU and memory reservation. If no reservation is specified at the VM level, default slot size of 256 MHZ for CPU and 0 MB + memory overhead for RAM will be taken as slot size. We can control the HA slot size manually by using the following values.
There are 4 options we can configure at HA advanced options related to slot size
HA will usually monitors ESX hosts and reboot the virtual machine in the failed hosts in the other host in the cluster in case of host isolation but i need the HA to monitors for Virtual machine failures also. Here the feature called VM monitoring status as part of HA settings.VM monitoring restarts the virtual machine if the vmware tools heartbeat didn’t received with the specified time using Monitoring sensitivity.
When an ESXi host fails, the VM’s which were running on that ESXi are restarted on remaining nodes in the cluster. But how HA knows that how many VM’s were running on the host before it has failed. The answer is:
HA takes help of 2 files namely “power on” and “Protected list”. The “power on file is maintained by each ESXi host individually and it contains entries of those VM’s which are currently running on that ESXi. The “Protected list” file is maintained at datastore level and tells HA that what were the VM’s which were protected before the failure. On the basis of contents of these 2 files HA takes decision of restarting VM’s.
When a VM is powered off manually then entry of that VM is removed from “Protected list” file so that HA do not accidently restart that VM also.
I have written a post about how the HA slots are calculated.
No election will not happen even if the newly introduced ESXi has visibility to more datastores than master ESXi host. But if you reconfigure HA on the cluster then the newly added ESXi will become master because it is connected to more number of datastores.
VMware HA has a mechanism to detect a host is isolated from rest of hosts in the cluster. When the ESX host loses its ability to exchange heartbeat via management network between the other hosts in the HA cluster, that ESX host will be considered as a Isolated.
It will not receive heartbeat and also ping to the isolation address also failed. So, host will think itself as isolated and HA will initiate the reboot of virtual machines on the host to other hosts in the cluster. Why do you need this unwanted situation while performing scheduled maintenance window.
To avoid the above situation when performing scheduled activity which may cause ESX host to isolate, remove the check box in ” Enable Host Monitoring” until you are done with the network maintenance activity.
If a virtual machine needs to be restarted by HA and the virtual machine is in the process of being Storage vMotioned and the virtual machine fails, the restart process is not started until Vcenter informs the master that the Storage vMotion task has completed or has been rolled back.
As per “VMware Availability Guide”,
Vcenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.
Let’s take an example; you are performing network maintenance activity on your switches which connects your one of Th ESX host in HA cluster.
As per VMware’s Definition,
“A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.”
If you have configured reservations at VM level, It influence the HA slot calculation. Highest memory reservation and highest CPU reservation of the VM in your cluster determines the slot size for the cluster.
A master is elected by a set of HA agents whenever the agents are not in network contact with a master. A master election thus occurs when HA is first enabled on a cluster and when the host on which the master is running:
Note: Removing slave ESXi from a cluster doesn’t have any effect on election process i.e. if any slave ESXi is removed or shutdown or put into maintenance mode, election will not happen.
Maximum number of hosts in the HA cluster is 32.
By default, VMWare HA use to ping default gateway as the isolation address if it stops receiving heartbeat. We can add an additional values in case if we are using redundant service console both belongs to different subnet. Let’s say we can add the default gateway of SC1 as first value and gateway of SC2 as the additional one using the below value.
Power off – All the VMs are powered off, when the HA detects that the network isolation occurs.
Shut down – All VMs running on that host are shut down with the help of VMware Tools, when the HA detects that the network isolation occurs. If the shutdown via VMWare tools not happened within 5 minutes, VM’s power off operation will be executed. This behavior can be changed with the help of HA advanced options. Please refer my Post on HA Advanced configuration.
Leave powered on – The VM’s state remain powered on or remain unchanged, when the HA detects that the network isolation occurs.
No removal of slave ESXi from cluster doesn’t has any impact on master. No election will be happening in this case.
It is mandatory that for restarting VM’s master should be present in cluster. Now when election is happening in a cluster, it takes 15 seconds to complete the election process. Now during that time if a slave ESXi also fails then restart of VM has to wait until election process is completed.
The newly elected master will first read the “Protected List” file to find out the VM’s whose power state has been changed. After reading that file it will decide that how many vm’s were there which failed during election time and then will perform restart of those VM’s.
Select the maximum number of host failures that you can afford for or to guarantee fail over. Prior vSphere 4.1, Minimum is 1 and the maximum is 4.
In the Host Failures cluster tolerates admission control policy, we can define the specific number of hosts that can fail in the cluster and also it ensures that the sufficient resources remain to fail over all the virtual machines from that failed hosts to the other hosts in cluster. VMware High Availability (HA) uses a mechanism called slots to calculate both the available and required resources in the cluster for a failing over virtual machines from a failed host to other hosts in the cluster.
HA will respond when the state of a host has changed, or when the state of one or more virtual machines has changed. There are multiple scenarios in which HA will attempt to restart a virtual machine of which we have listed the most common below:
Prior to vSphere 5, the actual number of restart attempts was 6, as it excluded the initial attempt. With vSphere 5.0 the default is @There are specific times associated with each of these attempts. The following bullet list will clarify this concept. The ‘m’ stands for “minutes” in this list.
In case of a host failure, HA will try to restart the virtual machine on other hosts in the affected cluster; while performing the restart if this is unsuccessful on that host, the restart count will be increased by 1.
Let’s say first restart attempt is made at T0 minutes when the host failure has occurred (In actual restart is not performed as soon as host has failed because HA takes some time before declaring host failure; read above the 2 scenarios which I have mentioned).
If the first restart attempt is failed, then the restart counter is increased by one and the next restart is attempted after 2 minutes (T2). In the same fashion HA keep trying restarting the VM until issued power on attempt is reported as “completed”.
A successful restart might never occur if the restart count is reached and all five restart attempts were unsuccessful.
In the case of the isolation of a master, this timeline is a bit less complicated because there is no need to go through an election process. In this timeline, “s” refers to seconds.
AAM is the Legato automated availability management. Prior to vSphere 4.1, VMware HA is actually re engineered to work with VM’s with the help of Legato’s Automated Availability Manager (AAM) software. VMware’s Vcenter agent (vpxa) interfaces with the VMware HA agent which acts as an intermediary to the AAM software. From vSphere 5.0, it uses an agent called “FDM” (Fault Domain Manager).
In HA cluster, ESX hosts uses heartbeats to communicate among other hosts in the cluster. By default, Heartbeat will be sent every 1 second.
If a ESX host in the cluster did not receive heartbeat for 13 seconds from any other hosts in the cluster, The host considered it as isolated and host will ping the configured isolation address (default gateway by default). If the ping fails, VMware HA will execute the Host isolation response.
Yes admission control policy is dependent on Vcenter Server although it is part of HA and we all know HA works independently of Vcenter Server. Admission control policies don’t work when at the time of failure of an ESXi host, Vcenter server is not available. This doesn’t mean VM that were running on failed host will not be restarted, but whatever policy you have chosen that policy will not work.
For E.g.: You have chosen “Specify failover host” policy and dedicated one ESXi host for handling the failover. Now in normal scenario, if a host failure has occurred then HA will failover the failed VM’s on only this dedicated host and not on any other hosts in cluster. But if Vcenter is not available and this happens then HA might restarts your VM’s on other hosts also if there are not sufficient resources available on your specified failover host.
Maximum number of primary HA host is @VMware HA cluster chooses the first 5 hosts that join the cluster as primary nodes and all others hosts are automatically selected as secondary nodes.
As per VMware Definition:
VMware® High Availability (HA) provides easy to use, cost effective high availability for applications running in virtual machines. In the event of server failure, affected virtual machines are automatically restarted on other production servers with spare capacity.
Enable: Do not power on VMs that violate availability constraints.
Disable: Power on VMs that violate availability constraints.
You can configure a parameter called “das.isolationShutdown.Timeout”. The value of this parameter is specified in minutes and it is time which will be taken by HA to gracefully shutdown a VM when isolation response is set to “Shutdown VM” and it is triggered.
Below steps are taken from my blog posts troubleshooting HA:
- Incoming port: TCP/UDP 8042-8045
- Outgoing port: TCP/UDP 2050-2250
- service vmware–aam restart
- service vmware–aam stop
- service vmware–aam start
There is a slight difference between ESXi host isolation and network partitioned. When multiple slave ESXi hosts has isolated together but they can ping each other than this condition is known as network partitioned.
For e.g.: Subnet mask of 5 ESXi has been changed then they will be unable to talk to master (being on different subnets) but they can communicate to each other (being on same subnet).
When network partitioned happens in a cluster then election happens between the isolated slaves ESXi and a new master is elected among them. In this case there will be 2 masters in a cluster.