Nutanix Kubernetes Engine (NKE): Day-two operations. Scaling up and Shrinking Kubernetes Clusters.

Previously, I wrote about deploying Kubernetes clusters using the Nutanix Kubernetes Engine.

The next few articles will cover the day-two operations that make managing Kubernetes clusters easier.

Today we will look at how to expand a Kubernetes cluster or how to shrink it using NKE.

We will begin with a brand new Kubernetes cluster, deployed in the previous article, that consists of 2 control plane nodes, 3 etcd, and 3 worker nodes:

# kubectl get nodes -o wide
NAME                              STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                 CONTAINER-RUNTIME
nke-vmik-lab-01-2c9716-master-0   Ready    control-plane,master   15d   v1.25.6   192.168.22.240   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-master-1   Ready    control-plane,master   15d   v1.25.6   192.168.22.231   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-0   Ready    node                   15d   v1.25.6   192.168.22.236   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-1   Ready    node                   15d   v1.25.6   192.168.22.235   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-2   Ready    node                   15d   v1.25.6   192.168.22.234   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16

In this cluster, I have a deployment with a set of Nginx containers:

# kubectl get pods -n vmik-test  -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP               NODE                              NOMINATED NODE   READINESS GATES
nginx-deployment-cd55c47f5-5d8ql   1/1     Running   0          3h19m   172.20.249.207   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-8dg8f   1/1     Running   0          3h19m   172.20.87.15     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-bnvnx   1/1     Running   0          3h19m   172.20.87.14     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-cnmnj   1/1     Running   0          3h19m   172.20.87.12     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-jmr99   1/1     Running   0          3h19m   172.20.219.78    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-m5blv   1/1     Running   0          3h19m   172.20.249.206   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-sxs5d   1/1     Running   0          3h19m   172.20.219.79    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-tf6gh   1/1     Running   0          3h19m   172.20.249.208   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-w6kx6   1/1     Running   0          3h19m   172.20.87.13     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-wc5f5   1/1     Running   0          3h19m   172.20.219.80    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>

In this article we will talk about scaling up and shrinking our cluster, so there are operations you can do and you can’t:

You can add more workers to the worker pool;
You can delete workers from the pool;
You can create a new worker pool;
You can’t resize existing workers – change CPU/RAM/HDD;
You can’t increase control plane nodes count, or etcd nodes.

Now let’s look at this in action. Click on Node Pools:

This menu provides access to the:

Control Plane – you can’t do anything here, but look at the node’s parameters and IP address;
etcd – same as the Control Plane;
Worker – add or remove workers from an existing pool, create a new pool, add labels, etc.

Click on Worker:

In this window, we will see an existing pool, which includes 3 worker nodes and those resources.

Clicking on the Actions button we will see three options:

We can’t delete a default pool because many system containers are running there. This option is grayed in this example.

Let’s click Update:

As I wrote, we can’t change the resources of the existing nodes. This menu can show us the resources and provide the ability to add Kubernetes labels to the nodes.

# kubectl describe nodes nke-vmik-lab-01-2c9716-worker-0
Name:               nke-vmik-lab-01-2c9716-worker-0
Roles:              node
Labels: 
…
                    nke-default=true
                    first-pool=yes
…

In the output above, you can see that NKE has added a label to the worker node.

Now, let’s look at Resize and how it works:

By default, you can see the total number of nodes in the pool. You can increase or decrease it.

Let’s add one more worker by increasing the total number of nodes to four and click resize:

We will see a new task and status:

At this moment, NKE is deploying a new VM and joining it to the Kubernetes cluster.

The procedure shouldn’t take a long time. After it finishes, we can see a new worker in the pool:

# kubectl get nodes -o wide
NAME                              STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                 CONTAINER-RUNTIME
nke-vmik-lab-01-2c9716-master-0   Ready    control-plane,master   15d   v1.25.6   192.168.22.240   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-master-1   Ready    control-plane,master   15d   v1.25.6   192.168.22.231   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-0   Ready    node                   15d   v1.25.6   192.168.22.236   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-1   Ready    node                   15d   v1.25.6   192.168.22.235   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-2   Ready    node                   15d   v1.25.6   192.168.22.234   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16
nke-vmik-lab-01-2c9716-worker-3   Ready    node                   11m   v1.25.6   192.168.22.237   <none>        CentOS Linux 7 (Core)   3.10.0-1160.108.1.el7.x86_64   containerd://1.6.16

Let’s scale an existing deployment to run a few containers on the new worker:

# kubectl scale deployment -n vmik-test nginx-deployment --replicas 15
deployment.apps/nginx-deployment scaled

Four containers were started on this node:

# kubectl get pods -n vmik-test -o wide | grep worker-3
nginx-deployment-cd55c47f5-mx264   1/1     Running   0          84s     172.20.165.131   nke-vmik-lab-01-2c9716-worker-3   <none>           <none>
nginx-deployment-cd55c47f5-ptswq   1/1     Running   0          84s     172.20.165.132   nke-vmik-lab-01-2c9716-worker-3   <none>           <none>
nginx-deployment-cd55c47f5-r5btn   1/1     Running   0          84s     172.20.165.129   nke-vmik-lab-01-2c9716-worker-3   <none>           <none>
nginx-deployment-cd55c47f5-x47lk   1/1     Running   0          84s     172.20.165.130   nke-vmik-lab-01-2c9716-worker-3   <none>           <none>

Now let’s delete Worker-3 by clicking the Delete button in the NKE interface:

We may see that four containers were redeployed on the other nodes, and no one runs on worker 3:

# kubectl get pods -n vmik-test  -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP               NODE                              NOMINATED NODE   READINESS GATES
nginx-deployment-cd55c47f5-5d8ql   1/1     Running   0          4h15m   172.20.249.207   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-8dg8f   1/1     Running   0          4h15m   172.20.87.15     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-b92p6   1/1     Running   0          18s     172.20.219.81    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-bnvnx   1/1     Running   0          4h15m   172.20.87.14     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-cnmnj   1/1     Running   0          4h15m   172.20.87.12     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-cv687   1/1     Running   0          18s     172.20.219.82    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-dlncf   1/1     Running   0          18s     172.20.249.210   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-jmr99   1/1     Running   0          4h15m   172.20.219.78    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-m5blv   1/1     Running   0          4h15m   172.20.249.206   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-mc6qm   1/1     Running   0          3m34s   172.20.249.209   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-sxs5d   1/1     Running   0          4h15m   172.20.219.79    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-tf6gh   1/1     Running   0          4h15m   172.20.249.208   nke-vmik-lab-01-2c9716-worker-2   <none>           <none>
nginx-deployment-cd55c47f5-w6kx6   1/1     Running   0          4h15m   172.20.87.13     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-wc5f5   1/1     Running   0          4h15m   172.20.219.80    nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-z94ch   1/1     Running   0          18s     172.20.87.16     nke-vmik-lab-01-2c9716-worker-0   <none>           <none>

And we have three workers again:

# kubectl get nodes
NAME                              STATUS   ROLES                  AGE   VERSION
nke-vmik-lab-01-2c9716-master-0   Ready    control-plane,master   15d   v1.25.6
nke-vmik-lab-01-2c9716-master-1   Ready    control-plane,master   15d   v1.25.6
nke-vmik-lab-01-2c9716-worker-0   Ready    node                   15d   v1.25.6
nke-vmik-lab-01-2c9716-worker-1   Ready    node                   15d   v1.25.6
nke-vmik-lab-01-2c9716-worker-2   Ready    node                   15d   v1.25.6

Now, let’s shrink our pool to one node using the resize button as before:

We decided to shrink a default pool from 3 workers to 1, and NKE needs to delete two worker nodes.

This is a step-by-step procedure, and workers will be deleted one by one. Before deleting, NKE drains the node (I believe it uses something like kubectl drain), and pods restart on other nodes in the cluster.

At each step, NKE will check the Kubernetes cluster’s health and ensure it’s OK before moving on to the next step.

We can see that pods from worker 2 were restarted on workers 0 and 1, while worker 2 is deleting:

# kubectl get pods -n vmik-test -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
nginx-deployment-cd55c47f5-8dg8f   1/1     Running   0          4h24m   172.20.87.15    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-b92p6   1/1     Running   0          9m24s   172.20.219.81   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-bnvnx   1/1     Running   0          4h24m   172.20.87.14    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-cnmnj   1/1     Running   0          4h24m   172.20.87.12    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-cv687   1/1     Running   0          9m24s   172.20.219.82   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-g2v69   1/1     Running   0          87s     172.20.219.84   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-gj2zf   1/1     Running   0          87s     172.20.219.85   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-jmr99   1/1     Running   0          4h24m   172.20.219.78   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-jnzls   1/1     Running   0          87s     172.20.87.22    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-lqd2q   1/1     Running   0          87s     172.20.219.86   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-mj98x   1/1     Running   0          87s     172.20.87.18    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-sxs5d   1/1     Running   0          4h24m   172.20.219.79   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-w6kx6   1/1     Running   0          4h24m   172.20.87.13    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-wc5f5   1/1     Running   0          4h24m   172.20.219.80   nke-vmik-lab-01-2c9716-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-z94ch   1/1     Running   0          9m24s   172.20.87.16    nke-vmik-lab-01-2c9716-worker-0   <none>           <none>

And.., just to demonstrate. An ERROR! After deleting one worker, the task was stopped because Kubernetes cluster health was far from normal after deleting the first worker and NKE noticed that and stopped the next task:

Pro tip: check /home/nutanix/data/logs/karbon_core.out log file.

This log contains all the information you need to troubleshoot the NKE issues. This time we can see a problem with a Calico deployment:

[DEBUG] Verifying calico-typha deployment in kube-system namespace
[DEBUG] expecting 3 available replicas of calico-typha deployment in kube-system namespace. Currently running: 2
[DEBUG] expecting 3 available replicas of calico-typha deployment in kube-system namespace. Currently running: 2
[DEBUG] expecting 3 available replicas of calico-typha deployment in kube-system namespace. Currently running: 2
[DEBUG] expecting 3 available replicas of calico-typha deployment in kube-system namespace. Currently running: 2
[ERROR] Failed to verify calico addon
[INFO] Cluster 2c97168e-e34d-4a11-65bd-9ca93d39e714 is unhealthy with error :Operation timed out

To fix this issue, resize the worker pool again, but with a minimum of three nodes, as required for a production deployment:

In my case, I’ve resized the pool to 5 nodes. Sadly, node IDs didn’t start from the beginning, and now I have workers 4/5/6.

Now the cluster should be “green” again:

In this situation, we can easily shrink the cluster back to 3 nodes using resize. Nothing should happen. Don’t forget that during shrinking, the minimal number of worker nodes for a production cluster is 3.

We have looked at how to resize the worker pool, how to add nodes, how to shrink the pool, and how to manually delete the desired node. Now we know that the minimal number of worker nodes is three for a production cluster.

The last thing I want to show is creating a new worker pool. It may be useful when you want to deploy workers with different settings (CPU/RAM/HDD).

Click on Add Node Pool:

There, we need to specify some settings:

Name of the pool and number of desired workers;
CPU, Memory, and Storage properties;
Node Pool Network – Don’t forget, that it must use IPAM and has free IPs in the pool;
Metadata if needed.

After specifying the settings, click Add, and the procedure should begin. We can see that the new pool is deploying:

After a while, the pool will be deployed:

# kubectl get nodes
NAME                                             STATUS   ROLES                  AGE     VERSION
nke-vmik-lab-01-2c9716-master-0                  Ready    control-plane,master   15d     v1.25.6
nke-vmik-lab-01-2c9716-master-1                  Ready    control-plane,master   15d     v1.25.6
nke-vmik-lab-01-2c9716-small-cpu-pool-worker-0   Ready    node                   3m46s   v1.25.6
nke-vmik-lab-01-2c9716-small-cpu-pool-worker-1   Ready    node                   3m46s   v1.25.6
nke-vmik-lab-01-2c9716-worker-0                  Ready    node                   15d     v1.25.6
nke-vmik-lab-01-2c9716-worker-1                  Ready    node                   15d     v1.25.6
nke-vmik-lab-01-2c9716-worker-4                  Ready    node                   34m     v1.25.6

Also, no one prevents to run a container on the new nodes:

# kubectl get pods -n vmik-test  -o wide | grep small
nginx-deployment-cd55c47f5-2hlsj   1/1     Running   0          74s   172.20.193.194   nke-vmik-lab-01-2c9716-small-cpu-pool-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-65s4q   1/1     Running   0          73s   172.20.76.195    nke-vmik-lab-01-2c9716-small-cpu-pool-worker-1   <none>           <none>
nginx-deployment-cd55c47f5-7zl9l   1/1     Running   0          73s   172.20.193.195   nke-vmik-lab-01-2c9716-small-cpu-pool-worker-0   <none>           <none>
nginx-deployment-cd55c47f5-ks9mr   1/1     Running   0          74s   172.20.76.194    nke-vmik-lab-01-2c9716-small-cpu-pool-worker-1   <none>           <none>

If you want to delete a newly created pool, you may notice that the delete button is inactive. To delete a non-default pool, we need to delete all workers in it or resize it to zero:

After that, we can delete the pool:

# kubectl get nodes
NAME                              STATUS   ROLES                  AGE   VERSION
nke-vmik-lab-01-2c9716-master-0   Ready    control-plane,master   16d   v1.25.6
nke-vmik-lab-01-2c9716-master-1   Ready    control-plane,master   16d   v1.25.6
nke-vmik-lab-01-2c9716-worker-0   Ready    node                   16d   v1.25.6
nke-vmik-lab-01-2c9716-worker-1   Ready    node                   16d   v1.25.6
nke-vmik-lab-01-2c9716-worker-4   Ready    node                   18h   v1.25.6

By the way, remember, as I said, that the minimum number of workers for a production cluster is three? That’s correct, but it will work even in this configuration:

I have one worker node in the default pool and two nodes in the new one. You may remember that before we couldn’t shrink the original pool to one node, but now we can.

In this situation, if you will try to shrink small-pool, you will face an error, as we’ve faced before, because the total number of nodes will be less than three.

Notice, that we can’t delete our original pool. It should contain at least one node.

That’s all for now. I hope now you know better how to expand worker pools, how to add another one, or how to shrink them. Don’t forget about the minimum worker’s count.

Leave a Reply Cancel reply