Overview

Recently someone raised a question because they were seeing their traffic source NAT to the node IP when using Azure CNI and Calico. I’ve covered this a bit when I dug into the Azure CNI and it’s impact on iptables in my Aks Networking Iptables in AKS post. The short version is that the ip-masq-agent that runs in the cluster has a matching configmap which tells it what ranges it should ignore for outbound NAT. By default this range is set to the cluster’s Vnet CIDR, however, in this post I was only looking at Azure CNI without any Kubernetes Network Policy applied. When you introduce calico into the mix some interesting things happen. Most noteably the ip-masq-agent config I shared in that article gets hijacked by Calico. Lets have a quick look at how this works.

Setup

I’m not going to go through the full network setup here, as if you’re hitting this post you’ve probably already run into this issue, but let me share my high level setup. In my lab I have two Vnets peered with each other. In Vnet A I have an AKS cluster and an Ubuntu node that I can use to test traffic between a node and cluster in the same Vnet. In Vnet B I just have an Ubuntu node to test traffic leaving the the AKS Vnet across the Vnet peering. On both Ubuntu nodes I’ve installed docker and started up an nginx pod using “docker run -d -p 80:80 nginx” and then once running I ran “docker logs <containername> -f” to watch the logs for nginx. The nginx logs will show the source IP for every request.

As for the AKS cluster, I created that in Vnet A using the following command to make sure I’d enabled Azure CNI and Calico.

# Create Cluster
az aks create -g <ResourceGroupName> \
-n azurecnicalico \
--network-plugin azure \
--network-policy calico \
--vnet-subnet-id <subnet resource id>

So now we have two Ubuntu nodes running Nginx that we can hit, with one being in the AKS Vnet and one in a separate peered Vnet. In our cluster we’ll fire up a quick ubuntu test pod we can use to curl the two Ubuntu servers.

# Create Ubuntu Pod
cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: ubuntu
  name: ubuntu
spec:
  containers:
  - image: ubuntu
    name: ubuntu
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
  restartPolicy: Never
EOF

Now you can exec into the pod with kubectl exec -it ubuntu -- bash, run an apt update and then apt install curl. If you curl either of your servers you should see the nginx logs show the source IP. On the node in Vnet A (same vnet as the cluster) you’ll see the pod ip, and on the node in Vnet B (outside of the AKS vnet) you’ll see the node IP.

Lets check that out:

# Check out the pod IP
kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
ubuntu                   1/1     Running   0          10m     10.240.0.46   aks-nodepool1-30745869-vmss000001   <none>           <none>

# Get the IP of the node the pod is running on
kubectl get node aks-nodepool1-30745869-vmss000001 -o wide
NAME                                STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-nodepool1-30745869-vmss000001   Ready    agent   3d22h   v1.19.3   10.240.0.35   <none>        Ubuntu 18.04.5 LTS   5.4.0-1031-azure   containerd://1.4.1+azure

After running the above I can see the following:

  • Pod IP: 10.240.0.46
  • AKS Node IP: 10.240.0.35
  • VNet A Server IP (Same Vnet as AKS): 10.240.0.97
  • Vnet B Server IP (Different Vnet from AKS): 172.17.0.4
# Curl Vnet A Server (AKS Vnet) from our Ubuntu Pod
curl 10.240.0.97

# Docker Logs on the Vnet A Server
10.240.0.46 - - [11/Dec/2020:17:20:04 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.68.0" "-"

As you can see above, within the same Vnet, the server we’re calling sees the source IP is the pod IP (10.240.0.46).

# Curl Vnet B Server (Peered Vnet) from our Ubuntu Pod
curl 172.17.0.4

# Docker Logs on the Vnet A Server
10.240.0.35 - - [11/Dec/2020:17:21:45 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.68.0" "-"

Now we can see that when traffic leaves the Vnet the server outside sees the source IP as the node IP (10.240.0.35).

iptables

At this point we have a valid test scenario, and have been able to show the SNAT that is taking place..but where is this in an Azure CNI + Calico cluster. Lets have a look through the iptables. You’ll need to ssh into a node. For this, as always, we’ll use ssh-jump but there are various other options, including using privileged containers. If you do ssh to a node, you’ll need to set up ssh access.

I’m going to jump right to the POSTROUTING chain here to save us some time, but obviously you could explore the full set of chains more extensively if you prefer.

# First I'll jump into a node
kubectl ssh-jump aks-nodepool1-30745869-vmss000000

# Now lets check out the POSTROUTING chain
sudo iptables -t nat -L POSTROUTING
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
cali-POSTROUTING  all  --  anywhere             anywhere             /* cali:O3lYWMrLQYEMJtB5 */
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

As we can see above, the POSTROUTING chain passes everything along to ‘cali-POSTROUTING’, so lets check that one out.

sudo iptables -t nat -L cali-POSTROUTING
Chain cali-POSTROUTING (1 references)
target     prot opt source               destination
cali-fip-snat  all  --  anywhere             anywhere             /* cali:Z-c7XtVd2Bq7s_hA */
cali-nat-outgoing  all  --  anywhere             anywhere             /* cali:nYKhEzDlr11Jccal */

# So cali-POSTROUTING passes to cali-fip-snat and cali-nat-outgoing
# Lets check those out
sudo iptables -t nat -L cali-fip-snat
Chain cali-fip-snat (1 references)
target     prot opt source               destination

sudo iptables -t nat -L cali-nat-outgoing
Chain cali-nat-outgoing (1 references)
target     prot opt source               destination
MASQUERADE  all  --  anywhere             anywhere             /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst

Not much happending in cali-fip-snat, but we do see a pass to the MASQUERADE chain from the cali-nat-outgoing chain. MASQUERADE is where the SNAT happens. This rule has a few parameters on it. I won’t dig deep into these, but the key to point out is the ‘match-set’ flag. match-set uses the iptables extension ipset. IPSet allows you to have a table of addresses/ranges that can be queried from iptables. We can actaully see these tables using the ipset command. Let’s check that out on our host.

sudo ipset list cali40masq-ipam-pools
Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 512
References: 1
Number of entries: 1
Members:
10.240.0.0/16

There it is! You can see the AKS Cluster Vnet CIDR is listed in the ‘Members’ block of this ipset, but how did it get there. We have a tip in the name of the ipset. If we search online for ‘Calico IP Pools’ we get to the Calico ip pool documentation, where it turns out there’s an ipool CRD! Lets check that out!

kubectl get ippools
NAME                  AGE
default-ipv4-ippool   3d23h

# Lets see whats in that ippool 
kubectl get ippool default-ipv4-ippool -o yaml
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"crd.projectcalico.org/v1","kind":"IPPool","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile"},"name":"default-ipv4-ippool"},"spec":{"blockSize":26,"cidr":"10.240.0.0/16","ipipMode":"Never","natOutgoing":true}}
  creationTimestamp: "2020-12-07T18:15:30Z"
  generation: 3
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  managedFields:
  - apiVersion: crd.projectcalico.org/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:addonmanager.kubernetes.io/mode: {}
      f:spec:
        .: {}
        f:blockSize: {}
        f:cidr: {}
        f:ipipMode: {}
        f:natOutgoing: {}
    manager: kubectl
    operation: Update
    time: "2020-12-11T15:49:39Z"
  name: default-ipv4-ippool
  resourceVersion: "802963"
  selfLink: /apis/crd.projectcalico.org/v1/ippools/default-ipv4-ippool
  uid: 8d1132dd-358a-4422-b188-f938fc2edc57
spec:
  blockSize: 26
  cidr: 10.240.0.0/16
  ipipMode: Never
  natOutgoing: true

Right at the bottom of that manifest we can see the spec, including the CIDR block, that matches our ippool CIDR. Now that we can see where this Vnet CIDR is coming from, can we change that routing rule to add additional CIDR blocks, like the one from VNet B. I’m going to use the IPPool example from the Calico docs to create an ip pool with Vnet B’s CIDR. I’m going to set the ‘disable’ flag to false, to be sure nothing tries to assign IPs from this range, but I do want calico to be aware of it for NAT.

Note: Before you mess around with ippools you should make sure you understand all of the options and the impact on your cluster.


cat <<EOF |kubectl apply -f -
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: othervnet
spec:
  cidr: 172.17.0.0/16
  natOutgoing: true
  disabled: true
  nodeSelector: all()
EOF

# Check out the ippools list
kubectl get ippools
NAME                  AGE
default-ipv4-ippool   4d
othervnet             61m

# Lets take another look at that ipset and see if we have a new cidr block added
sudo ipset list cali40masq-ipam-pools
Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 576
References: 1
Number of entries: 2
Members:
10.240.0.0/16
172.17.0.0/16

It worked! Now lets check out the traffic to Vnet A and Vnet B.

# Curl Vnet A Server (AKS Vnet) from our Ubuntu Pod
curl 10.240.0.97

# Docker Logs on the Vnet A Server
10.240.0.46 - - [11/Dec/2020:17:20:04 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.68.0" "-"

As you can see above, within the same Vnet the server we’re calling still sees the source IP is the pod IP (10.240.0.46). So we didnt break that.

What about the server in Vnet B.

# Curl Vnet B Server (Peered Vnet) from our Ubuntu Pod
curl 172.17.0.4

# Docker Logs on the Vnet A Server
10.240.0.46 - - [11/Dec/2020:17:55:50 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.68.0" "-"

Success! The server in Vnet B now can see the pod IP!

Summary

As we saw in this post, Azure CNI will use the ip-masq-agent to snat any traffic leaving the vnet, but when we enable Calico network policy that control is taken over by Calico itself. Calico uses the IPPool crd to allow you to manage ippools, which are implemented with ipsets and the ipset iptables extension. You can add an IPPool to your cluster to extend the range of IPs that will be ignored from SNAT.

WARNING: My understanding is that some network appliances may not like to see pod traffic that hasnt been NAT’d to a real host IP address, and may drop that traffic. I need to dig into this topic further in a future post, but for now you should proceed with caution when updating ippools in your AKS clusters. Assume that AKS does this SNAT for traffic outside of the Vnet for good reason, and do your own extensive testing for any such change.