Ce TP se déroule sur un cluster DigitalOcean.
Comprendre le routage et la NAT avec Cilium dans un setup particulier. Plus de détails sur http://arthurchiao.art/blog/cilium-life-of-a-packet-pod-to-service/
Connectez vous sur un Pod.
Lancer un ping extérieur (par exemple sur 1.1)
Observez la table de routage du Pod :
#route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.244.0.201 0.0.0.0 UG 0 0 0 eth0
10.244.0.201 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
# arp -an
? (10.244.0.201) at f6:93:7d:11:4d:37 [ether] on eth0
On constate que la gateway du Pod est 10.244.0.201 @MAC=f6:93:7d:11:4d:37
#ip a
24: eth0@if25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether a6:c6:93:71:26:19 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.213/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a4c6:93ff:fe71:2619/64 scope link
valid_lft forever preferred_lft forever
Par ailleurs, connectez-vous sur le Node (avec kubectl node-shell
), puis faites un
ip a | grep -A1 -B1 f6:93:7d:11:4d:37
25: lxcf4cd39e34601@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether f6:93:7d:11:4d:37 brd ff:ff:ff:ff:ff:ff link-netns cni-17945243-a306-236e-f504-5197df7b88d9
inet6 fe80::f493:7dff:fe11:4d37/64 scope link
On voit que l'@MAC de la gw du Pod est l'interface lxcf4cd39e34601@if24
, qui est donc l'extremité du veth du Pod.
L'IP de la gateway est portée par l'interface cilium_host
:
# ifconfig cilium_host
cilium_host: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST> mtu 1500
inet 10.244.0.201 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::a0a3:2ff:fe88:734 prefixlen 64 scopeid 0x20<link>
ether a2:a3:02:88:07:34 txqueuelen 1000 (Ethernet)
RX packets 286 bytes 20100 (19.6 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 4642 bytes 302532 (295.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Un programe BPF y est attaché :
tc filter show dev lxcf4cd39e34601 ingress
On obtient
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_lxc.o:[from-container] direct-action not_in_hw id 13245 tag 7febdcdfe7cdd86f jited
Le Node route ensuite le paquet suivant sa table de routage :
route -n
On obtient :
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 165.22.16.1 0.0.0.0 UG 0 0 0 eth0
10.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
10.135.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
10.244.0.0 10.135.152.71 255.255.255.128 UG 0 0 0 eth1
10.244.0.128 10.244.0.148 255.255.255.128 UG 0 0 0 cilium_host
10.244.0.148 0.0.0.0 255.255.255.255 UH 0 0 0 cilium_host
165.22.16.0 0.0.0.0 255.255.240.0 U 0 0 0 eth0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
Le paquet sort par eth0. Regardons si un programme BPF est attaché :
tc filter show dev eth0 egress
Notons qu'il y a de nombreuses règles IPTables sur le Node :
# iptables-save | grep -c KUBE
138
Connectez-vous sur le pod Cilium du Node en question :
kubectl exec -ti cilium-r7ffp -nkube-system -- bash
Si on fait un wget http://1.1 dans le Pod original, on peut afficher la table de suivi eBPF :
# cilium bpf ct list global | grep 1.0.0.1
TCP OUT 10.244.0.213:46486 -> 1.0.0.1:80 expires=17755044 RxPackets=4 RxBytes=614 RxFlagsSeen=0x1b LastRxReport=17755035 TxPackets=6 TxBytes=415 TxFlagsSeen=0x1b LastTxReport=17755035 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=42837 IfIndex=0
ICMP OUT 10.244.0.213:51724 -> 1.0.0.1:0 expires=17753923 RxPackets=2 RxBytes=196 RxFlagsSeen=0x00 LastRxReport=17753864 TxPackets=2 TxBytes=196 TxFlagsSeen=0x00 LastTxReport=17753864 Flags=0x0000 [ ] RevNAT=0 SourceSecurityID=42837 IfIndex=0
ICMP OUT 10.244.0.213:0 -> 1.0.0.1:0 related expires=17755093 RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=1 TxBytes=74 TxFlagsSeen=0x02 LastTxReport=17755035 Flags=0x0010 [ SeenNonSyn ] RevNAT=0 SourceSecurityID=42837 IfIndex=0
ICMP OUT 10.244.0.213:41227 -> 1.0.0.1:0 expires=17754874 RxPackets=104 RxBytes=10192 RxFlagsSeen=0x00 LastRxReport=17754815 TxPackets=104 TxBytes=10192 TxFlagsSeen=0x00 LastTxReport=17754815 Flags=0x0000 [ ] RevNAT=0 SourceSecurityID=42837 IfIndex=0
On peut également voir les logs grâce à cilium monitor
:
# cilium monitor | grep 1.0.0.1
level=info msg="Initializing dissection cache..." subsys=monitor
-> stack flow 0x8958dbda identity 42837->world state new ifindex 0 orig-ip 0.0.0.0: 10.244.0.213:59242 -> 1.0.0.1:80 tcp SYN
-> endpoint 762 flow 0x0 identity world->42837 state reply ifindex lxcf4cd39e34601 orig-ip 1.0.0.1: 1.0.0.1:80 -> 10.244.0.213:59242 tcp SYN, ACK
-> stack flow 0x8958dbda identity 42837->world state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.213:59242 -> 1.0.0.1:80 tcp ACK
-> stack flow 0x8958dbda identity 42837->world state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.213:59242 -> 1.0.0.1:80 tcp ACK
-> endpoint 762 flow 0x0 identity world->42837 state reply ifindex lxcf4cd39e34601 orig-ip 1.0.0.1: 1.0.0.1:80 -> 10.244.0.213:59242 tcp ACK
-> stack flow 0x8958dbda identity 42837->world state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.213:59242 -> 1.0.0.1:80 tcp ACK, FIN
-> endpoint 762 flow 0x0 identity world->42837 state reply ifindex lxcf4cd39e34601 orig-ip 1.0.0.1: 1.0.0.1:80 -> 10.244.0.213:59242 tcp ACK, FIN
-> stack flow 0x8958dbda identity 42837->world state established ifindex 0 orig-ip 0.0.0.0: 10.244.0.213:59242 -> 1.0.0.1:80 tcp ACK
Dans le setup de ce cluster la NAT n'est pas gérée par Cilium :
# cilium bpf nat list
Unable to open /sys/fs/bpf/tc/globals/cilium_snat_v4_external: Unable to get object /sys/fs/bpf/tc/globals/cilium_snat_v4_external: no such file or directory. Skipping.
Unable to open /sys/fs/bpf/tc/globals/cilium_snat_v6_external: Unable to get object /sys/fs/bpf/tc/globals/cilium_snat_v6_external: no such file or directory. Skipping.
Et effectivement depuis votre PC ou Gipod :
% cilium config view | grep -i masq
enable-bpf-masquerade false
masquerade true
On peut s'en assurer en regardant la table conntrack :
root@node-7o7xd:/# grep 1.0.0.1 /proc/net/nf_conntrack
ipv4 2 tcp 6 117 TIME_WAIT src=10.244.0.213 dst=1.0.0.1 sport=41708 dport=80 src=1.0.0.1 dst=164.92.138.123 sport=80 dport=41708 [ASSURED] mark=0 zone=0 use=2
On peut inspecter :
- les variables d'environnement du pod
kubectl describe pod/cilium-zpvw5 -n kube-system
[..]
Command:
cilium-agent
Args:
--config-dir=/tmp/cilium/config-map
--k8s-api-server=https://2e37bf88-3149-4fb7-8141-670f20da6700.k8s.ondigitalocean.com
--ipv4-native-routing-cidr=10.244.0.0/16
State: Running
Started: Wed, 11 Sep 2024 09:05:20 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 300m
memory: 300Mi
Liveness: http-get http://127.0.0.1:9879/healthz delay=120s timeout=5s period=30s #success=1 #failure=10
Readiness: http-get http://127.0.0.1:9879/healthz delay=0s timeout=5s period=30s #success=1 #failure=3
Startup: http-get http://127.0.0.1:9879/healthz delay=0s timeout=1s period=2s #success=1 #failure=105
Environment:
K8S_NODE_NAME: (v1:spec.nodeName)
CILIUM_K8S_NAMESPACE: kube-system (v1:metadata.namespace)
CILIUM_CLUSTERMESH_CONFIG: /var/lib/cilium/clustermesh/
KUBERNETES_SERVICE_HOST: 2e37bf88-3149-4fb7-8141-670f20da6700.k8s.ondigitalocean.com
KUBERNETES_SERVICE_PORT: 443
[..]
- la config-map utilisée par Cilium
kubectl get cm/cilium-config -o yaml -nkube-system
apiVersion: v1
data:
agent-not-ready-taint-key: node.cilium.io/agent-not-ready
annotate-k8s-node: "true"
arping-refresh-period: 30s
auto-direct-node-routes: "true"
bpf-lb-external-clusterip: "false"
bpf-lb-map-max: "65536"
bpf-lb-proto-diff: "true"
bpf-lb-sock: "false"
bpf-map-dynamic-size-ratio: "0.0025"
bpf-policy-map-max: "16384"
bpf-root: /sys/fs/bpf
cgroup-root: /run/cilium/cgroupv2
cilium-endpoint-gc-interval: 5m0s
cluster-id: "0"
cluster-name: default
cluster-pool-ipv4-mask-size: "25"
cni-chaining-mode: portmap
cni-exclusive: "false"
cni-log-file: /var/run/cilium/cilium-cni.log
custom-cni-conf: "false"
debug: "false"
debug-verbose: ""
disable-cnp-status-updates: "true"
egress-gateway-reconciliation-trigger-interval: 1s
enable-auto-protect-node-port-range: "true"
enable-bgp-control-plane: "false"
enable-bpf-clock-probe: "false"
enable-endpoint-health-checking: "true"
enable-external-ips: "false"
enable-health-check-nodeport: "false"
enable-health-checking: "true"
enable-host-legacy-routing: "true"
enable-host-port: "false"
enable-hubble: "true"
enable-ipv4: "true"
enable-ipv4-big-tcp: "false"
enable-ipv4-masquerade: "true"
enable-ipv6: "false"
enable-ipv6-big-tcp: "false"
enable-ipv6-masquerade: "true"
enable-k8s-networkpolicy: "true"
enable-k8s-terminating-endpoint: "true"
enable-l2-neigh-discovery: "true"
enable-l7-proxy: "true"
enable-local-redirect-policy: "false"
enable-node-port: "false"
enable-policy: default
enable-remote-node-identity: "true"
enable-sctp: "false"
enable-session-affinity: "true"
enable-svc-source-range-check: "true"
enable-vtep: "false"
enable-well-known-identities: "true"
enable-xt-socket-fallback: "true"
external-envoy-proxy: "false"
hubble-disable-tls: "false"
hubble-listen-address: :4244
hubble-socket-path: /var/run/cilium/hubble.sock
hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
identity-allocation-mode: crd
identity-gc-interval: 5m
identity-heartbeat-timeout: 15m
install-no-conntrack-iptables-rules: "false"
ipam: cluster-pool
ipam-cilium-node-update-rate: 15s
k8s-client-burst: "10"
k8s-client-qps: "5"
kube-proxy-replacement: partial
kube-proxy-replacement-healthz-bind-address: ""
mesh-auth-enabled: "true"
mesh-auth-gc-interval: 5m0s
mesh-auth-queue-size: "1024"
mesh-auth-rotated-identities-queue-size: "1024"
metrics: +cilium_bpf_map_pressure
monitor-aggregation: medium
monitor-aggregation-flags: all
monitor-aggregation-interval: 5s
node-port-bind-protection: "true"
nodes-gc-interval: "0"
operator-api-serve-addr: 127.0.0.1:9234
preallocate-bpf-maps: "false"
procfs: /host/proc
prometheus-serve-addr: :9090
proxy-connect-timeout: "2"
proxy-max-connection-duration-seconds: "0"
proxy-max-requests-per-connection: "0"
remove-cilium-node-taints: "true"
routing-mode: native
set-cilium-is-up-condition: "true"
set-cilium-node-taints: "true"
sidecar-istio-proxy-image: cilium/istio_proxy
skip-cnp-status-startup-clean: "false"
synchronize-k8s-nodes: "true"
tofqdns-dns-reject-response-code: refused
tofqdns-enable-dns-compression: "true"
tofqdns-endpoint-max-ip-per-hostname: "50"
tofqdns-idle-connection-grace-period: 0s
tofqdns-max-deferred-connection-deletes: "10000"
tofqdns-proxy-response-max-delay: 100ms
unmanaged-pod-watcher-interval: "15"
vtep-cidr: ""
vtep-endpoint: ""
vtep-mac: ""
vtep-mask: ""
write-cni-conf-when-ready: /host/etc/cni/net.d/05-cilium.conflist
kind: ConfigMap
metadata:
creationTimestamp: "2024-09-11T07:02:56Z"
labels:
c3.doks.digitalocean.com/component: cilium
c3.doks.digitalocean.com/plane: data
c3.doks.digitalocean.com/variant: legacy
doks.digitalocean.com/managed: "true"
name: cilium-config
namespace: kube-system
resourceVersion: "553"
uid: efecf122-f5ac-4c0f-a66a-62abe89e9386