-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688
Comments
Hi @ivansharamok, could you share the vpp-manager logs: kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27? |
Are the nodes using NetworkManager or systemd.networkd? Could you please share the appropriate logs (NM or systemd.networkd) when this issue happens? |
I tried v3.27.0 but the |
Installed Calico VPP v3.27.0. Hit the same issue. Below is the info collected from the cluster using Calico VPP v3.27.0. Looks like Ubuntu 22.04 by default uses
Here's the log for
Logs for one of
|
I just tried switching from Ubuntu 22.04 to CentOS 8 and I didn't run into DNS resolution issue on the host when using CentOS hosts. I noticed that CentOS uses NetworkManager by default. At this point, I'm not sure what the exact root cause of the issue is, but it might be related to networking managed by |
Thanks for the details and sorry about the missing What happens is that when We have faced this issue in the past and usually a restart of NetworkManager has a config option, After the azure instances are up and running, modify netplan to make the network config static instead of DHCP, and then start the kubeadm steps to install the cluster. Try the |
Environment
Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue description
The
calico-vpp-node
pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The/etc/resolv.conf
file on the hosts get edited when thecalico-vpp-node
pod is running. The DNS resolution from within thecalico-vpp-node
pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck inImagePullBackOff
state.To Reproduce
Steps to reproduce the behavior:
Standard_D4s_v3
size instancesinterfaceName: eth0
instead of the defaulteth1
as shown below:installation-default.yaml
was edit as the following:Expected behavior
Installation of Calico VPP should not disrupt host's DNS resolution.
Additional context
calico-vpp-node
pods getting initialized, the DNS resolution on the host works as expected. However, once thecalico-vpp-dataplane/calico-vpp-node
pods get to theRunning
state, the DNS resolution stops working on the host and/etc/resolv.conf
file gets modified./etc/resolv.conf
on the host before Calico VPP is installed/etc/resolv.conf
on the host aftercalico-vpp-node
pod reaches theRunning
state/etc/resolv.conf
inside thecalico-vpp-node
podscurl google.com
from within thecalico-vpp-node
pod, but the same query fails on the host with the messagecurl: (6) Could not resolve host: google.com
calico-vpp-node
is up or right after when you manually kill the pod and before it's back upcalico-vpp-node
pod is upcalico-vpp-node
pods get up and running, is to manually kill thecalico-vpp-node
pods and force restart the pods that are failing to pull the images. Since it takes thecalico-vpp-node
pods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again./etc/resolv.conf
file on the host and make it look like the one I fetch from within thecalico-vpp-node
pods. The DNS starts working until thecalico-vpp-node
gets restarted as the restart of that pod seems to overwrite the/etc/resolv.conf
file once again.Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.
The text was updated successfully, but these errors were encountered: