Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with T.38 in a NAT'd environment (GCP) #1609

Open
jmordica opened this issue Feb 8, 2023 · 19 comments
Open

Possible issue with T.38 in a NAT'd environment (GCP) #1609

jmordica opened this issue Feb 8, 2023 · 19 comments

Comments

@jmordica
Copy link

jmordica commented Feb 8, 2023

We use rtpengine in GCP where asterisk (using spandsp) instances don't have public ip's assigned, but rtpengine.conf is setup like so:

interface=internal/10.206.0.11;external/10.206.0.11!33.177.188.61

Kamailio sets the proper internal external or external internal flags based on the direction of the call and everything except large content faxes work fine.

The GCP firewall rule includes all of rtpengine udp port range as well.

When I refer to large content faxes, I'm simply referring to a ~10 page fax where each page has a lot of content on each page.

I've narrowed the issue down to either something not happening correctly on the rtpengine side or GCP port/firewall mapping because I can successfully send the exact fax to the same environment where the only difference is rtpengine using a single interface and no NATting is happening.

I have full debug logs and pcaps of this occurring but would rather not just dump everything here if it is not needed. I'm just wondering if you have experienced something along these lines before and/or have some experience you can pass along here as to where to look in GCP/network/kamailio.

Thanks!

@carstenbock
Copy link

Hi,

  1. Please use the Mailing-List (https://rtpengine.com/mailing-list) for such questions.
  2. Fax usually is rather sensitive to timing so you will need a precise timer on the device. If the device is a Patton device, for example, you need to look for a model with a "High precision timer" for Faxes to work properly for anything beyond 2-3 pages. AVM FritzBox devices (consumer DSL modems, popular in Germany), on the other hand, have occasionally issues when it comes to "large content" faxes.

Luckily I do mobile networks today, so I don't have to deal with Fax and T.38 anymore ;-)

Thanks,
Carsten

@rfuchs
Copy link
Member

rfuchs commented Feb 9, 2023

In addition to what @carstenbock has said, AFAIK T.38 has the property of sending packets in one direction only during the transmission phase, without any feedback packets being sent back. You can verify this with Wireshark/tcpdump. This in turn may lead to the NAT mapping of the UDP port being closed by the firewall after a certain time has passed, during which the firewall has not seen any packets in one (usually the "outgoing") direction. This is purely speculation though and you'd have to confirm that this behaviour is actually what your firewall/NAT gateway does.

If that's what happens, the only solution would be to have SpanDSP generate periodic keepalive packets that can be sent out to keep the NAT mapping alive. I'm not sure if such an option exists in SpanDSP.

@jmordica
Copy link
Author

I did more testing. We are currently on mr11.1.1.3.

I was able to confirm by using perf that the outbound udp port/mapping is not closing due to some timeout or firewall. I can send incoming packets to the NAT'd rtpengine instance over the WAN without any responses for ~4 minutes and then successfully send a UDP packet from rtpengine back to the remote endpoint without an issue.

It seems there is an issue/bug in rtpengine when receiving a large incoming fax page and where rtpengine is in a NAT'd environment. Here are a couple of screenshots. The green colored screenshot is taken from an rtpengine instance only using a single interface in a non-nat'd environment:
good

The second screenshot colored red is taken from an rtpengine instance in a NAT'd environment where multiple interfaces internal/external interfaces are used while receiving a large incoming fax in passthrough to asterisk spandsp:
bad

You can see from the above ^ the port remapping issue occurring where rtpengine either forgets or crosses them up later on throughout the communication.

Let me know if I need to try another version here or if you have any further advice.

Thanks!

@rfuchs
Copy link
Member

rfuchs commented Feb 17, 2023

There's no reason for rtpengine to suddenly change ports or interfaces unless there was some kind of signalling event in between, or it has received inbound packets that would indicate a change in ports. Your screenshot doesn't show where or when ports would have changed (and all packets before the ones highlighted by you are flowing the same way), so I can't tell when or why that would have happened. You need to look for the event that would have triggered the change in ports or interfaces. With debug logging enabled you would also see a log message emitted if ports or interfaces or destinations were ever changed.

There was one recent commit related to local interface switching, but without more details I can't tell if that has anything to do with what you're seeing. But you can try a newer version from the 11.1 or 11.2 branches which have that commit included, for example 11.1.1.7.

@themrrobert
Copy link

themrrobert commented Feb 18, 2023

I already looked at the pcap for signalling or other packets coming in from that port to that interface that would explain the observed behavior, and found none, so I ruled those out.

I am double checking now, and for your peace of mind, the following filter shows no results: udp.dstport==4119 && udp.srcport != 12102
So that rules out incoming traffic changing the port. (The only traffic to port 4119 is from 10.206.0.49:12102 => 10.52.2.14:4119)

There is no SIP or rtpProxyNG signalling between 35s (when it's working fine) and 250s (when it's using the wrong port).

We'll try those branches + debugging enabled

@themrrobert
Copy link

Do you know which commit you're referring to? i'd like to take a look at it, thanks! :)

@rfuchs
Copy link
Member

rfuchs commented Feb 21, 2023

That would be 5c65690

@jmordica
Copy link
Author

I have a recent example with the latest 11.2.1.4. Where can I send the pcap and debug logs to?

You will be able to see the port changing after the first page is received.

@rfuchs
Copy link
Member

rfuchs commented Feb 27, 2023

You can attach them here, or send to my email.

@jmordica
Copy link
Author

Just sent via email.

Thanks!

@jmordica
Copy link
Author

jmordica commented Mar 6, 2023

I have confirmed that the issue exists in the kernel module somewhere. I can disable the kernel module by setting --table=-1 flag and large faxes work fine. What can I do to help identify the cause? I can produce debug logs and pcaps?

Thanks.

@rfuchs
Copy link
Member

rfuchs commented Mar 7, 2023

First thing you should do is inspect the contents of /proc/rtpengine/0/list to see what the kernel's notion of the forwarding chain is. All ports involved are listed there. In the example from your screenshot above, you should see an entry corresponding to local 10.206.0.49:12102 and underneath that you would see the output IP/ports.

@jmordica
Copy link
Author

I sent you a log of this command:

watch -n 1 -p "date >> /tmp/kmlog.txt; cat /proc/rtpengine/0/list >> /tmp/kmlog.txt"

along with the pcap to see if you can derive anything from that log output.

@jmordica
Copy link
Author

Hey there any updates here? Or should we just used without the kernel module?

Thanks.

@rfuchs
Copy link
Member

rfuchs commented Mar 14, 2023

Right now this is not actionable as there's no indication that there's anything wrong with the code.

@jmordica
Copy link
Author

How can I help debug further in order to identify the issue?

@rfuchs
Copy link
Member

rfuchs commented Mar 15, 2023

The only things I can think of is 1) recreate the exact same scenario outside of GCP (with NAT and all) and see if it does the same thing, and 2) add a debug output log line to the kernel module to confirm that the source port is set correctly, for example in send_proxy_packet4(), where *uh is set, something like printk(KERN_INFO, "source port %u\n", src->port); and then check dmesg for the output.

@jmordica
Copy link
Author

I don’t assume there is anyway to conditionally use kernel module when setting up new offer is there?

it would be nice to just not use the kernel module for certain calls.

@rfuchs
Copy link
Member

rfuchs commented Mar 15, 2023

No we don't have such a feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants