Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated peer confirmation and kernelisation messages #1633

Open
abalashov opened this issue Mar 22, 2023 · 31 comments
Open

Repeated peer confirmation and kernelisation messages #1633

abalashov opened this issue Mar 22, 2023 · 31 comments

Comments

@abalashov
Copy link
Contributor

abalashov commented Mar 22, 2023

In mr11.2.1.5 on kernel 4.18.0-425.10.1.el8_7.x86_64, on some calls, we are seeing hundreds of messages like this from rtpengine logs, essentially instantaneously:

Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Kernelizing media stream: y.y.y.y:39412
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 54928]: [core] Confirmed peer address as z.z.z.z:52108

On this call, it was 400+.

I confirmed via wire protocol trace that there was only a single offer, answer and delete command, so no re-invites. There's nothing unusual about the SDP, it's as banal as can be.

Offer:

v=0
o=Sonus_UAC 20105 5717 IN IP4 y.y.y.y
s=SIP Media Capabilities
c=IN IP4 y.y.y.y
t=0 0
m=audio 39412 RTP/AVP 0 8 18 101
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:18 G729/8000
a=fmtp:18 annexb=no
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
a=sendrecv
a=maxptime:20

Answer:

v=0
o=xxx 1679377785 1679377786 IN IP4 z.z.z.z
s=xxx
c=IN IP4 10.100.153.236
t=0 0
m=audio 52108 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=ptime:20
a=rtcp:52109 IN IP4 z.z.z.z

I cannot find any correlation between any SDP attributes and this phenomenon occurring.

I also checked to make sure there were no duplicate -j RTPENGINE rules in the INPUT chain; there are not.

Would appreciate any insight into the cause of this!

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

Looks like one of these ports was receiving media from two sources:

Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as x.x.x.x:28202
...
Mar 21 20:18:18 rtp-relay rtpengine[1237]: INFO: [SDgmmb201-ed3883e83407be71a657612939715d49-v300g00 port 43562]: [core] Confirmed peer address as y.y.y.y:39412

If you have something like media handover enabled then that would cause flapping like this.

@abalashov
Copy link
Contributor Author

Hi,

Thank you for catching that! I was wondering if that was the case but wasn't sure how you determined that from the log.

media handover is not enabled (unless it's enabled by default), but neither is strict-source. Any theories on why this could be happening?

@abalashov
Copy link
Contributor Author

abalashov commented Mar 22, 2023

Could such an outcome arise as a result of RTPEngine being shared among multiple, unrelated Kamailio instances? This is the case here. We have traced this port, :43562, to two relatively consecutive calls which were set off by Kamailio A and then Kamailio B. The flapping endpoints x.x.x.x and y.y.y.y were among parties such that only Kamailio A handles calls for x.x.x.x and only Kamailio B for y.y.y.y.

Since RTPEngine itself allocates its local ports, this was not expected to be a problem. Surely RTPEngine would not allocate a local port to two targets at once, even if they are the result of two different offer/answer commands? Or is this type of race condition possible if two Kamailios are connected to two different LWPs handling these commands, for example?

I suppose the fundamental question is: is it safe to share RTPEngine among multiple Kamailio instances? I assumed it was. My reasoning was: since every Kamailio worker process makes dedicated sockets to communicate with RTPEngine, there would have to be some sort of mutex across multiple LWPs/threads for port allocations anyway, because not all commands are going to land on the same process (if there are multiple ones) even from a single Kamailio instance.

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

Perhaps you're running into the same thing as #1634 ?

@abalashov
Copy link
Contributor Author

I think that may be the case! I had no idea; this wasn't mentioned in the rtpengine Kamailio control module docs, or anywhere else obviously in the docs.

Out of curiosity, why would the server ID matter to properly isolating the port allocations? As far as I understand it, RTPEngine just receives forwarding allocation requests (in the form of offer / answer commands) and responds to them. As far as I know, it does not and has no need to comprehend the nature of the endpoint that is steering it. Is there a larger philosophy behind this?

@abalashov
Copy link
Contributor Author

Also, does this mean that I should set different server_id values for Kamailio A and Kamailio B, or the same server_id? Which would lead to proper mutexing of the port allocations?

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

this wasn't mentioned in the rtpengine Kamailio control module docs, or anywhere else obviously in the docs.

We discovered this ourselves only recently 😊

Out of curiosity, why would the server ID matter to properly isolating the port allocations?

It doesn't as such, but with colliding server IDs and colliding process IDs and colliding sequence numbers, rtpengine ends up detecting messages as duplicates and will return the same SDP as for a different call.

Also, does this mean that I should set different server_id values for Kamailio A and Kamailio B, or the same server_id?

Unique server ID values for each Kamailio instance.

@abalashov
Copy link
Contributor Author

The word duplicate (case-insensitive) is not present in any of my RTPEngine logs going back many days, and all INFO-level messages are logged. Does this suggest that the server_id is not the problem?

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

The word duplicate (case-insensitive) is not present in any of my RTPEngine logs going back many days, and all INFO-level messages are logged. Does this suggest that the server_id is not the problem?

Depends on how reliable your logging is I guess. You can double check further if you see whether there's any overlap in process IDs between the Kamailio instances.

@abalashov
Copy link
Contributor Author

The logging is very reliable, and there are no overlapping Kamailio PIDs on the two systems. Kamailio A PIDs are in the 31,000 range, while Kamailio B PIDs are in the 1,500 range.

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

Definitely not the same situation then. The only other possible explanation is that the local port was re-used for a new call very quickly after being closed from a previous call, and a misbehaving remote RTP client that kept sending media to the old port after the call was closed.

@abalashov
Copy link
Contributor Author

abalashov commented Mar 22, 2023

I've got a call on Kamailio A that used this local port which began at 20:18:13, while the delete command for the old call, which was managed by Kamailio B, did not show up until 20:18:34. That call began around 20:10:00. So, they were just running concurrently. This means the same port was recycled for a new call while the old one using it was still running.

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

Do you have the SDPs to show this? Because it shouldn't be possible to open the same port twice.

@abalashov
Copy link
Contributor Author

abalashov commented Mar 22, 2023

Well, the SDP offer & answer for the more recent call is included in the initial post for this issue.

The SDP offer for the older, original call on Kamailio B is:

o=SomeUA 1679369936 1679369937 IN IP4 xxx.xxx.xxx.xxx
s=SomeUA
c=IN IP4 xxx.xxx.xxx.xxx
t=0 0
m=audio 59502 RTP/AVP 0 8 3 101
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:3 GSM/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=rtcp:59503 IN IP4 xxx.xxx.xxx.xxx
a=ptime:20

The SDP answer fed into RTPEngine is a JsSIP answer:

v=0
o=- 7401020083321265013 2 IN IP4 127.0.0.1
s=-
t=0 0
a=msid-semantic: WMS qqD4FgwrVF4grhK7mW3au5gQD3wqI53V1HbE
m=audio 9 UDP/TLS/RTP/SAVP 0 8 101
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:YkNm
a=ice-pwd:nbvhjLEAorR8eiUNZm01aZ/O
a=ice-options:trickle
a=fingerprint:sha-256 0D:42:29:3E:D3:96:C1:46:80:5A:81:1D:8B:33:DA:93:05:C7:84:FA:CD:55:36:D0:DA:D7:
2E:34:09:70:BF:BF
a=setup:active
a=mid:0
a=sendrecv
a=msid:qqD4FgwrVF4grhK7mW3au5gQD3wqI53V1HbE dabddf2a-e15f-42f2-928e-9b384c04b466
a=rtcp-mux
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=ssrc:2621079166 cname:WyG+B/sSwZxPn0M5
a=ssrc:2621079166 msid:qqD4FgwrVF4grhK7mW3au5gQD3wqI53V1HbE dabddf2a-e15f-42f2-928e-9b384c04b466
a=ssrc:2621079166 mslabel:qqD4FgwrVF4grhK7mW3au5gQD3wqI53V1HbE
a=ssrc:2621079166 label:dabddf2a-e15f-42f2-928e-9b384c04b466

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

The output SDPs would be of interest, the ones that would show port 43562 being used.

@abalashov
Copy link
Contributor Author

Oh, okay.

For the older call, here is the output offer, which contains port 43562. This is coming from a FreeSWITCH host, being WebRTC-ified, and going out to a JsSIP endpoint.

v=0
o=SomeUA 1679369936 1679369937 IN IP4 4.4.4.4
s=SomeUA
c=IN IP4 4.4.4.4
t=0 0
m=audio 43562 UDP/TLS/RTP/SAVP 0 8 3 101
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:3 GSM/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv
a=rtcp:43563
a=rtcp-mux
a=setup:actpass
a=fingerprint:sha-256 E0:E6:F8:4C:F6:15:D7:9F:6E:77:E2:C5:27:5E:19:F2:99:D2:E7:8E:BF:3F:1B:65:2F:D2:38:A8:45:1D:FC:8B
a=ptime:20
a=ice-ufrag:e6Y7pmm9
a=ice-pwd:Rlt6cQ3aTdl8NHJMNrsj6QnVCe
a=candidate:GLfzdMdLvukjM9Qa 1 UDP 2130706431 4.4.4.4 43562 typ host
a=candidate:GLfzdMdLvukjM9Qa 2 UDP 2130706430 4.4.4.4 43563 typ host

And here's the output SDP answer for the more recent call, which is plain-UDP coming from some commercial SBC to FreeSWITCH, with RTPEngine in between the two. We are answering it and the answer offers port 43562:

v=0
o=SomeUA 1679377785 1679377786 IN IP4 xxx.xxx.xxx.xxx
s=SomeUA
c=IN IP4 xxx.xxx.xxx.xxx
t=0 0
m=audio 43562 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv
a=rtcp:43563
a=ptime:20

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

and 4.4.4.4 is the same as xxx.xxx.xxx.xxx ?

@abalashov
Copy link
Contributor Author

abalashov commented Mar 22, 2023

No, 4.4.4.4 is the public Internet side of RTPEngine ("external_eip" interface), while xxx.xxx.xxx.xxx is a private interface that has the designation "external" (not to be confused with yet another interface called "internal"). It had not occurred to me, but the full invocation could be relevant:

rtpengi+    1237  0.8  0.7 1040792 28732 ?       Sl   Mar15  92:25 /usr/local/sbin/rtpengine -t 0 -i 
external_eip/xxx.xxx.xxx.xxx7!4.4.4.4-i external/xxx.xxx.xxx.xxx -i internal/yet.another.rfc1918.ip -n 
yet.another.rfc1918.ip:5050 --pidfile=/run/rtpengine.pid -m 35000 -M 65000 --no-fallback

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

Just to confirm, in -i external_eip/xxx.xxx.xxx.xxx!4.4.4.4 -i external/xxx.xxx.xxx.xxx both xxx.xxx.xxx.xxx are the same? Because I just tried this, and both interfaces end up using the same port pool and so shouldn't be allowed to open the same port at the same time.

@abalashov
Copy link
Contributor Author

abalashov commented Mar 22, 2023

That's right, xxx.xxx.xxx.xxx is the same underlying private IP (AWS EC2) in both cases.

And, I hear you, but I think they are indeed opening the same port at the same time. At least, that's where all the info points. Would you not agree?

@rfuchs
Copy link
Member

rfuchs commented Mar 22, 2023

That can't be right. The kernel module doesn't allow this either. It would report an error if the same local IP:port was added a second time.

@abalashov
Copy link
Contributor Author

Well, I no longer have the RTP captures -- they were rotated out -- so I can't prove it. But the output SDP came straight from the captures. The offer toward the WebRTC endpoint was at ~20:10:00, while the offer toward the regular carrier endpoint was at ~20:18:13, and the delete command for the WebRTC call didn't come until 20:18:34.

@abalashov
Copy link
Contributor Author

The immediate resolution to this problem has been to separate the RTPEngines used for Kamailio A and Kamailio B. We are watching for the same signs of port collisions, and will follow up and let you know. It's certainly possible that collisions occur for reasons other than offer/answer requests from different controlling Kamailio instances.

@rfuchs
Copy link
Member

rfuchs commented Mar 23, 2023

There's no distinction between requests coming from different Kamailio instances so I'm not sure how that would make a difference.

@abalashov
Copy link
Contributor Author

abalashov commented Mar 23, 2023 via email

@abalashov
Copy link
Contributor Author

abalashov commented Mar 23, 2023

You were certainly correct that merely eliminating multiple Kamailio instances controlling the same RTPEngine instance did not solve the problem.

But, so far, the examples we have seen of this problem all involve overlapping port assignment on calls where one call goes out the public external_eip egress to a WebRTC agent, and the other goes out/comes into the external ingress (private). Both are bound to the same interface address, but one is !aliased with a public address.

I know you said that port collisions between these are impossible, but I'm not sure that's the case, given what we're seeing. We'll continue digging.

@rfuchs
Copy link
Member

rfuchs commented Mar 23, 2023

Somehow this smells like a misconfiguration. Maybe double check which address the actual sockets are bound to (netstat/ss -anp) and maybe check if there's a stray config file in /etc.

@abalashov
Copy link
Contributor Author

I explored the misconfiguration idea as best as I could, but can't see anything. There's no stray config files.

@abalashov
Copy link
Contributor Author

Does it not strike you as odd that external_ip and external have separate port ranges despite being bound to the "same" IP?

[root@rtp-relay ~]# telnet 10.255.138.166 8080
Trying 10.255.138.166...
Connected to 10.255.138.166.
Escape character is '^]'.
list interfaces
Interface 'external_eip' address '10.255.97.64' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35014
Interface 'external' address '10.255.97.64' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35014
Interface 'internal' address '10.255.138.166' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35020
Connection closed by foreign host.

@abalashov
Copy link
Contributor Author

We tried something new: we aliased the external AWS EIP to a different interface, and suddenly, no more port collisions:

Trying 10.255.138.166...
Connected to 10.255.138.166.
Escape character is '^]'.
list interfaces
Interface 'external_eip' address '10.255.97.64' (IPv4)
 Port range: 35000 - 65000
 Ports used:     4 / 30001 (  0.0%)
 Last port used: 35022
Interface 'external' address '10.255.99.170' (IPv4)
 Port range: 35000 - 65000
 Ports used:     0 / 30001 (  0.0%)
 Last port used:     0
Interface 'internal' address '10.255.138.166' (IPv4)
 Port range: 35000 - 65000
 Ports used:     4 / 30001 (  0.0%)
 Last port used: 35020
Connection closed by foreign host.

I'm sorry, but I think you are mistaken about the separation of ports between the aliased external_eip and external interface (same physical interface).

@rfuchs
Copy link
Member

rfuchs commented Mar 24, 2023

Does it not strike you as odd that external_ip and external have separate port ranges despite being bound to the "same" IP?

[root@rtp-relay ~]# telnet 10.255.138.166 8080
Trying 10.255.138.166...
Connected to 10.255.138.166.
Escape character is '^]'.
list interfaces
Interface 'external_eip' address '10.255.97.64' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35014
Interface 'external' address '10.255.97.64' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35014
Interface 'internal' address '10.255.138.166' (IPv4)
 Port range: 35000 - 65000
 Ports used:     2 / 30001 (  0.0%)
 Last port used: 35020
Connection closed by foreign host.

No, that's just how the output is formatted (one output per logical interface). The port pools are actually the same. You can see that the number of "ports used" and the "last port used" is always the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants