Skip to content

Commit

Permalink
Add Connection Shutdown to TSG (microsoft#1473)
Browse files Browse the repository at this point in the history
  • Loading branch information
nibanks authored Apr 13, 2021
1 parent 2018103 commit 27ceb5d
Showing 1 changed file with 38 additions and 12 deletions.
50 changes: 38 additions & 12 deletions docs/TSG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,30 @@ This document is meant to be a step-by-step guide for trouble shooting any issue
# Trouble Shooting a Functional Issue

1. [The handshake is failing for some or all of my connections.](#why-is-the-handshake-failing)
2. [The connection is unexpectedly disconnecting.](#why-is-the-connection-disconnecting)
1. [I am getting an error code I don't understand.](#understanding-error-codes)
2. [The connection is unexpectedly shutting down.](#why-is-the-connection-shutting-down)
3. [No application (stream) data seems to be flowing.](#why-isnt-application-data-flowing)

## Why is the handshake failing?
## Understanding Error Codes

1. [The handshake failed with an error code I don't understand.](#mapping-error-codes-for-handshake-failures)
2. [Does it happen on Linux, only with large number of connections?](#linux-file-handle-limit-too-small)
3.
Some error codes are MsQuic specific (`QUIC_STATUS_*`), and some are simply a passthrough from the platform. You can find the MsQuic specific error codes in the platform specific header ([msquic_posix.h](../src/inc/msquic_posix.h), [msquic_winkernel.h](../src/inc/msquic_winkernel.h), or [msquic_winuser.h](../src/inc/msquic_winuser.h)).

### Mapping Error Codes for Handshake Failures
From [msquic_winuser.h](../src/inc/msquic_winuser.h):
```C
#ifndef ERROR_QUIC_HANDSHAKE_FAILURE
#define ERROR_QUIC_HANDSHAKE_FAILURE _HRESULT_TYPEDEF_(0x80410000L)
#endif

> TODO
#ifndef ERROR_QUIC_VER_NEG_FAILURE
#define ERROR_QUIC_VER_NEG_FAILURE _HRESULT_TYPEDEF_(0x80410001L)
#endif

...
```

### Linux File Handle Limit Too Small

In many Linux setups, the default per-process file handle limit is relatively small (~1024). In scenarios where lots of (usually client) connection are opened, a large number of sockets (a type of file handle) are created. Eventually the handle limit is reached and connections start failing because new sockets cannot be created. To fix this, you will need to increase the handle limit.
In many Linux setups, the default per-process file handle limit is relatively small (~1024). In scenarios where lots of (usually client) connection are opened, a large number of sockets (a type of file handle) are created. Eventually the handle limit is reached and connections start failing (error codes `0x16` or `0xbebc202`) because new sockets cannot be created. To fix this, you will need to increase the handle limit.

To query the maximum limit you may set:
```
Expand All @@ -42,9 +49,28 @@ To set a new limit (up to the max):
ulimit -n newValue
```

## Why is the connection disconnecting?
## Why is the connection shutting down?

> TODO
1. [What does this QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_TRANSPORT event mean?](#understanding-shutdown-by-transport)
2. [What does this QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_APP event mean?](#understanding-shutdown-by-app)

### Understanding shutdown by Transport.

There are two ways for a connection to be shutdown, either by the application layer or by the transport layer (i.e. the QUIC layer). The `QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_TRANSPORT` event occurs when the transport shuts the connection down. Generally, the transport shuts down the connection either when there's some kind of error or if the negotiated idle period has elapsed.

```
[2]6F30.34B0::2021/04/13-09:22:48.297449100 [Microsoft-Quic][conn][0x1CF25AC46B0] Transport Shutdown: 18446744071566327813 (Remote=0) (QS=1)
```

Above is an example event collected during an attempt to connect to a non-existent server. Eventually the connection failed and the transport indicated the event with the appropriate error code. This error code (`18446744071566327813`) maps to `0xFFFFFFFF80410005`, which specifically refers to the `QUIC_STATUS` (indicated by `QS=1`) for `0x80410005`; which indicates `ERROR_QUIC_CONNECTION_IDLE`. For more details for understanding error codes see [here](#understanding-error-codes).

### Understanding shutdown by App.

As indicated in [Understanding shutdown by Transport](#understanding-shutdown-by-transport), there are two ways for connections to be shutdown. The `QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_APP` event occurs when the peer application has explicitly shut down the connection. In MsQuic API terms, this would mean the app called [ConnectionShutdown](./api/connectionshutdown.md).

> TODO - Add an example event
The error code indicated in this event is completely application defined (type of `QUIC_UINT62`). The transport has no understanding of the meaning of this value. It never generates these error codes itself. So, to map these values to some meaning will require the application protocol documentation.

## Why isn't application data flowing?

Expand All @@ -61,7 +87,7 @@ ulimit -n newValue
## Why is Performance bad across all my Connections?

1. [The work load isn't spreading evenly across cores.](diagnosing-rss-issues)
1. [The work load isn't spreading evenly across cores.](#diagnosing-rss-issues)
2.

### Diagnosing RSS Issues
Expand Down

0 comments on commit 27ceb5d

Please sign in to comment.