Skip to content

Commit

Permalink
INSTALL.DPDK: Replace tabs with spaces
Browse files Browse the repository at this point in the history
Signed-off-by: Ciara Loftus <[email protected]>
Signed-off-by: Ben Pfaff <[email protected]>
  • Loading branch information
cloftus authored and blp committed Jun 3, 2016
1 parent 3bd4ae2 commit 7058ec5
Showing 1 changed file with 173 additions and 169 deletions.
342 changes: 173 additions & 169 deletions INSTALL.DPDK.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,223 +289,227 @@ Using the DPDK with ovs-vswitchd:
Performance Tuning:
-------------------
1. PMD affinitization
1. PMD affinitization
A poll mode driver (pmd) thread handles the I/O of all DPDK
interfaces assigned to it. A pmd thread will busy loop through
the assigned port/rxq's polling for packets, switch the packets
and send to a tx port if required. Typically, it is found that
a pmd thread is CPU bound, meaning that the greater the CPU
occupancy the pmd thread can get, the better the performance. To
that end, it is good practice to ensure that a pmd thread has as
many cycles on a core available to it as possible. This can be
achieved by affinitizing the pmd thread with a core that has no
other workload. See section 7 below for a description of how to
isolate cores for this purpose also.
A poll mode driver (pmd) thread handles the I/O of all DPDK
interfaces assigned to it. A pmd thread will busy loop through
the assigned port/rxq's polling for packets, switch the packets
and send to a tx port if required. Typically, it is found that
a pmd thread is CPU bound, meaning that the greater the CPU
occupancy the pmd thread can get, the better the performance. To
that end, it is good practice to ensure that a pmd thread has as
many cycles on a core available to it as possible. This can be
achieved by affinitizing the pmd thread with a core that has no
other workload. See section 7 below for a description of how to
isolate cores for this purpose also.
The following command can be used to specify the affinity of the
pmd thread(s).
The following command can be used to specify the affinity of the
pmd thread(s).
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>`
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>`
By setting a bit in the mask, a pmd thread is created and pinned
to the corresponding CPU core. e.g. to run a pmd thread on core 1
By setting a bit in the mask, a pmd thread is created and pinned
to the corresponding CPU core. e.g. to run a pmd thread on core 1
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2`
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2`
For more information, please refer to the Open_vSwitch TABLE section in
For more information, please refer to the Open_vSwitch TABLE section in
`man ovs-vswitchd.conf.db`
`man ovs-vswitchd.conf.db`
Note, that a pmd thread on a NUMA node is only created if there is
at least one DPDK interface from that NUMA node added to OVS.
Note, that a pmd thread on a NUMA node is only created if there is
at least one DPDK interface from that NUMA node added to OVS.
2. Multiple poll mode driver threads
2. Multiple poll mode driver threads
With pmd multi-threading support, OVS creates one pmd thread
for each NUMA node by default. However, it can be seen that in cases
where there are multiple ports/rxq's producing traffic, performance
can be improved by creating multiple pmd threads running on separate
cores. These pmd threads can then share the workload by each being
responsible for different ports/rxq's. Assignment of ports/rxq's to
pmd threads is done automatically.
With pmd multi-threading support, OVS creates one pmd thread
for each NUMA node by default. However, it can be seen that in cases
where there are multiple ports/rxq's producing traffic, performance
can be improved by creating multiple pmd threads running on separate
cores. These pmd threads can then share the workload by each being
responsible for different ports/rxq's. Assignment of ports/rxq's to
pmd threads is done automatically.
The following command can be used to specify the affinity of the
pmd threads.
The following command can be used to specify the affinity of the
pmd threads.
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>`
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=<hex string>`
A set bit in the mask means a pmd thread is created and pinned
to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2
A set bit in the mask means a pmd thread is created and pinned
to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6`
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6`
For more information, please refer to the Open_vSwitch TABLE section in
For more information, please refer to the Open_vSwitch TABLE section in
`man ovs-vswitchd.conf.db`
`man ovs-vswitchd.conf.db`
For example, when using dpdk and dpdkvhostuser ports in a bi-directional
VM loopback as shown below, spreading the workload over 2 or 4 pmd
threads shows significant improvements as there will be more total CPU
occupancy available.
For example, when using dpdk and dpdkvhostuser ports in a bi-directional
VM loopback as shown below, spreading the workload over 2 or 4 pmd
threads shows significant improvements as there will be more total CPU
occupancy available.
NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1
The following command can be used to confirm that the port/rxq assignment
to pmd threads is as required:
The following command can be used to confirm that the port/rxq assignment
to pmd threads is as required:
`ovs-appctl dpif-netdev/pmd-rxq-show`
`ovs-appctl dpif-netdev/pmd-rxq-show`
This can also be checked with:
This can also be checked with:
```
top -H
taskset -p <pid_of_pmd>
```
To understand where most of the pmd thread time is spent and whether the
caches are being utilized, these commands can be used:
```
# Clear previous stats
ovs-appctl dpif-netdev/pmd-stats-clear
# Check current stats
ovs-appctl dpif-netdev/pmd-stats-show
```
3. DPDK port Rx Queues
`ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>`
The command above sets the number of rx queues for DPDK interface.
The rx queues are assigned to pmd threads on the same NUMA node in a
round-robin fashion. For more information, please refer to the
Open_vSwitch TABLE section in
```
top -H
taskset -p <pid_of_pmd>
```
`man ovs-vswitchd.conf.db`
To understand where most of the pmd thread time is spent and whether the
caches are being utilized, these commands can be used:
4. Exact Match Cache
```
# Clear previous stats
ovs-appctl dpif-netdev/pmd-stats-clear

Each pmd thread contains one EMC. After initial flow setup in the
datapath, the EMC contains a single table and provides the lowest level
(fastest) switching for DPDK ports. If there is a miss in the EMC then
the next level where switching will occur is the datapath classifier.
Missing in the EMC and looking up in the datapath classifier incurs a
significant performance penalty. If lookup misses occur in the EMC
because it is too small to handle the number of flows, its size can
be increased. The EMC size can be modified by editing the define
EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c.
# Check current stats
ovs-appctl dpif-netdev/pmd-stats-show
```
As mentioned above an EMC is per pmd thread. So an alternative way of
increasing the aggregate amount of possible flow entries in EMC and
avoiding datapath classifier lookups is to have multiple pmd threads
running. This can be done as described in section 2.
3. DPDK port Rx Queues
5. Compiler options
`ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer>`
The default compiler optimization level is '-O2'. Changing this to
more aggressive compiler optimizations such as '-O3' or
'-Ofast -march=native' with gcc can produce performance gains.
The command above sets the number of rx queues for DPDK interface.
The rx queues are assigned to pmd threads on the same NUMA node in a
round-robin fashion. For more information, please refer to the
Open_vSwitch TABLE section in
6. Simultaneous Multithreading (SMT)
`man ovs-vswitchd.conf.db`
With SMT enabled, one physical core appears as two logical cores
which can improve performance.
4. Exact Match Cache
SMT can be utilized to add additional pmd threads without consuming
additional physical cores. Additional pmd threads may be added in the
same manner as described in section 2. If trying to minimize the use
of physical cores for pmd threads, care must be taken to set the
correct bits in the pmd-cpu-mask to ensure that the pmd threads are
pinned to SMT siblings.
Each pmd thread contains one EMC. After initial flow setup in the
datapath, the EMC contains a single table and provides the lowest level
(fastest) switching for DPDK ports. If there is a miss in the EMC then
the next level where switching will occur is the datapath classifier.
Missing in the EMC and looking up in the datapath classifier incurs a
significant performance penalty. If lookup misses occur in the EMC
because it is too small to handle the number of flows, its size can
be increased. The EMC size can be modified by editing the define
EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c.
For example, when using 2x 10 core processors in a dual socket system
with HT enabled, /proc/cpuinfo will report 40 logical cores. To use
two logical cores which share the same physical core for pmd threads,
the following command can be used to identify a pair of logical cores.
As mentioned above an EMC is per pmd thread. So an alternative way of
increasing the aggregate amount of possible flow entries in EMC and
avoiding datapath classifier lookups is to have multiple pmd threads
running. This can be done as described in section 2.
`cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list`
5. Compiler options
where N is the logical core number. In this example, it would show that
cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable
two pmd threads running on these two logical cores (one physical core)
is.
The default compiler optimization level is '-O2'. Changing this to
more aggressive compiler optimizations such as '-O3' or
'-Ofast -march=native' with gcc can produce performance gains.
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002`
6. Simultaneous Multithreading (SMT)
Note that SMT is enabled by the Hyper-Threading section in the
BIOS, and as such will apply to the whole system. So the impact of
enabling/disabling it for the whole system should be considered
e.g. If workloads on the system can scale across multiple cores,
SMT may very beneficial. However, if they do not and perform best
on a single physical core, SMT may not be beneficial.
With SMT enabled, one physical core appears as two logical cores
which can improve performance.
7. The isolcpus kernel boot parameter
SMT can be utilized to add additional pmd threads without consuming
additional physical cores. Additional pmd threads may be added in the
same manner as described in section 2. If trying to minimize the use
of physical cores for pmd threads, care must be taken to set the
correct bits in the pmd-cpu-mask to ensure that the pmd threads are
pinned to SMT siblings.
isolcpus can be used on the kernel bootline to isolate cores from the
kernel scheduler and hence dedicate them to OVS or other packet
forwarding related workloads. For example a Linux kernel boot-line
could be:
For example, when using 2x 10 core processors in a dual socket system
with HT enabled, /proc/cpuinfo will report 40 logical cores. To use
two logical cores which share the same physical core for pmd threads,
the following command can be used to identify a pair of logical cores.
'GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"'
`cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list`
8. NUMA/Cluster On Die
where N is the logical core number. In this example, it would show that
cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable
two pmd threads running on these two logical cores (one physical core)
is.
Ideally inter NUMA datapaths should be avoided where possible as packets
will go across QPI and there may be a slight performance penalty when
compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3,
Cluster On Die is introduced on models that have 10 cores or more.
This makes it possible to logically split a socket into two NUMA regions
and again it is preferred where possible to keep critical datapaths
within the one cluster.
`ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002`
It is good practice to ensure that threads that are in the datapath are
pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
responsible for forwarding.
Note that SMT is enabled by the Hyper-Threading section in the
BIOS, and as such will apply to the whole system. So the impact of
enabling/disabling it for the whole system should be considered
e.g. If workloads on the system can scale across multiple cores,
SMT may very beneficial. However, if they do not and perform best
on a single physical core, SMT may not be beneficial.
9. Rx Mergeable buffers
7. The isolcpus kernel boot parameter
Rx Mergeable buffers is a virtio feature that allows chaining of multiple
virtio descriptors to handle large packet sizes. As such, large packets
are handled by reserving and chaining multiple free descriptors
together. Mergeable buffer support is negotiated between the virtio
driver and virtio device and is supported by the DPDK vhost library.
This behavior is typically supported and enabled by default, however
in the case where the user knows that rx mergeable buffers are not needed
i.e. jumbo frames are not needed, it can be forced off by adding
mrg_rxbuf=off to the QEMU command line options. By not reserving multiple
chains of descriptors it will make more individual virtio descriptors
available for rx to the guest using dpdkvhost ports and this can improve
performance.
10. Packet processing in the guest
It is good practice whether simply forwarding packets from one
interface to another or more complex packet processing in the guest,
to ensure that the thread performing this work has as much CPU
occupancy as possible. For example when the DPDK sample application
`testpmd` is used to forward packets in the guest, multiple QEMU vCPU
threads can be created. Taskset can then be used to affinitize the
vCPU thread responsible for forwarding to a dedicated core not used
for other general processing on the host system.
11. DPDK virtio pmd in the guest
dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path
to the guest using the DPDK vhost library. This library is compatible with
virtio-net drivers in the guest but significantly better performance can
be observed when using the DPDK virtio pmd driver in the guest. The DPDK
`testpmd` application can be used in the guest as an example application
that forwards packet from one DPDK vhost port to another. An example of
running `testpmd` in the guest can be seen here.
isolcpus can be used on the kernel bootline to isolate cores from the
kernel scheduler and hence dedicate them to OVS or other packet
forwarding related workloads. For example a Linux kernel boot-line
could be:
`./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan --forward-mode=io --auto-start`
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4
default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"
```
See below information on dpdkvhostcuse and dpdkvhostuser ports.
See [DPDK Docs] for more information on `testpmd`.
8. NUMA/Cluster On Die
Ideally inter NUMA datapaths should be avoided where possible as packets
will go across QPI and there may be a slight performance penalty when
compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3,
Cluster On Die is introduced on models that have 10 cores or more.
This makes it possible to logically split a socket into two NUMA regions
and again it is preferred where possible to keep critical datapaths
within the one cluster.
It is good practice to ensure that threads that are in the datapath are
pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
responsible for forwarding.
9. Rx Mergeable buffers
Rx Mergeable buffers is a virtio feature that allows chaining of multiple
virtio descriptors to handle large packet sizes. As such, large packets
are handled by reserving and chaining multiple free descriptors
together. Mergeable buffer support is negotiated between the virtio
driver and virtio device and is supported by the DPDK vhost library.
This behavior is typically supported and enabled by default, however
in the case where the user knows that rx mergeable buffers are not needed
i.e. jumbo frames are not needed, it can be forced off by adding
mrg_rxbuf=off to the QEMU command line options. By not reserving multiple
chains of descriptors it will make more individual virtio descriptors
available for rx to the guest using dpdkvhost ports and this can improve
performance.
10. Packet processing in the guest
It is good practice whether simply forwarding packets from one
interface to another or more complex packet processing in the guest,
to ensure that the thread performing this work has as much CPU
occupancy as possible. For example when the DPDK sample application
`testpmd` is used to forward packets in the guest, multiple QEMU vCPU
threads can be created. Taskset can then be used to affinitize the
vCPU thread responsible for forwarding to a dedicated core not used
for other general processing on the host system.
11. DPDK virtio pmd in the guest
dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path
to the guest using the DPDK vhost library. This library is compatible with
virtio-net drivers in the guest but significantly better performance can
be observed when using the DPDK virtio pmd driver in the guest. The DPDK
`testpmd` application can be used in the guest as an example application
that forwards packet from one DPDK vhost port to another. An example of
running `testpmd` in the guest can be seen here.
```
./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00
--disable-hw-vlan --forward-mode=io --auto-start
```
See below information on dpdkvhostcuse and dpdkvhostuser ports.
See [DPDK Docs] for more information on `testpmd`.
DPDK Rings :
------------
Expand Down

0 comments on commit 7058ec5

Please sign in to comment.