Skip to content

Commit

Permalink
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Browse files Browse the repository at this point in the history
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b38 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e25

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e25

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: David S. Miller <[email protected]>
  • Loading branch information
davem330 committed Jul 4, 2019
2 parents e2c7469 + e5a3e25 commit c4cde58
Show file tree
Hide file tree
Showing 98 changed files with 6,197 additions and 841 deletions.
1 change: 1 addition & 0 deletions Documentation/bpf/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Program types
.. toctree::
:maxdepth: 1

prog_cgroup_sockopt
prog_cgroup_sysctl
prog_flow_dissector

Expand Down
93 changes: 93 additions & 0 deletions Documentation/bpf/prog_cgroup_sockopt.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
.. SPDX-License-Identifier: GPL-2.0
============================
BPF_PROG_TYPE_CGROUP_SOCKOPT
============================

``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
cgroup hooks:

* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
system call.
* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
system call.

The context (``struct bpf_sockopt``) has associated socket (``sk``) and
all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.

BPF_CGROUP_SETSOCKOPT
=====================

``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of
sockopt and it has writable context: it can modify the supplied arguments
before passing them down to the kernel. This hook has access to the cgroup
and socket local storage.

If BPF program sets ``optlen`` to -1, the control will be returned
back to the userspace after all other BPF programs in the cgroup
chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed).

Note, that ``optlen`` can not be increased beyond the user-supplied
value. It can only be decreased or set to -1. Any other value will
trigger ``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success, continue with next BPF program in the cgroup chain.

BPF_CGROUP_GETSOCKOPT
=====================

``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of
sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
if it's interested in whatever kernel has returned. BPF hook can override
the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
has been increased above initial ``getsockopt`` value (i.e. userspace
buffer is too small), ``EFAULT`` is returned.

This hook has access to the cgroup and socket local storage.

Note, that the only acceptable value to set to ``retval`` is 0 and the
original value that the kernel returned. Any other value will trigger
``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
``retval`` from the syscall (note that this can be overwritten by
the BPF program from the parent cgroup).

Cgroup Inheritance
==================

Suppose, there is the following cgroup hierarchy where each cgroup
has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
``BPF_F_ALLOW_MULTI`` flag::

A (root, parent)
\
B (child)

When the application calls ``getsockopt`` syscall from the cgroup B,
the programs are executed from the bottom up: B, A. First program
(B) sees the result of kernel's ``getsockopt``. It can optionally
adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
control will be passed to the second (A) program which will see the
same context as B including any potential modifications.

Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
A and B, the trigger order is B, then A. If B does any changes
to the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
then the next program in the chain (A) will see those changes,
*not* the original input ``setsockopt`` arguments. The potentially
modified values will be then passed down to the kernel.

Example
=======

See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
of BPF program that handles socket options.
16 changes: 15 additions & 1 deletion Documentation/networking/af_xdp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,21 @@ Usage
In order to use AF_XDP sockets there are two parts needed. The
user-space application and the XDP program. For a complete setup and
usage example, please refer to the sample application. The user-space
side is xdpsock_user.c and the XDP side xdpsock_kern.c.
side is xdpsock_user.c and the XDP side is part of libbpf.

The XDP code sample included in tools/lib/bpf/xsk.c is the following::

SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
int index = ctx->rx_queue_index;

// A set entry here means that the correspnding queue_id
// has an active AF_XDP socket bound to it.
if (bpf_map_lookup_elem(&xsks_map, &index))
return bpf_redirect_map(&xsks_map, index, 0);

return XDP_PASS;
}

Naive ring dequeue and enqueue could look like this::

Expand Down
12 changes: 7 additions & 5 deletions drivers/net/ethernet/intel/i40e/i40e_xsk.c
Original file line number Diff line number Diff line change
Expand Up @@ -641,8 +641,8 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
struct i40e_tx_desc *tx_desc = NULL;
struct i40e_tx_buffer *tx_bi;
bool work_done = true;
struct xdp_desc desc;
dma_addr_t dma;
u32 len;

while (budget-- > 0) {
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
Expand All @@ -651,21 +651,23 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
break;
}

if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break;

dma_sync_single_for_device(xdp_ring->dev, dma, len,
dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);

dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL);

tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
tx_bi->bytecount = len;
tx_bi->bytecount = desc.len;

tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
tx_desc->buffer_addr = cpu_to_le64(dma);
tx_desc->cmd_type_offset_bsz =
build_ctob(I40E_TX_DESC_CMD_ICRC
| I40E_TX_DESC_CMD_EOP,
0, len, 0);
0, desc.len, 0);

xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count)
Expand Down
15 changes: 9 additions & 6 deletions drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
Original file line number Diff line number Diff line change
Expand Up @@ -571,8 +571,9 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
union ixgbe_adv_tx_desc *tx_desc = NULL;
struct ixgbe_tx_buffer *tx_bi;
bool work_done = true;
u32 len, cmd_type;
struct xdp_desc desc;
dma_addr_t dma;
u32 cmd_type;

while (budget-- > 0) {
if (unlikely(!ixgbe_desc_unused(xdp_ring)) ||
Expand All @@ -581,14 +582,16 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
break;
}

if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break;

dma_sync_single_for_device(xdp_ring->dev, dma, len,
dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);

dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL);

tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
tx_bi->bytecount = len;
tx_bi->bytecount = desc.len;
tx_bi->xdpf = NULL;
tx_bi->gso_segs = 1;

Expand All @@ -599,10 +602,10 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
cmd_type = IXGBE_ADVTXD_DTYP_DATA |
IXGBE_ADVTXD_DCMD_DEXT |
IXGBE_ADVTXD_DCMD_IFCS;
cmd_type |= len | IXGBE_TXD_CMD;
cmd_type |= desc.len | IXGBE_TXD_CMD;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
tx_desc->read.olinfo_status =
cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT);
cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT);

xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count)
Expand Down
2 changes: 1 addition & 1 deletion drivers/net/ethernet/mellanox/mlx5/core/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o \
en/params.o
en/params.o en/xsk/umem.o en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o

#
# Netdev extra
Expand Down
Loading

0 comments on commit c4cde58

Please sign in to comment.