Skip to content

eBPF Computational Storage Device (CSD) for Zoned Namespace (ZNS) SSDs in QEMU

License

Notifications You must be signed in to change notification settings

dusollee22/qemu-csd

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pipeline status coverage report

OpenCSD

OpenCSD is an improved version of ZCSD achieving log-structured filesystem (LFS) integration on Zoned Namespaces (ZNS) SSD Computational Storage Devices (CSD). Below is a diagram of the overall architecture as presented to the end user. However, the actual implementation differs due to the use of emulation using technologies such as QEMU, uBPF and SPDK.

Progress Report

provisional

  • Week 1 -> Goal: get fuse-lfs working with libfuse
    • Add libfuse, fuse-lfs and rocksdb as dependencies
    • Create custom libfuse fork to support non-privileged installation
    • Configure CMake to install libfuse
    • Configure environment script to setup pkg-config path
    • Use Docker in Docker (dind) to build docker image for Gitlab CI pipeline
    • Investigate and document how to debug fuse filesystems
    • Determine and document RocksDB required syscalls
    • Setup persistent memory that can be shared across processes
      • Split into daemon and client modes
  • Week 2 -> Goal get a working LFS filesystem
    • Create solid digital logbook to track discussions
  • Week 3 -> Investigate FUSE I/O calls and fadvise
    • Get a working LFS filesystem using FUSE
      • What are the requirements for these filesystems.
      • Create FUSE LFS path to inode function.
        • Test path to inode function using unit tests.
    • Setup research questions in thesis.
    • Run filesystem benchmarks with strace
      • RocksDB DBBench
      • Filebench
    • Use fsetxattr for 'process' attributes in FUSE
      • Document how this can enable CSD functionality in regular filesystems
  • Week 4 -> FUSE LFS filesystem
    • Get a working LFS filesystem using FUSE
      • What are the requirements for these filesystems? (research question)
        • Snapshots
        • GC
    • Test path to inode function using unit tests.
  • Week 5 -> FUSE LFS filesystem
  • Week 6 -> FUSE LFS filesystem
  • Week 7 -> FUSE LFS filesystem
  • Week 8 -> FUSE LFS filesystem
  • Run filesystem benchmarks with strace
    • RocksDB DBBench
    • Filebench

Logbook

Serves as a place to quickly store digital information until it can be refined and processed into the thesis.

Discussion Notes

  • In order to analyze the exact calls RocksDB makes during its benchmarks tools like strace can be used.
  • Several methods exist to prototype filesystem integration for CSDs. Among these are using LD_PRELOAD to override system calls such as read(), write() and open(). In this design we choose to use FUSE as this simplifies some of the management and opens the possibility of allowing parallelism while the interface between FUSE and the filesystem calls is still thin enough it can be correlated.
  • The filesystem can use a snapshot concurrency model with reference counts.
  • Each file can maintain a special table that associates system calls with CSD kernels. To isolate this behavior (to specific users) we can use filehandles and process IDs (These should be available for most FUSE API calls anyway).
  • The design should reuse existing operating system interfaces as much as possible. Any new API or call should be well motivated with solid arguments. As an initial idea we can investigate reusing POSIX fadvise.
  • As requirements our FUSE LFS requires gc and snapshots. It would be nice to have parallelism.
  • Crossing kernel and userspace boundaries can be achieved using ioctl should the need arise.
  • As experiment for evaluation we should try to run RocksDB benchmarks on top of the FUSE LFS filesystem while offloading bloom filter computations from SST tables
  • Filebench benchmark to identify filesystems calls. db_bench from RocksDB, run both with strace
  • Filesystem design why FUSE, why build from scratch
  • FUSE, is it enough? filesystem calls, does the API support what we need. Research question.

Correlation POSIX and FUSE

For convenience and reasonings sake a map between common POSIX I/O and FUSE API calls is needed.

POSIX

  • close
  • (p/w)read
  • (p/w)write
  • lseek
  • open
  • fcntl
  • readdir
  • posix_fadvise

FUSE

  • getattr
  • readdir
  • open
  • create
  • read
  • write
  • unlink
  • statfs

Non-persistent Conditional Extended Attributes in FUSE

Extended filesystem attributes support various namespaces with different behavior and responsibility. Since the underlying filesystem is still tasked with storing these attributes persistently regardless of namespace, the FUSE filesystem is effectively in full control on how to proceed.

Given the already existing standard to use namespaces for permissions roles and behavior an additional namespace is an easy and clean extension. Introducing the process namespace. Non-persistent extended file attributes that are only visible to the process that created them. Effectively an in memory map that lives inside the filesystem instead of in the calling process.

Requirements:

  • Calling PID must be (made) available to either the high level or low level FUSE API hooks (By observing the -d FUSE output the PID is already available in some contexts just not to the API calls).
  • A clean method to deregister all hooks is needed, this either needs to be done when the file is released or when the file is reopened using a previously used PID. Using the release / releasedir system calls is difficult as the calling PID is not available in this context.

RocksDB Integration

Required syscalls, by analysis of https://github.com/facebook/rocksdb/blob/7743f033b17bf3e0ea338bc6751b28adcc8dc559/env/io_posix.cc

  • clearerr (stdio.h)
  • close (unistd.h)
  • fclose (stdio.h)
  • feof (stdio.h)
  • ferror (stdio.h)
  • fread_unlocked (stdio.h)
  • fseek (stdio.h)
  • fstat (sys/stat.h)
  • fstatfs (sys/statfs.h / sys/vfs.h)
  • ioctl (sys/ioctl.h)
  • major (sys/sysmacros.h)
  • open (fcntl.h)
  • posix_fadvise (fcntl.h)
  • pread (unistd.h)
  • pwrite (unistd.h)
  • readahead (fcntl.h + _GNU_SOURCE)
  • realpath (stdlib.h)
  • sync_file_range (fcntl.h + _GNU_SOURCE)
  • write (unistd.h)

Potential issues:

  • Use of IOCTL
  • Use of IO_URING

ZCSD

ZCSD is a full stack prototype to execute eBPF programs as if they are running on a ZNS SSD CSD. The entire prototype can be run from userspace by utilizing existing technologies such as SPDK and uBPF. Since consumer ZNS SSDs are still unavailable, QEMU can be used to create a virtual ZNS SSD. The programming and interactive steps of individual components is shown below.

Getting Started

To get started using OpenCSD perform the steps described in the Setup section, followed by the steps in Usage Examples.

Index

Directory structure

  • qemu-csd - project source files
  • cmake - small cmake snippets to enable various features
  • dependencies - project dependencies
  • docs - doxygen generated source code documentation
  • documentation - project report written in LaTeX
  • playground - small toy examples or other experiments
  • presentation - midterm presentation written in LaTeX
  • python - python scripts to aid in visualization or measurements
  • scripts - Shell scripts primarily used by CMake to install project dependencies
  • tests - unit tests and possibly integration tests
  • .vscode - Launch targets and settings to debug programs runnings inside QEMU over SSH

Modules

Module Task
arguments Parse commandline arguments to relevant components
bpf_helpers Headers to define functions available from within BPF
bpf_programs BPF programs ready to run on a CSD using bpf_helpers
fuse_lfs Log Structured Filesystem in FUSE
nvme_csd Emulated additional NVMe commands to enable BPF CSDs
nvme_zns Interface to handle zoned I/O using abstracted backends
spdk_init Provides SPDK initialization and handles for nvme_csd

Dependencies

This project requires quite some dependencies, the majority will be compiled by the project itself and installed into the build directory. Anything that is not automatically compiled and linked is shown below. Note however, these dependencies are already installed on the image used with QEMU.

Warning Meson must be below version 0.60 due to a bug in DPDK

  • General
    • Linux 5.5 or higher
    • compiler with c++17 support
    • clang 10 or higher
    • cmake 3.18 or higher
    • python 3.x
    • mesonbuild < 0.60 (pip3 install meson==0.59)
    • pyelftools (pip3 install pyelftools)
    • ninja
    • cunit
  • Documentation
    • doxygen
    • LaTeX
  • Code Coverage
    • ctest
    • lcov
    • gcov
    • gcovr
  • Continuous Integration
    • valgrind
  • Python scripts
    • virtualenv

The following dependencies are automatically compiled. Dependencies are preferably linked statically due to the nature of this project. However, for several dependencies this is not possible due to various reason. For Boost, it is because the unit test framework can not be statically linked (easily):

Dependency Systen Version
backward ZCSD 1.6
booost ZCSD 1.74.0
bpftool ZCSD 5.14
bpf_load ZCSD 5.10
dpdk ZCSD 20.11.0
generic-ebpf ZCSD c9cee73
fuse-lfs OpenCSD 526454b
libbpf ZCSD 0.5
libfuse OpenCSD 3.10.5
libbpf-bootstrap ZCSD 67a29e5
linux ZCSD 5.14
spdk ZCSD 21.07
isa-l ZCSD spdk-v2.30.0
rocksdb OpenCSD 6.25.3
qemu ZCSD 6.1.0
uBPF ZCSD 9eb26b4

Setup

Building tools and dependencies is done by simply executing the following commands from the root directory. For a more complete list of cmake options see the Configuration section. The environment file sourced with source builddir/qemu-csd/activate needs to be sourced every time. It configures essential include and binary paths to be able to run all the dependencies.

This first section of commands generates targets for host development. Among these is compiling and downloading an image for QEMU. Many parts of this project can be developed on the host but some require being developed on the guest. See the next section for on guest development.

Navigate to the root directory of the project before executing the following instructions. These instructions will compile the dependencies on the host, these include a version of QEMU.

git submodule update --init
mkdir build
cd build
cmake ..
cmake --build .
# Do not use make -j $(nproc), CMake is not able to solve concurrent dependency chain
cmake .. # this prevents re-compiling dependencies on every next make command
source qemu-csd/activate
# run commands and tools as you please for host based development
deactivate

From the root directory execute the following commands for the one time deployment into the QEMU guest. These command assume the previous section of commands has successfully been executed. The QEMU guest will automatically start an SSH server reachable on port 7777. Both the arch and root user can be used to login. In both cases the password is arch as well. By default the QEMU script will only bind the guest ports on localhost to reduce security concerns due to these basic passwords.

git bundle create deploy.git HEAD
cd build/qemu-csd
source activate
qemu-img create -f raw znsssd.img 16777216
# By default qemu will use 4 CPU cores and 8GB of memory
./qemu-start.sh
# Wait for QEMU VM to fully boot... (might take some time)
rsync -avz -e "ssh -p 7777" ../../deploy.git arch@localhost:~/
# Type password (arch)
ssh arch@localhost -p 7777
# Type password (arch)
git clone deploy.git qemu-csd
rm deploy.git
cd qemu-csd
git -c submodule."dependencies/qemu".update=none submodule update --init
mkdir build
cd build
cmake -DENABLE_DOCUMENTATION=off -DIS_DEPLOYED=on ..
# Do not use make -j $(nproc), CMake is not able to solve concurrent dependency chain
cmake --build .

Optionally, if the intend is to develop on the guest and commit code, the git remote can be updated. In that case it also best to generate an ssh keypair, be sure to start an ssh-agent as well as this needs to be performed manually on Arch. The ssh-agent is only valid for as long as the terminal session that started it. Optionally, it can be included in .bashrc.

git remote set-url origin [email protected]:Dantali0n/qemu-csd.git
ssh-keygen -t rsa -b 4096
eval $(ssh-agent) # must be done after each login
ssh-add ~/.ssh/NAME_OF_KEY

Additionally, any python based tools and graphs are generated by execution these additional commands from the root directory. Ensure the previous environment has been deactivated.

virtualenv -p python3 python
cd python
source bin/activate
pip install -r requirements.txt

Running & Debugging

Running and debugging programs is an essential part of development. Often, barrier to entry and clumsy development procedures can severely hinder productivity. Qemu-csd comes with a variety of scripts preconfigured to reduce this initial barrier and enable quick development iterations.

Environment:

Within the build folder will be a qemu-csd/activate script. This script can be sourced using any shell source qemu-csd/activate. This script configures environment variables such as LD_LIBRARY_PATH while also exposing an essential sudo alias: ld-sudo.

The environment variables ensure any linked libraries can be found for targets compiled by Cmake. Additionally, ld-sudo provides a mechanism to start targets with sudo privileges while retaining these environment variables. The environment can be deactivated at any time by executing deactivate.

Usage Examples:

TODO: Generate integer data file, describe qemucsd and spdk-native applications, usage parameters, relevant code segments to write your own BPF program, relevant code segments to extend the prototype.

Debugging on host:

For debugging, several mechanisms are put in place to simplify this process. Firstly, vscode launch files are created to debug applications even though the require environmental configuration. Any application can be launched using the following set of commands:

source qemu-csd/activate
# For when the target does not require sudo
gdbserver localhost:2222 playground/play-boost-locale
# For when the target requires sudo privileges
ld-sudo gdbserver localhost:2222 playground/play-spdk

Note, that when QEMU is running the port 2222 will be used by QEMU instead. The launch targets in .vscode/launch.json can be easily modified or extended.

When gdbserver is running simply open vscode and select the root folder of qemu-csd, navigate to the source files of interest and set breakpoints and select the launch target from the dropdown (top left). The debugging panel in vscode can be accessed quickly by pressing ctrl+shift+d.

Alternative debugging methods such as using gdb TUI or gdbgui should work but will require more manual setup.

Debugging on QEMU:

Debugging on QEMU is similar but uses different launch targets in vscode. This target automatically logs-in using SSH and forwards the gdbserver connection.

More native debugging sessions are also supported. Simply login to QEMU and start the gdbserver manually. On the host connect to this gdbserver and set up substitute-path.

On QEMU:

# from the root of the project folder.
cd  build
source qemu-csd/activate
ld-sudo gdbserver localhost:2000 playground/play-spdk

On host:

gdb
target remote localhost:2222
set substitute-path /home/arch/qemu-csd/ /path/to/root/of/project

More detailed information about development & debugging for this project can be found in the report.

Debugging FUSE:

Debugging FUSE filesystem operations can be done through the compiled filesystem binaries by adding the -f argument. This argument will keep the FUSE filesystem process in the foreground.

gdb ./filesystem
b ...
run -f mountpoint

CMake Configuration

This section documents all configuration parameters that the CMake project exposes and how they influence the project. For more information about the CMake project see the report generated from the documentation folder. Below all parameters are listed along their default value and a brief description.

Parameter Default Use case
ENABLE_TESTS ON Enables unit tests and adds tests target
ENABLE_CODECOV OFF Produce code coverage report \w unit tests
ENABLE_DOCUMENTATION ON Produce code documentation using doxygen & LaTeX
ENABLE_PLAYGROUND OFF Enables playground targets
ENABLE_LEAK_TESTS OFF Add compile parameter for address sanitizer
IS_DEPLOYED OFF Indicate that CMake project is deployed in QEMU

For several parameters a more in depth explanation is required, primarily IS_DEPLOYED. This parameter is used as the Cmake project is both used to compile QEMU and configure it as well as compile binaries to run inside QEMU. As a results, the CMake project needs to be able to identify if it is being executed outside of QEMU or not. This is what IS_DEPLOYED facilitates. Particularly, IS_DEPLOYED prevents the compilation of QEMU from source.

Licensing

This project is available under the MIT license, several limitations apply including:

  • Source files with an alternative author or license statement other than Dantali0n and MIT respectively.
  • Images subject to copyright or usage terms, such the VU and UvA logo.
  • CERN beamer template files by Jerome Belleman.
  • Configuration files that can't be subject to licensing such as doxygen.cnf or .vscode/launch.json

References

Snippets

  • SPDK -> now supports ZNS zone append
  • uNVME
  • OCSSD
  • RMDA
  • libbpf (standalone)
  • libbpf-tools (BCC)
  • Linux Kernel:
    • p2pdma
    • ioctl

Configuration and parameters for QEMU ZNS SSDs:

Usage:
      -device nvme-subsys,id=subsys0
      -device nvme,serial=foo,id=nvme0,subsys=subsys0
      -device nvme,serial=bar,id=nvme1,subsys=subsys0
      -device nvme,serial=baz,id=nvme2,subsys=subsys0
      -device nvme-ns,id=ns1,drive=<drv>,nsid=1,subsys=subsys0  # Shared
      -device nvme-ns,id=ns2,drive=<drv>,nsid=2,bus=nvme2

nvme options:
  addr=<int32>           - Slot and optional function number, example: 06.0 or 06 (default: -1)
  aer_max_queued=<uint32> -  (default: 64)
  aerl=<uint8>           -  (default: 3)
  cmb_size_mb=<uint32>   -  (default: 0)
  discard_granularity=<size> -  (default: 4294967295)
  drive=<str>            - Node name or ID of a block device to use as a backend
  failover_pair_id=<str>
  logical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  max_ioqpairs=<uint32>  -  (default: 64)
  mdts=<uint8>           -  (default: 7)
  min_io_size=<size>     -  (default: 0)
  msix_qsize=<uint16>    -  (default: 65)
  multifunction=<bool>   - on/off (default: false)
  num_queues=<uint32>    -  (default: 0)
  opt_io_size=<size>     -  (default: 0)
  physical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  pmrdev=<link<memory-backend>>
  rombar=<uint32>        -  (default: 1)
  romfile=<str>
  serial=<str>
  share-rw=<bool>        -  (default: false)
  smart_critical_warning=<uint8>
  subsys=<link<nvme-subsys>>
  use-intel-id=<bool>    -  (default: false)
  write-cache=<OnOffAuto> - on/off/auto (default: "auto")
  x-pcie-extcap-init=<bool> - on/off (default: true)
  x-pcie-lnksta-dllla=<bool> - on/off (default: true)
  zoned.append_size_limit=<size> -  (default: 131072)

nvme-ns options:
  bootindex=<int32>
  discard_granularity=<size> -  (default: 4294967295)
  drive=<str>            - Node name or ID of a block device to use as a backend
  logical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  min_io_size=<size>     -  (default: 0)
  nsid=<uint32>          -  (default: 0)
  opt_io_size=<size>     -  (default: 0)
  physical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  share-rw=<bool>        -  (default: false)
  subsys=<link<nvme-subsys>>
  uuid=<str>             - UUID (aka GUID) or "auto" for random value (default) (default: "auto")
  write-cache=<OnOffAuto> - on/off/auto (default: "auto")
  zoned.cross_read=<bool> -  (default: false)
  zoned.descr_ext_size=<uint32> -  (default: 0)
  zoned.max_active=<uint32> -  (default: 0)
  zoned.max_open=<uint32> -  (default: 0)
  zoned.zone_capacity=<size> -  (default: 0)
  zoned.zone_size=<size> -  (default: 134217728)
  zoned=<bool>           -  (default: false)

Create required images and launch QEMU with ZNS SSD:

qemu-img create -f raw znsssd.img 16777216
qemu-system-x86_64 -name qemucsd -m 4G -cpu Haswell -smp 2 -hda ./arch-qemucsd.qcow2 \
-net user,hostfwd=tcp::7777-:22,hostfwd=tcp::2222-:2000 -net nic \
-drive file=./znsssd.img,id=mynvme,format=raw,if=none \
-device nvme,serial=baz,id=nvme2,zoned.append_size_limit=131072 \
-device nvme-ns,id=ns2,drive=mynvme,nsid=2,logical_block_size=4096,\
physical_block_size=4096,zoned=true,zoned.zone_size=131072,zoned.zone_capacity=131072,\
zoned.max_open=0,zoned.max_active=0,bus=nvme2

Week 1 friday demo scripts:

cat /sys/block/nvme0n1/queue/zoned
cat /sys/block/nvme0n1/queue/chunk_sectors
cat /sys/block/nvme0n1/queue/nr_zones
sudo blkzone report /dev/nvme0n1
sudo nvme zns id-ns /dev/nvme0n1
sudo nvme zns report-zones /dev/nvme0n1
sudo nvme zns open-zone /dev/nvme0n1 -s 0xe40
sudo nvme zns finish-zone /dev/nvme0n1 -s 0xe40
sudo nvme zns report-zones /dev/nvme0n1
sudo nvme zns reset-zone /dev/nvme0n1 -s 0xe40

About

eBPF Computational Storage Device (CSD) for Zoned Namespace (ZNS) SSDs in QEMU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 36.1%
  • TeX 34.0%
  • CMake 25.3%
  • C 2.8%
  • Shell 1.5%
  • Dockerfile 0.3%