Here we will explain the overall structure of the repository and how you can contribute to it.
For each big feature (in this case big means that it takes more than 3 commits to finish and debug), create a new branch, and then submit a pull-request for approval.
The respository is structured as follows:
./bindings
This folder contains all the bindings. Currently there are only Python's bindings but in the futures we might expand them to other languages. In general this folder contains a directory for each language../bindings/python
Home for the Python's bindings for more details see itsREADME.md
../code_analysis
Our static Rust code analyzer, this is used to both automatically generate bindings and enforce extra rules on the code, such as naming conventions and proper documentation of methods../drawio
Old (outdated) images we use to explain concepts of the library../fuzzing
Where all the Rust fuzzing related things are../fuzzing/graph_harness
The general harnesses we use with the different fuzzers (libFuzz, Honggfuzz)../fuzzing/graph_harness/fuzz
Contains libFuzzer targets, so that from its parent directory you can run it../fuzzing/honggfuzz
Folder setted up to fuzz ensmallen with hongufzz, for more details see the./fuzzing/stupid_fuzzer
This is the most basic fuzzer, it takes random values and feed them to the harness without any sort of coverage. This can also be used to reproduce bugs found by other tests and thus is mostly just to better inspect bugs found by other fuzzers../fuzzing/unit_tests
Our harnesses catch panics and some signals and dump informations about the errors in a folder with a random name here. When possible we provvide the panic / signal informations, backtrace, original data, graph report, and any other information we can find../graph
Core of the library, this is where most algorithms and data-structures are../graph/tags
Procedural macro we use to "tag" methods to instruct our bindings generator or harness generator../notebooks_and_scripts
This folder contains jupyter notebooks and scripts that we use mainly to update the graph automatic-retrival pipeline../perf
Little rust executables we use to profile the library and find the bottlenecks../setup
Dockerfiles and bash scripts that we use to build ensmallen on different systems.
In general we try to have a file README.md
in every folder that contains informations and details and
a Makefile
that contains recipies releasesfor the most common operations.
The core of the library is inside the folder ./graph
.
Most of the code are methods fo the Graph
struct.
To keep things usable, we try to divide methods into different files and
the self-contained helper functions and structs in sub-modules.
To test it you can use the command:
# Quick check that the code compiles
$ cargo check
# Check that there are no code-smells (THIS IS STILL A WORK IN PROGRESS)
$ cargo clippy -- -D clippy::pedantic
# Run the tests (--release is suggested because they take a while)
$ cargo test --release
# Run our custom set of rules and validations based on our static code analyzer
$ (cd ../code_analysis/; cargo run --release --bin check)
To gather coverage over the tests, run:
# Setup flags to gather better coverage (THIS WILL SLOW DOWN SIGNIFICANTLY the tests)
$ export CARGO_INCREMENTAL=0
$ export RUSTFLAGS="-Zprofile -Ccodegen-units=1 -Copt-level=0 -Clink-dead-code -Coverflow-checks=off -Zpanic_abort_tests -Cpanic=abort"
$ export RUSTDOCFLAGS="-Cpanic=abort"
# Remove the target file so that we force re-compilation
$ rm -rfd target;
# Update the libraries (mostly the ones from git need this)
$ cargo update;
# Run the now instrumented test (due to -Zprofile)
$ cargo test;
# Create a folder where we will save the coverage
$ mkdir ./target/debug/coverage;
# Convert the coverage gathered during the test to a human-readable satic website.
$ grcov ./target/debug/ -s . -t html --llvm --branch --ignore-not-existing -o ./target/debug/coverage/;
The resulting coverage can be found at ./target/debug/coverage/src/index.html
.
Most of the test suite is contained in ./src/test_utilities.rs
, this file contains
tests that check the general consistency of the graph. Thesere are in the main sources
because we use them both in tests and during fuzzing.
To help check if there are bugs, we heavely fuzz the code. We have two main harnesses.
An harness is a piece of code that takes a random vector of bytes and use it as
input for the code you want to test. The simple harness (found at ./fuzzing/graph_harness/src/from_strings.rs
)
create a random graph using the from_strings
constructor and then run the default test-suite on it.
More informations can be found in the README.md
inside the ./fuzzing
folder, but
to start fuzzing you can just run:
$ cd fuzzing
$ make hfuzz_from_strings
If it finds bugs, a report about them, useful for debugging, will be generated inside the ./fuzzing/unit_tests
folder.
Our hand-written recursive descent rust parser can be found in ./code_analysis/rust_parser
.
We built several utilities:
bindgen
It analyze all the methods and functions in the core library and automatically create the python bindings creating the fileauto_generated_bindings.rs
. Also this create the data we use for the TF-IDF raccomander system we have for the__get_attr__
magic method, in particular it pre-process and store, for each struct, a list of all the methods, and the pre-computed TFIDF weights of each term.metatest
It analyze all the methods of theGraph
struct and create a fuzzing harness that create a random graph and calls 10 random methods with random arguments.check
It's our static analysis code quality checks. It check that every public method has a documentation formatted in our standard way, that, if the files has a naming convention, it's respected, and other checks.
For linux this currently works only with maturin before 0.12.4 because the 0.12.5 broke our auditwheel.
Most of the python bindings are automatically generated throught our code_analysis
tools.
So it's a good practice to re-generate the bindings each time you compile them:
$ cd ./code_analysis
$ cargo run --release --bin bindgen
The bindings use the PyO3
library and maturin
for the compilation and packaging steps.
To be able to optimize the compilation for different target cpus, we have a custom
build script ./bindings/python/build.py
which automatically build the bindings
for multiple versions of python, and create wheels that automatically dispatch
the best version for the current CPU.
Most settings for the builder can be edited in ./bindings/python/builder_settings.json
which is json format that allows comments (the comment must start with //
and have to be on its own line).
The building script is a CLI tool:
$ python ./build.py -h
usage: build.py [-h] [-p PYTHON_VERSIONS] [-s SETTINGS_PATH] [-v {debug,info,error}]
Build Ensmallen with performance vs compatability automatical dispatching. This
utility makes to build folders, one for the avx version, one for the non_avx version.
The script will compile in both the wheels, and then merge them making a new wheel
file that can dispatch at startup which library to use. Finally, to be able to
guarantee proper compatability, on linux, we follow the manylinux2010 standard which
requires us to patch the created wheel using `auditwheel`. This is needed because we
have `reqwest` as a dependancy which requires libcrypto (openssl) which is not in the
manylinux2010 allowd libraries. To fix this auditwheel will ship in the wheel the
needed libraries and patch the relocations section of the two versions of ensmallen to
import these. This builder uses a json file for settings the targets. An example of
settings file is:
{
"wheels_folder":"wheels",
"shared_rustflags":"-C inline-threshold=1000",
"targets":{
"haswell":{
"build_dir":"build_haswell",
"rustflags":"-C target-cpu=haswell"
},
"core2":{
"build_dir":"build_core2",
"rustflags":"-C target-cpu=core2"
}
}
}
optional arguments:
-h, --help show this help message and exit
-s SETTINGS_PATH, --settings-path SETTINGS_PATH
The json file f (default: builder_settings.json)
-v {debug,info,error}, --verbosity {debug,info,error}
Verbosity of the logger (default: info)
If you want to publish the wheel for linux (you should really follow the section below) you have to install auditwheel
with:
pip install auditwheel
otherwise you can skip the repair with adding the --skip-repair
flag on the build script. This is fine for local development but
the wheel won't be accepted by Pypi.
then you can run the building with:
cd ./bindings/python
python build.py
the resulting wheels will be in the folder specified in the settings (by default ./bindings/python/wheels
).
1- (the first time) Build the compilation environment
sudo docker build -t manylinux2010 -f ./setup/DockerFileManylinux2010 ./setup
2- Open a shell
sudo docker run --rm -it -v "${PWD}:/io" manylinux2010 bash
3- Inside the shell:
# navigate to the the python bindings
cd /io/bindings/python
# run the compilation
python3 ./build.py
# upload the result
twine upload ./wheels/*.whl
Since the docker runs as root it will touch many files, at the end run in the root:
sudo chown -R $USER:$USER *
If twine upload
hangs with no output it's usually a permission problem and you are probably trying to upload the wheel outside of the docker which requires to chown
the wheels.
Using the new shiny Google's toy, Atheris we can fuzz the bindings. This way pairing the Rust and the Python fuzzers we can test all the code in the library.
A little example of an harness is:
#!/usr/bin/python
import sys
import atheris
import ensmallen_graph
def fuzz_me(input_bytes: bytes):
fdp = atheris.FuzzedDataProvider(input_bytes)
try:
ensmallen_graph.preprocessing.okapi_bm25_tfidf_int(
[
[
fdp.ConsumeUInt(4)
for _ in range(fdp.ConsumeUInt(1))
]
for _ in range(fdp.ConsumeUInt(1))
],
fdp.ConsumeUInt(4),
fdp.ConsumeUInt(4),
False,
)
except ValueError:
pass
atheris.instrument_all()
atheris.Setup(sys.argv, fuzz_me)
atheris.Fuzz()
To have coverage over the rust code we must instrumen it, so we need to compile the bindings as follows:
$ cd ./bindings/python
$ maturin develop --release --rustc-extra-args='-Ctarget-cpu=native -Zinstrument-coverage -Cpasses=sancov -Cllvm-args=-sanitizer-coverage-level=4 -Cllvm-args=-sanitizer-coverage-trace-compares -Cllvm-args=-sanitizer-coverage-inline-8bit-counters -Cllvm-args=-sanitizer-coverage-pc-table -Cllvm-args=-sanitizer-coverage-stack-depth --verbose -Zsanitizer=address'
Finally you can run the fuzzer as:
LD_PRELOAD="$(python -c "import atheris; print(atheris.path())")/asan_with_fuzzer.so" python my_atheris_script.py
This does not yet supports panics / signals catching so you must implmement it yourself in the python harness.
To profile our code just create a new binary target (e.g my_bin
) in the folder,
and create a program that loads a graph and call the method you want to profile.
It's better if the program takes from 10 seconds to 1 minute.
Then you should compile the code:
cargo build --release
And run the profiling:
$ perf record --call-graph=dwarf ./target/release/my_bin
This will create a file called perf.data
which can be analyzed using hotspot
or directly perf report
and perf annotate
.
For more info, see the perf
documentation.