Developer Notes

Packaging and Dependencies

This project uses Conda to manage Python packaging and dependencies.

A coding standard is enforced using Black, isort and Flake8. Python 3 type hinting is validated using MyPy.

Unit tests are written using Pytest, documentation is written using Google Style Python Docstring. Pydocstyle is used as static analysis tool for checking compliance with Python docstring conventions.

Additional code security standards are enforced by Safety and Bandit. Git-secrets ensure you're not pushing any passwords or sensitive information into your Bitbucket repository. Commits are rejected if the tool matches any of the configured regular expression patterns that indicate that sensitive information has been stored improperly.

We use sphinx or mkdocs for building documentation. You can call make build_docs from the project root, the docs will be built under docs/_build/html. Detail information about documentation can be found here.

Git Hooks

We rely on pre-commit hooks to ensure that the code is properly-formatted, clean, and type-safe when it's checked in. The run install step described below installs the project pre-commit hooks into your repository. These hooks are configured in .pre-commit-config.yaml. After installing the development requirements and cloning the package, run

pre-commit install

from the project root to install the hooks locally. Now before every git commit ... these hooks will be run to verify that the linting and type checking is correct. If there are errors, the commit will fail, and you will see the changes that need to be made. Alternatively, you can run pre-commit

pre-commit run --all-files

If necessary, you can temporarily disable a hook using Git's --no-verifyswitch. However, keep in mind that the CI build enforces these checks, so the build will fail.

You can build your own pre-commit scripts. Put them on scripts folder. To make a shell script executable, use the following command.

git update-index --chmod=+x scripts/name_of_script.sh

Don’t forget to commit and push your changes after running it!

Warning: You need to run git commit with your conda environment activated. This is because by default the packages used by pre-commit are installed into your project's conda environment. (note: pre-commit install --install-hooks will install the pre-commit hooks in the currently active environment).

Markdown

Local links can be written as normal, but external links should be referenced at the bottom of the Markdown file for clarity. For example:

Use a local link to reference the [`README.md`](../README.md) file, but an external link for [Fraunhofer AICOS][fhp-aicos].

[fhp-aicos]: https://www.fraunhofer.pt/

We also try to wrap Markdown to a line length of 88 characters. This is not strictly enforced in all cases, for example with long hyperlinks.

Testing

[Tests are written using the pytest framework][pytest], with its configuration in the pyproject.toml file. Note, only tests in finetune_sd/tests folders folder are run. To run the tests, enter the following command in your terminal:

pytest -vvv

Code coverage

[Code coverage of Python scripts is measured using the coverage Python package][coverage]; its configuration can be found in pyproject.toml. To run code coverage, and view it as an HTML report, enter the following command in your terminal:

coverage run -m pytest
coverage html

or use the make command:

make coverage_html

The HTML report can be accessed at htmlcov/index.html.

Set private environment variables in .envrc file

System specific variables (e.g. absolute paths to datasets) should not be under version control, or it will result in conflict between different users. Your private keys also shouldn't be versioned since you don't want them to be leaked.

The .env file, which serves as an example. Create a new file called .env (this name is excluded from version control in .gitignore). You should use it for storing environment variables like this:

MY_VAR=/home/user/my_system_path

All variables from .env are loaded in config.py automatically.

Version control your data and models with DVC

Use DVC to version control big files, like your data or trained ML models. To initialize the dvc repository:

dvc init

To start tracking a file or directory, use dvc add (e.g. pictures):

dvc add data/raw/*.jpg

DVC stores information about the added file (or a directory) in a special .dvc file named data/raw/*jpg.dvc, a small text file with a human-readable format. This file can be easily versioned like source code with Git, as a placeholder for the original data:

git add data/raw/*jpg.dvc
git commit -m "Add raw data"

We recommend tagging each time you modify the files inside the data folder

git commit -m "Add more images. Model trained with 2000 images."
git tag -a "v2.0" -m "model v2.0, 2000 images"
git push --tags
dvc push  # Upload dataset to S3 Bucket on Minio Server

The regular workflow is to use git checkout first to switch a branch, checkout a commit/tag, or a revision of a .dvc file, and then run dvc checkout to sync data: To switch to a previous version (e.g. with tag v1.0) of our code and data. DVC checkout will remove the new files.

git checkout v1.0
dvc checkout

Hydra

Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.

We recommend going through at least the Basic Tutorial, and the docs about Instantiating objects with Hydra.

CI

All PRs trigger a CI job to run linting, type checking, tests, and build docs. The CI script is located here and should be considered the source of truth for running the various development commands.

Line Endings

The .gitattributes file controls line endings for the files in this repository.

Prerequisites

Nearly all prerequisites are managed by Conda. All you need to do is make sure that you have a working Python 3 environment and install miniconda itself. Conda manages virtualenvs as well. Typically, on a project that uses virtualenv directly you would activate the virtualenv to get all the binaries that you install with pip onto the path. Conda works in a similar way but with different commands.

Use miniconda for your python environments (it's usually unnecessary to install full anaconda environment, miniconda should be enough). It makes it easier to install some dependencies, like cudatoolkit for GPU support. It also allows you to access your environments globally.

Example installation:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

IDE Setup

There are a few useful plugins that are probably available for most IDEs. Using Pycharm, you'll want to install the black plugin.

blackconnect can be configured to auto format files on save. Just run make blackd from a shell to set up the server and the plugin will do its thing. You need to configure it to format on save, it's off by default.

Development Details

You can run make help for a full list of targets that you can run. These are the ones that you'll need most often.

# For running tests locally
make test

# For formatting and linting
make lint
make format
make format-fix

# Remove all generated artifacts
make clean

Reproducible environment

The first step in reproducing an analysis is always reproducing the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.

By listing all of your requirements in the repository you can easily track the packages needed to recreate the analysis, but what tool should we use to do that?

Whilst popular for scientific computing and data-science, conda poses problems for collaboration and packaging:

It is hard to reproduce a conda-environment across operating systems
It is hard to make your environment "pip-installable" if your environment is fully specified by conda

Files

Due to these difficulties, we recommend only using conda to create a virtual environment and list dependencies not available through pip install.

environment.yaml - Defines the base conda environment and any dependencies not "pip-installable".
requirements/requirements.txt - Defines the dependencies required to run the code. If you need to add a dependency, chances are it goes here!
requirements/requirements-dev.txt - Defines development dependencies. These are for dependencies that are needed during development but not needed to run the core code. For example, packages to run tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVELOPER.md

DEVELOPER.md

Developer Notes

Packaging and Dependencies

Git Hooks

Markdown

Testing

Code coverage

Set private environment variables in .envrc file

Version control your data and models with DVC

Hydra

CI

Line Endings

Prerequisites

IDE Setup

Development Details

Reproducible environment

Files

Files

DEVELOPER.md

Latest commit

History

DEVELOPER.md

File metadata and controls

Developer Notes

Packaging and Dependencies

Git Hooks

Markdown

Testing

Code coverage

Set private environment variables in .envrc file

Version control your data and models with DVC

Hydra

CI

Line Endings

Prerequisites

IDE Setup

Development Details

Reproducible environment

Files