Introduction

This repository demonstrates how to create a small Docker image serving large language models (LLMs).

The resources present in this repo were used in the Medium articles.

https://medium.com/towards-data-science/reducing-the-size-of-docker-images-serving-llm-models-b70ee66e5a76

https://czuk.medium.com/reducing-the-size-of-docker-images-serving-large-language-models-part-2-b7226a0b6514

TL;DR

It is possible to reduce the size of the Docker image serving an LLM model from gigabytes to megabytes. In our case, it was from 7GB to 575MB. Such a significant size reduction might be useful when we are limited or stumble by network transfers (pushing and pulling the image over a network), the image registry's limitations, or the production server's memory limitations.

Size reduction was possible thanks to:

using onnxruntime instead of torch,
converting and quantizing the model to ONNX format,
model compression,
using tokenizers package instead of transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
gfx		gfx
models		models
.dockerignore		.dockerignore
Dockerfile_cuda		Dockerfile_cuda
Dockerfile_onnx		Dockerfile_onnx
Dockerfile_onnx_xs		Dockerfile_onnx_xs
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
api-cuda.py		api-cuda.py
api-onnx-xs.py		api-onnx-xs.py
api-onnx.py		api-onnx.py
entrypoint_onnx_xs.sh		entrypoint_onnx_xs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

TL;DR

About

Releases

Packages

Languages

License

CodeNLP/codenlp-docker-ml

Folders and files

Latest commit

History

Repository files navigation

Introduction

TL;DR

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages