Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version manually.
On Linux, re-run the install script:
curl -fsSL https://ollama.com/install.sh | sh
Review the Troubleshooting docs for more about using logs.
Please refer to the GPU docs.
By default, Ollama uses a context window size of 2048 tokens.
To change this when using ollama run
, use /set parameter
:
/set parameter num_ctx 4096
When using the API, specify the num_ctx
parameter:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"options": {
"num_ctx": 4096
}
}'
Use the ollama ps
command to see what models are currently loaded into memory.
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
The Processor
column will show which memory the model was loaded in to:
100% GPU
means the model was loaded entirely into the GPU100% CPU
means the model was loaded entirely in system memory48%/52% CPU/GPU
means the model was loaded partially onto both the GPU and into system memory
Ollama server can be configured with environment variables.
If Ollama is run as a macOS application, environment variables should be set using launchctl
:
-
For each environment variable, call
launchctl setenv
.launchctl setenv OLLAMA_HOST "0.0.0.0"
-
Restart Ollama application.
If Ollama is run as a systemd service, environment variables should be set using systemctl
:
-
Edit the systemd service by calling
systemctl edit ollama.service
. This will open an editor. -
For each environment variable, add a line
Environment
under section[Service]
:[Service] Environment="OLLAMA_HOST=0.0.0.0"
-
Save and exit.
-
Reload
systemd
and restart Ollama:systemctl daemon-reload systemctl restart ollama
On Windows, Ollama inherits your user and system environment variables.
-
First Quit Ollama by clicking on it in the task bar.
-
Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.
-
Click on Edit environment variables for your account.
-
Edit or create a new variable for your user account for
OLLAMA_HOST
,OLLAMA_MODELS
, etc. -
Click OK/Apply to save.
-
Start the Ollama application from the Windows Start menu.
Ollama pulls models from the Internet and may require a proxy server to access the models. Use HTTPS_PROXY
to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
Note
Avoid setting HTTP_PROXY
. Ollama does not use HTTP for model pulls, only HTTPS. Setting HTTP_PROXY
may interrupt client connections to the server.
The Ollama Docker container image can be configured to use a proxy by passing -e HTTPS_PROXY=https://proxy.example.com
when starting the container.
Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on macOS, Windows, and Linux, and Docker daemon with systemd.
Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.
FROM ollama/ollama
COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
RUN update-ca-certificates
Build and run this image:
docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
No. Ollama runs locally, and conversation data does not leave your machine.
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the OLLAMA_HOST
environment variable.
Refer to the section above for how to set environment variables on your platform.
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:
server {
listen 80;
server_name example.com; # Replace with your domain or IP
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host localhost:11434;
}
}
Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:
ngrok http 11434 --host-header="localhost:11434"
To use Ollama with Cloudflare Tunnel, use the --url
and --http-host-header
flags:
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"
Ollama allows cross-origin requests from 127.0.0.1
and 0.0.0.0
by default. Additional origins can be configured with OLLAMA_ORIGINS
.
Refer to the section above for how to set environment variables on your platform.
- macOS:
~/.ollama/models
- Linux:
/usr/share/ollama/.ollama/models
- Windows:
C:\Users\%username%\.ollama\models
If a different directory needs to be used, set the environment variable OLLAMA_MODELS
to the chosen directory.
Note: on Linux using the standard installer, the
ollama
user needs read and write access to the specified directory. To assign the directory to theollama
user runsudo chown -R ollama:ollama <directory>
.
Refer to the section above for how to set environment variables on your platform.
There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of extensions & plugins at the bottom of the main repository readme.
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the nvidia-container-toolkit. See ollama/ollama for more details.
GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
This can impact both installing Ollama, as well as downloading models.
Open Control Panel > Networking and Internet > View network status and tasks
and click on Change adapter settings
on the left panel. Find the vEthernel (WSL)
adapter, right click and select Properties
.
Click on Configure
and open the Advanced
tab. Search through each of the properties until you find Large Send Offload Version 2 (IPv4)
and Large Send Offload Version 2 (IPv6)
. Disable both of these
properties.
If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the /api/generate
and /api/chat
API endpoints.
To preload the mistral model using the generate endpoint, use:
curl http://localhost:11434/api/generate -d '{"model": "mistral"}'
To use the chat completions endpoint, use:
curl http://localhost:11434/api/chat -d '{"model": "mistral"}'
To preload a model using the CLI, use the command:
ollama run llama3.1 ""
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the keep_alive
parameter with either the /api/generate
and /api/chat
API endpoints to control how long the model is left in memory.
The keep_alive
parameter can be set to:
- a duration string (such as "10m" or "24h")
- a number in seconds (such as 3600)
- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
- '0' which will unload the model immediately after generating a response
For example, to preload a model and leave it in memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama3", "keep_alive": -1}'
To unload the model and free up memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama3", "keep_alive": 0}'
Alternatively, you can change the amount of time all models are loaded into memory by setting the OLLAMA_KEEP_ALIVE
environment variable when starting the Ollama server. The OLLAMA_KEEP_ALIVE
variable uses the same parameter types as the keep_alive
parameter types mentioned above. Refer to section explaining how to configure the Ollama server to correctly set the environment variable.
If you wish to override the OLLAMA_KEEP_ALIVE
setting, use the keep_alive
API parameter with the /api/generate
or /api/chat
API endpoints.
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting OLLAMA_MAX_QUEUE
.
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
OLLAMA_MAX_LOADED_MODELS
- The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.OLLAMA_NUM_PARALLEL
- The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.OLLAMA_MAX_QUEUE
- The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
Installing multiple GPUs of the same brand can be a great way to increase your available VRAM to load larger models. When you load a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transfering across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.