Skip to content

Commit

Permalink
README: Add notes about device specification for AOTI inference (pyto…
Browse files Browse the repository at this point in the history
  • Loading branch information
Jack-Khuu authored Jul 29, 2024
1 parent 900b6d4 commit fe7e5b2
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so

> [!NOTE]
> If your machine has cuda add this flag for performance
`--quantize config/data/cuda.json` when exporting. You'll also need to tell generate to use `--device cuda` and the runner to use `-d CUDA`
`--quantize config/data/cuda.json` when exporting.


### Run in a Python Enviroment
Expand All @@ -266,6 +266,7 @@ To run in a python enviroment, use the generate subcommand like before, but incl
```
python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is"
```
**Note:** Depending on which accelerator is used to generate the .dso file, the command may need the device specified: `--device (cuda | cpu)`.


### Run using our C++ Runner
Expand All @@ -275,10 +276,11 @@ To run in a C++ enviroment, we need to build the runner binary.
scripts/build_native.sh aoti
```

Then run the compiled executable, with the exported DSO from earlier:
Then run the compiled executable, with the exported DSO from earlier.
```bash
cmake-out/aoti_run exportedModels/llama3.so -z `python3 torchchat.py where llama3`/tokenizer.model -l 3 -i "Once upon a time"
```
**Note:** Depending on which accelerator is used to generate the .dso file, the runner may need the device specified: `-d (CUDA | CPU)`.

## Mobile Execution

Expand Down

0 comments on commit fe7e5b2

Please sign in to comment.