Skip to content

Commit

Permalink
Merge pull request #102 from Azure-Samples/installable
Browse files Browse the repository at this point in the history
More generic, installable
  • Loading branch information
pamelafox authored Oct 25, 2024
2 parents b0697bf + 66b317e commit 7441f26
Show file tree
Hide file tree
Showing 46 changed files with 494 additions and 318 deletions.
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@
"hostRequirements": {
"memory": "8gb"
},
"postCreateCommand": "pip install -r requirements-dev.txt"
"postCreateCommand": "pip install -e .\"[dev]\""
}
2 changes: 1 addition & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Before you submit your Pull Request (PR) consider the following guidelines:
* Install the development tools and pre-commit hooks:

```shell
python3 -m pip install -r requirements-dev.txt
python3 -m pip install -e ."[dev"]
pre-commit install
```

Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/azure-dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,12 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -e .[dev]
- name: Run evaluation
run: |
azd env get-values > .env
source .env
python -m scripts evaluate --config=example_config.json --numquestions=2 --targeturl=${{ env.TARGET_URL }}
python -m evaltools evaluate --config=example_config.json --numquestions=2 --targeturl=${{ env.TARGET_URL }}
env:
TARGET_URL: ${{ secrets.TARGET_URL }}
8 changes: 4 additions & 4 deletions .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,10 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-dev.txt
pip install -e .[dev]
- name: Lint with ruff
run: ruff check .
- name: Check formatting with black
run: black . --check --verbose
- name: Check formatting with ruff
run: ruff format . --check
- name: Run Pytest tests
run: python3 -m pytest
run: python -m pytest
12 changes: 6 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ repos:
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.2.0
rev: v0.6.2
hooks:
- id: ruff
- repo: https://github.com/psf/black
rev: 24.1.1
hooks:
- id: black
# Run the linter.
- id: ruff
args: [ --fix ]
# Run the formatter.
- id: ruff-format
4 changes: 1 addition & 3 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
{
"python.testing.pytestArgs": [
"scripts"
],
"python.testing.pytestArgs": ["tests"],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"files.exclude": {
Expand Down
34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@ If not, then follow these steps:

1. Install Python 3.10 or higher
2. Create a Python [virtual environment](https://learn.microsoft.com/azure/developer/python/get-started?tabs=cmd#configure-python-virtual-environment).
3. Inside that virtual environment, install the requirements:
3. Inside that virtual environment, install the project:

```shell
python -m pip install -r requirements.txt
python -m pip install -e .
```

## Deploying a GPT-4 model
Expand All @@ -57,11 +57,17 @@ We've made that easy to deploy with the `azd` CLI tool.
1. Install the [Azure Developer CLI](https://aka.ms/azure-dev/install)
2. Run `azd auth login` to log in to your Azure account
3. Run `azd up` to deploy a new GPT-4 instance
4. Create a `.env` file based on the provisioned resources by copying `.env.sample` and filling in the required values.
You can run this command to see the deployed values:
4. Create a `.env` file based on `.env.sample`:
```shell
azd env get-values
cp .env.sample .env
```
5. Run this commands to get the required values for `AZURE_OPENAI_EVAL_DEPLOYMENT` and `AZURE_OPENAI_SERVICE` from your deployed resource group and paste those values into the `.env` file:
```shell
azd env get-value AZURE_OPENAI_EVAL_DEPLOYMENT
azd env get-value AZURE_OPENAI_SERVICE
```
### Using an existing Azure OpenAI instance
Expand Down Expand Up @@ -133,7 +139,7 @@ This repo includes a script for generating questions and answers from documents
3. Run the generator script:

```shell
python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5
python -m evaltools generate --output=example_input/qa.jsonl --persource=5 --numquestions=200
```

That script will generate 200 questions and answers, and store them in `example_input/qa.jsonl`. We've already provided an example based off the sample documents for this app.
Expand All @@ -145,15 +151,15 @@ This repo includes a script for generating questions and answers from documents
By default this script assumes your index citation field is named `sourcepage`, if your search index contains a different citation field name use the `citationfieldname` option to specify the correct name
```shell
python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5 --citationfieldname=filepath
python -m evaltools generate --output=example_input/qa.jsonl --persource=5 --numquestions=200 --citationfieldname=filepath
```
## Running an evaluation
We provide a script that loads in the current `azd` environment's variables, installs the requirements for the evaluation, and runs the evaluation against the local app. Run it like this:

```shell
python -m scripts evaluate --config=example_config.json
python -m evaltools evaluate --config=example_config.json
```

The config.json should contain these fields as a minimum:
Expand Down Expand Up @@ -184,7 +190,7 @@ To run against a deployed endpoint, change the `target_url` to the chat endpoint
It's common to run the evaluation on a subset of the questions, to get a quick sense of how the changes are affecting the answers. To do this, use the `--numquestions` parameter:

```shell
python -m scripts evaluate --config=example_config.json --numquestions=2
python -m evaltools evaluate --config=example_config.json --numquestions=2
```

### Specifying the evaluate metrics
Expand Down Expand Up @@ -280,7 +286,7 @@ located inside the `review-tools` folder.
To view a summary across all the runs, use the `summary` command with the path to the results folder:

```bash
python -m review_tools summary example_results
python -m evaltools summary example_results
```

This will display an interactive table with the results for each run, like this:
Expand All @@ -295,7 +301,7 @@ A modal will appear with the parameters, including any prompt override.
To compare the answers generated for each question across 2 runs, use the `compare` command with 2 paths:

```bash
python -m review_tools diff example_results/baseline_1 example_results/baseline_2
python -m evaltools diff example_results/baseline_1 example_results/baseline_2
```

This will display each question, one at a time, with the two generated answers in scrollable panes,
Expand All @@ -308,7 +314,7 @@ Use the buttons at the bottom to navigate to the next question or quit the tool.
You can also filter to only show questions where the value changed for a particular metric, like this:

```bash
python -m review_tools diff example_results/baseline_1 example_results/baseline_2 --changed=has_citation
python -m evaltools diff example_results/baseline_1 example_results/baseline_2 --changed=has_citation
```

## Measuring app's ability to say "I don't know"
Expand All @@ -329,7 +335,7 @@ You can write these questions manually, but it’s also possible to generate the
assuming you already have ground truth data with answerable questions.

```shell
python -m scripts generate-dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45
python -m evaltools generate-dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45
```

That script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.
Expand Down Expand Up @@ -360,7 +366,7 @@ We recommend a separate output folder, as you'll likely want to make multiple ru
Run the evaluation like this:

```shell
python -m scripts evaluate --config=dontknows.config.json
python -m evaltools evaluate --config=dontknows.config.json
```

The results will be stored in the `results_dir` folder, and can be reviewed using the [review tools](#viewing-the-results).
Expand Down
24 changes: 20 additions & 4 deletions dontknows.config.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,27 @@
{
"testdata_path": "example_input/qa_dontknows.jsonl",
"results_dir": "blog_results_dontknows/gpt35_prompt2",
"results_dir": "example_results_dontknows/baseline",
"requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
"target_url": "http://host.docker.internal:50505/chat",
"target_url": "http://localhost:50505/chat",
"target_parameters": {
"overrides": {
"prompt_template": "<READFILE>example_input/prompt_refined.txt"
"top": 3,
"temperature": 0.3,
"minimum_reranker_score": 0,
"minimum_search_score": 0,
"retrieval_mode": "hybrid",
"semantic_ranker": true,
"semantic_captions": false,
"suggest_followup_questions": false,
"use_oid_security_filter": false,
"use_groups_security_filter": false,
"vector_fields": [
"embedding"
],
"use_gpt4v": false,
"gpt4v_input": "textAndImages"
}
}
},
"target_response_answer_jmespath": "message.content",
"target_response_context_jmespath": "context.data_points.text"
}
30 changes: 30 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[tool.ruff]
line-length = 120
target-version = "py39"
lint.isort.known-first-party = ["evaltools"]
lint.select = ["E", "F", "I", "UP"]
lint.ignore = ["D203"]

Expand All @@ -10,3 +11,32 @@ target-version = ["py39"]

[tool.pytest.ini_options]
addopts = "-ra"

[project]
name = "evaltools"
version = "0.1.1"
description = "Evaluate chat applications using Azure OpenAI evaluators"
dependencies = [
"requests",
"python-dotenv",
"azure-ai-generative[evaluate]==1.0.0b8",
"azure-ai-evaluation==1.0.0b3",
"azure-search-documents",
"typer",
"openai>=1.0.0",
"pandas",
"rich",
"jmespath",
"textual"
]

[project.optional-dependencies]
dev = [
"pre-commit",
"ruff",
"black",
"pytest"
]

[tool.setuptools.package-data]
evaltools = ["review/*.tcss"]
5 changes: 0 additions & 5 deletions requirements-dev.txt

This file was deleted.

2 changes: 0 additions & 2 deletions requirements.txt

This file was deleted.

6 changes: 0 additions & 6 deletions review_tools/__main__.py

This file was deleted.

31 changes: 0 additions & 31 deletions review_tools/cli.py

This file was deleted.

Loading

0 comments on commit 7441f26

Please sign in to comment.