Skip to content

nhsengland/ds_251_RAG

Repository files navigation

DS_251_RAG: Exploring the use of RAG in NHS England

pdm-managed licence: MIT

❗ Warning: this repository may contain references internal to NHS England that cannot be accessed publicly

This project aims to explore the potential uses and efficacy of "Retrieval Augmented Generation" (RAG) within NHS England.

This will principally involve:

  • WP1: Making a simple RAG pipeline -> here
  • WP2: Make reusable current code
  • WP3: Methodology exploration - how to take RAG further
  • WP4: Evaluation
  • WP5: Explainer

Progress towards these is tracked here: https://github.com/orgs/nhsengland/projects/29/

The project "milestones" are described here: https://github.com/nhsengland/ds_251_RAG/milestones

Contact

This repository is maintained by NHS England Data Science Team.

To contact us raise an issue on Github or via email.

See our (and our colleagues') other work here: NHS England Analytical Services

Description

This project will look at how RAG (Retrieval Augemented Generation) can be used within NHS England, with the primary aim of being able to evaluate how well it performs for an example use case, and to produce resources to advise on how it can be taken forward/enhanced, and to help senior decision makers understand the opportunities and benefits posed by RAG.

WP1: Make a simple RAP pipeline

  • We need to make a simple RAP pipeline to use to explore this technique
  • It doesn't need to be that complex - but we do need to be able to turn the "RAG" component on or off - so we can test what effect it has.

The basic structure is described below:

Loading
   graph

   A[Receive Query] --> L{RAG?}

   L -- Yes --> B[Pass to database]

   L -- No --> J

   B --> H[(Vectorstore)]

   H --> I[Retrieve documents]

   I --> F[Prompt: Inject Metadata] 
   F --> G[Prompt: Stuff Documents]

   G --> J[Submit prompt to LLM]

   J --> K[Get response]

The code for the RAG pipeline is found here: src/models.py

You can run the RAG pipeline here: dev.ipynb

WP2: Make current code reusable

  • We want to meet the silver RAP standard, hence the code needs better documentation, testing and breaking up into reusable chunks.

WP3: Methodology exploration

WP4: Evaluation

  • we need to understand how good these tools are at answering questions /summarising documents
  • A relatively simple approach we can start with is, to use the NHS Conditions website (https://www.nhs.uk/conditions/), and see if it can answer questions correctly about that content, and if turning RAG on and off makes any difference.
  • We could then measure the similarity of the "correct" answer with the generated answer using "LLM-as-a-judge" - then compare the scores for RAG and non-RAG
  • There a number of benchmarks, such as SQuAD, however it is not clear that good performance on these will mean it will perform well on tasks relevant to our business.
  • More discussion about evaluating RAG can be found in these papers:

WP5: Explainer

  • Senior decision makers will need to make decisions regarding this technology - we need to produce resources which explain it, the benefits and the risks, particularly when compared with fine-tuning (or doing nothing), any of our relevant findings.

Getting Started

  1. Clone the repository. To learn about what this means, and how to use Git, see the Git guide.

    git clone https://github.com/nhsengland/ds_251_RAG.git
    
  2. Set up your environment, using pdm

    • Ensure you've got PDM installed
      • Linux:
        curl -sSL https://pdm-project.org/install-pdm.py | python3 -
        
      • Windows (powershell):
        (Invoke-WebRequest -Uri https://pdm-project.org/install-pdm.py -UseBasicParsing).Content | python -
        
    • Make sure you've got the right version of python installed (see the pyroject.toml)
    • run pdm sync (in Windows, you might need to add "python -m " before this)
    • This will setup the venv and install the correct dependencies.
  3. Create your own .secrets file with relevant keys (see .secrets_example) for the format.

  4. Open the rag_demo.ipynb and run it

Project structure

Provide the user with an outline of your repository structure. This template is primarily designed for publications teams at NHS England. Projects with different requirements (e.g. more complex documentation and modelling) should look to DrivenData's cookiecutter project structure, as well as our Community of Practice for guidance.

|   .gitignore                        <- Files (& file types) automatically removed from version control for security purposes
|   config.toml                       <- Configuration file with parameters we want to be able to change (e.g. date)
|   environment.yml                   <- Conda equivalent of requirements file
|   requirements.txt                  <- Requirements for reproducing the analysis environment 
|   pyproject.toml                    <- Configuration file containing package build information
|   LICENCE                           <- License info for public distribution
|   README.md                         <- Quick start guide / explanation of your project
|
|   create_publication.py             <- Runs the overall pipeline to produce the publication     
|
+---src                               <- Scripts with functions for use in 'create_publication.py'. Contains project's codebase.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---utils                     <- Scripts relating to configuration and handling data connections e.g. importing data, writing to a database etc.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       file_paths.py             <- Configures file paths for the package
|   |       logging_config.py         <- Configures logging
|   |       data_connections.py       <- Handles data connections i.e. reading/writing dataframes from SQL Server
|   | 
|   +---processing                    <- Scripts with modules containing functions to process data i.e. clean and derive new fields
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       clean.py                  <- Perform cleaning and wrangling processes 
|   |       derive_fields.py          <- Create new field definitions, columns, derivations.
|   | 
|   +---data_ingestion                <- Scripts with modules containing functions to preprocess read data i.e. perform validation/data quality checks, other preprocessing etc.
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       preprocessing.py          <- Perform preprocessing, for example preparing your data for metadata or data quality checks.
|   |       validation_checks.py      <- Perform validation checks e.g. a field has acceptable values.
|   |
|   +---data_exports
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |       write_excel.py            <- Populates an excel .xlsx template with values from your CSV output.
|   |
+---sql                               <- SQL scripts for importing data  
|       example.sql
|
+---templates                         <- Templates for output files
|       publication_template.xlsx
|
+---tests
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---backtests                     <- Comparison tests for the old and new pipeline's outputs
|   |       backtesting_params.py
|   |       test_compare_outputs.py
|   |       __init__.py               <- Makes the functions folder an importable Python module
|   |
|   +---unittests                     <- Tests for the functional outputs of Python code
|   |       test_data_connections.py
|   |       test_processing.py
|   |       __init__.py               <- Makes the functions folder an importable Python module

root

In the highest level of this repository (known as the 'root'), there is one Python file: create_publication.py. This top level file should be the main place where users interact with the code, where you store the steps to create your publication.

This file currently runs a set of example steps using example data.

src

This directory contains the meaty parts of the code. By organising the code into logical sections, we make it easier to understand, maintain and test. Moreover, tucking the complex code out of the way means that users don't need to understand everything about the code all at once.

  • data_connections.py handles reading data in and writing data back out.
  • processing folder contains the core business logic.
  • utils folder contains useful reusable functions (e.g. to set up logging, and importing configuration settings from config.toml)
  • write_excel.py contains functions relating to the final part of the pipeline, any exporting or templating happens here. This is a simplistic application of writing output code to an Excel spreadsheet template (.xlsx). A good example of this application is: NHS sickness absence rates publication. We highly recommend to use Automated Excel Production for a more in depth Excel template production application.

Licence

The LICENCE file will need to be updated with the correct year and owner

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Acknowledgements