Welcome to the Capstone project! Everything you've learned over the past days will now be integrated in a realistic data pipeline. The training wheels are off, but we're still at the sideline, cheering you on and supporting you when needed.
In a nutshell, here's what you will do:
Read, transform and load stackoverflow data from S3 with PySpark and cleaning it such that it can be used by a llm model. We want to find out if we can improve an existing foundational model by providing it relevant data on specific topics in the form of stackoverflow questions and answers. Since these foundational models don't always contain the most recent changes, they might provide outdated results. Your task is thus to improve the llm by feeding it the right data.
You will start by building the code for the ingestion pipeline locally and scheduling it using Airflow. If this is going well, we will run it on a cloud platform, called Conveyor.
To get started We've set up a Gitpod environment containing all the tools required to complete this exercise (awscli, python, vscode, ...). You can access this environment by clicking the button below:
NOTE: When you fork the code repo to your own remote make sure to change the Gitpod URL to reflect your account in this README!
This is an ubuntu-based environment pre-installed with:
- VSCode
- A Python3 virtual environment: we recommend you always work inside this environment.
- The AWS CLI
IMPORTANT: Create a new branch and periodically push your work to the remote. After 30min of inactivity this environment shuts down and you will likely lose unsaved progress. As stated before, change the Gitpod URL to reflect your remote.
In this repository we prepared the scaffolding for a basic python project. Best is to use the scaffolding as is, as that will make sure you quickly get something up and running.
root/
|-- dags/
|-- src/
| |-- project/
| |-- |-- common/
| |-- |-- |-- spark.py
| |-- |-- tasks/
|-- tests/
| |-- common/
| |-- | -- spark.py
| Dockerfile
| setup.py
| .gitpod.yml
| .gitpod.dockerfile
| docker-compose.yaml
| requirements.txt
| requirements.in
| setup.py
Our team already ingested questions and answers from StackOverflow for you to use. We used the stackoverflow API. We ingested different tags, pick one of them with your table (3-4 peopla) as a starting point for cleaning your data. This helps to distribute who is working on which data.
The input data is stored in the following s3 bucket: dataminded-academy-capstone-llm-data-us
under path input/{tag}/
The S3 bucket resides in us-east-1 region.
Investigate the data, you can download and inspect the json files. Download them as follows:
aws s3 ls s3://dataminded-academy-capstone-llm-data-us/input/
aws s3 cp s3://dataminded-academy-capstone-llm-data-us/input/dbt/questions.json ./
Start by writing your cleaning transformation by reading/writing local files and only afterwards interact directly with s3.
Given the input data for 1 tag, the goal is to create 1 json document per question containing the title, question body and the response body.
So your goal is to extract the relevant fields from both the questions and answers and join them together using the question_id
field.
Write the json documents per question again to s3 under path cleaned/<user>/{tag}
If you are confident in your code, the next step is scheduling it using Airflow
The following commands are assumed to run in the capstone_llm
subdirectory.
- create a virtualenv:
python3 -m venv venv
- using dependencies:
- add the dependencies to the requirements.in
- generate correct versions + transitive dependencies using:
pip-compile requirements.in
- install the dependencies in your virtual environment using
pip install -r requirements.txt
- note: the provided dockerfile already packages pyspark 3.5.1, so you only need to specify this in your dev-requirements.in. This way you have spark available when running locally, but it is not packaged twice when running in Docker.
- 2 places to write your transformation logic:
- clean.py: your pyspark cleaning code
- ingest.py: see task 3bis (only if you have time left)
- run the tasks
- install the project in your venv directory as follows:
pip install -e .
- run a task:
python3 -m capstonellm.tasks.clean
orpython3 -m capstonellm.tasks.ingest
- install the project in your venv directory as follows:
As you know have working python code. We now want to make sure this is triggered using Airflow.
We start with a local installation of Airflow, you can use the docker-compose.yml
file, similar to the setup used in the Airflow session.
- package the python code in a Dockerfile. If you used the provided scaffolding, this should be easy. Take a look at the Dockerfile and make sure you understand everything
- create an Airflow dag with one task (clean) that will run your clean job using the DockerOperator. In order to access s3, you will have to pass your credentials to the docker container.
The next step is to start using your prepared data in the llm. Before feeding the data, let's test the performance of the current model by asking it some questions:
curl --location 'https://kfkjc4ha7crhz6s3uat54wtj6m0bgtdf.lambda-url.eu-west-1.on.aws' \
--header 'Content-Type: application/json' \
--data '{
"query":"What is the equivalent of `DataFrame.drop_duplicates()` from pandas in polars?"
}'
Go to the knowledge base and reprocess the input data again. See if now the model returns more accurate results. Try for example the exact question that fits in your input data?
NOTE: This is an optional task, if you still have time.
The goal here is to create the input data yourself instead of relying on the data that we have provided. In order to do this you will have to investigate the Stackoverflow API. You should call the API and fetch the questions and answers separately, which can be done as follows:
- Query the questions given 1 or more specified tags
- Using the question ids from the previous step, look for the relevant answers
As a best practice this raw data is not preprocessed/cleaned but dumped as is in the s3 bucket under path /input/{user}/{tag}
.
The processing and cleaning, you already did in the cleaning step. This way if you made a mistake while cleaning, you can start again from the raw data without calling the API again.
Now that we have verified all our code locally, it is now time to deploy it to production environment. In our case this will be Conveyor.
- login to Conveyor:
conveyor auth login
- create a Conveyor project with the following name:
capstone-llm-{user}
from the root directory. - Tweak teh
conveyor_example.py
to run your job using the ConveyorContainerOperatorV2. - instead of using the AWS credentials directly, as we did locally, you now attach the
capstone_conveyor_llm
role. - to test things out you can run:
conveyor run
- build and deploy your project:
conveyor build && conveyor deploy --env test
Setup virtual environment:
pyenv local
to use a correct python versionpython -m venv venv
to create a virtual environmentsource ./venv/bin/activate
to activate the virtual environmentpip install pip-tools
to install pip tools
Tasks:
pip install -r requirements.txt
to install dependenciespip install -r dev-requirements.txt
to install development dependenciespip install -e .
to install the project in editable modepython -m pytest --cov=src tests
runs all the tests and check coveragepython -m black dags src tests --check
checks PEP8 compliance issuespython -m black dags src tests
fixes PEP8 compliance issuespip-compile requirements.in
if you add new requirements this regenerates a new requirements.txtpip-compile dev-requirements.in
if you add new requirements this regenerates a new dev-requirements.txt, you should also do this when have updated your requirements.in