hands-on-llms/dataset_analysis at main · Omarlaouan/hands-on-llms

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
prompts_eda.ipynb		prompts_eda.ipynb

README.md

Exploratory Data Analysis of the Finetuning Q&A Dataset

As in any ML problem, data used to train models is indispensable to the model's performance and value. The same principle applies to Generative AI tasks. More precisely, in finetuning Large Language Models, one of the critical components in this course.

Thus, let's go over a few bullet points on why it might be essential for you to consider analyzing your dataset before finetuning any LLM model.

Here are the key steps of our EDA (Exploratory Data Analysis) on the prompts of the Q&A dataset:

Context Window Coverage

Helps in selecting the optimal context window size for your dataset
Ensures that your model can capture the whole contextual meaning of your prompt.
Helps in accurately selecting a parameter size for your model (7B, 13B, 2B)
Helps in establishing the max_sequence_len, speeds up fine-tuning due to lower attention masking matmul’s.

Word Cloud

Visual representation of most representative words
Helps in preventing bias towards a specific theme (overfit)
Helps in processing/removing irrelevant frequent words
Could serve as a primary quality check of your overall dataset vocabulary

Tokens Distribution

Helps to balance the lengths of your dataset prompts
Easy to check for the Gaussian Bell distribution of lengths
Helps identify the samples that could be enhanced or summarised
Helps limit the input prompt to a specific size, allowing the completion sentence to be longer.

Learn More

→ Check out the EDA notebook at dataset_analysis/prompts_eda.ipynb to see the complete analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_analysis

dataset_analysis

README.md

Exploratory Data Analysis of the Finetuning Q&A Dataset

Context Window Coverage

Word Cloud

Tokens Distribution

Learn More

Files

dataset_analysis

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset_analysis

Folders and files

parent directory

README.md

Exploratory Data Analysis of the Finetuning Q&A Dataset

Context Window Coverage

Word Cloud

Tokens Distribution

Learn More