From 320a61c7dc3212569c6b9ec9557708a1cd45478b Mon Sep 17 00:00:00 2001
From: Ben Browning <bbrownin@redhat.com>
Date: Wed, 27 Nov 2024 20:09:18 -0500
Subject: [PATCH] Add a CHANGELOG.md and fill it in for latest 2 releases

This doesn't fill in our entire history, but starts us down the path
of providing a proper CHANGELOG.md for future releases.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
---
 .spellcheck-en-custom.txt |  9 +++++++++
 CHANGELOG.md              | 17 +++++++++++++++++
 2 files changed, 26 insertions(+)
 create mode 100644 CHANGELOG.md

diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt
index 0ed144c1..c06f114b 100644
--- a/.spellcheck-en-custom.txt
+++ b/.spellcheck-en-custom.txt
@@ -9,6 +9,8 @@ Dataset
 dataset
 datasets
 distractor
+Docling
+docling
 Eval
 eval
 FIXME
@@ -23,18 +25,25 @@ MCQ
 Merlinite
 Mixtral
 MMLU
+multiphase
 Ouput
 Pre
 pre
+precomputed
 Pregenerated
 qna
 quantized
 repo
 sdg
 Splitter
+subdirectory
 subfolder
 Tatsu
+Tesseract
+tokenizer
+tokenizers
 unchunked
+upsampled
 UUID
 vLLM
 yaml
diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 00000000..8f56e2c6
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,17 @@
+## v0.6.1
+
+### Fixes
+
+* Fixed a bug where generating data from a taxonomy with 2 or more changed knowledge leaf nodes would fail with a message about a destination path `already exists and is not an empty directory`
+
+## v0.6.0
+
+### Features
+
+* Small knowledge datasets will automatically get upsampled during final data mixing based on the length of any precomputed skills datasets used during data mixing. This avoids issues where very large precomputed skills datasets were swamping the comparatively minor number of knowledge samples, resulting in lower than optimal knowledge retention during multiphase training. If a large precomputed dataset isn't in use during mixing (which is how things operate by default), this change is a no-op.
+* When chunking PDF documents, we'll now look for the docling models on-disk in `$XDG_DATA_HOME/instructlab/sdg/models` (as well as `$XDG_DATA_DIRS` with the same `instructlab/sdg/models` subdirectory). If they are not found on disk, they'll automatically be downloaded from HuggingFace.
+* When chunking PDF documents with Docling, we'll automatically configure Docling to use `tesserocr` if a working implementation is found instead of relying on `easyocr`. We fallback to `easyocr` if Tesseract is not properly configured for use by `tesserocr`.
+
+### Breaking Changes
+
+* Teacher model tokenizers are loaded from the local teacher model on-disk and not downloaded automatically from HuggingFace. The typical workflows in use so far expect the teacher model to exist on-disk, and this enforces that at least its tokenizer exists.