Skip to content

Commit

Permalink
Merge pull request #418 from bbrowning/add-changelog
Browse files Browse the repository at this point in the history
Add a CHANGELOG.md and fill it in for latest 2 releases
  • Loading branch information
mergify[bot] authored Dec 6, 2024
2 parents d31f4e7 + 320a61c commit f7c35d7
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 0 deletions.
9 changes: 9 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Dataset
dataset
datasets
distractor
Docling
docling
Eval
eval
FIXME
Expand All @@ -23,18 +25,25 @@ MCQ
Merlinite
Mixtral
MMLU
multiphase
Ouput
Pre
pre
precomputed
Pregenerated
qna
quantized
repo
sdg
Splitter
subdirectory
subfolder
Tatsu
Tesseract
tokenizer
tokenizers
unchunked
upsampled
UUID
vLLM
yaml
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## v0.6.1

### Fixes

* Fixed a bug where generating data from a taxonomy with 2 or more changed knowledge leaf nodes would fail with a message about a destination path `already exists and is not an empty directory`

## v0.6.0

### Features

* Small knowledge datasets will automatically get upsampled during final data mixing based on the length of any precomputed skills datasets used during data mixing. This avoids issues where very large precomputed skills datasets were swamping the comparatively minor number of knowledge samples, resulting in lower than optimal knowledge retention during multiphase training. If a large precomputed dataset isn't in use during mixing (which is how things operate by default), this change is a no-op.
* When chunking PDF documents, we'll now look for the docling models on-disk in `$XDG_DATA_HOME/instructlab/sdg/models` (as well as `$XDG_DATA_DIRS` with the same `instructlab/sdg/models` subdirectory). If they are not found on disk, they'll automatically be downloaded from HuggingFace.
* When chunking PDF documents with Docling, we'll automatically configure Docling to use `tesserocr` if a working implementation is found instead of relying on `easyocr`. We fallback to `easyocr` if Tesseract is not properly configured for use by `tesserocr`.

### Breaking Changes

* Teacher model tokenizers are loaded from the local teacher model on-disk and not downloaded automatically from HuggingFace. The typical workflows in use so far expect the teacher model to exist on-disk, and this enforces that at least its tokenizer exists.

0 comments on commit f7c35d7

Please sign in to comment.