-
Notifications
You must be signed in to change notification settings - Fork 107
Issues: NVIDIA/NeMo-Curator
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] download process has memory leak during extraction to jsonl
bug
Something isn't working
#38
opened Apr 20, 2024 by
zahramahani
Fuzzy dedup error if partition wise indices do not start from 0
bug
Something isn't working
#48
opened May 2, 2024 by
ayushdg
Better mimic DocumentDataset's New feature or request
read_*
functions to Dask's read_*
functions
enhancement
#50
opened May 3, 2024 by
sarahyurick
[BUG] Jaccard Shuffle error if shuffled_docs.parquet data already exists and has been written.
bug
Something isn't working
#51
opened May 3, 2024 by
ayushdg
[FEA] Add batched files reading to separate_by_metadata.py
enhancement
New feature or request
#53
opened May 6, 2024 by
miguelusque
[BUG] importing spacy before cluster creation leads on only 1 GPU being used.
bug
Something isn't working
#64
opened May 13, 2024 by
ayushdg
[FEA] Add examples showing how to use both CPU & GPU modules together
documentation
Improvements or additions to documentation
enhancement
New feature or request
#65
opened May 15, 2024 by
ayushdg
[FEA] Update read_json to work with s3 paths.
enhancement
New feature or request
#66
opened May 15, 2024 by
ayushdg
[BUG] Better error/checks around input types being CPU/GPU
enhancement
New feature or request
#79
opened May 23, 2024 by
ayushdg
[FEA] Raise a warning when creating FuzzyDuplicatesConfig with non-empty cache_dir
enhancement
New feature or request
#84
opened May 27, 2024 by
randerzander
find_pii_and_deidentify example fails
bug
Something isn't working
#85
opened May 28, 2024 by
randerzander
Update download documentation to include client creation
bug
Something isn't working
documentation
Improvements or additions to documentation
#100
opened Jun 6, 2024 by
moutasemalakkad
Remove Numpy<2.0 pin
meta
General NeMo-Curator maintenance/packaging
#120
opened Jun 18, 2024 by
ayushdg
Remove text field requirement from Download and Extract
enhancement
New feature or request
#158
opened Jul 22, 2024 by
ryantwolf
[META] Update python version to include python 3.11
meta
General NeMo-Curator maintenance/packaging
#188
opened Aug 6, 2024 by
VibhuJawa
Pandas and cuDF DataFrames in Something isn't working
DocumentDataset
bug
#195
opened Aug 8, 2024 by
sarahyurick
[FEA] Add license detector for code repositories
enhancement
New feature or request
#208
opened Aug 15, 2024 by
miguelusque
[FEA] Create intermediate representation (IR) from code source
enhancement
New feature or request
#209
opened Aug 15, 2024 by
miguelusque
[FEA] Align the character pruning vs sequence length based pruning for our models.
enhancement
New feature or request
#213
opened Aug 20, 2024 by
VibhuJawa
Explore Dask jobque's slurm runner for multi node slurm setups.
enhancement
New feature or request
#215
opened Aug 23, 2024 by
ayushdg
Grammar and punctuation nits in Jupyter Notebooks
documentation
Improvements or additions to documentation
good first issue
Good for newcomers
#228
opened Sep 4, 2024 by
sarahyurick
16 tasks
Previous Next
ProTip!
Follow long discussions with comments:>50.