Skip to content

A tool for filtering segments which are too long or too short after being encoded with a sentencepiece model

License

Notifications You must be signed in to change notification settings

erip/spm_filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spm-filter

A unix-like utility for filtering raw sentences depending on their post-encoded lengths.

Installation

git clone [email protected]:erip/spm_filter.git
cd spm_filter
pip install -e .

Usage

# Read from stdin by default
cat sents.txt | spm-filter -m /path/to/sentencepiece.model --max-len 256 > filtered.txt
# Or read from a file
spm-filter -i sents.txt -m /path/to/sentencepiece.model --max-len 256 > filtered2.txt

Authors

About

A tool for filtering segments which are too long or too short after being encoded with a sentencepiece model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages