Tags: huggingface/dataset-viewer
Tags
Simplify cache by dropping two collections (#202) * docs: ✏️ add backup/restore to migration instructions * feat: 🎸 pass the max number of rows to the worker * feat: 🎸 delete the 'rows' and 'columns' collections instead of keeping a large collection of rows and columns, then compute the response on every endpoint call, possibly truncating the response, we now pre-compute the response and store it in the cache. We lose the ability to get the original data, but we don't need it. It fixes #197. See #197 (comment). BREAKING CHANGE: 🧨 the cache database structure has been modified. Run 20220408_cache_remove_dbrow_dbcolumn.py to migrate the database. * style: 💄 fix types and style * docs: ✏️ add parameter to avoid error in mongodump * docs: ✏️ mark ROWS_MAX_BYTES and ROWS_MIN_NUMBER as worker vars
remove "gated datasets unlock" logic (#189) * refactor: 💡 move gated datasets "unlock" code to models/ also: add two tests to ensure the gated datasets can be accessed * test: 💍 adapt to new version of dummy_gated dataset I changed (https://huggingface.co/datasets/severo/dummy_gated/commit/99194748bed3625a941aaf785740df02ca5762c9) severo/dummy_gated to a simpler dataset, without a python script, to avoid having non-related errors. Also in the commit: load the HF_TOKEN from a secret in https://github.com/huggingface/datasets-preview-backend/settings/secrets/actions to be able to run the unit tests * test: 💍 fix wrong hardcoded value * chore: 🤖 ignore safety warning on ujson package it's a dependency of lm-dataformat, and last version still depends on a vulnerable ujson version * feat: 🎸 remove the "ask_access" logic for gated datasets the new "app" tokens on moonlanding can read the gated datasets without having to accept the conditions first, as it occurs for users. BREAKING CHANGE: 🧨 HF_TOKEN must be an app token
feat: 🎸 truncate cell contents instead of removing rows (#178) Add a ROWS_MIN_NUMBER environment variable, which defines how many rows should be returned as a minimum. If the size of these rows is greater than the ROWS_MAX_BYTES limit, then the cells themselves are truncated (transformed to strings, then truncated to 100 bytes which is an hardcoded limit). In that case, the new field "truncated_cells" contain the list of cells (column names) that are truncated. BREAKING CHANGE: 🧨 The /rows response format has changed
PreviousNext