Skip to content

Commit

Permalink
New dataset search features blog post (huggingface#2176)
Browse files Browse the repository at this point in the history
* Create hf-reinvents-dataset-search.md

* add main content

* add images

* update links

* add thumbnail

* rename datasets-filters

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <[email protected]>

* Update datasets-filters.md

Co-authored-by: Omar Sanseviero <[email protected]>

* upload datasets to hf repo

* update date

---------

Co-authored-by: Omar Sanseviero <[email protected]>
  • Loading branch information
lhoestq and osanseviero authored Jul 8, 2024
1 parent 062a301 commit 24eb4df
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 0 deletions.
8 changes: 8 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4291,3 +4291,11 @@
- partnerships
- intel
- llm

- local: datasets-filters
title: "Announcing New Dataset Search Features"
author: lhoestq
thumbnail: /blog/assets/datasets-filters/thumbnail.jpg
date: Jul 8, 2024
tags:
- datasets
Binary file added assets/datasets-filters/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 90 additions & 0 deletions datasets-filters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Announcing New Dataset Search Features"
thumbnail: /blog/assets/datasets-filters/thumbnail.jpg
authors:
- user: lhoestq
- user: severo
---

# Announcing New Dataset Search Features

The AI and ML community has shared more than 180,000 public datasets on The [Hugging Face Dataset Hub](https://huggingface.co/datasets).
Researchers and engineers are using these datasets for various tasks, from training LLMs to chat with users to evaluating automatic speech recognition or computer vision systems.
Dataset discoverability and visualization are key challenges to letting AI builders find, explore, and transform datasets to fit their use cases.

At Hugging Face, we are building the Dataset Hub as the place for the community to collaborate on open datasets.
So we built tools like Dataset Search and the Dataset Viewer, as well as a rich open source ecosystem of tools.
Today we are announcing four new features that will take Dataset Search on the Hub to the next level.

## Search by Modality

The modality of a dataset corresponds to the type of data inside the dataset. For example, the most common types of data on Hugging Face are text, image, audio, and tabular data.

We released a set of filters that allows you to filter datasets that have one or several modalities among this list:

- Text
- Image
- Audio
- Tabular
- Time-Series
- 3D
- Video
- Geospatial

For example, it is possible to look for [datasets that contain both text and image data](https://huggingface.co/datasets?modality=modality:3d&sort=trending):

![search by modality example](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/image_and_text.png)

The modalities of each dataset are automatically detected based on file contents and extensions.

## Search by Size

We recently released a new feature in the interface to show the number of rows of each dataset:

![number of rows of each dataset](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/datasets_sizes_in_overview.png)

Following this, it is now possible to search datasets by a number of rows by specifying a minimum and maximum number of rows.
This will let you look for datasets of small size to the biggest datasets that exist (for example, the ones used to pretrain LLMs).

The information about the number of rows is available for all the datasets in [supported formats](https://huggingface.co/docs/hub/datasets-adding#file-formats).
Even for the biggest datasets for which the number of rows is not included in the metadata the total number of rows is estimated accurately based on the content of the first 5GB.

For example, if you are looking at the datasets with the highest number of rows on Hugging Face, you can look for [datasets with more than 10B (10<sup>10</sup>) rows](https://huggingface.co/datasets?size_categories=or:%28size_categories:10B%3Cn%3C100B,size_categories:100B%3Cn%3C1T,size_categories:n%3E1T%29&sort=trending):

![biggest datasets](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/biggest_datasets.png)

## Search by Format

The same dataset can be stored in many different formats.
For example, text datasets are often in Parquet or JSON Lines, but they could be in text files, and image datasets are often a single directory of images, but they could be in [WebDataset format](https://huggingface.co/docs/hub/datasets-webdataset) (a format based on TAR archives).

Each format has its pros and cons.
For example, Parquet offers nested data support, unlike CSV, efficient filtering/analytics, and a good compression ratio, but accessing one specific row requires decoding a full row group.
Another example is WebDataset, which offers the highest data streaming speed but lacks some metadata, such as the number of rows per file, which is often needed to efficiently distribute data in multi-node training setups.

The dataset format, therefore, indicates which use cases are favoured and whether you will need to reformat the data to fit your needs.

Here you can see the [datasets in WebDataset format](https://huggingface.co/datasets?format=format:webdataset&sort=trending):

![webdatasets](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/webdatasets.png)

## Search by Library

There are many good libraries and tools to load datasets and prepare them for training, like Pandas, Dask, or the 🤗 Datasets library.
The Hub allows you to use your favorite tools and filter datasets compatible with any library, for example you can look for [datasets compatible with Pandas](https://huggingface.co/datasets?library=library:pandas&sort=trending):

![pandas compatible datasets](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/pandas_datasets)

The dataset compatibility is based on the dataset format and size (e.g., Dask can load big JSON Lines dataset, unlike Pandas, which requires loading the full dataset in memory).
In addition to this, we also provide the code snippet to load any dataset in your favorite tool:

![load fineweb-edu in dask](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/dask_fineweb_edu.png)

If you would like your library to appear in the list of supported libraries, feel free to open a discussion on [huggingface.js](https://github.com/huggingface/huggingface.js/issues)!

## Combine filters

Those four new Dataset Search tools can be used together and with the other existing filters like Language, Tasks, and Licenses.
Combining those filters with the text search bar you can look for the specific dataset you are looking for:

![search for a webdataset of images of pdf](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/datasets-filters/dataset_cars.png)

0 comments on commit 24eb4df

Please sign in to comment.