Skip to content

Commit 63d0d8b

Browse files
authored
Add new audio and vision docs in Datasets blog post (huggingface#409)
* 📝 add draft of blog post * 🖍 apply review feedback * ✨ add new guide overview image * 🖍 apply nate review * 🖍 apply omar review * ✨ use a faster space * 📝 update date
1 parent 76fc1b3 commit 63d0d8b

File tree

6 files changed

+127
-0
lines changed

6 files changed

+127
-0
lines changed

_blog.yml

+11
Original file line numberDiff line numberDiff line change
@@ -1042,3 +1042,14 @@
10421042
tags:
10431043
- nlp
10441044
- guide
1045+
1046+
- local: datasets-docs-update
1047+
title: "Introducing new audio and vision documentation in 🤗 Datasets"
1048+
author: stevhliu
1049+
thumbnail: assets/87_datasets-docs-update/thumbnail.gif
1050+
date: July 28, 2022
1051+
tags:
1052+
- audio
1053+
- cv
1054+
- community
1055+
- announcement
80.4 KB
Loading
319 KB
Loading
188 KB
Loading
211 KB
Loading

datasets-docs-update.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
title: "Introducing new audio and vision documentation in 🤗 Datasets"
3+
thumbnail: /blog/assets/87_datasets-docs-update/thumbnail.gif
4+
---
5+
6+
<h1>
7+
Introducing new audio and vision documentation in 🤗 Datasets
8+
</h1>
9+
10+
<div class="blog-metadata">
11+
<small>Published July 11, 2022.</small>
12+
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/datasets-docs-update.md">
13+
Update on GitHub
14+
</a>
15+
</div>
16+
17+
<div class="author-card">
18+
<a href="/stevhliu">
19+
<img class="avatar avatar-user" src="https://aeiljuispo.cloudimg.io/v7/https://s3.amazonaws.com/moonup/production/uploads/1599079986463-noauth.jpeg?w=200&h=200&f=face" title="Gravatar">
20+
<div class="bfc">
21+
<code>stevhliu</code>
22+
<span class="fullname">Steven Liu</span>
23+
</div>
24+
</a>
25+
</div>
26+
27+
Open and reproducible datasets are essential for advancing good machine learning. At the same time, datasets have grown tremendously in size as rocket fuel for large language models. In 2020, Hugging Face launched 🤗 Datasets, a library dedicated to:
28+
29+
1. Providing access to standardized datasets with a single line of code.
30+
2. Tools for rapidly and efficiently processing large-scale datasets.
31+
32+
Thanks to the community, we added hundreds of NLP datasets in many languages and dialects during the [Datasets Sprint](https://discuss.huggingface.co/t/open-to-the-community-one-week-team-effort-to-reach-v2-0-of-hf-datasets-library/2176)! 🤗 ❤️
33+
34+
But text datasets are just the beginning. Data is represented in richer formats like 🎵 audio, 📸 images, and even a combination of audio and text or image and text. Models trained on these datasets enable awesome applications like describing what is in an image or answering questions about an image.
35+
36+
<div class="hidden xl:block">
37+
<div style="display: flex; flex-direction: column; align-items: center;">
38+
<iframe src="https://hf.space/embed/Salesforce/BLIP/+
39+
" frameBorder="0" width="1400" height="690" title="Gradio app" class="p-0 flex-grow space-iframe" allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"></iframe>
40+
</div>
41+
</div>
42+
43+
The 🤗 Datasets team has been building tools and features to make working with these dataset types as simple as possible for the best developer experience. We added new documentation along the way to help you learn more about loading and processing audio and image datasets.
44+
45+
## Quickstart
46+
47+
The [Quickstart](https://huggingface.co/docs/datasets/quickstart) is one of the first places new users visit for a TLDR about a library’s features. That’s why we updated the Quickstart to include how you can use 🤗 Datasets to work with audio and image datasets. Choose a dataset modality you want to work with and see an end-to-end example of how to load and process the dataset to get it ready for training with either PyTorch or TensorFlow.
48+
49+
Also new in the Quickstart is the `to_tf_dataset` function which takes care of converting a dataset into a `tf.data.Dataset` like a mama bear taking care of her cubs. This means you don’t have to write any code to shuffle and load batches from your dataset to get it to play nicely with TensorFlow. Once you’ve converted your dataset into a `tf.data.Dataset`, you can train your model with the usual TensorFlow or Keras methods.
50+
51+
Check out the [Quickstart](https://huggingface.co/docs/datasets/quickstart) today to learn how to work with different dataset modalities and try out the new `to_tf_dataset` function!
52+
53+
<figure class="image table text-center m-0 w-full">
54+
<img style="border:none;" alt="Cards with links to end-to-end examples for how to process audio, vision, and NLP datasets" src="assets/87_datasets-docs-update/quickstart.png" />
55+
<figcaption>Choose your dataset adventure!</figcaption>
56+
</figure>
57+
58+
## Dedicated guides
59+
60+
Each dataset modality has specific nuances on how to load and process them. For example, when you load an audio dataset, the audio signal is automatically decoded and resampled on-the-fly by the `Audio` feature. This is quite different from loading a text dataset!
61+
62+
To make all of the modality-specific documentation more discoverable, there are new dedicated sections with guides focused on showing you how to load and process each modality. If you’re looking for specific information about working with a dataset modality, take a look at these dedicated sections first. Meanwhile, functions that are non-specific and can be used broadly are documented in the General Usage section. Reorganizing the documentation in this way will allow us to better scale to other dataset types we plan to support in the future.
63+
64+
<figure class="image table text-center m-0 w-full">
65+
<img style="border:none;" alt="An overview of the how-to guides page that displays five new sections of the guides: general usage, audio, vision, text, and dataset repository." src="assets/87_datasets-docs-update/overview.png" />
66+
<figcaption>The guides are organized into sections that cover the most essential aspects of 🤗 Datasets.</figcaption>
67+
</figure>
68+
69+
Check out the [dedicated guides](https://huggingface.co/docs/datasets/how_to) to learn more about loading and processing datasets for different modalities.
70+
71+
## ImageFolder
72+
73+
Typically, 🤗 Datasets users [write a dataset loading script](https://huggingface.co/docs/datasets/dataset_script) to download and generate a dataset with the appropriate `train` and `test` splits. With the `ImageFolder` dataset builder, you don’t need to write any code to download and generate an image dataset. Loading an image dataset for image classification is as simple as ensuring your dataset is organized in a folder like:
74+
75+
```py
76+
folder/train/dog/golden_retriever.png
77+
folder/train/dog/german_shepherd.png
78+
folder/train/dog/chihuahua.png
79+
80+
folder/train/cat/maine_coon.png
81+
folder/train/cat/bengal.png
82+
folder/train/cat/birman.png
83+
```
84+
85+
<figure class="image table text-center m-0 w-full">
86+
<img style="border:none;" alt="A table of images of dogs and their associated label." src="assets/87_datasets-docs-update/good_boi_pics.png" />
87+
<figcaption>Your 🐶 dataset should look something like this once you've uploaded it to the Hub and preview it.</figcaption>
88+
</figure>
89+
90+
Image labels are generated in a `label` column based on the directory name. `ImageFolder` allows you to get started instantly with an image dataset, eliminating the time and effort required to write a dataset loading script.
91+
92+
But wait, it gets even better! If you have a file containing some metadata about your image dataset, `ImageFolder` can be used for other image tasks like image captioning and object detection. For example, object detection datasets commonly have *bounding boxes*, coordinates in an image that identify where an object is. `ImageFolder` can use this file to link the metadata about the bounding box and category for each image to the corresponding images in the folder:
93+
94+
```py
95+
{"file_name": "0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
96+
{"file_name": "0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
97+
{"file_name": "0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}
98+
99+
dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
100+
dataset[0]["objects"]
101+
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
102+
```
103+
104+
You can use `ImageFolder` to load an image dataset for nearly any type of image task if you have a metadata file with the required information. Check out the [ImageFolder](https://huggingface.co/docs/datasets/image_load) guide to learn more.
105+
106+
## What’s next?
107+
108+
Similar to how the first iteration of the 🤗 Datasets library standardized text datasets and made them super easy to download and process, we are very excited to bring this same level of user-friendliness to audio and image datasets. In doing so, we hope it’ll be easier for users to train, build, and evaluate models and applications across all different modalities.
109+
110+
In the coming months, we’ll continue to add new features and tools to support working with audio and image datasets. Word on the 🤗 Hugging Face street is that there’ll be something called `AudioFolder` coming soon! 🤫 While you wait, feel free to take a look at the [audio processing guide](https://huggingface.co/docs/datasets/audio_process) and then get hands-on with an audio dataset like [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech).
111+
112+
---
113+
114+
Join the [forum](https://discuss.huggingface.co/) for any questions and feedback about working with audio and image datasets. If you discover any bugs, please open a [GitHub Issue](https://github.com/huggingface/datasets/issues/new/choose), so we can take care of it.
115+
116+
Feeling a little more adventurous? Contribute to the growing community-driven collection of audio and image datasets on the [Hub](https://huggingface.co/datasets)! [Create a dataset repository](https://huggingface.co/docs/datasets/upload_dataset) on the Hub and upload your dataset. If you need a hand, open a discussion on your repository’s **Community tab** and ping one of the 🤗 Datasets team members to help you cross the finish line!

0 commit comments

Comments
 (0)