This work is based on our short paper, VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation, first presented at the Workshop on Semantic Reasoning and Goal Understanding in Robotics (SemRob), RSS 2024.
The VLA-3D dataset is a 3D object referential dataset designed for vision-language grounding that can be used for the higher-level task of vision-language navigation (VLN). This dataset contains 9M+ language statements that are synthetically generated for 7635 3D scenes containing a total of 11619 regions from a combination of 3D scan datasets and is released as part of the CMU Vision-Language-Autonomy (VLA) Challenge. A sample scene from each data source can be found under sample_data. Please refer to the Download section for instructions on downloading the dataset.
- [2024-07] We release the first version of our dataset. Refer to the Download section for download instructions.
pip install minio tqdm
Run the download_dataset.py
script. The script can be run with the following arguments:
-
--download_path
: path to output folder where dataset is downloaded. Defaults toVLA-3D_dataset
. -
--subset
: specify name of dataset source to download only a subset of the data. One of Matterport/Scannet/HM3D/Unity/ARKitScenes/3RScan. If argument not given, the full dataset will be downloaded.
The data will be downloaded as zip files in the output directory, each corresponding to a 3D dataset source.
python download_dataset.py --download_path full_dataset
Following a template-based synthetic language generation method similar to ReferIt3D [1], referential natural-language statements were generated that refer to objects in scenes from six 3D scan datasets:
- Matterport3D [2]: 90 scenes - 2195 regions
- ScanNet [3]: 1513 scenes
- Habitat-Matterport 3D (HM3D) [4]: 140 scenes - 1991 regions
- Unity [5]: 15 scenes + 3 scenes omitted for the challenge - 46 regions
- ARKitScenes [6]: 4494 scenes
- 3RScan [7]: 1381 scenes
All of the datasets are real-world with the exception of Unity, where scenes were generated synthetically. The statements are generated per room/region for datasets that have multiple rooms or regions in one scene. The number of objects in each scene ranges from 4 to 2264. A sample visualization of a region from the dataset is visualized with a) a scene graph and b) a corresponding referential statement in the figure below.
The overall file structure for the dataset is:
<dataset_folder>/
-- <scene_name>/
-- <scene_name>_pc_result.ply
Processed point cloud of entire scene
-- <scene_name>_object_split.npy
File containing object IDs and split indices for use with the .ply file
-- <scene_name>_region_split.npy
File containing region IDs and split indices for use with the .ply file
-- <scene_name>_object_result.csv
Object information file containing object id, class labels, bounding box, and dominant colors of every object in the scene
-- <scene_name>_region_result.csv
Region information file containing region id, region name, bounding box
-- <scene_name>_scene_graph.json
Scene graph containing object relations within each region/room in the scene
-- <scene_name>_referential_statements.json
JSON file containing generated language statements
The format of the generated scene graphs are in JSON, where all objects along with their attributes and inter-object relations are stored per-region within the file for each scene. Objects attributes include semantic class labels, bounding box, size, dominant colors and heading direction (if any). Details can be found in Dataset Generation below.
The format of the generated language is also in JSON, where each JSON object consists of the language statement, information on the referred target object and anchor objects, relation type, and distractor objects. The referred objects are stored with their object_id, class label, center point, size (volume of bounding box), and top-three dominant colors.
The dataset generation pipeline consists of three main steps: 1) 3D Scan Processing, 2) Scene Graph Generation, and 3) Language Generation. The overall pipeline is shown in the figure below.
3D scenes from each dataset are stored in individual subfolders for each scene. The scenes are first preprocessed into five files:
- a point cloud .ply file.
- a .npy file containing the IDs and ending indices for the region points.
- a .npy file containing the IDs and ending indices for the object points.
- a CSV file containing region information.
- a CSV file containing object information.
Point Cloud Generation
Full colored scene point clouds are generated as follows, and stored in <scene_name>_pc_result.ply
:
- Matterport-3D, ScanNet and ARKitScenes store scene meshes as
.ply
files with colors pre-baked into vertices, so scene-level point clouds of these datasets are directly obtained from the raw .ply files. - The HM3D, 3RScan, and Unity datasets store scenes in
.glb
(x2) and.fbx
meshes respectively, and use UV mapping for textures. Point clouds are therefore sampled uniformly from these scenes, and colors are sampled from the original textures and baked into the sampled points. The number of sampled points per scene is proportional to the number of objects in each scene in Unity, and the total surface area of the 3RScan and HM3D meshes, sampled by dividing the calculated surface area of the mesh triangles by$2*10^{-4}$ .
Each .ply file stores point coordinates and RGB values, and does not contain object and region IDs. Instead, points are first sorted by region ID, then points within each region are sorted by object ID. Two .npy files: <scene_name>_object_split.npy
and <scene_name>_region_split.npy
are provided containing n_objects x 2
and n_regions x 2
arrays respectively. The first column in each array contains the respective object and region IDs, and may start with -1 if some points belong to unlabeled objects and regions. The second column contains the ending indices of the regions and objects of the given respective IDs. These files can be used with numpy.split
after reading the .ply files as numpy arrays to split the object and region points respectively. The visualizer contains an example of splitting the regions using these files.
Region-level Information
The region CSV files contain the following information per region:
region_id
: unique id of region within the sceneregion_label
: name of region (either from source data or labeled based on heuristics)region_bbox_c[xyz], region_bbox_[xyz]length, region_bbox_heading
: center point, size, and heading angle of region bounding box (heading angle is currently 0 for all datasets, and region bounding boxes are axis aligned)
ScanNet, ARKitScenes, and 3RScan already contain a single region per scene, so region bounding boxes are the full pointcloud bounding boxes. Region bounding boxes for Matterport-3D are obtained from the original dataset, while axis-aligned bounding boxes are created from the points segmented by region in HM3D and Unity.
Object-level Information
object_id
: unique id of object within the sceneregion_id
: id of region that the object belongs toraw_label
: the name of the object given in the original datasetnyu_id
,nyu40_id
,nyu_label
,nyu40_label
: class index and name based on NYUv2 and NYU40 schemasobject_bbox_c[xyz], object_bbox_[xyz]length, object_bbox_heading
: center point and length, width, height, heading of oriented object bounding boxobject_front_heading
: front heading direction for objects with a canonical front directionobject_color_[rgb][123]
: RGB values for at most the top-three dominant colorsobject_color_scheme[123]
: names of the top-three dominant colors based on the color mapping usedobject_color_scheme_percentage[123]
: percentage of points belonging to the top-three dominant colorsobject_color_scheme_average_dist[123]
: average distance between points classified as the top-three colors and the color values of those colors in LAB-space
The category mappings provided by the original authors were used to map ScanNet and Matterport-3D labels to NYUv2 and NYU40 labels. We manually created new category mapping files for the Unity, HM3D, ARKitScenes, and 3RScan datasets, found in unity, hm3d_full_mappings.csv, arkit_cat_mapping.csv, and 3rscan_full_mapping.csv.
Dominant Colors
To augment object referential statements with their colors, we classified the dominant colors of each object into a set of 15 basic colors. For each segmented object, point-level colors are mapped from RGB into LAB-space, then clustered using CSS3 colors as anchors. The CSS3 color labels are then mapped to a set of 15 basic colors using heuristics found in 3d_data_preprocess/utils/dominant_colors_new_lab.py. If more than 10% of points are assigned to a particular color, that color is dominant, and the top 3 or less dominant colors are saved.
The scripts to process scenes from each raw data source used are found under 3d_data_preprocess/.
Free-Space Generation
To provide extra navigation targets, each scan was also processed to generate the horizontally traversable free space. Separate traversable regions in a room are chunked into sub-regions, for which spatial relations with other objects in the scene are generated to create unambiguous references to these spaces (e.g. "the space near the table").
Spatial relations were calculated per-region with the preprocessed files using heuristics for each relation. All relations used are view-independent, as they do not depend on the perspective from which the scene is viewed from. Additionally, all relations are filtered if the target/anchor bounding boxes are significantly overlapping or if enclosed in another. The relations used are:
Relation | Definition | Synonyms | Additional Properties |
---|---|---|---|
Above | Target is above the anchor | Over | |
Below | Target is below the anchor | Under, Beneath, Underneath | |
Closest | Target is the closest object among a specific class to the anchor | Nearest | Inter-class |
Farthest | Target is the farthest object among a specific class to the anchor | Most distant from | Inter-class |
Between | Target is between two anchors | In the middle of, In-between | Ternary |
Near | Target is within a threshold distance of the anchor | Next to, Close to, Adjacent to, Beside | Symmetric |
In | Target is inside the anchor | Inside, within | |
On | Target is above and in contact with the anchor in the Z-axis | On top of |
The script to generate the scene graphs are found under the scene_graph/ folder.
Language statements were synthetically generated based on the calculated spatial relations using a template-based generation method similar to Sr3D [1]. From the table above, synonyms for each relation are used to add variety into the statements. Language statements are generated to ensure that they are:
- View-independent: the relation predicate for the target object does not depend on the perspective from which the scene is viewed from
- Unique: only one possibility exists in the region for the referred target object
- Minimal: following human language, statements use the least possible descriptors to disambiguate the target object
Each component of the language-generation pipeline is further detailed below.
Language Configs
The language config files contain all of the parameters and templates required for generating human readable sentences. They contain templates for each relationship type, such as "[target object] that is [relation] to the [anchor object]', as well as synonyms to use for each relation. These structural blueprints provide varied expressions for the same spatial relationships, ensuring natural and diverse phrasing.
Generators
Generators are used to exhaustively generate all possible unambiguous referential statements with the relations provided. There are three types of generators, differentiated by the properties of the spatial relationship handled.
- Binary Generator: Handles Above, Below, Near, On, In relations
- Ternary Generator: Handles Between relations
- Ordered Generator: Handles Closest, Farthest relations
Some basic pruning is done by the generators, such as pruning certain objects or relations and limiting the amount of redundant statements produced.
Object Filter
An object filter is used to ensure that referential language statements are both unique and distinguish a target object unambiguously. As language is first generated with just spatial relations, the object filter will add object attributes such as color and size if needed to distinguish objects from "distractors" of the same class.
The table below shows the number of different types of statements with synonyms included. Note that the statement types are not necessarily mutually exclusive with each other.
Statement Type | Total Statements |
---|---|
Above | 47,208 |
Below | 86,632 |
Closest | 3,060,074 |
Farthest | 4,590,111 |
Between | 249,615 |
Near | 1,655,185 |
In | 11,157 |
On | 25,915 |
Mentions color | 3,485,373 |
Mentions size | 2,114,500 |
Total | 9,696,079 |
The scene with the most statements is from the HM3D dataset and the scene with the least statements is from the Scannet dataset. In total, 477 unique object classes are referred to in the language statements and 9.6M+ unique statements (without relation synonyms) exist in the dataset.
The scripts to generate the language data are found under language_generator/.
An Open3D-based visualization tool is provided to visualize the language statements along with the scene. Details on installing and using the visualizer can be found in visualizer/README.md.
[1] Achlioptas, P., et al, "Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 2020, pp. 422–440.
[2] Chang, A., et al. "Matterport3d: Learning from rgb-d data in indoor environments," in arXiv preprint arXiv:1709.06158, 2017.
[3] Dai, A., et al, "Scannet: Richly-annotated 3d reconstructions of indoor scenes," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839.
[4] Ramakrishnan, S., et al. "Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai," in arXiv preprint arXiv:2109.08238, 2021.
[5] J. Haas. "A history of the unity game engine," in Diss. Worcester Polytechnic Institute, vol. 483, no. 2014, pp. 484, 2014.
[6] Baruch, G., et al. "Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data," in arXiv preprint arXiv:2111.08897, 2021.
[7] Johanna Wald, Helisa Dhamo, Nassir Navab, and Fed- erico Tombari. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3961–3970, 2020.