MS MARCO Web Search is a large-scale information-rich Web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks. It incorporates the largest open web document dataset, ClueWeb22, as our document corpus. ClueWeb22 includes about 10 billion high-quality web pages, sufficiently large to serve as representative web-scale data. It also contains rich information from the web pages, such as visual representation rendered by web browsers, raw HTML structure, clean text, semantic annotations, language and topic tags labeled by industry document understanding systems, etc. MS MARCO Web Search further contains 10 million unique queries from 93 languages with millions of relevant labeled query-document pairs collected from the search log of the Microsoft Bing search engine to serve as the query set.
It offers a retrieval benchmark on 100 million document set with three web retrieval challenge tasks that demands innovations in both machine learning and information retrieval system research domains: embedding model, embedding retrieval, and end-to-end retrieval challenges. The main goal of the leaderboard is to study what retrieval methods work best and what retrieval methods are cost-efficient when a large amount of data is available.
Moreover, MS MARCO Web Search also offers 5 times larger real click labels for the whole 10 billion document set. Researchers can use this dataset to verify whether methods that work on small data also work on large data.
If you use the MS MARCO Web Search dataset, or any dataset derived from it, please cite the paper:
@article{XXX,
title={MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels},
author={Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong1, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu,
Mingqin Li, Chuanjie Liu, Jason Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang},
journal={arXiv preprint arXiv:XXX},
year={2024}
}
There are three tasks: embedding model, embedding retrieval, and end-to-end retrieval rankings.
The first task focuses on embedding model ranking. The large-scale web data volume requires large embedding models to guarantee sufficient knowledge coverage. It requires balancing the following two goals: good model generalization ability and efficient train/inference speed. Given a query, you are expected to rank documents from the full collection based on their relevance to the query. You can submit up to 100 documents for this task. It models the embedding model quality. The metrics we evaluate include:
- Mean Reciprocal Rank (MRR): the average of the multiplicative inverse of the rank of the first correct result, which is widely used for evaluating the model quality.
- Recall: the average percentage of ground truth items (test query-document labels) recalled during the search.
- Throughput (QPS): All queries are provided at once, and we measure the wall clock time between the ingestion of the vectors and when all the results are output using all the threads in a machine. Then the throughput is calculated as the processed queries per second (QPS).
- Latency: we measure the 50, 90 and 99 percentile query latency at certain QPS.
Baselines | MRR@10 | recall@1 | recall@5 | recall@10 | recall@20 | recall@100 | QPS | P50 latency | P90 latency | P99 latency |
---|---|---|---|---|---|---|---|---|---|---|
DPR | 0.542 | 45.12% | 66.04% | 72.10% | 76.80% | 87.54% | 698 | 9.896 ms | 10.018 ms | 11.430 ms |
ANCE | 0.633 | 54.18% | 75.53% | 80.53% | 84.17% | 91.17% | 698 | 9.896 ms | 10.018 ms | 11.430 ms |
SimANS | 0.649 | 55.86% | 76.84% | 81.78% | 85.23% | 91.98% | 698 | 9.896 ms | 10.018 ms | 11.430 ms |
Embedding models need to co-work with the embedding retrieval system to serve a web scale dataset. The second task focuses on embedding retrieval algorithm/system performance and accuracy. We take the embedding vectors generated by one of the baseline models as the ANN vector set. The goal of this challenge is to call for ANN algorithm innovations to minimize the accuracy gap between approximate search and brute-force search while still preserving good system performance. In this task, we only evaluate ANN Recall (take the brute-force vector search results as the ground truth), Throughout and Latency.
Baselines | ANN recall@1 | ANN recall@10 | ANN recall@100 | QPS | P50 latency | P90 latency | P99 latency |
---|---|---|---|---|---|---|---|
SPANN | 87.97% | 80.55% | 69.84% | 625 | 10.411 ms | 10.873 ms | 11.334 ms |
DiskANN | 91.46% | 87.07% | 69.73% | 2691 | 21.968 ms | 37.841 | 69.462 ms |
In the web scenario, the result quality and system performance of the end-to-end retrieval system are the most important metrics in comparing different solutions. This challenge task encourages any kind of solutions, including an embedding model plus ANN system, inverted index solution, hybrid solution, neural indexer, and large language model, etc.
Baselines | MRR@10 | recall@1 | recall@5 | recall@10 | recall@20 | recall@100 | QPS | P50 latency | P90 latency | P99 latency |
---|---|---|---|---|---|---|---|---|---|---|
Elasticsearch BM25 | 0.296 | 22.30% | 39.04% | 46.00% | 52.42% | 63.87% | 149 | 312.025 ms | 1065.141 ms | 3745.546 ms |
DPR + SPANN | 0.467 | 39.21% | 56.66% | 61.27% | 64.69% | 70.28% | 625 | 21.924 ms | 23.017 ms | 34.217 ms |
ANCE+ SPANN | 0.580 | 49.87% | 68.59% | 72.94% | 75.86% | 80.18% | 625 | 21.924 ms | 23.017 ms | 34.217 ms |
SimANS + SPANN | 0.585 | 50.63% | 68.79% | 73.14% | 75.85% | 79.82% | 625 | 21.924 ms | 23.017 ms | 34.217 ms |
I confirm that I accept the terms and the licenses. Click to see the download links of datasets
Type | Filename | File size | Num Records | Format |
---|---|---|---|---|
ClueWeb22 Collection | https://lemurproject.org/clueweb22.php/ | --- | 10B | --- |
Document ID in ClueWeb22 | doc_hash_mapping.tsv | 8.34 GB | 210,894,832 | tsv: docid in ClueWeb22, docid |
Train | queries_train.tsv | 678.36 MB | 9,206,475 | tsv: qid, query, languages |
Train | qrels_train.tsv | 194.93 MB | 9,346,695 | TREC qrels format |
Dev | queries_dev.tsv | 675.2 KB | 9,253 | tsv: qid, query, languages |
Dev | qrels_dev.tsv | 173.19 KB | 9,402 | TREC qrels format |
Test | queries_test.tsv | 734.33 KB | 9,374 | tsv: qid, query, languages |
Test | qrels_test.tsv | 180.32 KB | 9,374 | TREC qrels format |
Document Embedding Vectors | vectors.bin, metaidx.bin, meta.bin | 289.16GB | 100,924,960 | Binary Format |
Query Embedding Vectors | vectors.bin, metaidx.bin, meta.bin | 27.47 MB | 9,374 | Binary Format |
Embedding Retrieval Truth | truth.txt | 7.97 MB | 9,374 | Truth Format |
Description | Filename | File size | Num Records | Format |
---|---|---|---|---|
ClueWeb22 Collection | https://lemurproject.org/clueweb22.php/ | --- | 10B | --- |
Train | queries_train.tsv | 678.36 MB | 9,206,475 | tsv: qid, query, languages |
Train | qrels_train.tsv | 2.43 GB | 62,302,553 | TREC qrels format |
Dev | queries_dev.tsv | 675.2 KB | 9,253 | tsv: qid, query, languages |
Dev | qrels_dev.tsv | 2.35 MB | 63,314 | TREC qrels format |
Test | queries_test.tsv | 734.33 KB | 9,374 | tsv: qid, query, languages |
Test | qrels_test.tsv | 2.65 MB | 40,511 | TREC qrels format |
IMPORTANT NOTE: You are allowed to use external information while developing your runs. However, it is prohibited to use any datasets in your submission except those listed above. The original MS MARCO Web Search dataset reveals minor details of how the dataset was constructed that would not be available in a real-world search engine; hence, should be avoided.
The MS MARCO Web Search are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The datasets are provided "as is" without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of the datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically.
Please contact us at [email protected] if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE-CCA file, and grant you a license to any code in the repository under the MIT License, see the LICENSE file.
Microsoft licenses the MS MARCO Web Search Mark "as-is" and makes no express or implied representations or warranties regarding non-infringement. You must remove all uses of the Mark immediately upon request from Microsoft.
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
Privacy information can be found at https://privacy.microsoft.com/en-us/.
Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.