Microsoft Q&A (MSQA) dataset is a question-answering dataset collected from the Microsoft Q&A forum. The dataset covers a wide range of Microsoft technologies and products, including Azure, Office 365, Windows, and more. It contains 32k QA pairs, where the answer is human-generated and selected as accepted answer.
- We arxiv our paper.
Recent advancements in large language models (LLMs) have demonstrated their impressive performance across various natural language processing (NLP) tasks. However, when it comes to domain-specfic problems, LLMs exhibit limited performance due to their insufficient pretraining on domain knowledge. Fine-tuning and maintaining LLMs to incorporate domain-specific knowledge could be expensive for researchers. In addition, the availability of domain-specific data is often restricted and confidential, introducing risks of data leakage when using them for fine-tuning LLMs.
Existing works leverage retrieval-based methods or external modules to extract domain-specific knowledge. However, these approaches suffer from limitations such as not retrieving all the necessary context when facing complex questions. We introduce a novel model interaction paradigm that bridges domain-general and domain-specific knowledge. Our approach involves fine-tuning a small language model, e.g., LLaMA-7B using domain documentation to align it with domain-specific knowledge. Then we instruction-tune with MSQA scenario. This paradigm replaces traditional retrieval modules with the generation of domain-specific knowledge, enabling easy maintenance and privacy protection within the specific domain.
We release MSQA data and believe that this benchmarking dataset will assist the research community in evaluating their model interaction strategies in domain-specific scenarios.
The directory structure of this repository is as follows:
data/
: Stores the data, which is the MSQA dataset.process/
: Stores the code for data cleaning.viz/
: Stores the code and files related to visualization.
Statistic | Value |
---|---|
Start Date | 2019-10-29 |
End Date | 2023-05-25 |
Date range | 1304 days |
# data | 32252 |
# tags | 377 |
Avg. # questions per tag | 109.74 |
Avg. # tags per question | 1.28 |
Avg. #tokens per question | 276.43 |
Avg. #tokens per answer | 279.41 |
Avg. #upvotes per question | 0.05 |
Avg. #upvotes per answer | 0.28 |
Avg. #upvotes per sample | 0.33 |
We first filtered the raw data collected by applying the following criteria:
- Discarded samples that contained attachments in the question or answer. Attachments are usually images, such as when a user takes a screenshot and asks a question. Since this work focuses mainly on text-based Q&A, samples containing attachments were discarded.
- Discarded samples without an "Accept Answer". Some questions were not answered, or had answers that were not accepted.
- Discarded samples with multi-turn discussions. Some questions contained multi-turn discussions, which were not within the scope of this work.
Due to the data being collected from an online Q&A forum, the content is complex and includes a large number of decorative symbols and platform-generated content, which is hard to be used for research directly. To address this issue, we conducted a deep sampling of the collected data, summarized the existing problems, identified patterns, and designed the following data filtering pipeline:
- Remove user-id.
- Standardize all links using the Markdown link reference syntax to organize them into a unified format.
- Remove platform-generated content, such as messages asking for upvotes or email notifications.
- Remove irregular decorative symbols added by users, such as asterisks for separation.
- Match various line breaks and replace consecutive multiple line breaks with a single one.
- Detect the length of questions and specifically label samples with questions exceeding 8192 tokens, as these may require special handling or truncation for current models.
Below is an actual example from the MSQA dataset:
Attribut | Value |
---|---|
QuestionId | 879 |
AnswerId | 868 |
CreationDate | 2019-11-06T02:18:47.397Z |
Score | 1 |
QuestionScore | 0 |
AnswerScore | 1 |
Tags | ["Universal Windows Platform (UWP)"] |
IsAzure | False |
IsM365 | False |
IsOther | True |
QuestionText | Can we control these three buttons:"The system Back, Close, Minimize, and Maximize buttons" Based on the doc: https://learn.microsoft.com/en-us/windows/uwp/design/shell/title-bar#how-much-to-customize-the-title-bar "When you opt for full customization, you are responsible for putting content in the title bar area, and you can define your own draggable region. The system Back, Close, Minimize, and Maximize buttons are still available and handled by the system, but elements like the app title are not. You will need to create those elements yourself as needed by your app." Does it mean that we cannot control these three buttons? |
AnswerText | Hello, Welcome to our Microsoft Q&A platform! *>>The system Back, Close, Minimize, and Maximize buttons are still available and **handled by the system* Yes, as the document saied, you can put content in the title bar area, and define your own draggable region. But the Back, Close, Minimize, and Maximize buttons are still controlled by the system. Thanks. |
Url | https://learn.microsoft.com/en-us/answers/questions/879/can-we-control-these-three-buttons-the-system-back.html |
ProcessedAnswerText | *>>The system Back, Close, Minimize, and Maximize buttons are still available and **handled by the system* Yes, as the document saied, you can put content in the title bar area, and define your own draggable region. But the Back, Close, Minimize, and Maximize buttons are still controlled by the system. |
To facilitate manual inspection of the effectiveness of data filtering, we developed a script, viz.py
, that visualizes the differences before and after data filtering. Below is an example of the visualization,
The source files for this demo can be found at viz_demo.html.
This dataset is released under open data license, CDLA-Permissive-2 (https://cdla.dev/permissive-2-0/)
If you think the release of this dataset might infringe your copyright, please inform us via the email [email protected] for taking down the dataset.
https://arxiv.org/abs/2305.11541
@misc{wang2023empower,
title={Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering},
author={Zezhong Wang and Fangkai Yang and Pu Zhao and Lu Wang and Jue Zhang and Mohit Garg and Qingwei Lin and Dongmei Zhang},
year={2023},
eprint={2305.11541},
archivePrefix={arXiv},
primaryClass={cs.CL}
}