Skip to content

Commit

Permalink
Merge pull request microsoft#437 from microsoft/staging
Browse files Browse the repository at this point in the history
Staging to Master
  • Loading branch information
saidbleik authored Oct 4, 2019
2 parents b912ce1 + 7ba2e10 commit a2ac143
Show file tree
Hide file tree
Showing 27 changed files with 358 additions and 296 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,13 +127,16 @@ nlp_*.yaml
nohup.out
temp/
tmp/
logs/
score.py

# Data
data/
squad/
bidaf-question-answering/
*/question_answering/bidaf.tar.gz
*/question_answering/bidafenv.yml
*/question_answering/config.json
*/question_answering/score.py
*/question_answering/vocabulary/
*/question_answering/weights.th

Expand Down
Binary file added NLP-Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# NLP Best Practices
<img src="NLP-Logo.png" align="right" alt="" width="300"/>


# NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

Expand Down
22 changes: 15 additions & 7 deletions SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,25 @@ For training at scale, operationalization or hyperparameter tuning, it is recomm
## Table of Contents

* [Compute environments](#compute-environments)
* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm-machines)
* [Create a cloud-based workstation (Optional)](#Create-a-cloud-based-workstation-optional)
* [Setup guide for Local or Virtual Machines](#setup-guide-for-local-or-virtual-machines)
* [Requirements](#requirements)
* [Dependencies setup](#dependencies-setup)
* [Register the conda environment in the DSVM JupyterHub](#register-conda-environment-in-dsvm-jupyterhub)
* [Installing the Repo's Utils via PIP](#installing-the-repo's-utils-via-pip)
* [Installing the Repo's Utils via PIP](#installing-the-repos-utils-via-pip)


## Compute Environments

Depending on the type of NLP system and the notebook that needs to be run, there are different computational requirements. Currently, this repository supports **Python CPU** and **Python GPU**. A conda environment YAML file can be generated for either CPU or GPU environments as shown below in the *Dependencies Setup* section.

## Create a cloud-based workstation (Optional)

## Setup Guide for Local or DSVM Machines
[Azure Machine Learning service](https://azure.microsoft.com/en-us/services/machine-learning-service/)’s Notebook Virtual Machine (VM), is a cloud-based workstation created specifically for data scientists. Notebook VM based authoring is directly integrated into Azure Machine Learning service, providing a code-first experience for Python developers to conveniently build and deploy models in the workspace. Developers and data scientists can perform every operation supported by the Azure Machine Learning Python SDK using a familiar Jupyter notebook in a secure, enterprise-ready environment. Notebook VM is secure and easy-to-use, preconfigured for machine learning, and fully customizable.

You can learn how to create a Notebook VM [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup#azure) and then follow the same setup as in the [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm-machines) directly using the terminal in the Notebook VM.

## Setup Guide for Local or Virtual Machines

### Requirements

Expand Down Expand Up @@ -96,13 +102,15 @@ If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsof
<p>
A setup.py file is provided in order to simplify the installation of this utilities in this repo from the main directory.

To install, please run the command below
To install the package, please run the command below (from directory root)

pip install -e .

python setup.py install
Running the command tells pip to install the `utils_nlp` package from source in [development mode](https://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode). This just means that any updates to `utils_nlp` source directory will immediately be reflected in the installed package without needing to reinstall; a very useful practice for a package with constant updates.

It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects.
> It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects (while still reflecting updates to the source as it's installed as an editable `'-e'` package).
pip install -e [email protected]:microsoft/nlp.git@master#egg=utils_nlp
> `pip install -e [email protected]:microsoft/nlp.git@master#egg=utils_nlp`
Either command, from above, makes `utils_nlp` available in your conda virtual environment. You can verify it was properly installed by running:

Expand Down
3 changes: 2 additions & 1 deletion VERSIONING.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Semantic Versioning
> NOTE: Support for `setuptools_scm` is currently removed due to a known [issue](https://github.com/pypa/setuptools_scm/issues/357) with the way pip installations restrict access to certain SCM metadata during package installation. Support will be restored when `setuptools_scm` and `pip` developers fix this with a patch.
This library is configured to use
[setuptools_scm](https://github.com/pypa/setuptools_scm/) to automatically get package version from git commit histories.

> NOTE: **There shouldn't be any references to manually coded versions**.
**There shouldn't be any references to manually coded versions**.

Verify what git tag to use by running:

Expand Down
5 changes: 3 additions & 2 deletions examples/entailment/entailment_xnli_bert_azureml.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"**Note: To learn how to do pre-training on your own, please reference the [AzureML-BERT repo](https://github.com/microsoft/AzureML-BERT) created by Microsoft.**"
]
},
{
{
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -46,6 +46,7 @@
"from azureml.core import Experiment\n",
"from azureml.widgets import RunDetails\n",
"from azureml.core.compute import ComputeTarget\n",
"from azureml.exceptions import ComputeTargetException\n",
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
]
},
Expand Down Expand Up @@ -537,7 +538,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.7.3"
}
},
"nbformat": 4,
Expand Down
14 changes: 8 additions & 6 deletions examples/question_answering/bidaf_aml_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"# BiDAF Model Deep Dive on AzureML"
]
},
{
{
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -181,14 +181,15 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]\n",
"System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
"[GCC 7.3.0]\n",
"Azure ML SDK Version: 1.0.48\n"
]
}
Expand All @@ -214,6 +215,7 @@
"from azureml.train.dnn import PyTorch\n",
"from azureml.widgets import RunDetails\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"from azureml.exceptions import ComputeTargetException\n",
"from allennlp.predictors import Predictor\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
Expand Down Expand Up @@ -986,9 +988,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python (nlp_gpu)",
"language": "python",
"name": "python3"
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -1000,7 +1002,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,6 @@
"outputs": [],
"source": [
"# Model configuration\n",
"AZUREML_CONFIG_PATH = \"./.azureml\"\n",
"DATA_FOLDER = './squad'\n",
"PROJECT_FOLDER = './pytorch-transformers'\n",
"EXPERIMENT_NAME = 'NLP-QA-BERT-deepdive'\n",
Expand All @@ -202,6 +201,13 @@
"MAX_CONCURRENT_RUNS = 4\n",
"BERT_UTIL_PATH = '../../utils_nlp/azureml/azureml_bert_util.py'\n",
"EVALUATE_SQAD_PATH = '../../utils_nlp/eval/evaluate_squad.py'\n",
"\n",
"# Azure resources\n",
"subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n",
"resource_group = \"YOUR_RESOURCE_GROUP_NAME\" \n",
"workspace_name = \"YOUR_WORKSPACE_NAME\" \n",
"workspace_region = \"YOUR_WORKSPACE_REGION\" #Possible values eastus, eastus2 and so on.\n",
"AZUREML_CONFIG_PATH = \"./.azureml\"\n",
"AZUREML_VERBOSE = False"
]
},
Expand Down Expand Up @@ -241,11 +247,10 @@
" ws = azureml_utils.get_or_create_workspace(config_path=AZUREML_CONFIG_PATH)\n",
"else:\n",
" ws = azureml_utils.get_or_create_workspace(\n",
" config_path=AZUREML_CONFIG_PATH,\n",
" subscription_id=\"<SUBSCRIPTION_ID>\",\n",
" resource_group=\"<RESOURCE_GROUP>\",\n",
" workspace_name=\"<WORKSPACE_NAME>\",\n",
" workspace_region=\"<WORKSPACE_REGION>\",\n",
" subscription_id=subscription_id,\n",
" resource_group=resource_group,\n",
" workspace_name=workspace_name,\n",
" workspace_region=workspace_region,\n",
" )\n",
"\n",
"if AZUREML_VERBOSE:\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -694,9 +694,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python (nlp_gpu)",
"language": "python",
"name": "python3"
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
Expand Down
85 changes: 59 additions & 26 deletions examples/sentence_similarity/bert_senteval.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"# Parallel Experimentation with BERT on AzureML"
]
},
{
{
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -34,9 +34,19 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 4,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
"[GCC 7.3.0]\n",
"AzureML version: 1.0.57\n"
]
}
],
"source": [
"import os\n",
"import sys\n",
Expand All @@ -48,7 +58,9 @@
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import scrapbook as sb\n",
"\n",
"import azureml\n",
"from azureml.core import Experiment\n",
"from azureml.data.data_reference import DataReference\n",
"from azureml.train.dnn import PyTorch\n",
Expand All @@ -58,7 +70,11 @@
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_or_create_amlcompute\n",
"from utils_nlp.models.bert.common import Language, Tokenizer\n",
"from utils_nlp.models.bert.sequence_encoding import BERTSentenceEncoder, PoolingStrategy\n",
"from utils_nlp.eval.senteval import SentEvalConfig"
"from utils_nlp.eval.senteval import SentEvalConfig\n",
"\n",
"%matplotlib inline\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"AzureML version: {}\".format(azureml.core.VERSION))"
]
},
{
Expand Down Expand Up @@ -627,6 +643,29 @@
"Here we aggregate the outputs from each SentEval experiment to plot the distribution of Pearson correlations reported across the different encodings. We can see that for the STS Benchmark downstream task, the first layer achieves the highest Pearson correlation on the test dataset. As suggested in [bert-as-a-service](https://github.com/hanxiao/bert-as-service), this can be interpreted as a representation that is closer to the original word embedding."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = [\n",
" pickle.load(open(f, \"rb\"))\n",
" for f in sorted(glob.glob(os.path.join(CACHE_DIR, \"outputs\", \"*.pkl\")))\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For testing\n",
"sb.glue(\"pearson\", results[0][\"STSBenchmark\"][\"pearson\"])\n",
"sb.glue(\"mse\", results[0][\"STSBenchmark\"][\"mse\"])\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
Expand Down Expand Up @@ -656,34 +695,28 @@
}
],
"source": [
"%matplotlib inline\n",
"\n",
"results = [\n",
" pickle.load(open(f, \"rb\"))\n",
" for f in sorted(glob.glob(os.path.join(CACHE_DIR, \"outputs\", \"*.pkl\")))\n",
"]\n",
"\n",
"df = pd.DataFrame(\n",
" np.reshape(\n",
" [r[\"STSBenchmark\"][\"pearson\"] for r in results],\n",
" (len(EXP_PARAMS[\"layer_index\"]), len(EXP_PARAMS[\"pooling_strategy\"])),\n",
" ).T,\n",
" index=[s.value for s in EXP_PARAMS[\"pooling_strategy\"]],\n",
" columns=EXP_PARAMS[\"layer_index\"],\n",
")\n",
"fig, ax = plt.subplots(figsize=(10, 2))\n",
"if len(results) == 24:\n",
" df = pd.DataFrame(\n",
" np.reshape(\n",
" [r[\"STSBenchmark\"][\"pearson\"] for r in results],\n",
" (len(EXP_PARAMS[\"layer_index\"]), len(EXP_PARAMS[\"pooling_strategy\"])),\n",
" ).T,\n",
" index=[s.value for s in EXP_PARAMS[\"pooling_strategy\"]],\n",
" columns=EXP_PARAMS[\"layer_index\"],\n",
" )\n",
" fig, ax = plt.subplots(figsize=(10, 2))\n",
"\n",
"sns.heatmap(df, annot=True, fmt=\".2g\", ax=ax).set_title(\n",
" \"Pearson correlations of BERT sequence encodings on STS Benchmark\"\n",
")"
" sns.heatmap(df, annot=True, fmt=\".2g\", ax=ax).set_title(\n",
" \"Pearson correlations of BERT sequence encodings on STS Benchmark\"\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (nlp_cpu)",
"display_name": "Python (nlp_gpu)",
"language": "python",
"name": "nlp_cpu"
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -695,7 +728,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit a2ac143

Please sign in to comment.