Merge pull request microsoft#437 from microsoft/staging

Staging to Master
ravi-code-ranjan · Oct 4, 2019 · a2ac143 · a2ac143
2 parents b912ce1 + 7ba2e10
commit a2ac143
Show file tree

Hide file tree

Showing 27 changed files with 358 additions and 296 deletions.
diff --git a/.gitignore b/.gitignore
@@ -127,13 +127,16 @@ nlp_*.yaml
 nohup.out
 temp/
 tmp/
+logs/
+score.py
 
 # Data
 data/
+squad/
+bidaf-question-answering/
 */question_answering/bidaf.tar.gz
 */question_answering/bidafenv.yml
 */question_answering/config.json
-*/question_answering/score.py
 */question_answering/vocabulary/
 */question_answering/weights.th
 

diff --git a/NLP-Logo.png b/NLP-Logo.png
diff --git a/README.md b/README.md
@@ -1,4 +1,7 @@
-# NLP Best Practices
+<img src="NLP-Logo.png" align="right" alt="" width="300"/>
+
+
+# NLP Best Practices 
 
 In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.
 

diff --git a/SETUP.md b/SETUP.md
@@ -10,19 +10,25 @@ For training at scale, operationalization or hyperparameter tuning, it is recomm
 ## Table of Contents
 
 * [Compute environments](#compute-environments)
-* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm-machines)
+* [Create a cloud-based workstation (Optional)](#Create-a-cloud-based-workstation-optional)
+* [Setup guide for Local or Virtual Machines](#setup-guide-for-local-or-virtual-machines)
   * [Requirements](#requirements)
   * [Dependencies setup](#dependencies-setup)
   * [Register the conda environment in the DSVM JupyterHub](#register-conda-environment-in-dsvm-jupyterhub)
-  * [Installing the Repo's Utils via PIP](#installing-the-repo's-utils-via-pip)
+  * [Installing the Repo's Utils via PIP](#installing-the-repos-utils-via-pip)
 
 
 ## Compute Environments
 
 Depending on the type of NLP system and the notebook that needs to be run, there are different computational requirements. Currently, this repository supports **Python CPU** and **Python GPU**. A conda environment YAML file can be generated for either CPU or GPU environments as shown below in the *Dependencies Setup* section.
 
+## Create a cloud-based workstation (Optional)
 
-## Setup Guide for Local or DSVM Machines
+[Azure Machine Learning service](https://azure.microsoft.com/en-us/services/machine-learning-service/)’s Notebook Virtual Machine (VM), is a cloud-based workstation created specifically for data scientists. Notebook VM based authoring is directly integrated into Azure Machine Learning service, providing a code-first experience for Python developers to conveniently build and deploy models in the workspace. Developers and data scientists can perform every operation supported by the Azure Machine Learning Python SDK using a familiar Jupyter notebook in a secure, enterprise-ready environment. Notebook VM is secure and easy-to-use, preconfigured for machine learning, and fully customizable. 
+
+You can learn how to create a Notebook VM [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup#azure) and then follow the same setup as in the [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm-machines) directly using the terminal in the Notebook VM.
+
+## Setup Guide for Local or Virtual Machines
 
 ### Requirements
 
@@ -96,13 +102,15 @@ If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsof
     <p>  
 A setup.py file is provided in order to simplify the installation of this utilities in this repo from the main directory.  
 
-To install, please run the command below
+To install the package, please run the command below (from directory root)
+
+    pip install -e . 
 
-    python setup.py install 
+Running the command tells pip to install the `utils_nlp` package from source in [development mode](https://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode). This just means that any updates to `utils_nlp` source directory will immediately be reflected in the installed package without needing to reinstall; a very useful practice for a package with constant updates.   
 
-It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects. 
+> It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects (while still reflecting updates to the source as it's installed as an editable `'-e'` package). 
 
-    pip install -e  [email protected]:microsoft/nlp.git@master#egg=utils_nlp  
+>   `pip install -e  [email protected]:microsoft/nlp.git@master#egg=utils_nlp`  
 
 Either command, from above, makes `utils_nlp` available in your conda virtual environment. You can verify it was properly installed by running:  
 

diff --git a/VERSIONING.md b/VERSIONING.md
@@ -1,9 +1,10 @@
 # Semantic Versioning
+> NOTE: Support for `setuptools_scm` is currently removed due to a known [issue](https://github.com/pypa/setuptools_scm/issues/357) with the way pip installations restrict access to certain SCM metadata during package installation. Support will be restored when `setuptools_scm` and `pip` developers fix this with a patch.
 
 This library is configured to use
 [setuptools_scm](https://github.com/pypa/setuptools_scm/) to automatically get package version from git commit histories.
 
-> NOTE: **There shouldn't be any references to manually coded versions**.
+**There shouldn't be any references to manually coded versions**.
 
 Verify what git tag to use by running:
 

diff --git a/examples/entailment/entailment_xnli_bert_azureml.ipynb b/examples/entailment/entailment_xnli_bert_azureml.ipynb
@@ -15,7 +15,7 @@
     "**Note: To learn how to do pre-training on your own, please reference the [AzureML-BERT repo](https://github.com/microsoft/AzureML-BERT) created by Microsoft.**"
    ]
   },
-    {
+  {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -46,6 +46,7 @@
     "from azureml.core import Experiment\n",
     "from azureml.widgets import RunDetails\n",
     "from azureml.core.compute import ComputeTarget\n",
+    "from azureml.exceptions import ComputeTargetException\n",
     "from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
    ]
   },
@@ -537,7 +538,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.4"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,

diff --git a/examples/question_answering/bidaf_aml_deep_dive.ipynb b/examples/question_answering/bidaf_aml_deep_dive.ipynb
@@ -16,7 +16,7 @@
     "# BiDAF Model Deep Dive on AzureML"
    ]
   },
-    {
+  {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -181,14 +181,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]\n",
+      "System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
+      "[GCC 7.3.0]\n",
       "Azure ML SDK Version: 1.0.48\n"
      ]
     }
@@ -214,6 +215,7 @@
     "from azureml.train.dnn import PyTorch\n",
     "from azureml.widgets import RunDetails\n",
     "from azureml.core.conda_dependencies import CondaDependencies\n",
+    "from azureml.exceptions import ComputeTargetException\n",
     "from allennlp.predictors import Predictor\n",
     "\n",
     "print(\"System version: {}\".format(sys.version))\n",
@@ -986,9 +988,9 @@
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python (nlp_gpu)",
    "language": "python",
-   "name": "python3"
+   "name": "nlp_gpu"
   },
   "language_info": {
    "codemirror_mode": {
@@ -1000,7 +1002,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.8"
   }
  },
  "nbformat": 4,

diff --git a/examples/question_answering/pretrained-BERT-SQuAD-deep-dive-aml.ipynb b/examples/question_answering/pretrained-BERT-SQuAD-deep-dive-aml.ipynb
@@ -187,7 +187,6 @@
    "outputs": [],
    "source": [
     "# Model configuration\n",
-    "AZUREML_CONFIG_PATH = \"./.azureml\"\n",
     "DATA_FOLDER = './squad'\n",
     "PROJECT_FOLDER = './pytorch-transformers'\n",
     "EXPERIMENT_NAME = 'NLP-QA-BERT-deepdive'\n",
@@ -202,6 +201,13 @@
     "MAX_CONCURRENT_RUNS = 4\n",
     "BERT_UTIL_PATH = '../../utils_nlp/azureml/azureml_bert_util.py'\n",
     "EVALUATE_SQAD_PATH = '../../utils_nlp/eval/evaluate_squad.py'\n",
+    "\n",
+    "# Azure resources\n",
+    "subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n",
+    "resource_group = \"YOUR_RESOURCE_GROUP_NAME\"  \n",
+    "workspace_name = \"YOUR_WORKSPACE_NAME\"  \n",
+    "workspace_region = \"YOUR_WORKSPACE_REGION\" #Possible values eastus, eastus2 and so on.\n",
+    "AZUREML_CONFIG_PATH = \"./.azureml\"\n",
     "AZUREML_VERBOSE = False"
    ]
   },
@@ -241,11 +247,10 @@
     "    ws = azureml_utils.get_or_create_workspace(config_path=AZUREML_CONFIG_PATH)\n",
     "else:\n",
     "    ws = azureml_utils.get_or_create_workspace(\n",
-    "        config_path=AZUREML_CONFIG_PATH,\n",
-    "        subscription_id=\"<SUBSCRIPTION_ID>\",\n",
-    "        resource_group=\"<RESOURCE_GROUP>\",\n",
-    "        workspace_name=\"<WORKSPACE_NAME>\",\n",
-    "        workspace_region=\"<WORKSPACE_REGION>\",\n",
+    "        subscription_id=subscription_id,\n",
+    "        resource_group=resource_group,\n",
+    "        workspace_name=workspace_name,\n",
+    "        workspace_region=workspace_region,\n",
     "    )\n",
     "\n",
     "if AZUREML_VERBOSE:\n",

diff --git a/examples/question_answering/question_answering_system_bidaf_quickstart.ipynb b/examples/question_answering/question_answering_system_bidaf_quickstart.ipynb
@@ -694,9 +694,9 @@
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python (nlp_gpu)",
    "language": "python",
-   "name": "python3"
+   "name": "nlp_gpu"
   },
   "language_info": {
    "codemirror_mode": {

diff --git a/examples/sentence_similarity/bert_senteval.ipynb b/examples/sentence_similarity/bert_senteval.ipynb
@@ -7,7 +7,7 @@
     "# Parallel Experimentation with BERT on AzureML"
    ]
   },
-    {
+  {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -34,9 +34,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 4,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
+      "[GCC 7.3.0]\n",
+      "AzureML version: 1.0.57\n"
+     ]
+    }
+   ],
    "source": [
     "import os\n",
     "import sys\n",
@@ -48,7 +58,9 @@
     "import pandas as pd\n",
     "import seaborn as sns\n",
     "import matplotlib.pyplot as plt\n",
+    "import scrapbook as sb\n",
     "\n",
+    "import azureml\n",
     "from azureml.core import Experiment\n",
     "from azureml.data.data_reference import DataReference\n",
     "from azureml.train.dnn import PyTorch\n",
@@ -58,7 +70,11 @@
     "from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_or_create_amlcompute\n",
     "from utils_nlp.models.bert.common import Language, Tokenizer\n",
     "from utils_nlp.models.bert.sequence_encoding import BERTSentenceEncoder, PoolingStrategy\n",
-    "from utils_nlp.eval.senteval import SentEvalConfig"
+    "from utils_nlp.eval.senteval import SentEvalConfig\n",
+    "\n",
+    "%matplotlib inline\n",
+    "print(\"System version: {}\".format(sys.version))\n",
+    "print(\"AzureML version: {}\".format(azureml.core.VERSION))"
    ]
   },
   {
@@ -627,6 +643,29 @@
     "Here we aggregate the outputs from each SentEval experiment to plot the distribution of Pearson correlations reported across the different encodings. We can see that for the STS Benchmark downstream task, the first layer achieves the highest Pearson correlation on the test dataset. As suggested in [bert-as-a-service](https://github.com/hanxiao/bert-as-service), this can be interpreted as a representation that is closer to the original word embedding."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = [\n",
+    "    pickle.load(open(f, \"rb\"))\n",
+    "    for f in sorted(glob.glob(os.path.join(CACHE_DIR, \"outputs\", \"*.pkl\")))\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For testing\n",
+    "sb.glue(\"pearson\", results[0][\"STSBenchmark\"][\"pearson\"])\n",
+    "sb.glue(\"mse\", results[0][\"STSBenchmark\"][\"mse\"])\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 18,
@@ -656,34 +695,28 @@
     }
    ],
    "source": [
-    "%matplotlib inline\n",
-    "\n",
-    "results = [\n",
-    "    pickle.load(open(f, \"rb\"))\n",
-    "    for f in sorted(glob.glob(os.path.join(CACHE_DIR, \"outputs\", \"*.pkl\")))\n",
-    "]\n",
-    "\n",
-    "df = pd.DataFrame(\n",
-    "    np.reshape(\n",
-    "        [r[\"STSBenchmark\"][\"pearson\"] for r in results],\n",
-    "        (len(EXP_PARAMS[\"layer_index\"]), len(EXP_PARAMS[\"pooling_strategy\"])),\n",
-    "    ).T,\n",
-    "    index=[s.value for s in EXP_PARAMS[\"pooling_strategy\"]],\n",
-    "    columns=EXP_PARAMS[\"layer_index\"],\n",
-    ")\n",
-    "fig, ax = plt.subplots(figsize=(10, 2))\n",
+    "if len(results) == 24:\n",
+    "    df = pd.DataFrame(\n",
+    "        np.reshape(\n",
+    "            [r[\"STSBenchmark\"][\"pearson\"] for r in results],\n",
+    "            (len(EXP_PARAMS[\"layer_index\"]), len(EXP_PARAMS[\"pooling_strategy\"])),\n",
+    "        ).T,\n",
+    "        index=[s.value for s in EXP_PARAMS[\"pooling_strategy\"]],\n",
+    "        columns=EXP_PARAMS[\"layer_index\"],\n",
+    "    )\n",
+    "    fig, ax = plt.subplots(figsize=(10, 2))\n",
     "\n",
-    "sns.heatmap(df, annot=True, fmt=\".2g\", ax=ax).set_title(\n",
-    "    \"Pearson correlations of BERT sequence encodings on STS Benchmark\"\n",
-    ")"
+    "    sns.heatmap(df, annot=True, fmt=\".2g\", ax=ax).set_title(\n",
+    "        \"Pearson correlations of BERT sequence encodings on STS Benchmark\"\n",
+    "    )"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python (nlp_cpu)",
+   "display_name": "Python (nlp_gpu)",
    "language": "python",
-   "name": "nlp_cpu"
+   "name": "nlp_gpu"
   },
   "language_info": {
    "codemirror_mode": {
@@ -695,7 +728,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.4"
+   "version": "3.6.8"
   }
  },
  "nbformat": 4,