title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Provision the Linux Data Science Virtual Machine | Microsoft Docs |
Configure and create a Linux Data Science Virtual Machine on Azure to do analytics and machine learning. |
machine-learning |
bradsev |
jhubbard |
cgronlun |
3bab0ab9-3ea5-41a6-a62a-8c44fdbae43b |
machine-learning |
data-services |
na |
na |
article |
12/09/2016 |
bradsev |
The Linux Data Science Virtual Machine is an Azure virtual machine that comes with a collection of pre-installed tools. These tools are commonly used for doing data analytics and machine learning. The key software components included are:
- Microsoft R Server Developer Edition
- Anaconda Python distribution (versions 2.7 and 3.5), including popular data analysis libraries
- JupyterHub - a multiuser Jupyter notebook server supporting R, Python, Julia kernels
- Azure Storage Explorer
- Azure command-line interface (CLI) for managing Azure resources
- PostgresSQL Database
- Machine learning tools
- Computational Network Toolkit (CNTK): A deep learning software toolkit from Microsoft Research.
- Vowpal Wabbit: A fast machine learning system supporting techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
- XGBoost: A tool providing fast and accurate boosted tree implementation.
- Rattle (the R Analytical Tool To Learn Easily): A tool that makes getting started with data analytics and machine learning in R easy, with GUI-based data exploration, and modeling with automatic R code generation.
- Azure SDK in Java, Python, node.js, Ruby, PHP
- Libraries in R and Python for use in Azure Machine Learning and other Azure services
- Development tools and editors (Eclipse, Emacs, gedit, vi)
Doing data science involves iterating on a sequence of tasks:
- Finding, loading, and pre-processing data
- Building and testing models
- Deploying the models for consumption in intelligent applications
Data scientists use various tools to complete these tasks. It can be quite time consuming to find the appropriate versions of the software, and then to download, compile, and install these versions.
The Linux Data Science Virtual Machine can ease this burden substantially. Use it to jump-start your analytics project. It enables you to work on tasks in various languages, including R, Python, SQL, Java, and C++. Eclipse provides an IDE to develop and test your code that is easy to use. The Azure SDK included in the VM allows you to build your applications by using various services on Linux for the Microsoft cloud platform. In addition, you have access to other languages like Ruby, Perl, PHP, and node.js that are also pre-installed.
There are no software charges for this data science VM image. You pay only the Azure hardware usage fees that are assessed based on the size of the virtual machine that you provision with the VM image. More details on the compute fees can be found on the VM listing page on the Azure Marketplace .
Before you can create a Linux Data Science Virtual Machine, you must have the following:
- An Azure subscription: To obtain one, see Get Azure free trial.
- An Azure storage account: To create one, see Create an Azure storage account. Alternatively, the storage account can be created as part of the process of creating the VM, if you do not want to use an existing account.
Here are the steps to create an instance of the Linux Data Science Virtual Machine:
-
Navigate to the virtual machine listing on the Azure portal.
-
The following sections provide the inputs for each of the steps in the wizard (enumerated on the right of the preceding figure) used to create the Microsoft Data Science Virtual Machine. Here are the inputs needed to configure each of these steps:
a. Basics:
- Name: Name of your data science server you are creating.
- User Name: First account sign-in ID.
- Password: First account password (you can use SSH public key instead of password).
- Subscription: If you have more than one subscription, select the one on which the machine is to be created and billed. You must have resource creation privileges for this subscription.
- Resource Group: You can create a new one or use an existing group.
- Location: Select the data center that is most appropriate. Usually it is the data center that has most of your data, or is closest to your physical location for fastest network access.
b. Size:
- Select one of the server types that meets your functional requirement and cost constraints. Select View All to see more choices of VM sizes.
c. Settings:
- Disk Type: Choose Premium if you prefer a solid state drive (SSD). Otherwise, choose Standard.
- Storage Account: You can create a new Azure storage account in your subscription, or use an existing one in the same location that was chosen on the Basics step of the wizard.
- Other parameters: In most cases, you just use the default values. To consider non-default values, hover over the informational link for help on the specific fields.
d. Summary:
- Verify that all information you entered is correct.
e. Buy:
- To start the provisioning, click Buy. A link is provided to the terms of the transaction. The VM does not have any additional charges beyond the compute for the server size you chose in the Size step.
The provisioning should take about 10-20 minutes. The status of the provisioning is displayed on the Azure portal.
After the VM is created, you can sign in to it by using SSH. Use the account credentials that you created in the Basics section of step 3 for the text shell interface. On Windows, you can download an SSH client tool like Putty. If you prefer a graphical desktop (X Windows System), you can use X11 forwarding on Putty or install the X2Go client.
Note
The X2Go client performed significantly better than X11 forwarding in testing. We recommend using the X2Go client for a graphical desktop interface.
The Linux VM is already provisioned with X2Go server and ready to accept client connections. To connect to the Linux VM graphical desktop, do the following on your client:
- Download and install the X2Go client for your client platform from X2Go.
- Run the X2Go client, and select New Session. It opens a configuration window with multiple tabs. Enter the following configuration parameters:
- Session tab:
- Host: The host name or IP address of your Linux Data Science VM.
- Login: User name on the Linux VM.
- SSH Port: Leave it at 22, the default value.
- Session Type: Change the value to XFCE. Currently the Linux VM only supports XFCE desktop.
- Media tab: You can turn off sound support and client printing if you don't need to use them.
- Shared folders: If you want directories from your client machines mounted on the Linux VM, add the client machine directories that you want to share with the VM on this tab.
- Session tab:
After you sign in to the VM by using either the SSH client or XFCE graphical desktop through the X2Go client, you are ready to start using the tools that are installed and configured on the VM. On XFCE, you can see applications menu shortcuts and desktop icons for many of the tools.
R is one of the most popular languages for data analysis and machine learning. If you want to use R for your analytics, the VM has Microsoft R Open (MRO) with the Math Kernel Library (MKL). The MKL optimizes math operations common in analytical algorithms. MRO is 100 percent compatible with CRAN-R, and any of the R libraries published in CRAN can be installed on the MRO. You can edit your R programs in one of the default editors, like vi, Emacs, or gedit. You can also download and use other IDEs, such as RStudio. For your convenience, a simple script (installRStudio.sh) is provided in the /dsvm/tools directory that installs RStudio. If you are using the Emacs editor, note that the Emacs package ESS (Emacs Speaks Statistics), which simplifies working with R files within the Emacs editor, has been pre-installed.
To launch R, you just type R in the shell. This takes you to an interactive environment. To develop your R program, you typically use an editor like Emacs or vi or gedit, and then run the scripts within R. If you install RStudio, you have a full graphical IDE environment to develop your R program.
There is also an R script for you to install the Top 20 R packages if you want. This script can be run after you are in the R interactive interface, which can be entered (as mentioned) by typing R in the shell.
For development using Python, Anaconda Python distribution 2.7 and 3.5 has been installed. This distribution contains the base Python along with about 300 of the most popular math, engineering, and data analytics packages. You can use the default text editors. In addition, you can use Spyder, a Python IDE that is bundled with Anaconda Python distributions. Spyder needs a graphical desktop or X11 forwarding. A shortcut to Spyder is provided in the graphical desktop.
Since we have both Python 2.7 and 3.5, you need to specifically activate the desired Python version you want to work on in the current session. The activation process sets the PATH variable to the desired version of Python.
To activate Python 2.7, run the following from the shell:
source /anaconda/bin/activate root
Python 2.7 is installed at /anaconda/bin.
To activate Python 3.5, run the following from the shell:
source /anaconda/bin/activate py35
Python 3.5 is installed at /anaconda/envs/py35/bin.
To invoke a Python interactive session, just type python in the shell. If you are on a graphical interface or have X11 forwarding set up, you can type spyder to launch the Python IDE.
The Anaconda distribution also comes with a Jupyter notebook, an environment to share code and analysis. The Jupyter notebook is accessed through JupyterHub. You sign in using your local Linux user name and password.
The Jupyter notebook server has been pre-configured with Python 2, Python 3, and R kernels. There is a desktop icon named "Jupyter Notebook" to launch the browser to access the notebook server. If you are on the VM via SSH or X2Go client, you can also visit https://localhost:8000/ to access the Jupyter notebook server.
Note
Continue if you get any certificate warnings.
You can access the Jupyter notebook server from any host. Just type https://<VM DNS name or IP Address>:8000/
Note
Port 8000 is opened in the firewall by default when the VM is provisioned.
We have packaged sample notebooks--one in Python and one in R. You can see the link to the samples on the notebook home page after you authenticate to the Jupyter notebook by using your local Linux user name and password. You can create a new notebook by selecting New, and then the appropriate language kernel. If you don't see the New button, click the Jupyter icon on the top left to go to the home page of the notebook server.
You have a choice of several code editors. This includes vi/VIM, Emacs, gEdit and Eclipse. gEdit and Eclipse are graphical editors, and need you to be signed in to a graphical desktop to use them. These editors have desktop and application menu shortcuts to launch them.
VIM and Emacs are text-based editors. On Emacs, we have installed an add-on package called Emacs Speaks Statistics (ESS) that makes working with R easier within the Emacs editor. More information can be found at ESS.
Eclipse is an open source, extensible IDE that supports multiple languages. The Java developers edition is the instance installed on the VM. There are plugins available for several popular languages that can be installed to extend the Eclipse environment. We also have a plugin installed in Eclipse called Azure Toolkit for Eclipse. It allows you to create, develop, test, and deploy Azure applications using the Eclipse development environment that supports languages like Java. There is also an Azure SDK for Java that allows access to different Azure services from within a Java environment. More information on Azure toolkit for Eclipse can be found at Azure Toolkit for Eclipse.
LaTex is installed through the texlive package along with an Emacs add-on auctex package, which simplifies authoring your LaTex documents within Emacs.
The open source database Postgres is available on the VM, with the services running and initdb already completed. You still need to create databases and users. For more information, see the Postgres documentation.
SQuirrel SQL, a graphical SQL client, has been provided to connect to different databases (such as Microsoft SQL Server, Postgres, and MySQL) and to run SQL queries. You can run this from a graphical desktop session (using the X2Go client, for example). To invoke SQuirrel SQL, you can either launch it from the icon on the desktop or run the following command on the shell.
/usr/local/squirrel-sql-3.7/squirrel-sql.sh
Before the first use, set up your drivers and database aliases. The JDBC drivers are located at:
/usr/share/java/jdbcdrivers
For more information, see SQuirrel SQL.
The ODBC driver package for SQL Server also comes with two command-line tools:
bcp: The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables, or to export data out of tables into data files. To import data into a table, you must either use a format file created for that table, or understand the structure of the table and the types of data that are valid for its columns.
For more information, see Connecting with bcp.
sqlcmd: You can enter Transact-SQL statements with the sqlcmd utility, as well as system procedures, and script files at the command prompt. This utility uses ODBC to execute Transact-SQL batches.
For more information, see Connecting with sqlcmd.
Note
There are some differences in this utility between Linux and Windows platforms. See the documentation for details.
There are libraries available in R and Python to access databases.
- In R, the RODBC package or dplyr package allows you to query or execute SQL statements on the database server.
- In Python, the pyodbc library provides database access with ODBC as the underlying layer.
To access Postgres:
- From R: Use the package RPostgreSQL.
- From Python: Use the psycopg2 library.
The following Azure tools are installed on the VM:
-
Azure command-line interface: The Azure CLI allows you to create and manage Azure resources through shell commands. To invoke the Azure tools, just type azure help. For more information, see the Azure CLI documentation page.
-
Microsoft Azure Storage Explorer: Microsoft Azure Storage Explorer is a graphical tool that is used to browse through the objects that you have stored in your Azure storage account, and to upload and download data to and from Azure blobs. You can access Storage Explorer from the desktop shortcut icon. You can invoke it from a shell prompt by typing StorageExplorer. You need to be signed in from an X2Go client, or have X11 forwarding set up.
-
Azure Libraries: The following are some of the pre-installed libraries.
- Python: The Azure-related libraries in Python that are installed are azure, azureml, pydocumentdb, and pyodbc. With the first three libraries, you can access Azure storage services, Azure Machine Learning, and Azure DocumentDB (a NoSQL database on Azure). The fourth library, pyodbc (along with the Microsoft ODBC driver for SQL Server), enables access to SQL Server, Azure SQL Database, and Azure SQL Data Warehouse from Python by using an ODBC interface. Enter pip list to see all the listed libraries. Be sure to run this command in both the Python 2.7 and 3.5 environments.
- R: The Azure-related libraries in R that are installed are AzureML and RODBC.
- Java: The list of Azure Java libraries can be found in the directory /dsvm/sdk/AzureSDKJava on the VM. The key libraries are Azure storage and management APIs, DocumentDB, and JDBC drivers for SQL Server.
You can access the Azure portal from the pre-installed Firefox browser. On the Azure portal, you can create, manage, and monitor Azure resources.
Azure Machine Learning is a fully managed cloud service that enables you to build, deploy, and share predictive analytics solutions. You build your experiments and models from Azure Machine Learning Studio. It can be accessed from a web browser on the data science virtual machine by visiting Microsoft Azure Machine Learning.
After you sign in to Azure Machine Learning Studio, you have access to an experimentation canvas where you can build a logical flow for the machine learning algorithms. You also have access to a Jupyter notebook hosted on Azure Machine Learning and can work seamlessly with the experiments in Machine Learning Studio. Operationalize the machine learning models that you have built by wrapping them in a web service interface. This enables clients written in any language to invoke predictions from the machine learning models. For more information, see the Machine Learning documentation.
You can also build your models in R or Python on the VM, and then deploy it in production on Azure Machine Learning. We have installed libraries in R (AzureML) and Python (azureml) to enable this functionality.
For information on how to deploy models in R and Python into Azure Machine Learning, see Ten things you can do on the Data science Virtual Machine (in particular, the section "Build models using R or Python and Operationalize them using Azure Machine Learning").
Note
These instructions were written for the Windows version of the Data Science VM. But the information provided there on deploying models to Azure Machine Learning is applicable to the Linux VM.
The VM comes with a few machine learning tools and algorithms that have been pre-compiled and pre-installed locally. These include:
-
CNTK (Computational Network Toolkit from Microsoft Research): A deep learning toolkit.
-
Vowpal Wabbit: A fast online learning algorithm.
-
xgboost: A tool that provides optimized, boosted tree algorithms.
-
Python: Anaconda Python comes bundled with machine learning algorithms with libraries like Scikit-learn. You can install other libraries by using the
pip install
command. -
R: A rich library of machine learning functions is available for R. Some of the libraries that are pre-installed are lm, glm, randomForest, rpart. Other libraries can be installed by running:
install.packages(<lib name>)
Here is some additional information about the first three machine learning tools in the list.
This is an open source, deep learning toolkit. It is a command-line tool (cntk), and is already in the PATH.
To run a basic sample, execute the following commands in the shell:
# Copy samples to your home directory and execute cntk
cp -r /dsvm/tools/CNTK-2016-02-08-Linux-64bit-CPU-Only/Examples/Other/Simple2d cntkdemo
cd cntkdemo/Data
cntk configFile=../Config/Simple.cntk
The model output is in ~/cntkdemo/Output/Models.
For more information, see the CNTK section of GitHub, and the CNTK wiki.
Vowpal Wabbit is a machine learning system that uses techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
To run the tool on a very basic example, do the following:
cp -r /dsvm/tools/VowpalWabbit/demo vwdemo
cd vwdemo
vw house_dataset
There are other, larger demos in that directory. For more information on VW, see this section of GitHub, and the Vowpal Wabbit wiki.
This is a library that is designed and optimized for boosted (tree) algorithms. The objective of this library is to push the computation limits of machines to the extremes needed to provide large-scale tree boosting that is scalable, portable, and accurate.
It is provided as a command line as well as an R library.
To use this library in R, you can start an interactive R session (just by typing R in the shell), and load the library.
Here is a simple example you can run in R prompt:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred <- predict(bst, test$data)
To run the xgboost command line, here are the commands to execute in the shell:
cp -r /dsvm/tools/xgboost/demo/binary_classification/ xgboostdemo
cd xgboostdemo
xgboost mushroom.conf
A .model file is written to the directory specified. Information about this demo example can be found on GitHub.
For more information about xgboost, see the xgboost documentation page, and its Github repository.
Rattle (the R Analytical Tool To Learn Easily) uses GUI-based data exploration and modeling. It presents statistical and visual summaries of data, transforms data that can be readily modeled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new data sets. It also generates R code, replicating the operations in the UI that can be run directly in R or used as a starting point for further analysis.
To run Rattle, you need to be in a graphical desktop sign-in session. On the terminal, type R
to enter the R environment. At the R prompt, enter the following commands:
library(rattle)
rattle()
Now a graphical interface opens up with a set of tabs. Here are the quick start steps in Rattle needed to use a sample weather data set and build a model. In some of the steps below, you are prompted to automatically install and load some required R packages that are not already on the system.
Note
If you don't have access to install the package in the system directory (the default), you may see a prompt on your R console window to install packages to your personal library. Answer y if you see these prompts.
- Click Execute.
- A dialog pops up, asking you if you like to use the example weather data set. Click Yes to load the example.
- Click the Model tab.
- Click Execute to build a decision tree.
- Click Draw to display the decision tree.
- Click the Forest radio button, and click Execute to build a random forest.
- Click the Evaluate tab.
- Click the Risk radio button, and click Execute to display two Risk (Cumulative) performance plots.
- Click the Log tab to show the generate R code for the preceding operations. (Due to a bug in the current release of Rattle, you need to insert a # character in front of Export this log ... in the text of the log.)
- Click the Export button to save the R script file named weather_script.R to the home folder.
You can exit Rattle and R. Now you can modify the generated R script, or use it as it is to run it anytime to repeat everything that was done within the Rattle UI. Especially for beginners in R, this is an easy way to quickly do analysis and machine learning in a simple graphical interface, while automatically generating code in R to modify and/or learn.
Here's how you can continue your learning and exploration:
- The Data science on the Linux Data Science Virtual Machine walkthrough shows you how to perform several common data science tasks with the Linux Data Science VM provisioned here.
- Explore the various data science tools on the data science VM by trying out the tools described in this article. You can also run dsvm-more-info on the shell within the virtual machine for a basic introduction and pointers to more information about the tools installed on the VM.
- Learn how to build end-to-end analytical solutions systematically by using the Team Data Science Process.
- Visit the Cortana Analytics Gallery for machine learning and data analytics samples that use the Cortana Analytics Suite.