Skip to content

Build big Data processing and Machine Learning platform with MLSQL

License

Notifications You must be signed in to change notification settings

gcstar/streamingpro

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What's StreamingPro and MLSQL?

StreamingPro is mainly designed to run on Apache Spark but it also supports Apache Flink for the runtime. Thus, it can be considered as a cross,distributed platform which is the combination of BigData platform and AI platform where you can run both Big Data Processing and Machine Learning script.

MLSQL is a DSL akin to SQL but more powerfull based on StreamingPro platform. Since StreamingPro have already intergrated many ML frameworks including Spark MLLib, DL4J and Python ML framework eg. Sklearn, Tensorflow(supporting cluster mode) this means you can use MLSQL to operate all these popular Machine Learning frameworks.

Why MLSQL

MLSQL give you the power to use just one SQL-like language to finish all your Machine Learning pipeline. It also provides so many modules and functions to help you simplify the complexity of building Machine Learning application.

  1. MLSQL is the only one language you should take over.
  2. Data preproccessing created in training phase can also be used in streaming, batch , API service directly without coding.
  3. Server mode make you get rid of environment trouble.

Quick Tutorial

Step 1:

Download the jars from the release page: Release页面:

  1. streamingpro-mlsql-1.x.x.jar
  2. ansj_seg-5.1.6.jar
  3. nlp-lang-1.7.8.jar

Step 2:

Visit the downloads page: Spark, to download Apache Spark 2.2.0 and then unarvhive it.

Step 3:

cd spark-2.2.0-bin-hadoop2.7/

./bin/spark-submit   --class streaming.core.StreamingApp \
--master local[*] \
--name sql-interactive \
--jars ansj_seg-5.1.6.jar,nlp-lang-1.7.8.jar
streamingpro-mlsql-1.1.2.jar    \
-streaming.name sql-interactive    \
-streaming.job.file.path file:///tmp/query.json \
-streaming.platform spark   \
-streaming.rest true   \
-streaming.driver.port 9003   \
-streaming.spark.service true \
-streaming.thrift false \
-streaming.enableHiveSupport true

query.json is a json file contains "{}".

Step 4:

Open your chrome browser, type the following url:

http://127.0.0.1:9003

Enjoy.


Run the first Machine Learning Script in MLSQL.

-- load data from spark distribution 
load libsvm.`/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt` as data;

-- train a NaiveBayes model and save it in /tmp/bayes_model.
-- Here the alg we use  is based on Spark MLlib 
train data as NaiveBayes.`/tmp/bayes_model`;

-- register your model
register NaiveBayes.`/tmp/bayes_model` as bayes_predict;

-- predict all data 
select bayes_predict(features) as predict_label, label  from data as result;

-- save predicted result in /tmp/result with json format
save overwrite result as json.`/tmp/result`;

-- show predict label in web table.
select * from result as res;

Please make sure the path /spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt is correct.

Copy and paste the script to the web page, and click 运行, then you will see the label and predict_label.

Congratulations, you have completed the first Machine Learning script!


Run the first ETL Script In MLSQL.

select "a" as a,"b" as b
as abc;

-- here we just copy all from table abc and then create a new table newabc.

From Oscar:
-- we just copy all from table abc and create a new table newabc here.

select * from abc
as newabc;

-- save the newabc table to mysql.
save overwrite newabc
as jdbc.`db.abc`
options truncate="true"
and driver="com.mysql.jdbc.Driver"
and url="jdbc:mysql://127.0.0.1:3306/...."
and driver="com.mysql.jdbc.Driver"
and user="..."
and password="...."

Congratulations, you have completed the first ETL script!


Run as Application or Server

  1. Application mode: Run StreamingPro as a application which executes a json file.
  2. Server mode:Run StreamingPro as a server and you can interactive with it with http protocol.

We strongly recommend users to deploy StreamingPro with Server mode. Server mode is developed actively.

In order to avoid compiling problems, please use release version directly.

If you really want to use application mode, StreamingPro supports batch.mlsql keyword in json file, so you can still use mlsql grammar.(This function provided from v1.1.2)

{
  "mlsql": {
    "desc": "test",
    "strategy": "spark",
    "algorithm": [],
    "ref": [],
    "compositor": [
      {
        "name": "batch.mlsql",
        "params": [
          {
            "sql": [
              "select 'a' as a as table1;",
              "save overwrite table1 as parquet.`/tmp/kk`;"
            ]
          }
        ]
      }
    ],
    "configParams": {
    }
  }
}

Learning MLSQL

Compiling

Advanced Programming

Machine Learning

Model deploy

MLSQL

Tools

  1. StreamingPro Manager
  2. StreamingPro json editor

experiment

  1. flink support

Other documents

About

Build big Data processing and Machine Learning platform with MLSQL

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 61.9%
  • CSS 16.6%
  • HTML 13.8%
  • Scala 3.8%
  • Java 3.7%
  • Python 0.2%