Skip to content

Commit

Permalink
update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
allwefantasy committed Apr 26, 2019
1 parent 07ec9ca commit 223011c
Show file tree
Hide file tree
Showing 8 changed files with 602 additions and 1 deletion.
10 changes: 9 additions & 1 deletion docs/gitbook/en/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,15 @@
* [Python UDF](guide/udf/python-udf.md)
* [Scala UDF](guide/udf/scala-udf.md)
* [Scala UDAF](guide/udf/scala-udaf.md)
* [JAVA UDF](guide/udf/java-udf.md)
* [JAVA UDF](guide/udf/java-udf.md)

* [Built-in UDF](guide/system_udf/README.md)
* [Http UDF](guide/system_udf/http-udf.md)
* [Common UDF](guide/system_udf/http-udf.md)

* [Python project](guide/python/README.md)
* [Project Standard](guide/python/project.md)
* [Python environment](guide/python/env.md)

* [API Docs](test2/a.md)
* [More]()
Expand Down
4 changes: 4 additions & 0 deletions docs/gitbook/en/guide/python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Python Project supports

MLSQL not only support Python UDF but also Python Project. We use conda to resolve python environment and this
is transparent for users.
38 changes: 38 additions & 0 deletions docs/gitbook/en/guide/python/env.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Python Environment

Before you can really run your python project, you should create the environment which your project
depends.

It looks like this:

```sql
set dependencies='''
name: tutorial4
dependencies:
- python=3.6
- pip
- pip:
- --index-url https://mirrors.aliyun.com/pypi/simple/
- numpy==1.14.3
- kafka==1.3.5
- pyspark==2.3.2
- pandas==0.22.0
''';

load script.`dependencies` as dependencies;

run command as PythonEnvExt.`/tmp/jack` where condaFile="dependencies" and command="create";
```

If you wanna remove this env, set command to `remove`. Notice that you should make sure all you machines have conda installed
and the internet connection is ok.

You can also specify condaYamlFilePath, which is the location of conda.yaml.

When you run python project meets errors like `Could not find Conda executable at conda`, you can add config in
PythonAlg/PythonParallelExt.

```sql
-- anaconda3 local path
and systemParam.envs='''{"MLFLOW_CONDA_HOME":"/anaconda3"}''';
```
31 changes: 31 additions & 0 deletions docs/gitbook/en/guide/python/process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Distribute run Python project In MLSQL

## Prerequisites

If you runs on yarn mode, please make sure you start the MLSQL Engine with follow configuration:

```
-streaming.ps.cluster.enable should be enabled.
Please make sure
you have the uber-jar of mlsql placed in
1. --jars
2. --conf "spark.executor.extraClassPath=[your jar name in jars]"
for exmaple:
--jars ./streamingpro-mlsql-spark_2.x-x.x.x-SNAPSHOT.jar
--conf "spark.executor.extraClassPath=streamingpro-mlsql-spark_2.x-x.x.x-SNAPSHOT.jar"
Otherwise the executor will
fail to start and the whole application will fails.
```

If you runs on Standalone, please send the MLSQL jar to every node and then configure:

```
--conf "spark.executor.extraClassPath=[MLSQL jar path]"
```

## How to use

52 changes: 52 additions & 0 deletions docs/gitbook/en/guide/python/project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Python project standard

MLSQL has low invasion on your python project. You add two description file and then convert this
project to a MLSQL compatible project.

Here is the structure of project:

```
examples/sklearn_elasticnet_wine/
├── MLproject
├── batchPredict.py
├── conda.yaml
├── predict.py
├── train.py
```

MLproject describe how to execute the project.
conda.yaml describe how to build python environment.

MLProject contains:

```yaml
name: tutorial

conda_env: conda.yaml

entry_points:
main:
train:
command: "python train.py"
batch_predict:
command: "python batchPredict.py"
api_predict:
command: "python predict.py"
```
conda.yaml:
```
name: tutorial
dependencies:
- python=3.6
- pip
- pip:
- --index-url https://mirrors.aliyun.com/pypi/simple/
- numpy==1.14.3
- kafka==1.3.5
- pyspark==2.3.2
- pandas==0.22.0
```
3 changes: 3 additions & 0 deletions docs/gitbook/en/guide/system_udf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Built-in UDF List

MLSQL has a lot built-in UDFs.
40 changes: 40 additions & 0 deletions docs/gitbook/en/guide/system_udf/http-udf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# HTTP UDFs

Before you can use this functions, please add this line to your startup script:

```sql
-streaming.udf.clzznames streaming.crawler.udf.Functions
```

HTTP UDFs make MLSQL more powerful, this means you can invoke any API from out or inner to help you achieve your target.

For example:

```sql
select crawler_http("http://www.csdn.net","GET",map("k1","v1","k2","v2")) as c as output;
```

The second parameter supports:

* GET
* POST

MLSQL also support download image:

```sql
select crawler_request_image("http://www.csdn.net","GET",map("k1","v1","k2","v2")) as c as output;
```

c is array[byte].

We also provide UDFs which you can used to extract title and body from html:

* crawler_auto_extract_body
* crawler_auto_extract_title

Or you can use xpath to extract something you want:

```sql

crawler_extract_xpath(html,xpath)
```
Loading

0 comments on commit 223011c

Please sign in to comment.