piflow/README_CN.md

296 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-logo2.png)
PiFlow是一个简单易用功能强大的大数据流水线系统。
## 目录
- [特性](#特性)
- [架构](#架构)
- [要求](#要求)
- [开始](#开始)
## 特性
- 简单易用
- 可视化配置流水线
- 监控流水线
- 查看流水线日志
- 检查点功能
- 扩展性强:
- 支持自定义开发数据处理组件
- 性能优越:
- 基于分布式计算引擎Spark开发
- 功能强大:
- 提供100+的数据处理组件
- 包括Hadoop 、Spark、MLlib、Hive、Solr、Redis、MemCache、ElasticSearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON等
- 集成了微生物领域的相关算法
## 架构
![](https://gitee.com/opensci/piflow/blob/master/doc/architecture.png)
## 要求
* JDK 1.8 及以上版本
* Apache Maven 3.1.0 及以上版本
* Git Client
* Spark-2.1.0 及以上版本
* Hadoop-2.6.0 及以上版本
## 开始
如何Build:
`mvn clean package -Dmaven.test.skip=true`
[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] piflow-project ..................................... SUCCESS [ 4.602 s]
[INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
[INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
[INFO] piflow-server ...................................... SUCCESS [03:01 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:18 min
[INFO] Finished at: 2018-12-24T16:54:16+08:00
[INFO] Final Memory: 41M/812M
[INFO] ------------------------------------------------------------------------
如何运行Piflow Server
- `使用Intellij Idea`:
- 编辑config.properties文件
- build piflow工程生成piflow-server.jar
- 运行cn.piflow.api.Main
- 切记设置SPARK_HOME
- `直接运行release版本`:
- 下载release版本地址https://github.com/cas-bigdatalab/piflow/releases
- 将build好的piflow-server.jar拷贝到piflow_release文件夹由于git不能上传超过1G大文件故需自行build piflow-server.jar
- 编辑config.properties文件
- 运行start.sh 或者后台运行 nohup ./start.sh > piflow.log 2>&1 &
- `如何配置config.properties`
#server ip and port
server.ip=10.0.86.191
server.port=8002
h2.port=50002
#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster
yarn.resourcemanager.hostname=10.0.86.191
yarn.resourcemanager.address=10.0.86.191:8032
yarn.access.namenode=hdfs://10.0.86.191:9000
yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
#hive config
hive.metastore.uris=thrift://10.0.86.191:9083
#piflow-server.jar path
piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar
#checkpoint hdfs path
checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/
#debug path
debug.path=hdfs://10.0.88.191:9000/piflow/debug/
#yarn url
yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
#the count of data shown in log
data.show=10
#h2 db port
h2.port=50002
如何运行Piflow Web
- https://github.com/cas-bigdatalab/piflow-web
如何使用:
- 命令行方式
- 流水线样例配置
{
"flow":{
"name":"test",
"uuid":"1234",
"checkpoint":"Merge",
"stops":[
{
"uuid":"1111",
"name":"XmlParser",
"bundle":"cn.piflow.bundle.xml.XmlParser",
"properties":{
"xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
"rowTag":"phdthesis"
}
},
{
"uuid":"2222",
"name":"SelectField",
"bundle":"cn.piflow.bundle.common.SelectField",
"properties":{
"schema":"title,author,pages"
}
},
{
"uuid":"3333",
"name":"PutHiveStreaming",
"bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
"properties":{
"database":"sparktest",
"table":"dblp_phdthesis"
}
},
{
"uuid":"4444",
"name":"CsvParser",
"bundle":"cn.piflow.bundle.csv.CsvParser",
"properties":{
"csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
"header":"false",
"delimiter":",",
"schema":"title,author,pages"
}
},
{
"uuid":"555",
"name":"Merge",
"bundle":"cn.piflow.bundle.common.Merge",
"properties":{
"inports":"data1,data2"
}
},
{
"uuid":"666",
"name":"Fork",
"bundle":"cn.piflow.bundle.common.Fork",
"properties":{
"outports":"out1,out2,out3"
}
},
{
"uuid":"777",
"name":"JsonSave",
"bundle":"cn.piflow.bundle.json.JsonSave",
"properties":{
"jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
}
},
{
"uuid":"888",
"name":"CsvSave",
"bundle":"cn.piflow.bundle.csv.CsvSave",
"properties":{
"csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
"header":"true",
"delimiter":","
}
}
],
"paths":[
{
"from":"XmlParser",
"outport":"",
"inport":"",
"to":"SelectField"
},
{
"from":"SelectField",
"outport":"",
"inport":"data1",
"to":"Merge"
},
{
"from":"CsvParser",
"outport":"",
"inport":"data2",
"to":"Merge"
},
{
"from":"Merge",
"outport":"",
"inport":"",
"to":"Fork"
},
{
"from":"Fork",
"outport":"out1",
"inport":"",
"to":"PutHiveStreaming"
},
{
"from":"Fork",
"outport":"out2",
"inport":"",
"to":"JsonSave"
},
{
"from":"Fork",
"outport":"out3",
"inport":"",
"to":"CsvSave"
}
]
}
}
- 运行命令
- curl -0 -X POST http://serverIP:serverPort/flow/start -H "Content-type: application/json" -d '你的流水线json配置文件'
- 访问piflow web: 试运行地址 "http://piflow.ml/piflow-web", user/password: admin/admin
- 登录
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-login.png)
- 流水线列表
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-flowlist.png)
- 流水线配置
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-congfigflow.png)
- 流水线资源配置
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-resource.png)
- 运行流水线
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-run.png)
- 删除流水线
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-delete.png)
- 流水线保存模板
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-savetemplate.png)
- 创建流水线:用户点击创建按钮,需要输入流水线名称及描述信息,同时可设置流水线需要的资源.
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-createflows.png)
- 配置流水线:用户可通过拖拽方式进行流水线的配置方式类似visio
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-flowconfig.png)
- 搜索流水线组件:画布左边栏显示组件组和组件,可按关键字搜索,户选择好组件后可拖至画布中央
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-module.png)
- 流水线基本信息:画布右侧显示流水线基本信息,包括流水线名称及描述
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-message.png)
- 流水线配置:画布中央选择任一数据处理组件,右侧显示该数据处理组件的基本信息,包括名称描述,作者等信息.选择AttributeInfo tab,显示该数据处理组件的属性信息,用户可根据实际需求进行配置
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-randomconfig.png)
- 运行流水线:用户配置好流水线后,可点击运行按钮运行流水线
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-runpiflow.png)
- 流水线监控:进入流水线监控页面。监控页面会显示整条流水线的执行状况,包括运行状态、执行进度、执行时间等,击具体数据处理组件,显示该数据处理组件的运行状况,包括运行状态、执行时间。
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-control.png)
- 查看流水线日志
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-log.png)
- 运行中流水线列表: 已运行流水线会显示在Process List中包括开始时间、结束时间、进度、状态等。同时可对已运行流水线进行查看在运行停止和删除操作
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-processlist.png)
- 运行流水线检查点
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-runcheckpoint.png)
- 创建保存模板
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-savetemplates.png)
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-savetemplatess.png)
- 模板列表
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-templatelist.png)
- 下载模板模板会保存成xml文件存放到本地
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-download.png)
- 上传模板
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-upload.png)
- 加载模板
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-load1.png)
![](https://gitee.com/opensci/piflow/blob/master/doc/piflow-load2.png)