piflow/README.md

230 lines
6.9 KiB
Markdown
Raw Normal View History

2018-12-24 15:59:18 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow.png)
2018-12-24 17:38:57 +08:00
is an easy to use, powerful big data pipeline system.
2018-05-03 18:15:05 +08:00
2018-12-24 15:59:18 +08:00
## Table of Contents
- [Features](#features)
2018-12-24 17:45:55 +08:00
- [Architecture](#architecture)
2018-12-24 15:59:18 +08:00
- [Requirements](#requirements)
- [Getting Started](#getting-started)
- [Getting Help](#getting-help)
- [Documentation](#documentation)
## Features
- Easy to use
- provide a WYSIWYG web interface to configure data flow
- monitor big data flow status
- check big data flow logs
- provide checkpoint
- Strong Scalability:
- Support for custom development data processing components
- Superior performance
- based on distributed computing engine Spark
- Powerful
- 100+ data processing components available
- include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、jsonetc.
2018-12-24 17:45:55 +08:00
## Architecture
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/architecture.png)
2018-12-24 15:59:18 +08:00
## Requirements
* JDK 1.8 or newer
* Apache Maven 3.1.0 or newer
* Git Client (used during build process by 'bower' plugin)
2018-12-24 17:50:01 +08:00
* spark-2.1.0
* hadoop-2.6.0
2018-12-24 15:59:18 +08:00
## Getting Started
2018-12-24 17:37:20 +08:00
To Build:
`mvn clean package -Dmaven.test.skip=true`
[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] piflow-project ..................................... SUCCESS [ 4.602 s]
[INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
[INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
[INFO] piflow-server ...................................... SUCCESS [03:01 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:18 min
[INFO] Finished at: 2018-12-24T16:54:16+08:00
[INFO] Final Memory: 41M/812M
[INFO] ------------------------------------------------------------------------
To Run Piflow Server
- configure config.properties
#server ip and port
server.ip=10.0.86.191
server.port=8002
h2.port=50002
#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster
yarn.resourcemanager.hostname=10.0.86.191
yarn.resourcemanager.address=10.0.86.191:8032
yarn.access.namenode=hdfs://10.0.86.191:9000
yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
#hive config
hive.metastore.uris=thrift://10.0.86.191:9083
#piflow jar path
piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar
#checkpoint hdfs path
checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/
- you can run piflow server on intellij
- main class is cn.piflow.api.Main
- remember to set SPARK_HOME
- you can run piflow server as follows:
- download piflowServer:***
- edit config.properties
- run start.sh
To Run Piflow Web
- todo
To Use
- command line
- flow config example
{
"flow":{
"name":"test",
"uuid":"1234",
"checkpoint":"Merge",
"stops":[
{
"uuid":"1111",
"name":"XmlParser",
"bundle":"cn.piflow.bundle.xml.XmlParser",
"properties":{
"xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
"rowTag":"phdthesis"
}
},
{
"uuid":"2222",
"name":"SelectField",
"bundle":"cn.piflow.bundle.common.SelectField",
"properties":{
"schema":"title,author,pages"
}
},
{
"uuid":"3333",
"name":"PutHiveStreaming",
"bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
"properties":{
"database":"sparktest",
"table":"dblp_phdthesis"
}
},
{
"uuid":"4444",
"name":"CsvParser",
"bundle":"cn.piflow.bundle.csv.CsvParser",
"properties":{
"csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
"header":"false",
"delimiter":",",
"schema":"title,author,pages"
}
},
{
"uuid":"555",
"name":"Merge",
"bundle":"cn.piflow.bundle.common.Merge",
"properties":{
"inports":"data1,data2"
}
},
{
"uuid":"666",
"name":"Fork",
"bundle":"cn.piflow.bundle.common.Fork",
"properties":{
"outports":"out1,out2,out3"
}
},
{
"uuid":"777",
"name":"JsonSave",
"bundle":"cn.piflow.bundle.json.JsonSave",
"properties":{
"jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
}
},
{
"uuid":"888",
"name":"CsvSave",
"bundle":"cn.piflow.bundle.csv.CsvSave",
"properties":{
"csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
"header":"true",
"delimiter":","
}
}
],
"paths":[
{
"from":"XmlParser",
"outport":"",
"inport":"",
"to":"SelectField"
},
{
"from":"SelectField",
"outport":"",
"inport":"data1",
"to":"Merge"
},
{
"from":"CsvParser",
"outport":"",
"inport":"data2",
"to":"Merge"
},
{
"from":"Merge",
"outport":"",
"inport":"",
"to":"Fork"
},
{
"from":"Fork",
"outport":"out1",
"inport":"",
"to":"PutHiveStreaming"
},
{
"from":"Fork",
"outport":"out2",
"inport":"",
"to":"JsonSave"
},
{
"from":"Fork",
"outport":"out3",
"inport":"",
"to":"CsvSave"
}
]
}
}
- curl -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
- piflow web
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow_web.png)