piflow/README.md

413 lines
14 KiB
Markdown
Raw Normal View History

2019-09-16 17:27:40 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-logo2.png)
2019-09-16 17:23:05 +08:00
[![GitHub releases](https://img.shields.io/github/release/cas-bigdatalab/piflow.svg)](https://github.com/cas-bigdatalab/piflow/releases)
[![GitHub stars](https://img.shields.io/github/stars/cas-bigdatalab/piflow.svg)](https://github.com/cas-bigdatalab/piflow/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/cas-bigdatalab/piflow.svg)](https://github.com/cas-bigdatalab/piflow/network)
[![GitHub downloads](https://img.shields.io/github/downloads/cas-bigdatalab/piflow/total.svg)](https://github.com/cas-bigdatalab/piflow/releases)
[![GitHub issues](https://img.shields.io/github/issues/cas-bigdatalab/piflow.svg)](https://github.com/cas-bigdatalab/piflow/issues)
[![GitHub license](https://img.shields.io/github/license/cas-bigdatalab/piflow.svg)](https://github.com/cas-bigdatalab/piflow/blob/master/LICENSE)
2019-09-11 14:21:34 +08:00
πFlow is an easy to use, powerful big data pipeline system.
2020-05-21 16:12:16 +08:00
Try PiFlow v0.6 with: http://piflow.cstcloud.cn/piflow-web/
2018-12-24 15:59:18 +08:00
## Table of Contents
- [Features](#features)
2018-12-24 17:45:55 +08:00
- [Architecture](#architecture)
2018-12-24 15:59:18 +08:00
- [Requirements](#requirements)
- [Getting Started](#getting-started)
2020-02-19 17:09:47 +08:00
- [PiFlow Docker](#docker-started)
2020-02-19 17:14:34 +08:00
- [Use Interface](#use-interface)
2018-12-24 15:59:18 +08:00
## Features
- Easy to use
- provide a WYSIWYG web interface to configure data flow
2019-03-14 16:50:41 +08:00
- monitor data flow status
- check the logs of data flow
- provide checkpoints
- Strong scalability:
- Support customized development of data processing components
2018-12-24 15:59:18 +08:00
- Superior performance
- based on distributed computing engine Spark
- Powerful
- 100+ data processing components available
- include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、jsonetc.
2018-12-24 17:45:55 +08:00
## Architecture
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/architecture.png)
2018-12-24 15:59:18 +08:00
## Requirements
2020-05-21 16:12:16 +08:00
* JDK 1.8
* Scala-2.11.8
2020-05-22 16:37:01 +08:00
* Apache Maven 3.1.0 or newer
2020-05-21 16:12:16 +08:00
* Spark-2.1.0、 Spark-2.2.0、 Spark-2.3.0
2019-03-14 16:50:41 +08:00
* Hadoop-2.6.0
2018-12-24 15:59:18 +08:00
## Getting Started
2018-12-24 17:37:20 +08:00
2020-05-21 16:24:16 +08:00
### To Build:
2020-05-21 16:28:34 +08:00
- `install external package`
2020-05-21 16:18:30 +08:00
2020-05-22 16:37:01 +08:00
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
2020-05-21 16:18:30 +08:00
2020-05-21 16:24:16 +08:00
- `mvn clean package -Dmaven.test.skip=true`
2018-12-24 17:37:20 +08:00
[INFO] Replacing original artifact with shaded artifact.
[INFO] Reactor Summary:
2020-05-21 16:18:30 +08:00
[INFO]
[INFO] piflow-project ..................................... SUCCESS [ 4.369 s]
[INFO] piflow-core ........................................ SUCCESS [01:23 min]
[INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
2018-12-24 17:37:20 +08:00
[INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
2020-05-21 16:18:30 +08:00
[INFO] piflow-server ...................................... SUCCESS [02:05 min]
2018-12-24 17:37:20 +08:00
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
2020-05-21 16:18:30 +08:00
[INFO] Total time: 06:01 min
[INFO] Finished at: 2020-05-21T15:22:58+08:00
[INFO] Final Memory: 118M/691M
2018-12-24 17:37:20 +08:00
[INFO] ------------------------------------------------------------------------
2020-05-21 16:24:16 +08:00
### Run Piflow Server
2019-03-14 16:50:41 +08:00
2020-05-22 16:37:01 +08:00
- `run piflow server on Intellij`:
- download piflow: git clone https://github.com/cas-bigdatalab/piflow.git
- import piflow into Intellij
- edit config.properties file
- build piflow to generate piflow jar:
- Edit Configurations --> Add New Configuration --> Maven
- Name: package
- Command line: clean package -Dmaven.test.skip=true -X
- run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
- run HttpService:
- Edit Configurations --> Add New Configuration --> Application
- Name: HttpService
- Main class : cn.piflow.api.Main
- Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
- run 'HttpService'
- test HttpService:
- run /../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
- change the piflow server ip and port to your configure
2019-03-14 16:50:41 +08:00
- `run piflow server by release version`:
2020-05-21 16:24:16 +08:00
- download piflow.tar.gz:
2020-05-21 16:26:42 +08:00
https://github.com/cas-bigdatalab/piflow/releases/download/v0.5/piflow.tar.gz
https://github.com/cas-bigdatalab/piflow/releases/download/v0.6/piflow-server-v0.6.tar.gz
https://github.com/cas-bigdatalab/piflow/releases/download/v0.7/piflow-server-v0.7.tar.gz
2020-05-21 16:12:16 +08:00
2020-05-21 16:28:34 +08:00
- unzip piflow.tar.gz:
2020-05-21 16:12:16 +08:00
tar -zxvf piflow.tar.gz
2020-05-22 16:37:01 +08:00
- edit config.properties
- run start.sh、stop.sh、 restart.sh、 status.sh
- test piflow server
- set PIFLOW_HOME
- vim /etc/profile
export PIFLOW_HOME=/yourPiflowPath/bin
export PATH=$PATH:$PIFLOW_HOME/bin
- command
piflow flow start example/mockDataFlow.json
piflow flow stop appID
piflow flow info appID
piflow flow log appID
piflow flowGroup start example/mockDataGroup.json
piflow flowGroup stop groupId
piflow flowGroup info groupId
2019-03-14 16:50:41 +08:00
- `how to configure config.properties`
2020-05-21 16:04:15 +08:00
2018-12-24 17:37:20 +08:00
#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster
2019-03-14 16:50:41 +08:00
2020-05-21 16:04:15 +08:00
#hdfs default file system
fs.defaultFS=hdfs://10.0.86.191:9000
#yarn resourcemanager.hostname
yarn.resourcemanager.hostname=10.0.86.191
2019-03-14 16:50:41 +08:00
2020-05-21 16:04:15 +08:00
#if you want to use hive, set hive metastore uris
#hive.metastore.uris=thrift://10.0.88.71:9083
2019-03-14 16:50:41 +08:00
2020-05-21 16:04:15 +08:00
#show data in log, set 0 if you do not want to show data in logs
2019-03-14 16:50:41 +08:00
data.show=10
2020-05-21 16:04:15 +08:00
#server port
server.port=8002
#h2db port
2019-03-14 16:50:41 +08:00
h2.port=50002
2020-05-21 16:04:15 +08:00
2018-12-24 17:37:20 +08:00
2020-05-21 16:24:16 +08:00
### Run Piflow Web
2019-03-18 16:15:21 +08:00
- https://github.com/cas-bigdatalab/piflow-web
2018-12-24 17:37:20 +08:00
2020-05-21 17:25:18 +08:00
### Restful API
2018-12-24 17:37:20 +08:00
2020-05-21 17:25:18 +08:00
- flow json
2020-05-21 17:22:14 +08:00
<details>
2020-05-21 17:25:18 +08:00
<summary>flow example</summary>
2020-05-21 17:22:14 +08:00
<pre>
2020-05-22 16:44:33 +08:00
<code>{
"flow": {
"name": "Example",
"executorMemory": "1g",
"executorNumber": "1",
"uuid": "8a80d63f720cdd2301723a4e679e2457",
"paths": [
{
"inport": "",
"from": "XmlParser",
"to": "SelectField",
"outport": ""
},
{
"inport": "",
"from": "Fork",
"to": "CsvSave",
"outport": "out1"
},
{
"inport": "data2",
"from": "SelectField",
"to": "Merge",
"outport": ""
},
{
"inport": "",
"from": "Merge",
"to": "Fork",
"outport": ""
},
{
"inport": "data1",
"from": "CsvParser",
"to": "Merge",
"outport": ""
},
{
"inport": "",
"from": "Fork",
"to": "JsonSave",
"outport": "out3"
},
{
"inport": "",
"from": "Fork",
"to": "PutHiveMode",
"outport": "out2"
2018-12-24 17:37:20 +08:00
}
2020-05-22 16:44:33 +08:00
],
"executorCores": "1",
"driverMemory": "1g",
"stops": [
{
"name": "CsvSave",
"bundle": "cn.piflow.bundle.csv.CsvSave",
"uuid": "8a80d63f720cdd2301723a4e67a52467",
"properties": {
"csvSavePath": "hdfs://master:9000/xjzhu/phdthesis_result.csv",
"partition": "",
"header": "false",
"saveMode": "append",
"delimiter": ","
},
"customizedProperties": {
2018-12-24 17:37:20 +08:00
2020-05-22 16:44:33 +08:00
}
},
{
"name": "PutHiveMode",
"bundle": "cn.piflow.bundle.hive.PutHiveMode",
"uuid": "8a80d63f720cdd2301723a4e67a22461",
"properties": {
"database": "sparktest",
"saveMode": "append",
"table": "dblp_phdthesis"
},
"customizedProperties": {
}
},
{
"name": "CsvParser",
"bundle": "cn.piflow.bundle.csv.CsvParser",
"uuid": "8a80d63f720cdd2301723a4e67a82470",
"properties": {
"schema": "title,author,pages",
"csvPath": "hdfs://master:9000/xjzhu/phdthesis.csv",
"delimiter": ",",
"header": "false"
},
"customizedProperties": {
}
},
{
"name": "JsonSave",
"bundle": "cn.piflow.bundle.json.JsonSave",
"uuid": "8a80d63f720cdd2301723a4e67a1245f",
"properties": {
"jsonSavePath": "hdfs://10.0.86.191:9000/xjzhu/phdthesis.json"
},
"customizedProperties": {
}
},
{
"name": "XmlParser",
"bundle": "cn.piflow.bundle.xml.XmlParser",
"uuid": "8a80d63f720cdd2301723a4e67a7246d",
"properties": {
"rowTag": "phdthesis",
"xmlpath": "hdfs://master:9000/xjzhu/dblp.mini.xml"
},
"customizedProperties": {
}
},
{
"name": "SelectField",
"bundle": "cn.piflow.bundle.common.SelectField",
"uuid": "8a80d63f720cdd2301723a4e67aa2477",
"properties": {
"columnNames": "title,author,pages"
},
"customizedProperties": {
}
},
{
"name": "Merge",
"bundle": "cn.piflow.bundle.common.Merge",
"uuid": "8a80d63f720cdd2301723a4e67a92475",
"properties": {
"inports": "data1,data2"
},
"customizedProperties": {
}
},
{
"name": "Fork",
"bundle": "cn.piflow.bundle.common.Fork",
"uuid": "8a80d63f720cdd2301723a4e67a42465",
"properties": {
"outports": "out1,out3,out2"
},
"customizedProperties": {
}
2018-12-24 17:37:20 +08:00
}
]
}
2020-05-22 16:44:33 +08:00
}
2020-05-21 17:22:14 +08:00
</code>
</pre>
</details>
2020-05-22 14:27:35 +08:00
- CURL POST
2019-01-11 15:47:36 +08:00
- curl -0 -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
2020-05-22 14:27:35 +08:00
2020-05-22 14:28:53 +08:00
- Command line
- set PIFLOW_HOME
vim /etc/profile
export PIFLOW_HOME=/yourPiflowPath/piflow-bin
export PATH=$PATH:$PIFLOW_HOME/bin
2020-05-22 14:27:35 +08:00
2020-05-22 14:28:53 +08:00
- command example
piflow flow start yourFlow.json
piflow flow stop appID
piflow flow info appID
piflow flow log appID
2020-02-19 17:09:47 +08:00
2020-05-22 14:28:53 +08:00
piflow flowGroup start yourFlowGroup.json
piflow flowGroup stop groupId
piflow flowGroup info groupId
2020-02-19 17:09:47 +08:00
## docker-started
2020-02-19 17:15:24 +08:00
- pull piflow images
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.6.1
2020-02-19 17:09:47 +08:00
- show docker images
docker images
2020-02-19 17:15:24 +08:00
- run a container with piflow imageID all services run automatically
2020-02-20 11:25:14 +08:00
docker run --name piflow-v0.6 -it [imageID]
2020-02-19 17:09:47 +08:00
- please visit "containerip:6001/piflow-web", it may take a while
- if somethings goes wrong, all the application are in /opt folder
2020-02-19 17:14:34 +08:00
## use-interface
2020-05-21 16:00:22 +08:00
- `Login`:
2020-02-19 17:09:47 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-login.png)
2020-05-21 16:00:22 +08:00
- `Flow list`:
2019-03-19 10:35:48 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-flowlist.png)
2020-05-21 16:00:22 +08:00
- `Create flow`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-createflow.png)
2020-05-21 16:00:22 +08:00
- `Configure flow`:
2019-03-18 17:00:26 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-flowconfig.png)
2020-05-21 16:00:22 +08:00
- `Load flow`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-loadflow.png)
2020-05-21 16:00:22 +08:00
- `Monitor flow`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-monitor.png)
2020-05-21 16:00:22 +08:00
- `Flow logs`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-log.png)
2020-05-21 16:00:22 +08:00
- `Group list`:
2020-05-21 17:16:30 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-group-list.png)
2020-05-21 16:00:22 +08:00
- `Configure group`:
2020-05-21 17:16:30 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-configure-group.png)
2020-05-21 16:00:22 +08:00
- `Monitor group`:
2020-05-21 17:16:30 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-monitor-group.png)
2020-05-21 16:00:22 +08:00
- `Process List`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-processlist.png)
2020-05-21 16:00:22 +08:00
- `Template List`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-templatelist.png)
2020-05-21 16:00:22 +08:00
- `Save Template`:
2019-02-26 16:38:32 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow-savetemplate.png)
2019-09-09 09:32:13 +08:00
2019-12-21 11:17:40 +08:00
Welcome to join PiFlow User Group! Contact US
Name:吴老师
Mobile Phone18910263390
WeChat18910263390
Email: wzs@cnic.cn
QQ Group1003489545
2019-12-21 11:25:09 +08:00
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/PiFlowUserGroup_QQ.jpeg)
2019-09-11 14:31:52 +08:00
2019-12-21 11:14:45 +08:00
2019-09-09 09:32:13 +08:00