2020-05-28 16:30:22 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-logo2.png)
|
2019-03-19 10:34:56 +08:00
|
|
|
|
PiFlow是一个简单易用,功能强大的大数据流水线系统。
|
|
|
|
|
|
|
|
|
|
## 目录
|
|
|
|
|
|
|
|
|
|
- [特性](#特性)
|
|
|
|
|
- [架构](#架构)
|
|
|
|
|
- [要求](#要求)
|
|
|
|
|
- [开始](#开始)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- [Docker镜像](#Docker镜像)
|
2020-05-28 17:55:58 +08:00
|
|
|
|
- [页面展示](#页面展示)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- [联系我们](#联系我们)
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
|
|
|
|
## 特性
|
|
|
|
|
|
|
|
|
|
- 简单易用
|
|
|
|
|
- 可视化配置流水线
|
|
|
|
|
- 监控流水线
|
|
|
|
|
- 查看流水线日志
|
|
|
|
|
- 检查点功能
|
|
|
|
|
|
|
|
|
|
- 扩展性强:
|
|
|
|
|
- 支持自定义开发数据处理组件
|
|
|
|
|
|
|
|
|
|
- 性能优越:
|
|
|
|
|
- 基于分布式计算引擎Spark开发
|
|
|
|
|
|
|
|
|
|
- 功能强大:
|
|
|
|
|
- 提供100+的数据处理组件
|
|
|
|
|
- 包括Hadoop 、Spark、MLlib、Hive、Solr、Redis、MemCache、ElasticSearch、JDBC、MongoDB、HTTP、FTP、XML、CSV、JSON等
|
|
|
|
|
- 集成了微生物领域的相关算法
|
|
|
|
|
|
|
|
|
|
## 架构
|
2020-05-28 16:50:59 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/architecture.png)
|
2019-03-19 10:34:56 +08:00
|
|
|
|
## 要求
|
2020-05-28 16:50:59 +08:00
|
|
|
|
* JDK 1.8
|
|
|
|
|
* Spark-2.11.8
|
|
|
|
|
* Apache Maven 3.1.0
|
2019-03-19 10:34:56 +08:00
|
|
|
|
* Spark-2.1.0 及以上版本
|
2020-05-28 16:50:59 +08:00
|
|
|
|
* Hadoop-2.6.0
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
|
|
|
|
## 开始
|
2020-05-28 16:50:59 +08:00
|
|
|
|
### Build PiFlow:
|
|
|
|
|
- `install external package`
|
|
|
|
|
|
|
|
|
|
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
|
|
|
|
|
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
|
|
|
|
|
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
|
|
|
|
|
mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
|
|
|
|
|
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- `mvn clean package -Dmaven.test.skip=true`
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
|
|
|
|
[INFO] Replacing original artifact with shaded artifact.
|
|
|
|
|
[INFO] Reactor Summary:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
[INFO]
|
|
|
|
|
[INFO] piflow-project ..................................... SUCCESS [ 4.369 s]
|
|
|
|
|
[INFO] piflow-core ........................................ SUCCESS [01:23 min]
|
|
|
|
|
[INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
|
2019-03-19 10:34:56 +08:00
|
|
|
|
[INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
|
2020-05-28 16:50:59 +08:00
|
|
|
|
[INFO] piflow-server ...................................... SUCCESS [02:05 min]
|
2019-03-19 10:34:56 +08:00
|
|
|
|
[INFO] ------------------------------------------------------------------------
|
|
|
|
|
[INFO] BUILD SUCCESS
|
|
|
|
|
[INFO] ------------------------------------------------------------------------
|
2020-05-28 16:50:59 +08:00
|
|
|
|
[INFO] Total time: 06:01 min
|
|
|
|
|
[INFO] Finished at: 2020-05-21T15:22:58+08:00
|
|
|
|
|
[INFO] Final Memory: 118M/691M
|
2019-03-19 10:34:56 +08:00
|
|
|
|
[INFO] ------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
### 运行 Piflow Server:
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- `Intellij上运行PiFlow Server`:
|
|
|
|
|
- 下载 piflow: git clone https://github.com/cas-bigdatalab/piflow.git
|
|
|
|
|
- 将PiFlow导入到Intellij
|
|
|
|
|
- 编辑配置文件config.properties
|
|
|
|
|
- Build PiFlow jar包:
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- Run --> Edit Configurations --> Add New Configuration --> Maven
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- Name: package
|
|
|
|
|
- Command line: clean package -Dmaven.test.skip=true -X
|
|
|
|
|
- run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
|
|
|
|
|
|
|
|
|
|
- 运行 HttpService:
|
|
|
|
|
- Edit Configurations --> Add New Configuration --> Application
|
|
|
|
|
- Name: HttpService
|
|
|
|
|
- Main class : cn.piflow.api.Main
|
|
|
|
|
- Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
|
|
|
|
|
- run 'HttpService'
|
|
|
|
|
|
|
|
|
|
- 测试 HttpService:
|
|
|
|
|
- run /../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
|
|
|
|
|
- change the piflow server ip and port to your configure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- `通过Release版本运行PiFlow`:
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 根据需求下载不同版本PiFlow(建议下载最新版本):
|
2020-05-28 16:50:59 +08:00
|
|
|
|
https://github.com/cas-bigdatalab/piflow/releases/download/v0.5/piflow.tar.gz
|
|
|
|
|
https://github.com/cas-bigdatalab/piflow/releases/download/v0.6/piflow-server-v0.6.tar.gz
|
|
|
|
|
https://github.com/cas-bigdatalab/piflow/releases/download/v0.7/piflow-server-v0.7.tar.gz
|
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 解压piflow-server-v0.7.tar.gz:
|
|
|
|
|
tar -zxvf piflow-server-v0.7.tar.gz
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
|
|
|
|
- 编辑配置文件config.properties
|
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 运行、停止、重启PiFlow Server
|
|
|
|
|
start.sh、stop.sh、 restart.sh、 status.sh
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
|
|
|
|
- 测试 PiFlow Server
|
|
|
|
|
- 设置环境变量 PIFLOW_HOME
|
|
|
|
|
- vim /etc/profile
|
|
|
|
|
export PIFLOW_HOME=/yourPiflowPath/bin
|
|
|
|
|
export PATH=$PATH:$PIFLOW_HOME/bin
|
2020-05-29 09:54:40 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- 运行如下命令
|
|
|
|
|
piflow flow start example/mockDataFlow.json
|
|
|
|
|
piflow flow stop appID
|
|
|
|
|
piflow flow info appID
|
|
|
|
|
piflow flow log appID
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
piflow flowGroup start example/mockDataGroup.json
|
|
|
|
|
piflow flowGroup stop groupId
|
|
|
|
|
piflow flowGroup info groupId
|
|
|
|
|
|
|
|
|
|
- `如何配置config.properties`
|
|
|
|
|
|
2019-03-19 10:34:56 +08:00
|
|
|
|
#spark and yarn config
|
|
|
|
|
spark.master=yarn
|
|
|
|
|
spark.deploy.mode=cluster
|
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
#hdfs default file system
|
|
|
|
|
fs.defaultFS=hdfs://10.0.86.191:9000
|
|
|
|
|
|
|
|
|
|
#yarn resourcemanager.hostname
|
|
|
|
|
yarn.resourcemanager.hostname=10.0.86.191
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
#if you want to use hive, set hive metastore uris
|
|
|
|
|
#hive.metastore.uris=thrift://10.0.88.71:9083
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
#show data in log, set 0 if you do not want to show data in logs
|
2019-03-19 10:34:56 +08:00
|
|
|
|
data.show=10
|
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
#server port
|
|
|
|
|
server.port=8002
|
|
|
|
|
|
|
|
|
|
#h2db port
|
2019-03-19 10:34:56 +08:00
|
|
|
|
h2.port=50002
|
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
### 运行PiFlow Web请到如下链接,PiFlow Server 与 PiFlow Web版本要对应:
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- https://github.com/cas-bigdatalab/piflow-web
|
|
|
|
|
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
### 接口Restful API:
|
2019-03-19 10:34:56 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- flow json(可查看piflow-bin/example文件夹下的流水线样例)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
<details>
|
|
|
|
|
<summary>flow example</summary>
|
|
|
|
|
<pre>
|
|
|
|
|
<code>
|
2019-03-19 10:34:56 +08:00
|
|
|
|
{
|
2020-05-28 16:50:59 +08:00
|
|
|
|
"flow": {
|
|
|
|
|
"name": "MockData",
|
|
|
|
|
"executorMemory": "1g",
|
|
|
|
|
"executorNumber": "1",
|
|
|
|
|
"uuid": "8a80d63f720cdd2301723b7461d92600",
|
|
|
|
|
"paths": [
|
|
|
|
|
{
|
|
|
|
|
"inport": "",
|
|
|
|
|
"from": "MockData",
|
|
|
|
|
"to": "ShowData",
|
|
|
|
|
"outport": ""
|
2019-03-19 10:34:56 +08:00
|
|
|
|
}
|
2020-05-28 16:50:59 +08:00
|
|
|
|
],
|
|
|
|
|
"executorCores": "1",
|
|
|
|
|
"driverMemory": "1g",
|
|
|
|
|
"stops": [
|
|
|
|
|
{
|
|
|
|
|
"name": "MockData",
|
|
|
|
|
"bundle": "cn.piflow.bundle.common.MockData",
|
|
|
|
|
"uuid": "8a80d63f720cdd2301723b7461d92604",
|
|
|
|
|
"properties": {
|
|
|
|
|
"schema": "title:String, author:String, age:Int",
|
|
|
|
|
"count": "10"
|
|
|
|
|
},
|
|
|
|
|
"customizedProperties": {
|
|
|
|
|
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"name": "ShowData",
|
|
|
|
|
"bundle": "cn.piflow.bundle.external.ShowData",
|
|
|
|
|
"uuid": "8a80d63f720cdd2301723b7461d92602",
|
|
|
|
|
"properties": {
|
|
|
|
|
"showNumber": "5"
|
|
|
|
|
},
|
|
|
|
|
"customizedProperties": {
|
|
|
|
|
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}</code>
|
|
|
|
|
</pre>
|
|
|
|
|
</details>
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- CURL方式:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- curl -0 -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
|
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 命令行方式:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
- set PIFLOW_HOME
|
|
|
|
|
vim /etc/profile
|
|
|
|
|
export PIFLOW_HOME=/yourPiflowPath/piflow-bin
|
2020-05-28 17:06:51 +08:00
|
|
|
|
export PATH=\$PATH:\$PIFLOW_HOME/bin
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
|
|
|
|
- command example
|
|
|
|
|
piflow flow start yourFlow.json
|
|
|
|
|
piflow flow stop appID
|
|
|
|
|
piflow flow info appID
|
|
|
|
|
piflow flow log appID
|
|
|
|
|
|
|
|
|
|
piflow flowGroup start yourFlowGroup.json
|
|
|
|
|
piflow flowGroup stop groupId
|
|
|
|
|
piflow flowGroup info groupId
|
|
|
|
|
|
|
|
|
|
## Docker镜像
|
|
|
|
|
|
|
|
|
|
- 根据需求拉取不同版本PiFlow docker镜像
|
2020-06-04 16:48:07 +08:00
|
|
|
|
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.7.1
|
2020-05-28 16:50:59 +08:00
|
|
|
|
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.6.1
|
|
|
|
|
|
|
|
|
|
- 查看Docker镜像的信息
|
|
|
|
|
docker images
|
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 通过镜像Id运行一个Container,所有PiFlow服务会自动运行
|
2020-05-28 16:50:59 +08:00
|
|
|
|
docker run --name piflow-v0.6 -it [imageID]
|
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- 访问 "containerip:6001/piflow-web", 启动时间可能有些慢,需要等待几分钟
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
|
|
|
|
- 如果发生错误,所有应用都放在了/opt文件夹下,可自行单独启动服务
|
|
|
|
|
|
|
|
|
|
## 页面展示
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `登录`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-login.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `流水线列表`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-flowlist.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `创建流水线`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-createflow.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `配置流水线`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-flowconfig.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `运行流水线`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-loadflow.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `监控流水线`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-monitor.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `流水线日志`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-log.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `流水线组列表`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-group-list.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `配置流水线组`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-configure-group.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `监控流水线组`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-monitor-group.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `运行态流水线列表`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-processlist.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
- `流水线模板列表`:
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2020-05-28 17:06:51 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/piflow-templatelist.png)
|
2020-05-28 16:50:59 +08:00
|
|
|
|
|
2019-09-02 17:12:13 +08:00
|
|
|
|
|
2020-05-28 16:50:59 +08:00
|
|
|
|
## 联系我们
|
|
|
|
|
- Name:吴老师
|
|
|
|
|
- Mobile Phone:18910263390
|
|
|
|
|
- WeChat:18910263390
|
|
|
|
|
- Email: wzs@cnic.cn
|
|
|
|
|
- QQ Group:1003489545
|
2020-06-02 10:39:03 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/PiFlowUserGroup_QQ.jpeg)
|
2020-06-02 10:43:06 +08:00
|
|
|
|
- WeChat group is valid for 7 days
|
2020-06-02 10:39:03 +08:00
|
|
|
|
![](https://gitee.com/opensci/piflow/raw/master/doc/PiFlowUserGroup.png)
|