piflow/README.md

![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow.png) 
is an easy to use, powerful big data pipeline system.

## Table of Contents

- [Features](#features)
- [Architecture](#architecture)
- [Requirements](#requirements)
- [Getting Started](#getting-started)
- [Getting Help](#getting-help)
- [Documentation](#documentation)

## Features

- Easy to use
  - provide a WYSIWYG web interface to configure data flow
  - monitor big data flow status
  - check big data flow logs
  - provide checkpoint
- Strong Scalability:
  - Support for custom development data processing components
- Superior performance
  - based on distributed computing engine Spark 
- Powerful
  - 100+ data processing components available
  - include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json，etc.

## Architecture
![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/architecture.png) 
## Requirements
* JDK 1.8 or newer
* Apache Maven 3.1.0 or newer
* Git Client (used during build process by 'bower' plugin)
* spark-2.1.0
* hadoop-2.6.0

## Getting Started

To Build: 
`mvn clean package -Dmaven.test.skip=true`

          [INFO] Replacing original artifact with shaded artifact.
          [INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
          [INFO] ------------------------------------------------------------------------
          [INFO] Reactor Summary:
          [INFO] 
          [INFO] piflow-project ..................................... SUCCESS [  4.602 s]
          [INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
          [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
          [INFO] piflow-server ...................................... SUCCESS [03:01 min]
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 06:18 min
          [INFO] Finished at: 2018-12-24T16:54:16+08:00
          [INFO] Final Memory: 41M/812M
          [INFO] ------------------------------------------------------------------------

To Run Piflow Server：
- configure config.properties

      #server ip and port
      server.ip=10.0.86.191
      server.port=8002
      h2.port=50002
      
      #spark and yarn config
      spark.master=yarn
      spark.deploy.mode=cluster
      yarn.resourcemanager.hostname=10.0.86.191
      yarn.resourcemanager.address=10.0.86.191:8032
      yarn.access.namenode=hdfs://10.0.86.191:9000
      yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
      yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
      yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/

      #hive config
      hive.metastore.uris=thrift://10.0.86.191:9083

      #piflow jar path
      piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar

      #checkpoint hdfs path
      checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/
- you can run piflow server on intellij 
  - main class is cn.piflow.api.Main
  - remember to set SPARK_HOME
- you can run piflow server as follows:
  - download piflowServer:***
  - edit config.properties
  - run start.sh
  
To Run Piflow Web：
  - todo
  
To Use：

- command line
  - flow config example
  

        {
          "flow":{
          "name":"test",
          "uuid":"1234",
          "checkpoint":"Merge",
          "stops":[
          {
            "uuid":"1111",
            "name":"XmlParser",
            "bundle":"cn.piflow.bundle.xml.XmlParser",
            "properties":{
                "xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
                "rowTag":"phdthesis"
            }
          },
          {
            "uuid":"2222",
            "name":"SelectField",
            "bundle":"cn.piflow.bundle.common.SelectField",
            "properties":{
                "schema":"title,author,pages"
            }

          },
          {
            "uuid":"3333",
            "name":"PutHiveStreaming",
            "bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
            "properties":{
                "database":"sparktest",
                "table":"dblp_phdthesis"
            }
          },
          {
            "uuid":"4444",
            "name":"CsvParser",
            "bundle":"cn.piflow.bundle.csv.CsvParser",
            "properties":{
                "csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
                "header":"false",
                "delimiter":",",
                "schema":"title,author,pages"
            }
          },
          {
            "uuid":"555",
            "name":"Merge",
            "bundle":"cn.piflow.bundle.common.Merge",
            "properties":{
              "inports":"data1,data2"
            }
          },
          {
            "uuid":"666",
            "name":"Fork",
            "bundle":"cn.piflow.bundle.common.Fork",
            "properties":{
              "outports":"out1,out2,out3"
            }
          },
          {
            "uuid":"777",
            "name":"JsonSave",
            "bundle":"cn.piflow.bundle.json.JsonSave",
            "properties":{
              "jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
            }
          },
          {
            "uuid":"888",
            "name":"CsvSave",
            "bundle":"cn.piflow.bundle.csv.CsvSave",
            "properties":{
              "csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
              "header":"true",
              "delimiter":","
            }
          }
        ],
        "paths":[
          {
            "from":"XmlParser",
            "outport":"",
            "inport":"",
            "to":"SelectField"
          },
          {
            "from":"SelectField",
            "outport":"",
            "inport":"data1",
            "to":"Merge"
          },
          {
            "from":"CsvParser",
            "outport":"",
            "inport":"data2",
            "to":"Merge"
          },
          {
            "from":"Merge",
            "outport":"",
            "inport":"",
            "to":"Fork"
          },
          {
            "from":"Fork",
            "outport":"out1",
            "inport":"",
            "to":"PutHiveStreaming"
          },
          {
            "from":"Fork",
            "outport":"out2",
            "inport":"",
            "to":"JsonSave"
          },
          {
            "from":"Fork",
            "outport":"out3",
            "inport":"",
            "to":"CsvSave"
          }
        ]
      }
    }
  - curl -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
- piflow web
  ![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow_web.png)
-												Update README.md
											
										
										
											2018-12-24 15:59:18 +08:00
+								![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow.png)
-												Update README.md
											
										
										
											2018-12-24 17:38:57 +08:00
+								is an easy to use, powerful big data pipeline system.
-												init

											
										
										
											2018-05-03 18:15:05 +08:00
-												Update README.md
											
										
										
											2018-12-24 15:59:18 +08:00
+								## Table of Contents
 								- [Features](#features)
-												Update README.md
											
										
										
											2018-12-24 17:45:55 +08:00
+								- [Architecture](#architecture)
-												Update README.md
											
										
										
											2018-12-24 15:59:18 +08:00
+								- [Requirements](#requirements)
 								- [Getting Started](#getting-started)
 								- [Getting Help](#getting-help)
 								- [Documentation](#documentation)
 								## Features
 								- Easy to use
 								  - provide a WYSIWYG web interface to configure data flow
 								  - monitor big data flow status
 								  - check big data flow logs
 								  - provide checkpoint
 								- Strong Scalability:
 								  - Support for custom development data processing components
 								- Superior performance
 								  - based on distributed computing engine Spark
 								- Powerful
 								  - 100+ data processing components available
 								  - include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json，etc.
-												Update README.md
											
										
										
											2018-12-24 17:45:55 +08:00
 								## Architecture
 								![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/architecture.png)
-												Update README.md
											
										
										
											2018-12-24 15:59:18 +08:00
+								## Requirements
 								* JDK 1.8 or newer
 								* Apache Maven 3.1.0 or newer
 								* Git Client (used during build process by 'bower' plugin)
-												Update README.md
											
										
										
											2018-12-24 17:50:01 +08:00
+								* spark-2.1.0
 								* hadoop-2.6.0
-												Update README.md
											
										
										
											2018-12-24 15:59:18 +08:00
 								## Getting Started
-												Update README.md
											
										
										
											2018-12-24 17:37:20 +08:00
 								To Build:
 								`mvn clean package -Dmaven.test.skip=true`
 								          [INFO] Replacing original artifact with shaded artifact.
 								          [INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
 								          [INFO] ------------------------------------------------------------------------
 								          [INFO] Reactor Summary:
 								          [INFO]
 								          [INFO] piflow-project ..................................... SUCCESS [  4.602 s]
 								          [INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
 								          [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
 								          [INFO] piflow-server ...................................... SUCCESS [03:01 min]
 								          [INFO] ------------------------------------------------------------------------
 								          [INFO] BUILD SUCCESS
 								          [INFO] ------------------------------------------------------------------------
 								          [INFO] Total time: 06:18 min
 								          [INFO] Finished at: 2018-12-24T16:54:16+08:00
 								          [INFO] Final Memory: 41M/812M
 								          [INFO] ------------------------------------------------------------------------
 								To Run Piflow Server：
 								- configure config.properties
 								      #server ip and port
 								      server.ip=10.0.86.191
 								      server.port=8002
 								      h2.port=50002
 								      #spark and yarn config
 								      spark.master=yarn
 								      spark.deploy.mode=cluster
 								      yarn.resourcemanager.hostname=10.0.86.191
 								      yarn.resourcemanager.address=10.0.86.191:8032
 								      yarn.access.namenode=hdfs://10.0.86.191:9000
 								      yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
 								      yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
 								      yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/
 								      #hive config
 								      hive.metastore.uris=thrift://10.0.86.191:9083
 								      #piflow jar path
 								      piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar
 								      #checkpoint hdfs path
 								      checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/
 								- you can run piflow server on intellij
 								  - main class is cn.piflow.api.Main
 								  - remember to set SPARK_HOME
 								- you can run piflow server as follows:
 								  - download piflowServer:***
 								  - edit config.properties
 								  - run start.sh
 								To Run Piflow Web：
 								  - todo
 								To Use：
 								- command line
 								  - flow config example
 								        {
 								          "flow":{
 								          "name":"test",
 								          "uuid":"1234",
 								          "checkpoint":"Merge",
 								          "stops":[
 								          {
 								            "uuid":"1111",
 								            "name":"XmlParser",
 								            "bundle":"cn.piflow.bundle.xml.XmlParser",
 								            "properties":{
 								                "xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
 								                "rowTag":"phdthesis"
 								            }
 								          },
 								          {
 								            "uuid":"2222",
 								            "name":"SelectField",
 								            "bundle":"cn.piflow.bundle.common.SelectField",
 								            "properties":{
 								                "schema":"title,author,pages"
 								            }
 								          },
 								          {
 								            "uuid":"3333",
 								            "name":"PutHiveStreaming",
 								            "bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
 								            "properties":{
 								                "database":"sparktest",
 								                "table":"dblp_phdthesis"
 								            }
 								          },
 								          {
 								            "uuid":"4444",
 								            "name":"CsvParser",
 								            "bundle":"cn.piflow.bundle.csv.CsvParser",
 								            "properties":{
 								                "csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
 								                "header":"false",
 								                "delimiter":",",
 								                "schema":"title,author,pages"
 								            }
 								          },
 								          {
 								            "uuid":"555",
 								            "name":"Merge",
 								            "bundle":"cn.piflow.bundle.common.Merge",
 								            "properties":{
 								              "inports":"data1,data2"
 								            }
 								          },
 								          {
 								            "uuid":"666",
 								            "name":"Fork",
 								            "bundle":"cn.piflow.bundle.common.Fork",
 								            "properties":{
 								              "outports":"out1,out2,out3"
 								            }
 								          },
 								          {
 								            "uuid":"777",
 								            "name":"JsonSave",
 								            "bundle":"cn.piflow.bundle.json.JsonSave",
 								            "properties":{
 								              "jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
 								            }
 								          },
 								          {
 								            "uuid":"888",
 								            "name":"CsvSave",
 								            "bundle":"cn.piflow.bundle.csv.CsvSave",
 								            "properties":{
 								              "csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
 								              "header":"true",
 								              "delimiter":","
 								            }
 								          }
 								        ],
 								        "paths":[
 								          {
 								            "from":"XmlParser",
 								            "outport":"",
 								            "inport":"",
 								            "to":"SelectField"
 								          },
 								          {
 								            "from":"SelectField",
 								            "outport":"",
 								            "inport":"data1",
 								            "to":"Merge"
 								          },
 								          {
 								            "from":"CsvParser",
 								            "outport":"",
 								            "inport":"data2",
 								            "to":"Merge"
 								          },
 								          {
 								            "from":"Merge",
 								            "outport":"",
 								            "inport":"",
 								            "to":"Fork"
 								          },
 								          {
 								            "from":"Fork",
 								            "outport":"out1",
 								            "inport":"",
 								            "to":"PutHiveStreaming"
 								          },
 								          {
 								            "from":"Fork",
 								            "outport":"out2",
 								            "inport":"",
 								            "to":"JsonSave"
 								          },
 								          {
 								            "from":"Fork",
 								            "outport":"out3",
 								            "inport":"",
 								            "to":"CsvSave"
 								          }
 								        ]
 								      }
 								    }
 								  - curl -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
 								- piflow web
 								  ![](https://github.com/cas-bigdatalab/piflow/blob/master/doc/piflow_web.png)