piflow

Go to file

judy0131 06e4246cc3 optimize CsvSave Stop		2020-04-28 17:46:05 +08:00
doc	Add files via upload	2019-12-21 11:22:13 +08:00
piflow-bin	1. modify piflow-bin	2020-03-13 17:44:59 +08:00
piflow-bundle	optimize CsvSave Stop	2020-04-28 17:46:05 +08:00
piflow-configure	1.return bundles when add plugin	2020-04-28 16:46:41 +08:00
piflow-core	undate finishedTime before flow finished	2020-04-28 14:56:13 +08:00
piflow-server	1.return bundles when add plugin	2020-04-28 16:46:41 +08:00
.gitignore	fix putHiveQL bug	2019-07-29 13:52:19 +08:00
LICENSE	init	2018-05-03 18:15:05 +08:00
README.md	Update README.md	2020-04-05 19:29:40 +08:00
README_CN.md	Update README_CN.md	2019-10-23 10:50:29 +08:00
config.properties	update stop	2020-03-26 15:29:11 +08:00
piflow_V0.7_componets.md	创建V0.7各组件文档	2020-04-10 17:42:02 +08:00
pom.xml	add piflow-configure module	2020-04-15 12:01:42 +08:00
readMe.txt	Update readMe.txt	2020-03-25 17:40:17 +08:00

README.md

πFlow is an easy to use, powerful big data pipeline system. Try with: http://piflow.cstcloud.cn/piflow-web/

Features
Architecture
Requirements
Getting Started
PiFlow Docker
Use Interface

Features

Easy to use
- provide a WYSIWYG web interface to configure data flow
- monitor data flow status
- check the logs of data flow
- provide checkpoints
Strong scalability:
- Support customized development of data processing components
Superior performance
- based on distributed computing engine Spark
Powerful
- 100+ data processing components available
- include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json，etc.

Architecture

Requirements

JDK 1.8 or newer
Apache Maven 3.1.0 or newer
Git Client (used during build process by 'bower' plugin)
Spark-2.1.0
Hadoop-2.6.0
Hive-1.2.1

Getting Started

To Build: mvn clean package -Dmaven.test.skip=true

      [INFO] Replacing original artifact with shaded artifact.
      [INFO] Replacing /opt/project/piflow/piflow-server/target/piflow-server-0.9.jar with /opt/project/piflow/piflow-server/target/piflow-server-0.9-shaded.jar
      [INFO] ------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO] 
      [INFO] piflow-project ..................................... SUCCESS [  4.602 s]
      [INFO] piflow-core ........................................ SUCCESS [ 56.533 s]
      [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
      [INFO] piflow-server ...................................... SUCCESS [03:01 min]
      [INFO] ------------------------------------------------------------------------
      [INFO] BUILD SUCCESS
      [INFO] ------------------------------------------------------------------------
      [INFO] Total time: 06:18 min
      [INFO] Finished at: 2018-12-24T16:54:16+08:00
      [INFO] Final Memory: 41M/812M
      [INFO] ------------------------------------------------------------------------

To Run Piflow Server：

run piflow server on intellij:
- edit config.properties
- build piflow to generate piflow-server.jar
- main class is cn.piflow.api.Main(remember to set SPARK_HOME)
run piflow server by release version:
- download piflow.tar.gz: https://github.com/cas-bigdatalab/piflow/releases/download/v0.5/piflow.tar.gz
- unzip piflow.tar.gz: tar -zxvf piflow.tar.gz
- edit config.properties
- run start.sh

how to configure config.properties

#server ip and port
server.ip=10.0.86.191
server.port=8002
h2.port=50002

#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster
yarn.resourcemanager.hostname=10.0.86.191
yarn.resourcemanager.address=10.0.86.191:8032
yarn.access.namenode=hdfs://10.0.86.191:9000
yarn.stagingDir=hdfs://10.0.86.191:9000/tmp/
yarn.jars=hdfs://10.0.86.191:9000/user/spark/share/lib/*.jar
yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/

#hive config
hive.metastore.uris=thrift://10.0.86.191:9083

#piflow-server.jar path
piflow.bundle=/opt/piflowServer/piflow-server-0.9.jar

#checkpoint hdfs path
checkpoint.path=hdfs://10.0.86.89:9000/piflow/checkpoints/

#debug path
debug.path=hdfs://10.0.88.191:9000/piflow/debug/

#yarn url
yarn.url=http://10.0.86.191:8088/ws/v1/cluster/apps/

#the count of data shown in log
data.show=10

#h2 db port
h2.port=50002

To Run Piflow Web：

https://github.com/cas-bigdatalab/piflow-web

To Use：

command line

flow config example

{
  "flow":{
  "name":"test",
  "uuid":"1234",
  "checkpoint":"Merge",
  "stops":[
  {
    "uuid":"1111",
    "name":"XmlParser",
    "bundle":"cn.piflow.bundle.xml.XmlParser",
    "properties":{
        "xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml",
        "rowTag":"phdthesis"
    }
  },
  {
    "uuid":"2222",
    "name":"SelectField",
    "bundle":"cn.piflow.bundle.common.SelectField",
    "properties":{
        "schema":"title,author,pages"
    }

  },
  {
    "uuid":"3333",
    "name":"PutHiveStreaming",
    "bundle":"cn.piflow.bundle.hive.PutHiveStreaming",
    "properties":{
        "database":"sparktest",
        "table":"dblp_phdthesis"
    }
  },
  {
    "uuid":"4444",
    "name":"CsvParser",
    "bundle":"cn.piflow.bundle.csv.CsvParser",
    "properties":{
        "csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv",
        "header":"false",
        "delimiter":",",
        "schema":"title,author,pages"
    }
  },
  {
    "uuid":"555",
    "name":"Merge",
    "bundle":"cn.piflow.bundle.common.Merge",
    "properties":{
      "inports":"data1,data2"
    }
  },
  {
    "uuid":"666",
    "name":"Fork",
    "bundle":"cn.piflow.bundle.common.Fork",
    "properties":{
      "outports":"out1,out2,out3"
    }
  },
  {
    "uuid":"777",
    "name":"JsonSave",
    "bundle":"cn.piflow.bundle.json.JsonSave",
    "properties":{
      "jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json"
    }
  },
  {
    "uuid":"888",
    "name":"CsvSave",
    "bundle":"cn.piflow.bundle.csv.CsvSave",
    "properties":{
      "csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv",
      "header":"true",
      "delimiter":","
    }
  }
],
"paths":[
  {
    "from":"XmlParser",
    "outport":"",
    "inport":"",
    "to":"SelectField"
  },
  {
    "from":"SelectField",
    "outport":"",
    "inport":"data1",
    "to":"Merge"
  },
  {
    "from":"CsvParser",
    "outport":"",
    "inport":"data2",
    "to":"Merge"
  },
  {
    "from":"Merge",
    "outport":"",
    "inport":"",
    "to":"Fork"
  },
  {
    "from":"Fork",
    "outport":"out1",
    "inport":"",
    "to":"PutHiveStreaming"
  },
  {
    "from":"Fork",
    "outport":"out2",
    "inport":"",
    "to":"JsonSave"
  },
  {
    "from":"Fork",
    "outport":"out3",
    "inport":"",
    "to":"CsvSave"
  }
]

} }

curl -0 -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'

docker-started

pull piflow images
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.6.1
show docker images
docker images
run a container with piflow imageID ， all services run automatically
docker run --name piflow-v0.6 -it [imageID]
please visit "containerip:6001/piflow-web", it may take a while
if somethings goes wrong, all the application are in /opt folder，