piflow

Go to file

judy0131 9e8e807b20 Update README.md		2020-05-21 17:22:14 +08:00
doc	Add files via upload	2020-05-21 17:15:57 +08:00
piflow-bin	1. modify piflow-bin	2020-03-13 17:44:59 +08:00
piflow-bundle	optimize code to support scalaExecutor in Group	2020-05-21 15:39:05 +08:00
piflow-configure	optimize code to support scalaExecutor in Group	2020-05-21 15:39:05 +08:00
piflow-core	new feature: execute scala code in stop	2020-05-20 19:15:06 +08:00
piflow-server	optimize code to support scalaExecutor in Group	2020-05-21 15:39:05 +08:00
.gitignore	fix putHiveQL bug	2019-07-29 13:52:19 +08:00
LICENSE	init	2018-05-03 18:15:05 +08:00
README.md	Update README.md	2020-05-21 17:22:14 +08:00
README_CN.md	Update README_CN.md	2019-10-23 10:50:29 +08:00
config.properties	update stop	2020-03-26 15:29:11 +08:00
piflow_V0.7_componets.md	创建V0.7各组件文档	2020-04-10 17:42:02 +08:00
pom.xml	add piflow-configure module	2020-04-15 12:01:42 +08:00
readMe.txt	Update readMe.txt	2020-03-25 17:40:17 +08:00

README.md

πFlow is an easy to use, powerful big data pipeline system. Try PiFlow v0.6 with: http://piflow.cstcloud.cn/piflow-web/

Features
Architecture
Requirements
Getting Started
PiFlow Docker
Use Interface

Features

Easy to use
- provide a WYSIWYG web interface to configure data flow
- monitor data flow status
- check the logs of data flow
- provide checkpoints
Strong scalability:
- Support customized development of data processing components
Superior performance
- based on distributed computing engine Spark
Powerful
- 100+ data processing components available
- include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json，etc.

Architecture

Requirements

JDK 1.8
Scala-2.11.8
Apache Maven 3.1.0 or newer
Git Client (used during build process by 'bower' plugin)
Spark-2.1.0、 Spark-2.2.0、 Spark-2.3.0
Hadoop-2.6.0

Getting Started

To Build:

install external package

    mvn install:install-file -Dfile=/.../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
    mvn install:install-file -Dfile=/.../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
    mvn install:install-file -Dfile=/.../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
    mvn install:install-file -Dfile=/.../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar

mvn clean package -Dmaven.test.skip=true

    [INFO] Replacing original artifact with shaded artifact.
    [INFO] Reactor Summary:
    [INFO]
    [INFO] piflow-project ..................................... SUCCESS [  4.369 s]
    [INFO] piflow-core ........................................ SUCCESS [01:23 min]
    [INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
    [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
    [INFO] piflow-server ...................................... SUCCESS [02:05 min]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 06:01 min
    [INFO] Finished at: 2020-05-21T15:22:58+08:00
    [INFO] Final Memory: 118M/691M
    [INFO] ------------------------------------------------------------------------

Run Piflow Server：

run piflow server on intellij:
- edit config.properties
- build piflow to generate piflow-server-0.9.jar
- main class is cn.piflow.api.Main(remember to set SPARK_HOME)
run piflow server by release version:
- download piflow.tar.gz:
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.5/piflow.tar.gz
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.6/piflow-server-v0.6.tar.gz
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.7/piflow-server-v0.7.tar.gz
- unzip piflow.tar.gz:
  tar -zxvf piflow.tar.gz
- edit config.properties
- run start.sh、stop.sh、 restart.sh、 status.sh

how to configure config.properties

#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster

#hdfs default file system
fs.defaultFS=hdfs://10.0.86.191:9000

#yarn resourcemanager.hostname
yarn.resourcemanager.hostname=10.0.86.191

#if you want to use hive, set hive metastore uris
#hive.metastore.uris=thrift://10.0.88.71:9083

#show data in log, set 0 if you do not want to show data in logs
data.show=10

#server port
server.port=8002

#h2db port
h2.port=50002

Run Piflow Web：

https://github.com/cas-bigdatalab/piflow-web

Use with command line：

command line

flow config example

{ "flow":{ "name":"test", "uuid":"1234", "checkpoint":"Merge", "stops":[ { "uuid":"1111", "name":"XmlParser", "bundle":"cn.piflow.bundle.xml.XmlParser", "properties":{ "xmlpath":"hdfs://10.0.86.89:9000/xjzhu/dblp.mini.xml", "rowTag":"phdthesis" } }, { "uuid":"2222", "name":"SelectField", "bundle":"cn.piflow.bundle.common.SelectField", "properties":{ "schema":"title,author,pages" } }, { "uuid":"3333", "name":"PutHiveStreaming", "bundle":"cn.piflow.bundle.hive.PutHiveStreaming", "properties":{ "database":"sparktest", "table":"dblp_phdthesis" } }, { "uuid":"4444", "name":"CsvParser", "bundle":"cn.piflow.bundle.csv.CsvParser", "properties":{ "csvPath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.csv", "header":"false", "delimiter":",", "schema":"title,author,pages" } }, { "uuid":"555", "name":"Merge", "bundle":"cn.piflow.bundle.common.Merge", "properties":{ "inports":"data1,data2" } }, { "uuid":"666", "name":"Fork", "bundle":"cn.piflow.bundle.common.Fork", "properties":{ "outports":"out1,out2,out3" } }, { "uuid":"777", "name":"JsonSave", "bundle":"cn.piflow.bundle.json.JsonSave", "properties":{ "jsonSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis.json" } }, { "uuid":"888", "name":"CsvSave", "bundle":"cn.piflow.bundle.csv.CsvSave", "properties":{ "csvSavePath":"hdfs://10.0.86.89:9000/xjzhu/phdthesis_result.csv", "header":"true", "delimiter":"," } } ], "paths":[ { "from":"XmlParser", "outport":"", "inport":"", "to":"SelectField" }, { "from":"SelectField", "outport":"", "inport":"data1", "to":"Merge" }, { "from":"CsvParser", "outport":"", "inport":"data2", "to":"Merge" }, { "from":"Merge", "outport":"", "inport":"", "to":"Fork" }, { "from":"Fork", "outport":"out1", "inport":"", "to":"PutHiveStreaming" }, { "from":"Fork", "outport":"out2", "inport":"", "to":"JsonSave" }, { "from":"Fork", "outport":"out3", "inport":"", "to":"CsvSave" } ] }

}

- flow config example

curl -0 -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'

docker-started

pull piflow images
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.6.1
show docker images
docker images
run a container with piflow imageID ， all services run automatically
docker run --name piflow-v0.6 -it [imageID]
please visit "containerip:6001/piflow-web", it may take a while
if somethings goes wrong, all the application are in /opt folder，