piflow

Go to file

judy0131 3596a2c4fc Add files via upload		2020-06-02 10:42:01 +08:00
doc	Add files via upload	2020-06-02 10:42:01 +08:00
piflow-bin	modify flow example	2020-05-22 16:37:01 +08:00
piflow-bundle	add trim for stop's properties	2020-05-26 18:30:56 +08:00
piflow-configure	optimize code to support scalaExecutor in Group	2020-05-21 15:39:05 +08:00
piflow-core	new feature: execute scala code in stop	2020-05-20 19:15:06 +08:00
piflow-server	modify flow example	2020-05-22 16:37:01 +08:00
.gitignore	fix putHiveQL bug	2019-07-29 13:52:19 +08:00
LICENSE	init	2018-05-03 18:15:05 +08:00
PiFlow_V0.6_Deployment_Instructions.md	rename doc file	2020-06-01 10:47:47 +08:00
PiFlow_V0.6_User_Guide.md	rename doc file	2020-06-01 10:47:47 +08:00
PiFlow_V0.7_Componets.md	rename doc file	2020-06-01 10:47:47 +08:00
PiFlow_V0.7_Deployment_Instructions.md	rename doc file	2020-06-01 10:47:47 +08:00
PiFlow_V0.7_User_Guide.md	rename doc file	2020-06-01 10:47:47 +08:00
README.md	Update README.md	2020-06-02 10:36:13 +08:00
README_CN.md	Update README_CN.md	2020-06-02 10:39:03 +08:00
config.properties	update stop	2020-03-26 15:29:11 +08:00
pom.xml	add piflow-configure module	2020-04-15 12:01:42 +08:00
readMe.txt	modify flow example	2020-05-22 16:37:01 +08:00

README.md

πFlow is an easy to use, powerful big data pipeline system. Try PiFlow v0.6 with: http://piflow.cstcloud.cn/piflow-web/

Features
Architecture
Requirements
Getting Started
PiFlow Docker
Use Interface
Contact Us

Features

Easy to use
- provide a WYSIWYG web interface to configure data flow
- monitor data flow status
- check the logs of data flow
- provide checkpoints
Strong scalability:
- Support customized development of data processing components
Superior performance
- based on distributed computing engine Spark
Powerful
- 100+ data processing components available
- include spark、mllib、hadoop、hive、hbase、solr、redis、memcache、elasticSearch、jdbc、mongodb、http、ftp、xml、csv、json，etc.

Architecture

Requirements

JDK 1.8
Scala-2.11.8
Apache Maven 3.1.0 or newer
Spark-2.1.0、 Spark-2.2.0、 Spark-2.3.0
Hadoop-2.6.0

Getting Started

To Build:

install external package

    mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
    mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
    mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
    mvn install:install-file -Dfile=/../piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar

mvn clean package -Dmaven.test.skip=true

    [INFO] Replacing original artifact with shaded artifact.
    [INFO] Reactor Summary:
    [INFO]
    [INFO] piflow-project ..................................... SUCCESS [  4.369 s]
    [INFO] piflow-core ........................................ SUCCESS [01:23 min]
    [INFO] piflow-configure ................................... SUCCESS [ 12.418 s]
    [INFO] piflow-bundle ...................................... SUCCESS [02:15 min]
    [INFO] piflow-server ...................................... SUCCESS [02:05 min]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 06:01 min
    [INFO] Finished at: 2020-05-21T15:22:58+08:00
    [INFO] Final Memory: 118M/691M
    [INFO] ------------------------------------------------------------------------

Run Piflow Server：

run piflow server on Intellij:
- download piflow: git clone https://github.com/cas-bigdatalab/piflow.git
- import piflow into Intellij
- edit config.properties file
- build piflow to generate piflow jar:
  - Edit Configurations --> Add New Configuration --> Maven
  - Name: package
  - Command line: clean package -Dmaven.test.skip=true -X
  - run 'package' (piflow jar file will be built in ../piflow/piflow-server/target/piflow-server-0.9.jar)
- run HttpService:
  - Edit Configurations --> Add New Configuration --> Application
  - Name: HttpService
  - Main class : cn.piflow.api.Main
  - Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
  - run 'HttpService'
- test HttpService:
  - run /../piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
  - change the piflow server ip and port to your configure
run piflow server by release version:
- download piflow.tar.gz:
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.5/piflow.tar.gz
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.6/piflow-server-v0.6.tar.gz
  https://github.com/cas-bigdatalab/piflow/releases/download/v0.7/piflow-server-v0.7.tar.gz
- unzip piflow.tar.gz:
  tar -zxvf piflow.tar.gz
- edit config.properties
- run start.sh、stop.sh、 restart.sh、 status.sh
- test piflow server
  - set PIFLOW_HOME
    - vim /etc/profile
      export PIFLOW_HOME=/yourPiflowPath/bin
      export PATH=$PATH:$PIFLOW_HOME/bin
    - command
      piflow flow start example/mockDataFlow.json
      piflow flow stop appID
      piflow flow info appID
      piflow flow log appID
      
      piflow flowGroup start example/mockDataGroup.json
      piflow flowGroup stop groupId
      piflow flowGroup info groupId

how to configure config.properties

#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster

#hdfs default file system
fs.defaultFS=hdfs://10.0.86.191:9000

#yarn resourcemanager.hostname
yarn.resourcemanager.hostname=10.0.86.191

#if you want to use hive, set hive metastore uris
#hive.metastore.uris=thrift://10.0.88.71:9083

#show data in log, set 0 if you do not want to show data in logs
data.show=10

#server port
server.port=8002

#h2db port
h2.port=50002

Run Piflow Web：

https://github.com/cas-bigdatalab/piflow-web

Restful API：

flow json

flow example

    
      {
"flow": {
  "name": "MockData",
  "executorMemory": "1g",
  "executorNumber": "1",
  "uuid": "8a80d63f720cdd2301723b7461d92600",
  "paths": [
    {
      "inport": "",
      "from": "MockData",
      "to": "ShowData",
      "outport": ""
    }
  ],
  "executorCores": "1",
  "driverMemory": "1g",
  "stops": [
    {
      "name": "MockData",
      "bundle": "cn.piflow.bundle.common.MockData",
      "uuid": "8a80d63f720cdd2301723b7461d92604",
      "properties": {
        "schema": "title:String, author:String, age:Int",
        "count": "10"
      },
      "customizedProperties": {
  }
},
{
  "name": "ShowData",
  "bundle": "cn.piflow.bundle.external.ShowData",
  "uuid": "8a80d63f720cdd2301723b7461d92602",
  "properties": {
    "showNumber": "5"
  },
  "customizedProperties": {

  }
}

]
}
}

CURL POST：
- curl -0 -X POST http://10.0.86.191:8002/flow/start -H "Content-type: application/json" -d 'this is your flow json'
Command line：
- set PIFLOW_HOME
  vim /etc/profile
  export PIFLOW_HOME=/yourPiflowPath/piflow-bin
  export PATH=$PATH:$PIFLOW_HOME/bin
- command example
  piflow flow start yourFlow.json
  piflow flow stop appID
  piflow flow info appID
  piflow flow log appID
  
  piflow flowGroup start yourFlowGroup.json
  piflow flowGroup stop groupId
  piflow flowGroup info groupId

docker-started

pull piflow images
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v0.6.1
show docker images
docker images
run a container with piflow imageID ， all services run automatically
docker run --name piflow-v0.6 -it [imageID]
please visit "containerip:6001/piflow-web", it may take a while
if somethings goes wrong, all the application are in /opt folder，