Flink的DataStream API
参考: https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/datastream_api.html
Data Sources
Sources 是程序读取其输入的位置,可以使用fsEnv.addSource(sourceFunction)
将Source附加到程序中。Flink内置了许多预先实现的SourceFunction,可以通过实现SourceFunction(non-parallel sources)来编写自己的自定义Source,或通过实现ParallelSourceFunction接口或继承RichParallelSourceFunction来实现并行Source.
File-based
readTextFile(path):逐行读取文本文件,底层使用TextInputFormat规范读取文件,并将其作为字符串返回。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines:DataStream[String]=fsEnv.readTextFile("file:///E:\\demo\\words")
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
fsEnv.execute("wordcount")
readFile(fileInputFormat, path) :根据指定的文件输入格式读取文件(仅仅读取一次,类似批处理)
//1.创建流处理的环境 - 远程发布|本地执行
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val inputFormat = new TextInputFormat(null)
//2.读取外围系统数据 - 细化
val lines:DataStream[String]=fsEnv.readFile(inputFormat,"file:///D:/demo/words")
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
// print(fsEnv.getExecutionPlan)
//3.执行流计算
fsEnv.execute("wordcount")
readFile(fileInputFormat, path, watchType, interval, pathFilter) :这是前两个内部调用的方法。它根据给定的FileInputFormat读取路径中的文件。可以根据watchType定期的检测路径下的文件,其中watchType可选值FileProcessingMode.PROCESS_CONTINUOUSLY或者FileProcessingMode.PROCESS_ONCE检查的周期由interval参数指定。用户可以使用pathFilter参数排除路劲下需要排除的文件。如果处理是PROCESS_CONTINUOUSLY,一旦文件内容发生改变,整个文件内容会被重复处理。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val inputFormat=new TextInputFormat(null)
val lines:DataStream[String]=fsEnv.readFile(
inputFormat,"file:///E:\\demo\\words",
FileProcessingMode.PROCESS_CONTINUOUSLY,
5000,new FilePathFilter {
override def filterPath(filePath: Path): Boolean = {
filePath.getPath.endsWith(".txt")
}
})
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
fsEnv.execute("wordcount")
Socket-based
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines:DataStream[String]=fsEnv.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
fsEnv.execute("wordcount")
Collection-based(测试)
//1.创建流处理的环境 - 远程发布|本地执行
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines:DataStream[String] = fsEnv.fromCollection(List("this is a demo","where are you from"))
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
// print(fsEnv.getExecutionPlan)
//3.执行流计算
fsEnv.execute("wordcount")
Custom Source
自定义数据源
package com.hw.demo02
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import scala.util.Random
/**
* @aurhor:fql
* @date 2019/10/15 19:23
* @type:
*/
class CustomSourceFunction extends ParallelSourceFunction[String]{
@volatile
var isRunning:Boolean=true
val lines:Array[String] = Array("this is a demo","hello word","are you ok")
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
while (isRunning){
Thread.sleep(1000)
ctx.collect(lines(new Random().nextInt(lines.length))) //将数据输出给下游
}
}
override def cancel(): Unit = {
isRunning=false
}
}
//1.创建流处理的环境 - 远程发布|本地执行
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines:DataStream[String] = fsEnv.addSource[String](new CustomSourceFunction)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
// print(fsEnv.getExecutionPlan)
//3.执行流计算
fsEnv.execute("wordcount")
√FlinkKafkaConsumer
- 引入相关依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
//1.创建流处理的环境 - 远程发布|本地执行
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
val lines=fsEnv.addSource(new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props))
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
// print(fsEnv.getExecutionPlan)
//3.执行流计算
fsEnv.execute("wordcount")
如果Kafka存储的都是json字符串数据,用户可以使用系统自带一些json支持的Schema。推荐使用
- JsonNodeDeserializationSchema:要求value必须是json字符串
- JSONKeyValueDeserializationSchema(meta):要求key,value都必须是josn格式,同时可以携带元数据(分区、 owset等)
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
val jsonData:DataStream[ObjectNode]=fsEnv.addSource(new FlinkKafkaConsumer("topic01",new JSONKeyValueDeserializationSchema(true),props))
jsonData.map(on=> (on.get("value").get("id").asInt(),on.get("value").get("name")))
.print()
fsEnv.execute("wordcount")
Data Sinks
Data Sinks接收DataStream数据,并将其转发到文件,socket,外部系统或者print它们。Flink预定义一些输出Sink。
file-based
write*
:writeAsText/writeAsCsv(…)/writeUsingOutputFormat请注意,DataStream上的write *()方法主要用于调试目的。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.writeAsText("file:///E:/results/text",WriteMode.OVERWRITE)
以上的写法只能保证at_least_once
语义的保证,如果是在生产环境下推荐使用flink-connector-filesystem
将数据写到外围系统,可以保证exactly-once
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
val bucketingSink = new BucketingSink[(String,Int)]("hdfs://CentOS:9000/BucketingSink")
bucketingSink.setBucketer(new DateTimeBucketer("yyyyMMddHH"))//文件目录
bucketingSink.setBatchSize(1024)
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(bucketingSink)
.setParallelism(6)
fsEnv.execute("wordcount")
print()/printErr()
//1.创建流处理的环境 - 远程发布|本地执行
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.读取外围系统数据 - 细化
val porps = new Properties()
porps.setProperty("bootstrap.servers","CentOS:9092")
porps.setProperty("group.id","g2")
val lines = fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),porps))
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print("测试") //输出前缀,可以区分当前有多个输出到控制台的流 可以添加 前缀
.setParallelism(2)
// print(fsEnv.getExecutionPlan)
//3.执行流计算
fsEnv.execute("wordcount")
Custom Sink
class CustomSinkFunction extends RichSinkFunction[(String,Int)]{
override def open(parameters: Configuration): Unit = {
println("初始化连接")
}
override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit = {
println(value)
}
override def close(): Unit = {
println("关闭连接")
}
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new CustomSinkFunction)
fsEnv.execute("wordcount")
√ RedisSink
- 添加依赖
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
class UserRedisMapper extends RedisMapper[(String,Int)]{
// 设置数据类型
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET,"wordcount")
}
override def getKeyFromData(data: (String, Int)): String = {
data._1
}
override def getValueFromData(data: (String, Int)): String = {
data._2.toString
}
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
val jedisConfig=new FlinkJedisPoolConfig.Builder()
.setHost("CentOS")
.setPort(6379)
.build()
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new RedisSink[(String, Int)](jedisConfig,new UserRedisMapper))
fsEnv.execute("wordcount")
√FlinkkafkaProducer
class UserKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
Int
override def serializeKey(element: (String, Int)): Array[Byte] = {
element._1.getBytes()
}
override def serializeValue(element: (String, Int)): Array[Byte] = {
element._2.toString.getBytes()
}
//可以覆盖 默认topic ,如果返回null 则将数据写入到默认topic中
override def getTargetTopic(element: (String, Int)): String = {
null
}
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props1 = new Properties()
props1.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props1.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")
val props2 = new Properties()
props2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props2.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props2.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
props2.setProperty(ProducerConfig.ACKS_CONFIG,"all")
props2.setProperty(ProducerConfig.RETRIES_CONFIG,"2")
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props1))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new FlinkKafkaProducer[(String, Int)]("topic02",new UserKeyedSerializationSchema,props2))
fsEnv.execute("wordcount")
DataStream Transformations
Map
Takes one element and produces one element.
dataStream.map { x => x * 2 }
FlatMap
Takes one element and produces zero, one, or more elements.
dataStream.flatMap { str => str.split(" ") }
Filter
Evaluates a boolean function for each element and retains those for which the function returns true.
dataStream.filter { _ != 0 }
Union
Union of two or more data streams creating a new stream containing all the elements from all the streams.
dataStream.union(otherStream1, otherStream2, ...)
Connect
“Connects” two data streams retaining their types, allowing for shared state between the two streams.
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val stream1 = fsEnv.socketTextStream("CentOS",9999)
val stream2 = fsEnv.socketTextStream("CentOS",8888)
stream1.connect(stream2).flatMap(line=>line.split("\\s+"),line=>line.split("\\s+"))
.map(Word(_,1))
.keyBy("word")
.sum("count")
.print()
fsEnv.execute("wordcount")
Split
Split the stream into two or more streams according to some criterion.
val split = someDataStream.split(
(num: Int) =>
(num % 2) match {
case 0 => List("even")
case 1 => List("odd")
}
)
Select
Select one or more streams from a split stream.
val even = split select "even"
val odd = split select "odd"
val all = split.select("even","odd")
val lines = fsEnv.socketTextStream("CentOS",9999)
val splitStream: SplitStream[String] = lines.split(line => {
if (line.contains("error")) {
List("error") //分支名称
} else {
List("info") //分支名称
}
})
splitStream.select("error").print("error")
splitStream.select("info").print("info")
Side Out
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines = fsEnv.socketTextStream("CentOS",9999)
val outTag = new OutputTag[String]("error")
val reslut = lines.process(new ProcessFunction[String, String] {
override def processElement(value: String, ctx: ProcessFunction[String, String]#Context, out: Collector[String]): Unit = {
if (value.contains("error")) {
ctx.output(outTag, value)
} else {
out.collect(value)
}
}
})
reslut.print("正常结果")
//获取边输出
reslut.getSideOutput(outTag).print("错误结果")
fsEnv.execute("wordcount")
KeyBy
Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning.
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
Reduce
A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
fsEnv.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.reduce((t1,t2)=>(t1._1,t1._2+t2._2))
.print()
Fold
A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
fsEnv.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.fold(("",0))((t1,t2)=>(t2._1,t1._2+t2._2))
.print()
Aggregations
Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).
zhangsan 001 1000
wangw 001 1500
zhaol 001 800
fsEnv.socketTextStream("CentOS",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toDouble))
.keyBy(1)
.minBy(2)//输出含有最小值的记录
.print()
1> (zhangsan,001,1000.0)
1> (zhangsan,001,1000.0)
1> (zhaol,001,800.0)
fsEnv.socketTextStream("CentOS",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toDouble))
.keyBy(1)
.min(2)
.print()
1> (zhangsan,001,1000.0)
1> (zhangsan,001,1000.0)
1> (zhangsan,001,800.0)