【Flink】如何整合flink和kafka,将kafka作为flink的source和sink
文章目录
将Kafka作为Flink的Source
概述
source分类
flink的source有四种分类:
- 基于集合 —— 有界数据集,一般都是本地测试用
- 基于文件 —— 适合监听文件修改并读取其内容,一般也是测试用,实际工作环境很少用
- 基于Socket —— 坚挺住几的host port,从Socket中获取数据
- 自定义addSource —— 大多数场景的数据都是*数据,比如消费Kafka的某个topic就需要用到自定义addSource,一般实际工作环境也是用这个更多一些
关于addSource
flink整合kafka的官方文档地址:https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/kafka.html#kafka-100-connector
首先,官网给出了一个Maven依赖包的选择,用户需要根据flink和kafka的版本来选择使用哪一个连接器。Flink Kafka Consumer继承了Flink的Checkpoint机制,可以提供一次性处理语义,为了达成此效果,Flink并不完全依赖Kafka的消费者群体offset跟踪,而是在内部跟踪和和检查这些offset。
案例
步骤
创建maven工程,导包
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>flinkbase29</artifactId>
<groupId>cn.itcast</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>day01</artifactId>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.2</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<hadoop.version>2.6.0</hadoop.version>
<flink.version>1.7.2</flink.version>
<scala.binary.version>2.11</scala.binary.version>
<iheart.version>1.4.3</iheart.version>
<fastjson.version>1.2.7</fastjson.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<!-- 导入scala的依赖 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- 导入flink streaming和scala的依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- 导入flink和scala的依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- 指定flink-client API的版本 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- 导入flink-table的依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- 指定hadoop-client API的版本 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<!--如果要保存到hdfs,必须要排除xml-apis,因为它和dom4j冲突-->
<exclusions>
<exclusion>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- 指定mysql-connector的依赖 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<!-- 指定fastjson的依赖 -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.60</version>
</dependency>
<dependency>
<groupId>com.jayway.jsonpath</groupId>
<artifactId>json-path</artifactId>
<version>2.3.0</version>
</dependency>
<!-- 指定flink-connector-kafka的依赖 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.5.1</version>
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
<!--<encoding>${project.build.sourceEncoding}</encoding>-->
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<!--
zip -d learn_spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
-->
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.itcast.batch.BatchWordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
开发代码
package cn.itcast.streaming
import java.util
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
object Kafka2Flink {
/**
* 1.获取执行环境
* 2.配置kafka consumer属性
* 3.动态感知kafka主题分区增加
* 4.通过java.util.HashMap[KafkaTopicPartition, java.lang.Long]()获取offset信息
* 5.从指定的位置开始消费数据
* 6.添加消费源
* 7.输出结果
*/
def main(args: Array[String]): Unit = {
// 1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 2.配置kafka consume属性
val props = new Properties()
props.setProperty("bootstrap.servers", "node01:9092,node02:9092,node03:9092") //kafka自带保存offset值的端口号
props.setProperty("group.id", "test01050801") //topic的所属groupid(为5.5设置)
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("flink.partition-discovery.interval-millis", "5000") //动态感知kafka主题分区的增加,每五秒感知一次
// 3.flink用来消费kafka数据
val myConsumer = new FlinkKafkaConsumer011[String]("test", new SimpleStringSchema(), props)
// 4.获取offset值(为5.3设置)
val offsets = new util.HashMap[KafkaTopicPartition, java.lang.Long](); // KafkaTopicPartition有两个参数,一个是topic的名字,一个是分区,表示从指定的topic中的分区获取offset值
offsets.put(new KafkaTopicPartition("test",0),100L)
offsets.put(new KafkaTopicPartition("test",1),100L)
offsets.put(new KafkaTopicPartition("test",2),100L)
// 5.指定从什么位置开始消费(一下选一种即可)
// 5.1 从topic的最开始
myConsumer.setStartFromEarliest()
// 5.2 从指定的时间戳开始消费
myConsumer.setStartFromTimestamp(1588867200)
// 5.3 从指定的offset值开始消费
myConsumer.setStartFromSpecificOffsets(offsets)
// 5.4 从topic中最新的数据开始消费
myConsumer.setStartFromLatest()
// 5.5 从topic所属的group中上次消费的位置开始消费
myConsumer.setStartFromGroupOffsets()
// 6.添加消费源
import org.apache.flink.streaming.api.scala._ //导入隐式转换的包
val text: DataStream[String] = env.addSource(myConsumer)
// 7.输出结果
text.print()
// 8.启动环境
env.execute("Kafka2Flink")
}
}
代码中涉及到的知识点
反序列化Schema类型
概述
为什么需要反序列化Schema类型?
首先连接器FlinkKafkaConsumer011类需要的三个参数中,第二个就要求是反序列化Schema类型。其次,它可以将数据源(比如kafka)传递的字节消息转换为Flink可以处理的数据类型——Java/Scala对象
分类
SimpleStringSchema:
可以将消息反序列化为字符串,当我们接收到消息并且反序列化失败时,会出现以下两种情况:
1.Flink从deserialize(…)方法中抛出异常,这会导致job失败,然后job会重启(没有开启容错)
2.在deserialize(…) 方法出现失败的时候返回null,这会让Flink Kafka consumer默默的忽略这条消息。请注意,如果配置了checkpoint 为enable,由于consumer的失败容忍机制,失败的消息会被继续消费,因此还会继续失败,这就会导致job被不断自动重启
JSONDeserializationSchema/ JSONKeyValueDeserializationSchema:
可以把序列化后的Json反序列化成ObjectNode,ObjectNode可以通过objectNode.get(“field”).as(Int/String/…)() 来访问指定的字段
TypeInformationSerializationSchema/ TypeInformationKeyValueSerializationSchema:(适用于读写均是flink的场景)
他们会基于Flink的TypeInformation来创建schema。这对于那些从Flink写入,又从Flink读出的数据是很有用的。这种Flink-specific的反序列化会比其他通用的序列化方式带来更高的性能。
设置Kafka Consumers从哪开始消费
Flink关于Kafka的动态分区检测
Flink和Spark这一点不同,Spark不需要做其他配置就可以动态发现Kafka的新增分区,而Flink是需要配置 “flink.partition-discovery.interval-millis” 属性的,比如properties.setProperty("flink.partition-discovery.interval-millis", "5000");
第二个参数的单位是毫秒,设置为0的话,动态分区检测功能就会关闭。
将Kafka作为Flink的Sink
步骤
先导三个json/xml转对象的依赖包
<!-- 指定json/xml转对象的依赖包 -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.9.9</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.9.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.9.9</version>
</dependency>
开发代码
import java.util.Properties
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper
object Flink2Kafka {
// 定义topicName,方便修改
val sinkTopic = "test"
//创建Student样例类,封装进行测试数据的类型
case class Student(id: Int, name: String, addr: String, sex: String)
// 创建ObjectMapper对象
val mapper = new ObjectMapper()
// 定义一个将对象转换为JsonString的方法
def toJsonString(T:Object): String ={
// 注册为Scala模型
mapper.registerModule(DefaultScalaModule)
// 转换数据类型
mapper.writeValueAsString(T)
}
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.获取数据源,生成数据
* 3.将数据转换为字符串
* 4.配置kafka参数
* 5.利用FlinkKafkaProducer011将数据sink到kafka
*/
// 1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 2.生成数据
import org.apache.flink.streaming.api.scala._
val stuDataStream: DataStream[Student] = env.fromElements(
Student(1, "ZhangSan", "Beijing", "Male"),
Student(2, "LiSi", "Shanghai", "Female")
)
// 3.将Student对象转换为JsonString
val finalDataStream: DataStream[String] = stuDataStream.map(student => {
toJsonString(student)
})
// 4.配置kafka环境属性
val props = new Properties()
props.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
// 5.创建FlinkKafkaProducer011对象
val myProducer = new FlinkKafkaProducer011[String](sinkTopic, new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()), props)
// 6.添加sink
finalDataStream.addSink(myProducer)
// 7.打印数据
finalDataStream.print()
// 8.执行环境
env.execute("Flink2Kafka")
}
}
成功消费
下一篇: 庖丁解牛之spring源码系列一