Kafka系列（五）Kafka体系架构英文版

程序员文章站 2022-07-12 19:17:41

...

文章目录

- - 数据缓冲
  - - Kafka：

数据缓冲

Kafka：

high-throughput，distributed，publish-subscribe messaging system

fast
scalable
durable
real-time

kafka architecture

Kafka系列（五）Kafka体系架构英文版

kafka maintains feeds of messages in called Topics

Producers publish messages to a kafka topic

consumers subscribe to topics and process the feed of published messages

servers in a kafka cluster are called brokers

kafka topic

Topics：
Partitions：
Logs：
Retention Period：
Consumers maintain and track their locations/offsets in each log

#修改脚本
vi .bashrc


# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
export SPARK_MAJOR_VERSION=2
sed -i s/'history -cw'//g .bash_logout
export PATH=/usr/hdp/current/kafka-broker/bin:$PATH

#启动kafka
kafka-server-start.sh /usr/hdp/2.6.4.0-91/kafka/config/server.properties
#创建topic
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --create --topic test --partition 3 --replication-factor 1


# 查看topic
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181  --list

# 查看topic分区
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --describe --topic test

# 修改主题数据保留时间
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --alter --topic test --config retention.ms=10000

kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --alter --topic test -delete-config retention.ms

Kafka系列（五）Kafka体系架构英文版

Kafka Message Flow

Kafka系列（五）Kafka体系架构英文版

producer可以控制指定发送到哪个分区（robin）

consumer可以指定我消费哪个分区

Kafka High-Throughput & Low-Latency

Batching of individual messages
Zero copy I/O using sendfile

kafka Broker

Each partition has one server acting as a leader, and zero or more servers acting as followers / ISRs

broker数要比分区数大得多, 每个分区的备份最好放在不在leader的其他机器上
Each server acts as a leader for some partitions and a followers for others, so load is well balanced

每个broker最好只要有一个leader

kafka Producer

Producers publish data to the topics of their choice

通过分区分配策略来控制我发送哪个分区
Async Publishing (less durable)

把ack调成0，数据可靠性最差的情况
All nodes can answer metadata requests about:

所有机器都可以提供位置信息给你作为传送消息的节点

kafka Consumer

可以定义细粒度：可以去拿topic，也可以去拿partition

Consumers consume messages through subscriptions
Multiple Consumers can read from the same topics
Consumers are organized into Consumer Groups

一个partition只能被一个group的一个consumer消费，

partition数量应该大于consumer数量
Kafka offers messages to Consumer Groups, not Consumer (instance) directly

kafka的offset消息是在消费组层面上的,在组里每个consumer都知道partition读到什么地方，如果有consumer挂掉，下一个consumer会继续从offset之后拿
Messages remain on Kafka, which are not removed after they are consumed
Messaging models

Queue: a message goes to one of the consumers.

⚫ All consumers are in the same Consumer Group

Publish-Subscribe: a message goes to all consumers.

⚫ All consumers are assigned to different Consumer Groups;

发布订阅，我多个消费组中的一个消费者去消费一个message时，这样可以并行去消费。

kafka ZooKeeper

Starting from 0.10, Kafka has its own internal topic for the offset storage

kafka API

◆ The Producer API allows an application to publish a stream of records to one or more Kafka topics.

◆ The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

◆ The Streams API allows an application to act as a stream processor , consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

◆ The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems.

Message Ordering

To ensure global ordering for a topic:

◼ If all message must be ordered within one topic, use one partition

全局有序得话，最好把分区数设为一

◼ If messages can be ordered by certain properties

⚫ Group messages in a partition by Key (defined upon the properties in producer)

⚫ Configure exactly one consumer instance per partition within a consumer group

Message Replication

◼ 0 – the producer never waits for an ack

◼ 1 – the producer gets an ack after the leader replica has received the data

◼ -1 / all – the producer gets an ack after all ISRs (in-sync replication) receives the data

Data Loss at the Producer

◆ Kafka Producer API

◼ Messages are accumulated in buffer in batches

批量发送

◼ Messages are batched by partition, retried at batch level

◼ Expired batches dropped after retries

在重试之后，批处理的消息到期了

◆ Data Loss at Producer

◼ Failed to close / flush producer on termination

没有成功刷新最后得消息

◼ Dropped batches due to communication or other errors when acks = 0 or retry exhaustion

Data Delivered but Loss in the Cluster

Kafka系列（五）Kafka体系架构英文版

Data Duplication

consumer和producer都会发生数据重复消费

Kafka系列（五）Kafka体系架构英文版

Kafka Use Cases

◆ Real-Time Streaming Processing (combined with Spark Streaming)

◆ General purpose Message Bus

◆ Collecting User Activity Data

用户行为数据

◆ Collecting Operational Metrics from Applications, Servers or Devices

◆ Log Aggregation (combined with ELK)

◆ Change Data Capture

◆ Commit Log for distributed System

上一篇： Spring Boot项目打包

下一篇：不使用开发工具将文件或文件夹打包成jar文件

Kafka系列（五）Kafka体系架构英文版

文章目录

数据缓冲

Kafka：

kafka architecture

kafka topic

Kafka Message Flow

Kafka High-Throughput & Low-Latency

kafka Broker

kafka Producer

kafka Consumer

kafka ZooKeeper

kafka API

Message Ordering

Message Replication

Data Loss at the Producer

Data Delivered but Loss in the Cluster

Data Duplication

Kafka Use Cases

Kafka系列（五）Kafka体系架构英文版

【赵强老师】Kafka的体系架构