欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Kafka系列(五)Kafka体系架构英文版

程序员文章站 2022-07-12 19:17:41
...

数据缓冲

Kafka:

high-throughputdistributedpublish-subscribe messaging system

  • fast
  • scalable
  • durable
  • real-time
kafka architecture

Kafka系列(五)Kafka体系架构英文版

  • kafka maintains feeds of messages in called Topics
  • Producers publish messages to a kafka topic
  • consumers subscribe to topics and process the feed of published messages
  • servers in a kafka cluster are called brokers
kafka topic
  • Topics:
  • Partitions:
  • Logs:
  • Retention Period:
  • Consumers maintain and track their locations/offsets in each log
#修改脚本
vi .bashrc


# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
export SPARK_MAJOR_VERSION=2
sed -i s/'history -cw'//g .bash_logout
export PATH=/usr/hdp/current/kafka-broker/bin:$PATH
#启动kafka
kafka-server-start.sh /usr/hdp/2.6.4.0-91/kafka/config/server.properties
#创建topic
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --create --topic test --partition 3 --replication-factor 1


# 查看topic
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181  --list

# 查看topic分区
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --describe --topic test

# 修改主题数据保留时间
kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --alter --topic test --config retention.ms=10000

kafka-topics.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --alter --topic test -delete-config retention.ms

Kafka系列(五)Kafka体系架构英文版

Kafka Message Flow

Kafka系列(五)Kafka体系架构英文版

producer可以控制指定发送到哪个分区(robin)

consumer可以指定我消费哪个分区

Kafka High-Throughput & Low-Latency
  • Batching of individual messages
  • Zero copy I/O using sendfile
kafka Broker
  • Each partition has one server acting as a leader, and zero or more servers acting as followers / ISRs

    broker数要比分区数大得多, 每个分区的备份最好放在不在leader的其他机器上

  • Each server acts as a leader for some partitions and a followers for others, so load is well balanced

    每个broker最好只要有一个leader

kafka Producer
  • Producers publish data to the topics of their choice

    通过分区分配策略来控制我发送哪个分区

  • Async Publishing (less durable)

    把ack调成0,数据可靠性最差的情况

  • All nodes can answer metadata requests about:

    所有机器都可以提供位置信息给你作为传送消息的节点

kafka Consumer

可以定义细粒度:可以去拿topic,也可以去拿partition

  • Consumers consume messages through subscriptions

  • Multiple Consumers can read from the same topics

  • Consumers are organized into Consumer Groups

    一个partition只能被一个group的一个consumer消费,

    partition数量应该大于consumer数量

  • Kafka offers messages to Consumer Groups, not Consumer (instance) directly

    kafka的offset消息是在消费组层面上的,在组里每个consumer都知道partition读到什么地方,如果有consumer挂掉,下一个consumer会继续从offset之后拿

  • Messages remain on Kafka, which are not removed after they are consumed

  • Messaging models

    Queue: a message goes to one of the consumers.

    ​ ⚫ All consumers are in the same Consumer Group

    Publish-Subscribe: a message goes to all consumers.

    ​ ⚫ All consumers are assigned to different Consumer Groups;

    发布订阅,我多个消费组中的一个消费者去消费一个message时,这样可以并行去消费。

kafka ZooKeeper
  • Starting from 0.10, Kafka has its own internal topic for the offset storage
kafka API

The Producer API allows an application to publish a stream of records to one or more Kafka topics.

The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

The Streams API allows an application to act as a stream processor , consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems.

Message Ordering

To ensure global ordering for a topic:

◼ If all message must be ordered within one topic, use one partition

全局有序得话,最好把分区数设为一

◼ If messages can be ordered by certain properties

​ ⚫ Group messages in a partition by Key (defined upon the properties in producer)

​ ⚫ Configure exactly one consumer instance per partition within a consumer group

Message Replication

◼ 0 – the producer never waits for an ack

◼ 1 – the producer gets an ack after the leader replica has received the data

◼ -1 / all – the producer gets an ack after all ISRs (in-sync replication) receives the data

Data Loss at the Producer

◆ Kafka Producer API

​ ◼ Messages are accumulated in buffer in batches

批量发送

​ ◼ Messages are batched by partition, retried at batch level

​ ◼ Expired batches dropped after retries

在重试之后,批处理的消息到期了

◆ Data Loss at Producer

​ ◼ Failed to close / flush producer on termination

没有成功刷新最后得消息

​ ◼ Dropped batches due to communication or other errors when acks = 0 or retry exhaustion

Data Delivered but Loss in the Cluster

Kafka系列(五)Kafka体系架构英文版

Data Duplication

consumer和producer都会发生数据重复消费

Kafka系列(五)Kafka体系架构英文版

Kafka Use Cases

◆ Real-Time Streaming Processing (combined with Spark Streaming)

◆ General purpose Message Bus

◆ Collecting User Activity Data

用户行为数据

◆ Collecting Operational Metrics from Applications, Servers or Devices

◆ Log Aggregation (combined with ELK)

◆ Change Data Capture

◆ Commit Log for distributed System