Introduction to Kafka

Kafka was originally developed by LinkedIn as real time messaging system.

LinkedIn and other companies use Apache Kafka for managing the streams of data coming from various applications and processes each piece of data and then send them where they need to go next.Since then it has improved a lot.Now it is available as open source product of Apache and named as Apache Kafka.

Features of Apache Kafka

A distributed system is a group of multiple machines working together as to appear as a single computer to the end-user. Kafka is just like a distributed system which sends ,receives and stores messages over the cluster consisting different nodes(brokers).
Kafka is horizontally scalable. Vertical scaling is simply to increase the resources (CPU, RAM, SSD) on the server which is a big burden of companies as there is limitation of hardware per machine and also needs downtime of server if we increase resources.Horizontal scaling solves the problem by adding more machines at it.

Kafka achieves fault tolerance as it duplicates each partition over a number of servers.
A commit log (also referred to as write-ahead log, transaction log) is totally-ordered sequence of records ordered by time which supports append only.

Apache Kafka commit log

Once written we can not modify or delete records and reads proceed left-to-right. Each entry is assigned a unique sequential log entry number. Thus it guarantees item ordering. For distributed systems, this property will turn out to be essential.

Because of these features Kafka achieves message bus with huge throughput (millions/sec) and use real-time stream processing on the data that goes through it all at once.

Kafka architecture:

Apache Kafka Architecture

Kafka cluster is nothing but consisting of multiple distributed servers or nodes (referred as Broker).

Applications ( referred as Producers) sends messages (records) to a Kafka node (broker) and said messages are consumed by other multiple applications called Consumers. Said messages get stored in a topic in Kafka node, so producers produce topics and consumers subscribe to those topics to receive new messages. Multiple producers and consumers can publish or retrieve messages simultaneously.

Topic architecture:

Each topic in kafka is divided into multiple partition to support for better performance and scalability of Kafka Cluster.

if the replication factor of any topic is set to 3, then Kafka will create 3 identical replicas of each partition and place them in the cluster to make available for all its operations. To balance a load in cluster, each broker stores one or more of those partitions.Every partition has one server acting as a leader. Rest other nodes having replica of that partition are followers for that partition

One of the follower will automatically takes the place of the new leader if the leader fails,

Kafka Topic Architecture

Kafka guarantees the order of all messages inside a partition as they came in sequence . Each specific message is identified by its offset, a sequence number which is incremented for each new message in a partition.

Message ordering in Kafka topic

Consumer Architecture:

consumers are generally consumer groups which consists of one or more consumer inside.

A consumer group has a unique group id. Each consumer group acts as subscriber for one or more Kafka topics. If you need multiple subscribers, then you have multiple consumer groups.Each consumer group maintains its offset per topic partition.,so a record for a specific topic partition gets delivered to only one consumer in a consumer group

Each consumer per consumer group processes records but only one in that group will get the same record.

Ideally, the number of partitions is equal to the number of consumers within a consumer group. If the number of consumers be greater, then the excess consumers will be idle, wasting client resources.

Kafka Consumer Architecture

Notice Broker 1 has topic partition P0 and P1 and Broker 2 has partition P2 and P3. Consumer C0 from Consumer Group 1 is processing records from P0 and P2. No consumer from any consumer group shares a single partition . Consumer C1 and C3 from Consumer Group 2 is processing records from P0 ,P1,P2,P3. So no more partition left for C4 and sits as idle.

Consumer C5,C6,C7 and C8 from Consumer Group 3 is processing records from P0 ,P1,P3 and P2 respectively .Here each consumer subscribe to each topic partition.so no one sits idle.

Conclusion:

In today's scenario Confluent provides platform which includes Apache Kafka core along with additional add on services to reduce the latency & to ease out the development operations by adding API / client to the existing Apache Kafka

Happy learning!!

Debashree

ITechShree-Data-Analytics-Technologies

Introduction to Kafka

Posted by: D Gorai

Post a Comment

0 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget

ITechShree-Data-Analytics-Technologies

Introduction to Kafka

Posted by: D Gorai

You may like these posts

Post a Comment

0 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget