R Karshan
4 min readMay 20, 2020

--

Introduction to Kafka

Introduction

Apache Kafka® is a distributed streaming platform.

A streaming platform has three key capabilities

· Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.

· Store streams of records in a fault-tolerant durable way.

· Process streams of records as they occur.

Kafka is generally used for two broad classes of applications

· Building real-time streaming data pipelines that reliably get data between systems or applications

· Building real-time streaming applications that transform or react to the streams of data

To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up.

First a few concepts

· Kafka is run as a cluster on one or more servers that can span multiple datacenters.

· The Kafka cluster stores streams of records in categories called topics.

· Each record consists of a key, a value, and a timestamp

Kafka includes five core apis

1. The Producer API allows applications to send streams of data to topics in the Kafka cluster.

2. The Consumer API allows applications to read streams of data from topics in the Kafka cluster.

3. The Streams API allows transforming streams of data from input topics to output topics.

4. The Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.

5. The Admin API allows managing and inspecting topics, brokers, and other Kafka objects.

Use for

Kafka’s main use is for streaming data in real-time into other systems. Kafka was designed to be the ultimate platform to handle all the real-time data feeds a large company might have.

Kafka can connect to external systems (for data import/export) via Kafka Connect and Kafka Streams, a Java stream processing library.

With Kafka, users can subscribe and publish data to any number of systems or real-time applications. Some examples of this include managing passenger and driver matching at Uber, providing real-time analytics and predictive maintenance for British Gas’ smart home, and performing numerous real-time services across LinkedIn.

While Kafka is mostly used for real-time data analytics and stream processing, you can also use it for log aggregation, messaging, click-stream tracking, audit trails, and much more. Kafka is also used for website activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, CEP, ingesting data into Spark, ingesting data into Hadoop, CQRS, replay messages, error recovery, and more.

Kafka as Messaging System

Kafka is a distributed messaging system that allows to publish-subscribe messages in a data pipeline. It is a fast and highly scalable messaging system and is most commonly used as a central messaging system and centralizes communication between different and large data system

Fault — Tolerant

Kafka Streams builds on fault-tolerance capabilities integrated natively within Kafka. Kafka partitions are highly available and replicated; so when stream data is persisted to Kafka it is available even if the application fails and needs to re-process it. Tasks in Kafka Streams leverage the fault-tolerance capability offered by the Kafka consumer client to handle failures. If a task runs on a machine that fails, Kafka Streams automatically restarts the task in one of the remaining running instances of the application.

Highly Scalable

One of the major advantages of using Kafka is that it is highly scalable. In times of any node failure, Kafka allows for quick and automatic recovery. In a world that now deals with high volumes of real-time data, this feature makes Kafka a hands down choice for data communication and integration.

High Performance

Kafka is super-efficient at handling real-time and complex data feeds with high throughput and lesser delays. The data and stored messages can run into terabytes, yet Kafka delivers high performance and the best companion for any enterprise Hadoop infrastructure.

Security

Encryption of data in-flight using SSL / TLS: This allows your data to be encrypted between your producers and Kafka and your consumers and Kafka. This is a very common pattern everyone has used when going on the web. That’s the “S” of HTTPS (that beautiful green lock you see everywhere on the web).

Authentication using SSL or SASL: This allows your producers and your consumers to authenticate to your Kafka cluster, which verifies their identity. It’s also a secure way to enable your clients to endorse an identity. Why would you want that? Well, for authorization!

Authorization using ACLs: Once your clients are authenticated, your Kafka brokers can run them against access control lists (ACL) to determine whether or not a particular client would be authorized to write or read to some topic.

Recoverable

You are responsible for the configuration of the corresponding parameters and the design of an architecture which can deal with fail overs in order to ensure that you’ll never lose any data.

--

--

R Karshan

Final Year Undergraduate | Software Engineering University of Kelaniya.