Home Analysis and system design Principles of building streaming analytics systems

Principles of building streaming analytics systems

by admin

Principles of building streaming analytics systems
Designing streaming analytics and streaming data processing systems has its own nuances, its own challenges, and its own technology stack. We talked about this in another open lesson which was held on the eve of the launch of the course " Data Engineer ".
The webinar discussed :

  • When stream processing is needed;
  • what elements there are in LDS, what tools we can use to implement those elements;
  • How to build your clickstream analysis system.

The instructor is. Yegor Mateshuk , Senior Data Engineer at MaximaTelecom.

When do you need streaming? Stream vs Batch

First of all, we should figure out when we need streaming and when we need batch processing. Let’s explain the strengths and weaknesses of these approaches.
So, the disadvantages of batch processing:

  • data are delivered with a delay. Since we have a certain period of computation, we always lag behind real time for this period. And the more iterations, the more we lag behind. So we get a time delay, which in some cases is critical;
  • creates a peak load on the hardware. If we calculate a lot in batch mode, we have a peak load at the end of the period (day, week, month), because we need to calculate a lot of things. What does it lead to? Firstly, we start to hit the limits, which, as you know, are not infinite. As a result, the system is periodically running at full capacity, which often ends in failure. Second, because all these jobs start at the same time, they compete and calculate quite slowly, which means that you can’t count on fast results.

But batch processing also has advantages :

  • high efficiency. Let’s not go deeper, because efficiency is related to compression, frameworks, columnar formats, etc. The fact is that batch processing, if you take the number of records processed per unit time, will be more efficient;
  • ease of development and support. You can process some part of the data, testing and recalculating as needed.

The benefits of streaming:

  • result in real time. We do not wait for the end of any periods:as soon as we get data (even a very small amount), we can immediately reprocess it and pass it on. That is, by definition, the result tends to be real-time;
  • uniform load on the iron. Of course, there are daily cycles, etc., but the load is still distributed throughout the day and is more even and predictable.

The main disadvantage of stream processing :

  • The complexity of development and support. First, it’s a bit harder to test, manage, and retrieve data compared to batch. The second difficulty (in fact, this is the most important problem) is related to rollbacks. If the jobs don’t work out, and a crash occurs, it’s very difficult to pinpoint exactly where things broke. And solving the problem will require more effort and resources from you compared to batch processing.

So, if you’re thinking about Do you need streamers. , answer the following questions for yourself :

  1. Is real-time really necessary?
  2. Are there many streaming sources?
  3. Is the loss of one record critical?

Consider two examples :
Example 1. Inventory analytics for retail :

  • product display does not change in real time;
  • data is most often delivered in batch mode;
  • information loss is critical.

In this example, it makes more sense to use batch.
Example 2. Analytics for Web Portal :

  • The speed of analytics determines the reaction time to a problem;
  • the data comes in in real time;
  • loss of a small amount of user activity information is acceptable.

Imagine that analytics reflects how web visitors feel about using your product. For example, you’ve rolled out a new release and you need to understand within 10-30 minutes if everything is okay, if any custom features are broken. Let’s say the text from the "Order" button is missing – analytics will allow you to quickly react to a sharp drop in the number of orders, and you will immediately realize that you need to roll back.
So, in the second example, it is better to use streamers.

SPOD Elements

Data processing engineers capture, move, deliver, transform, and store that very data (yes, yes, storing data is an active process, too!).
Consequently, to build a data streaming system (DSS), we will need the following elements :

  1. data loader (means of delivering data to the repository);
  2. data bus (you don’t always need it, but you can’t do without it in a streamer, because you will need a system to exchange data with in real time);
  3. data storage (what without it);
  4. ETL engine (needed to do various filtering, sorting and other operations);
  5. BI (to output the results);
  6. orchestrator (ties the whole process together by organizing multi-step data processing).

In our case, we will consider the simplest situation and focus only on the first three elements.

Tools for processing data streams

On Role data loader we have several "candidates":

  • Apache Flume
  • Apache NiFi
  • StreamSets

Apache Flume

The first one we will talk about is Apache Flume. – a tool for transporting data between different sources and repositories.
Principles of building streaming analytics systems
Pros :

  • is almost everywhere
  • has been in use for a long time
  • quite flexible and extensible

Minuses :

  • uncomfortable configuration
  • hard to monitor

As for its configuration, it looks something like this :
Principles of building streaming analytics systems
Above we create one simple channel which "sits" on a port, takes data from it and simply logs it. Basically, for a description of one process this is ok, but when you have dozens of such processes, the configuration file becomes a living hell. Somebody adds some visual configurators but why bother when there are tools that do it out of the box? For example, the same NiFi and StreamSets.

Apache NiFi

Essentially performs the same role as Flume, but with a visual interface, which is a big plus, especially when there are a lot of processes.
A few facts about NiFi

  • Originally developed at the NSA;
  • now maintained and developed by Hortonworks;
  • is part of Hortonworks’ HDF;
  • has a special version of MiNiFi for collecting data from devices.

The system looks something like this :
Principles of building streaming analytics systems
We have a field of creativity and data processing steps that we throw in there. There are a lot of connectors for every possible system, etc.

StreamSets

This is also a data flow management system with a visual interface. It was developed by Cloudera folks, is easily installed as a Parcel on CDH, and has a special version of SDC Edge for collecting data from devices.
Consists of two components :

  • SDC is a system that handles data directly (free);
  • StreamSets Control Hub – a control center for multiple SDCs with additional pipelining capabilities (paid).

It looks something like this :
Principles of building streaming analytics systems
The annoying thing is that StreamSets has both free and paid parts.

Data bus

Now let’s figure out where we are going to pour this data into. Applicants :

  • Apache Kafka
  • RabbitMQ
  • NATS

Apache Kafka is the best option, but if you have RabbitMQ or NATS in your company and you need to add a little bit of analytics, then deploying Kafka from scratch will not be very profitable.
In all other cases, Kafka is an excellent choice. It’s essentially a horizontally scalable message broker with tremendous bandwidth. It’s perfectly integrated into the entire ecosystem of data tools and can handle heavy loads. It has the most versatile interface and is the lifeblood of our data processing.
Inside Kafka divides into Topic – some separate data stream of messages with the same schema, or at least with the same purpose.
To discuss the next nuance, we have to remember that data sources can be slightly different. The format of the data is very important :
Principles of building streaming analytics systems
The Apache Avro data serialization format deserves special mention. The system uses JSON to define data structures (schemas) that are serialized in compact binary format Consequently, we save a huge amount of data, and the serialization/deserialization is cheaper.
This seems fine, but having separate schema files creates a problem because we need to exchange files between different systems. It seems simple, but when you’re working in different departments, the guys on the other end might change something and settle down, while you get things broken.
To avoid transferring all these files on thumb drives, floppy disks and sketching, there is a special service – Schema registry. This is a service for synchronizing avro schemas between services that write and read from Kafka.
Principles of building streaming analytics systems
In Kafka’s terms, the producer is the one who writes, the consumer is the one who consumes (reads) the data.

Data Warehouse

Contenders (there are actually many more options, but we’ll take just a few):

  • HDFS + Hive
  • Kudu+ Impala
  • ClickHouse

Before we choose a repository, let’s remember what the idempotency Wikipedia says that idempotency (lat. idem – the same + potens – capable) is the property of an object or operation to give the same result when it is repeatedly applied to an object as when it was first applied to it. In our case, the flow processing should be constructed so that when you reapply the initial data, the result remains correct.
How to achieve this In streaming systems :

  • Identify a unique id (can be composite)
  • use this id for data deduplication

HDFS + Hive storage Doesn’t give the hive any hodgepotency for out-of-the-box streaming writes, so we are left with :

  • Kudu + Impala
  • ClickHouse

Kudu – repository, suitable for analytic queries, but with Primary Key for deduplication. Impala – SQL interface to this repository (and several others).
As for ClickHouse, it is an analytics database from Yandex. Its main purpose is analytics on a table filled with a big stream of raw data. On the plus side, there is a ReplacingMergeTree engine for key deduplication (deduplication is designed to save space and can leave duplicates in some cases, you need to consider nuances ).
It remains to add a few words about Divolte If you remember, we talked about capturing some data. If you need to organize analytics for some portal quickly and on your knees, Divolte is a great service for capturing user events on a web page via JavaScript.
Principles of building streaming analytics systems

Case study

What do we try to do? Let’s try to build a pipeline to collect Clickstreamdata in real time. Clickstream – The virtual footprint a user leaves while on your site. We’ll capture the data with Divolte and write it to Kafka.
Principles of building streaming analytics systems
You need Docker to work, plus you will need to clone the following repository Everything that happens will run in containers. To run multiple containers in a coherent way will use docker-compose.yml In addition, there are Dockerfile which builds our StreamSets with certain dependencies.
There are also three folders :

  1. in clickhouse-data clickhouse data will be written
  2. exactly the same folder ( sdc-data ) we will have for StreamSets, where the system can store configurations of
  3. third folder ( examples ) includes the query file and the pipelining configuration file for StreamSets

Principles of building streaming analytics systems
To start it, enter the following command :

docker-compose up

And enjoy how slowly but surely the containers start up. After launching, we can go to http://localhost:18630/ And immediately touch Divolte:
Principles of building streaming analytics systems
So, we have Divolte, which has already received some events and recorded them in Kafka. Let’s try to calculate them using StreamSets: http://localhost:18630/ (password/login – admin/admin).
Principles of building streaming analytics systems
So as not to suffer, it is better to import Pipeline By naming it, for example, clickstream_pipeline And from the examples folder we import clickstream.json If everything is ok, we will see the following picture :
Principles of building streaming analytics systems
So, we created a connection to Kafka, prescribed which Kafka we wanted, prescribed which topic we were interested in, then selected the fields we were interested in, then put a merge in Kafka, prescribing which Kafka and which topic. The difference is that in one case, the Data format is Avro, and in the second, it’s just JSON.
Moving on. We can, for example, do a preview of that captures in real time from Kafka certain entries. Next, we record everything.
After running it, we see that we have a stream of events flying into Kafka, and it is happening in real time :
Principles of building streaming analytics systems
You can now make a repository for this data in ClickHouse. To work with ClickHouse, you can use a simple native client by running the following command :

docker run -it --rm --network divolte-ss-ch_default yandex/clickhouse-client --host clickhouse

Note – this line specifies the network to connect to. And depending on what your repository folder is called, you may have a different network name. In general, the command will be :

docker run -it --rm --network {your_network_name} yandex/clickhouse-client --host clickhouse

The list of networks can be viewed with the command :

docker network ls

Well, there’s not much left :
1. First, let’s "sign" our ClickHouse to Kafka , "explaining to it" what format of data we need there :

CREATE TABLE IF NOT EXISTS clickstream_topic (firstInSession UInt8, timestamp UInt64, location String, partyId String, sessionId String, pageViewId String, eventType String, userAgentString String) ENGINE = KafkaSETTINGSkafka_broker_list = 'kafka:9092', kafka_topic_list = 'clickstream', kafka_group_name = 'clickhouse', kafka_format = 'JSONEachRow';

2. Now let’s create the real table where we put the summary data :

CREATE TABLE clickstream (firstInSession UInt8, timestamp UInt64, location String, partyId String, sessionId String, pageViewId String, eventType String, userAgentString String) ENGINE = ReplacingMergeTree()ORDER BY (timestamp, pageViewId);

3. And then we provide a link between these two tables :

CREATE MATERIALIZED VIEW clickstream_consumer TO clickstreamAS SELECT * FROM clickstream_topic;

4. Now let’s select the necessary fields :

SELECT * FROM clickstream;

Finally, selecting from the target table will give us the result we want.
Principles of building streaming analytics systems
That’s it, that was the simplest Clickstream you can build. If you want to perform the above steps yourself, watch the video in its entirety.

You may also like