Home Java Let’s talk about Hadoop

Let’s talk about Hadoop

by admin

Let's talk about Hadoop

Introduction

As a person with a not very stable psyche, one look at a picture like this is enough for me to start a panic attack.But I have decided that I will suffer only myself.The purpose of this article is to make Hadoop look less scary.

What will be in this article :

  • Let’s break down what the framework consists of and why it’s needed;
  • Let’s break down the issue of painless cluster deployment;
  • Let’s look at a concrete example;
  • let’s touch a bit on the new features of Hadoop 2 (NamenodeFederation, Map/Reduce v2).

What will not be in this article :

  • In general the article is an overview article, so it’s uncomplicated;
  • Let’s not get into the intricacies of the ecosystem;
  • Let’s not dig deep into the maze of APIs;
  • let’s not look at all the near-developmental problems.

What is Hadoop and why we need it

Hadoop is not that complicated, the core consists of the HDFS file system and the MapReduceframework for processing data from that file system.
If you look at the question "why do we need Hadoop?" in terms of use in a large-scale enterprise, there are quite a few answers, ranging from "strongly for" to "very against." I recommend looking at the article ThoughtWorks .
If we look at the same question from technical point of view – for what tasks it makes sense to use Hadoop – it’s not so simple.In the manuals first of all two main examples are disassembled: word count and log analysis.Well, what if it’s not word count or log analysis?
It’s also a good idea to define the answer in some simple way.For example, SQL should be used if you have a lot of structured data and you really want to talk to the data. Ask as many questions of an unknown nature and format in advance as possible.
The long answer is to look at some number of existing solutions and gather implicitly in your subcortex the conditions for which Hadoop is needed. You can poke around in blogs, I can also suggest reading Mahmoud Parsian’s book "Data Algorithms: Recipes for Scaling up with Hadoop and Spark"
I’ll try to answer briefly. Hadoop should be used if :

  • The calculations should be componentizable, in other words, you should be able to run the calculations on a subset of the data and then merge the results.
  • You plan to handle a large volume of unstructured data – more than you can fit on one machine (> several terabytes of data).A plus here will be the ability to use commodity hardware to cluster in the case of Hadoop.

Hadoop should not be used :

  • For non-computable problems – e.g., recurrence problems.
  • If the whole amount of data fits on one machine. Significantly save time and resources.
  • Hadoop in general is a system for batch processing and is not suitable for real-time analysis (this is where the Storm ).

HDFS architecture and a typical Hadoop cluster

HDFS is similar to other traditional filesystems: files are stored in blocks, there is mapping between blocks and file names, a tree structure is supported, a permission-based file access model is supported, etc.
HDFS Distinctions:

  • Designed to store a large number of huge (> 10GB) files. One Consequence – large block size compared to other file systems (> 64MB)
  • Optimized to support high-streaming reads, so random read performance is beginning to falter.
  • Oriented towards a large number of low-cost servers. In particular, the servers use a JBOB (Just a bunch of disk) structure instead of RAID – mirroring and replication is done at the cluster level rather than at the individual machine level.
  • Many of the traditional problems of distributed systems are inherent in the design – already by default, all individual node failures are a perfectly normal and natural operation, not something out of the ordinary.

A Hadoop cluster consists of three types of nodes: NameNode, Secondary NameNode, Datanode.
Namenode – the brains of the system. Usually one node per cluster (more in case of Namenode Federation, but we leave this case out). It stores all metadata of the system – mapping between files and blocks. If node 1 then it is also Single Point of Failure. This problem is solved in the second version of Hadoop with Namenode Federation
Secondary NameNode – 1 node per cluster. It is commonly said that "Secondary NameNode" is one of the most unfortunate names in the history of programs. Indeed, Secondary NameNode is not a replica of NameNode. The state of the file system is stored directly in the fsimage file and in the log file edits, which contain the last changes to the file system (similar to the transaction log in the RDBMS world). The Secondary NameNode’s job is to periodically merge fsimage and edits – the Secondary NameNode keeps the size of edits within reasonable limits. Secondary NameNode is needed for quick manual recovery of NameNode in case of NameNode failure.
In a real cluster, the NameNode and Secondary NameNode are separate servers, demanding memory and hard disk. And the stated "commodity hardware" is already a case of DataNode.
DataNode – There are a lot of such nodes in the cluster. They store blocks of files directly. Node regularly sends NameNode its status (indicates that it is still alive) and hourly report, information about all blocks stored on this node. This is necessary to maintain the correct level of replication.
Let’s see how data is written to HDFS:
Let's talk about Hadoop
1.The client slices the file into block-sized chains.
2. The client connects to the NameNode and requests a write operation, sending the number of blocks and the required replication level
3. NameNode responds with a chain of DataNodes.
4. Client connects to the first node in the chain (if the first node fails, the second node fails, etc. fails completely – rollback). Client writes the first block to the first node, the first node to the second, etc.
5. When recording is completed in reverse order (4 -> 3, 3 -> 2, 2 -> 1, 1 -> to the client) messages are sent about successful recording.
6. Once the client receives confirmation of a successful block write, it notifies the NameNode to write the block, then receives the DataNode chain to write the second block, and so on.
The client continues to write blocks if it manages to successfully write a block to at least one node, i.e. replication will work on the well-known "eventual" principle, further NameNode undertakes to compensate and taki achieve the desired level of replication.
To wrap up our overview of HDFS and cluster, let’s pay attention to another great feature of Hadoop – rack awareness. The cluster can be configured so that NameNode knows which nodes are on which rack, thereby providing better protection against failures

MapReduce

The job unit is a set of map (parallel data processing) and reduce (combining outputs from map) tasks. Map tasks are performed by mappers, reduce tasks are performed by reducers. Job consists of at least one mapper, reducers are optional. Here the issue of splitting the problem into map’s and reduce’s has been dealt with. If you don’t understand the words "map" and "reduce" at all, you can see classic article on the subject.

MapReduce model

Let's talk about Hadoop

  • Data input/output is in pairs (key, value)
  • Two map functions are used: (K1, V1) -> ( (K2, V2), (K3, V3), …) – which maps a key-value pair to some set of intermediate key-value pairs, and also reduce: (K1, (V2, V3, V4, VN)) -> (K1, V1), mapping some set of values having a common key to a smaller set of values.
  • Shuffle and sort is needed to sort the input to the reducer by key, in other words, it makes no sense to send the value (K1, V1) and (K1, V2) to two different reducers. They should be processed together.

Let’s look at the architecture of MapReduce 1. First, let’s expand the view of the hadoop cluster by adding two new elements to the cluster, JobTracker and TaskTracker. JobTracker directly requests requests from clients and manages map/reduce tasks on TaskTrackers. JobTracker and NameNode are spread across different machines, while DataNode and TaskTracker are on the same machine.
The interaction between the client and the cluster is as follows :
Let's talk about Hadoop
1. The client sends a job to JobTracker. Job is a jar file.
2. JobTracker searches for TaskTrackers considering data locality, i.e. preferring those that already store data from HDFS. JobTracker assigns map and reduce tasks to TaskTrackers
3. the TaskTrackers send a job report to the JobTracker.
Task failure is expected behavior, failed tasks are automatically restarted on other machines.
Map/Reduce 2 (Apache YARN) no longer uses "JobTracker/TaskTracker" terminology. JobTracker is split into ResourceManager – resource management and Application Master – application management (of which MapReduce itself is one). MapReduce v2 uses a new API

Setting up the environment

There are several different Hadoop distributions on the market: Cloudera, HortonWorks, MapR – in order of popularity. However, we will not focus on the choice of a particular distribution. A detailed analysis of distributions can be found here
There are two ways to try Hadoop painlessly and with minimal effort:
1. Amazon Cluster – a complete cluster, but this option will cost money.
2. Download the virtual machine ( manual #1 or manual #2 ). In this case, the disadvantage will be that all the servers in the cluster are running on the same machine.
Let’s move on to the painful ways. Hadoop version 1 on Windows will require the installation of Cygwin. The plus side here will be the excellent integration with the development environments (IntellijIDEA and Eclipse). Read more in this great manual
Since version 2, Hadoop also supports Windows server editions.However, I would advise against trying to use Hadoop and Windows not only in production, but anywhere outside the developer’s computer, although there are special distributions .Windows 7 and 8 are not currently supported by vendors, but people who like a challenge can try this do it with their hands
I should also mention that for Spring fans, there is a framework Spring for Apache Hadoop
We’ll take the easy way and install Hadoop in a virtual machine. We’ll start by downloading distribution CDH-5.1 for virtual machine (VMWare or VirtualBox). The size of the distribution is about 3.5 gb. Downloaded, unzipped, loaded in the VM and that’s it. We are all set. It’s time to write everyone’s favorite WordCount!

Specific example

We will need a sample of the data. I suggest downloading any dictionary for bruteforce passwords. My file will be called john.txt.
Now open Eclipse and we already have a pre-created training project. This project already contains all the libraries we need to develop. Let’s throw out all the code the guys from Clouder put in and copy the following :

package com.hadoop.wordcount;import java.io.IOException;import java.util.*;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;public class WordCount {public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);}}}public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();}output.collect(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception {JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf);}}

We get roughly this result :
Let's talk about Hadoop
In the root of the training project by adding john.txt through the File -> New File menu. Result :
Let's talk about Hadoop
Press Run -> Edit Configurations and enter input.txt and output as Program Arguments, respectively
Let's talk about Hadoop
Click Apply and then Run. The job will run successfully :
Let's talk about Hadoop
Where are the results? To do this, update the project in Eclipse (with the F5 button):
Let's talk about Hadoop
In the output folder you can see two files: _SUCCESS, which says that the job was done successfully, and part-00000 directly with the results.
This code can of course be debugged, etc. Let’s finish with a review of unit tests. Actually, the only framework for writing unit tests in Hadoop is MRUnit (https://mrunit.apache.org/), but it is late for Hadoop: versions up to 2.3.0 are supported now, although the last stable Hadoop version is 2.5.0

Ecosystem Blitz : Hive, Pig, Oozie, Sqoop, Flume

In a nutshell and about everything.
Hive Pig In most cases, writing Map/Reduce jobs in pure Java is too time-consuming and unbearable, making sense usually only to pull out all the performance possible. Hive and Pig are two tools for that case. Hive is loved by Facebook, Pig is loved by Yahoo. Hive has SQL-like syntax ( similarities and differences with SQL-92 ). A lot of people with business analysis backgrounds and DBAs have moved into the Big Data camp — for them, Hive is often the tool of choice. Pig focuses on ETL.
Oozie – A workflow engine for jobs. Allows you to link jobs on different platforms: Java, Hive, Pig, etc.
Finally, the frameworks that provide direct input to the system. Quite briefly. Sqoop – Integration with structured data (RDBMS), Flume – with unstructured.

Literature and video course review

There is not much literature on Hadoop yet. As for the second version, I have only come across one book that concentrates specifically on it – Hadoop 2 Essentials: An End-to-End Approach Unfortunately, there is no way to get the book in electronic format, and it was not possible to read it.
I’m not considering literature on individual components of the ecosystem – Hive, Pig, Sqoop – because it is somewhat outdated, and more importantly, such books are unlikely to be read from cover to cover, rather they will be used as a reference guide. And you can always get by with documentation.
Hadoop: The Definitive Guide – The book is at the top of Amazon and has many positive reviews. The material is outdated : 2012 and describes Hadoop 1. On the plus side there are many positive reviews and quite extensive coverage of the entire ecosystem.
Lublinskiy B. Professional Hadoop Solution – the book from which I took a lot of material for this article. It is a bit complicated, but there are a lot of real-life examples and a lot of attention to specific nuances of solution building. Much more enjoyable than just reading the description of the product features.
Sammer E. Hadoop Operations – About half of the book is devoted to describing the Hadoop configuration. Considering that the book is from 2012, it will become obsolete very soon. It is intended primarily for devOps, of course. But I am of the opinion that it is impossible to understand and feel the system if you only develop it and not operate it. I found the book useful due to the fact that the standard problems of cluster backup, monitoring and benchmarking are taken apart.
Parsian M. "Data Algorithms: Recipes for Scaling up with Hadoop and Spark" – The emphasis is on the design of Map-Reduce applications. Strong bias toward the scientific side. Useful for a comprehensive and in-depth understanding and application of MapReduce.
Owens J. Hadoop Real World Solutions Cookbook – like many other Packt books with the word "Cookbook" in the title, is technical documentation sliced into questions and answers. It’s not that easy either. Try it for yourself. Worth reading for a broad overview, well, and to use as a reference book.
Two video courses from O’Reilly are also worth checking out.
Learning Hadoop – 8 hours. It seemed too superficial. But for me there was some value in the supplementary material, because I want to play with Hadoop, but need some live data. And here it is, a great source of data.
Building Hadoop Clusters – 2.5 hours. As the title makes clear, the emphasis here is on building Amazon clusters. I really liked the course – short and clear.
I hope my humble contribution will help those who are just starting to learn Hadoop.

You may also like