Apache Hadoop is a java based open source software. Basically, it’s a framework which is used to execute batch processing jobs on huge clusters. It is designed so that it can be scaled from a single server to hundreds of thousands of nodes in the cluster, with a high extent of fault-tolerance. Rather than relying on state-of-the-art hardware, the reliability of these hardware clusters is born from the software’s capability to detect and effectively handle any kind of failure on their own.
So what makes Hadoop so flexible, agile and robust? Let’s take a look. There are five main components inside this environment.
- Cluster: it is a set of nodes (computers, servers) which may be partitioned in a rack. This is the hardware part of the environment.
- YARN Infrastructure: Yet Another Resource Negotiator is a framework which is responsible for providing the required resources for the application executions. It’s two important components are:
- Node Manager: The node manager is allocated many per a cluster. It is the slave of the infrastructure. When it is started, it communicates with the Resource Manager, periodically sending heartbeats to it. Every node manager offers some resources to the cluster. The resource capacity for a node manager is the amount of memory and a number of vcores. At the runtime, the scheduler will decide how to utilize the capacity: a container which is just a fraction of the node manager capacity which is then used by the client running the program.
- Resource Manager: It is allocated one per cluster and it is considered the master. It knows where the nodes are located and how much resources do they hold. It runs a lot of services, most important one being the Resource Scheduler, which is in charge of assigning the resources.
- MapR Framework: this part of the environment is responsible for the data processing jobs.
- HDFS: this framework is responsible for handling the storage for Hadoop. It is typically used to store the inputs as well outputs, without the hassle of the intermediates.
- Storage Solutions: this part of the environment pertains to the storage part and any third party solution can be used. For example, Amazon makes use of the S3 with Hadoop.
The HDFS and YARN infrastructure are completely independent and decoupled from anything else. While provides with resources for running applications the HDFS Federation provides with storage. MapR is one of the many possible frameworks which operates on top of YARN, yet it is the only one implemented.
Let’s take a look at what YARN is made up of.
In YARN there are 3 main components
- The Master (Resource Manager)
- The Client (Job Submitter)
- The Slave (Node Manager)
The workflow process is as follows:
- The job submitter submits a job to the resource manager.
- The resource manager then allocates a container for it.
- The relevant node manager is then contacted by the resource manager.
- Node manager then launches the container.
- Then the container is set as a base to launch the application master.
The Application Master is the software responsible for the execution of single jobs. It asks the Resource Manager for the containers and then executes the specific programs on the containers it obtains. It knows the application logic of the platform, therefore it is very much framework specific. The MapR framework comes with its own implementation of the Application Master.
Now, the resource manager is the single point of failure in Yet Another Resource Negotiator. Employing the Application Masters, YARN spreads the metadata (data related to running the applications) over the cluster. It greatly reduces the load on the resource manager, making it quickly recoverable.