Top 24 Apache Hadoop Yarn Interview Questions You Must Prepare 19.Mar.2024

  • HDFS is a write once file system so a user cannot update the files once they exist either they can read or write to it. However, under certain scenarios in the enterprise environment like file uploading, file downloading, file browsing or data streaming –it is not possible to achieve all this using the standard HDFS. This is where a distributed file system protocol Network File System (NFS) is used. NFS allows access to files on remote machines just similar to how local file system is accessed by applications.
  • Namenode is the heart of the HDFS file system that maintains the metadata and tracks where the file data is kept across the Hadoop cluster.
  • StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.

Resource Manager is responsible for scheduling applications and tracking resources in a cluster. Prior to Hadoop 2.4, the Resource Manager does not have option to be setup for HA and is a single point of failure in a YARN cluster.

Since Hadoop 2.4, YARN Resource Manager can be setup for high availability. High availability of Resource Manager is enabled by use of Active/Standby architecture. At any point of time, one Resource Manager is active and one or more of Resource Managers are in the standby mode. In case the active Resource Manager fails, one of the standby Resource Managers tritions to an active mode.

HDFS federation can separate the namespace and storage to improve the scalability and isolation.

In MapReduce 1, Hadoop centralized all tasks to the Job Tracker. It allocates resources and scheduling the jobs across the cluster. In YARN, de-centralized this to ease the work pressure on the Job Tracker. Resource Manager responsibility allocate resources to the particular nodes and Node manager schedule the jobs on the application Master. YARN allows parallel execution and Application Master managing and execute the job. This approach can ease many Job Tracker problems and improves to scale up ability and optimize the job performance. Additionally YARN can allows to create multiple applications to scale up on the distributed environment.

Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource. In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The same container can be used for Map and Reduce tasks leading to better utilization.

  • YARN is backward compatible so all the existing MapReduce jobs.
  • Using YARN, one can even run applications that are not based on the MapReduce model.

YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.

YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.

The basic idea of YARN is to split the functionality of resource management and job scheduling/monitoring into separate daemons.

YARN consists of the following different components:

Resource Manager - The Resource Manager is a global component or daemon, one per cluster, which manages the requests to and resources across the nodes of the cluster.

Node Manager - Node Manger runs on each node of the cluster and is responsible for launching and monitoring containers and reporting the status back to the Resource Manager.

Application Master is a per-application component that is responsible for negotiating resource requirements for the resource manager and working with Node Managers to execute and monitor the tasks.

Container is YARN framework is a UNIX process running on the node that executes an application-specific process with a constrained set of resources (Memory, CPU, etc.).

Many changes, especially single point of failure and Decentralize Job Tracker power to data-nodes is the main changes. Entire job tracker architecture changed.

Some of the main difference between Hadoop 1.x and 2.x given below:

  • Single point of failure – Rectified.
  • Nodes limitation (4000- to unlimited) – Rectified.
  • Job Tracker bottleneck – Rectified.
  • Map-reduce slots are changed static to dynamic.
  • High availability – Available.
  • Support both Interactive, graph iterative algorithms (1.x not support).
  • Allows other applications also to integrate with HDFS.

You can configure the Resource Manager to use Capacity Scheduler by setting the value of property 'yarn.resourcemanager.scheduler.class' to 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler' in the file 'conf/yarn-site.xml'.

HDFS first renames its file name and place it in /trash directory for a configurable amount of time. In this scenario block might freed, but not file. After this time, Namenode deletes the file from HDFS name-space and make file freed. It’s configurable as fs.trash.interval in core-site.xml. By default its value is 1, you can set to 0 to delete file without storing in trash.

To upgrade 1.x to 2.x dont upgrade directly. Simple download locally then remove old files in 1.x files. Up-gradation take more time.

  • Share folder there. its important.. share.. hadoop .. mapreduce .. lib.
  • Stop all processes.
  • Delete old meta data info… from work/hadoop2data
  • Copy and rename first 1.x data into work/hadoop2.x
  • Don’t format NN while up gradation.
  • Hadoop namenode -upgrade // It will take a lot of time.
  • Don’t close previous terminal open new terminal.
  • Hadoop namenode -rollback.

  • In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
  • Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
  • Hadoop 1.x has single point of failure problem and whenever the Namenode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy Namenode overcomes the SPOF problem and whenever the Namenode fails it is configured for automatic recovery.
  • Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

Hadoop admin write a script called Topology script to determine the rack location of nodes. It is trigger to know the distance of the nodes to replicate the data. Configure this script in core-site.xml

<Property>

<name>topology.script.file.name</name>

<value>core/rack-awareness.sh</value>

</property>

in the rack-awareness.sh you should write script where the nodes located.

You can configure the Resource Manager to use FairScheduler by setting the value of property 'yarn.resourcemanager.scheduler.class' to 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler' in the file 'conf/yarn-site.xml'.

YARN scheduler is responsible for scheduling resources to user applications based on a defined scheduling policy. YARN provides three scheduling options - FIFO scheduler, Capacity scheduler and Fair scheduler.

FIFO Scheduler - FIFO scheduler puts application requests in queue and runs them in the order of submission.

Capacity Scheduler - Capacity scheduler has a separate dedicated queue for smaller jobs and starts them as soon as they are submitted.

Fair Scheduler - Fair scheduler dynamically balances and allocates resources between all the running jobs.

Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and java interface DNS Switch Mapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method get Distance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always 1.

YARN is generic concept, it support MapReduce, but it’s not replacement of MapReduce. You can development many applications with the help of YARN. Spark, drill and many more applications work on the top of YARN.

Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component. Hadoop 2.x Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the application can scale to process larger number of jobs.

Apache YARN, which stands for 'Yet another Resource Negotiator', is Hadoop cluster resource management system.

YARN provides APIs for requesting and working with Hadoop's cluster resources. These APIs are usually used by components of Hadoop's distributed frameworks such as MapReduce, Spark, and Tez etc. which are building on top of YARN. User applications typically do not use the YARN APIs directly. Instead, they use higher level APIs provided by the framework (MapReduce, Spark, etc.) which hide the resource management details from the user.

There are two ways to include native libraries in YARN jobs:-

  1. By setting the -Djava.library.path on the command line but in this case there are chances that the native libraries might not be loaded correctly and there is possibility of errors.
  2. The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.

Resource manager: As equivalent to the Job Tracker

Node manager: As equivalent to the Task Tracker.

Application manager: As equivalent to Jobs. Everything is application in YARN. When client submit job (application),

Containers: As equivalent to slots.

Yarn child: If you submit the application, dynamically Application master launch Yarn child to do Map and Reduce tasks.

If application manager failed, not a problem, resource manager automatically start new application task.

Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.

Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.

HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.

MapReduce- Java based programming model for data processing.

YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

The YARN Resource Manager is a global component or daemon, one per cluster, which manages the requests to and resources across the nodes of the cluster.

The Resource Manager has two main components - Scheduler and Applications Manager.

Scheduler - The scheduler is responsible for allocating resources to and starting applications based on the abstract notion of resource containers having a constrained set of resources.

Application Manager - The Applications Manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific Application Master and provides the service for restarting the Application Master container on failure.