YARN or Yet Another Resource Negotiator is one of the resource management tools of the Hadoop ecosystem. YARN is responsible for managing the resources and scheduling jobs to get the most out of your Hadoop cluster. Let’s dive right in and start looking at some of the basics of YARN.
In a Hadoop cluster that YARN is the resource management tool of, there are a bunch of nodes. These are the same nodes that make up HDFS. But running on these nodes can be two different types of YARN processes: ResourceManager and NodeManager.
A ResourceManager is the daemon that acts as the master to communicate with client processes, schedule tasks to NodeManagers, and keep track of the resources on the cluster.
The NodeManager is the worker daemon which kicks off and keeps track of processes that are on nodes.
The Resource Manager receives request for resources for a particular MapReduce, Spark, or other YARN applications job. The Resource Manager determines where there are resources available to execute that job and then schedules a Node Manager with the appropriate information to kick off that job.
Even though YARN is rather simplistic, it has one main configuration that helps define things for the instance. The YARN configuration file is an XML file that determines how the YARN cluster should look. It defines the correct locations for the Resource Managers and the Node Managers.
Because YARN has to keep track of all of the resources on the cluster, it has to look at a couple of things to ensure that it knows what is and what isn’t available for resources. YARN looks at two main resources: vcores and memory.
A vcore or virtual core is simply the usage share of a CPU core. So while NodeManager is tracking usage of its local resources, it sends those stats back to the ResourceManager so that it knows how many vcores and how much memory is available to help schedule jobs in the future. When an application wants to use some resources, it requests a container — this is essentially a way to put a hold on resources. Once that hold has been approved, the NodeManager creates a task inside of that container which will initiate its part of the application that it was assigned. This is a very important concept of YARN — it revolves around resource availability and the resource requirements of an application.
As stated above, an application could be many things such as a Spark job, MapReduce, etc. Even though these have different execution patterns, they communicate with YARN in a similar way.
The process kicks off when an application begins talking to the ResourceManager of the cluster. The ResourceManger makes a single container for that application which houses the ApplicationMaster. The ApplicationMaster is responsible for coordinating tasks on the YARN cluster. Once the ApplicationMaster is spun up by the ResourceManager, the ApplicationMaster starts to request more containers from the ResourceManager to be allocated for running tasks for the application. Once all of the tasks that the ApplicationMaster has created are finished, the containers are given back to the cluster so that ResourceManager can reallocate them to other applications.
But what happens when there are more applications that need resources than the cluster actually has?
This is where YARN implements the concept of Fair Scheduler. Fair Scheduler enforces that each user that uses the Hadoop cluster has the rights to their fair share of resources.
Let’s say for example, we have 2 users and there are a total of 100 vcores and 100 GB of memory. Each user would have rights to 50 vcores and 50 GB of memory. If that number become 3 users, each would have rights to ~33.3 vcores and ~33.3 GB of memory.
It's great that they have rights to it but they might not be using their entire allocation all the time, whereas someone else might need more than their fair share during this time. In this case, YARN will give some of a user’s fair share to help out another user that needs more resources than they have the rights to.
This process work great until that user needs their resources back to run their own job. At that time, YARN will preempt (kill) the containers that are over the users fair share so that those resources can go back to their original owner. These terminated containers will get rescheduled and wait for resources to become available. This is a great concept to have when dealing with multi tenant clusters (clusters that hold many different end users and use cases).
Chances are you have already used YARN when submitting Spark jobs. Especially if you have done some of the other articles. YARN is a very popular resource manager in the Hadoop ecosystem and most of the top ecosystem tools have integrations with YARN so that it is seamless to the user. However, it is still good to understand how it works on the backend to plan for cluster sizes and to tweak your jobs performance by assigning more or less containers.
Apache Tez is a framework that is built on top of YARN to help developers increase the capabilities of the MapReduce framework. MapReduce is well known for its ability to run batch processing jobs. Apache Tez tries to bridge the gap by creating the flexibility to create more streaming jobs that are required in today’s world.
Tez is based on data flow graphs so that MapReduce can perform faster and quicker.
Data flow graphs are graphs that represent dependencies of data between many operations. This is especially useful in parallel computing — Machine Learning and Tensor Flow algorithms also use data flow graphs to calculate the most efficient path based on the way the data flows.
Many Hadoop ecosystems including Hive and Pig have seen improvement in response time when using Tez. It is good to note that Hive has functionality to use Spark as its backend engine rather than MapReduce so the use of Tez to help optimize the Map Reduce engine isn't needed.
Some people say that Mesos and YARN are two of the same breed and that they can be interchanged without having to worry about anything. That is not entirely true.
YARN was built specifically for Hadoop to help manage resources. Mesos was built to manage resources for your entire data center. It has a lot of other use cases other than Hadoop, however, it is compatible with it. For this discussion, we will be focusing on the Hadoop use case.
When it comes to submitting applications to a Mesos managed cluster, you won’t see that many differences when compared to YARN. The major difference is that when a job comes from the client, it will be sent to the Mesos master who determines what the resources of the cluster looks like and then makes an offer back to the client. These offers can then be accepted or denied based on what the client needs. This is an interesting switch from YARN who will always accept your request for resources, but just might not do anything with it if there are no resources available. The Mesos way may seem a little strange, but it offers the ability to have insight to what your cluster is actually doing and determine the need to scale to more nodes if consistent rejections start to happen.
When working in a distributed world like Hadoop, it is very easy for different nodes to get out of sync. They need something coordinating them and keeping them in line. If you think about the name Zookeeper, you think of a person that makes sure that a zoo filled with a bunch of animals keeps running smoothly. The same can be thought of this Zookeeper.
Apache Zookeeper is a service used by a cluster of machines that holds data that needs to be kept in sync and ensures that that data is available for nodes. It actually runs in a distributed world as well, thus requiring the best synchronization techniques to ensure that the its servers are always up to date. It is very important to have an odd number of Zookeeper instances so that if there are any conflicts they can have a tie breaker.
Some of the common services that Zookeeper offers to a cluster are the following:
- Naming Service — Provides the ability to identify nodes by name
- Configuration Management — Ability to provide the latest configuration information of the system
- Cluster Management — Provide the real time node statuses and managing the joining and leaving of a node
- Leader Election — Electing a leader out of the nodes to provide coordination
All of these services are critical in a distributed system. It gives the ability to keep all nodes on the same page. Without a service like Zookeeper maintaining all these services, managing a distributed systems and trying to keep it in sync would be almost impossible.
Hue gives the end user a UI that provides a unique view into Hadoop. In this section we will walk through a quick tour of Hue.
To get started, if you are using Digital Ocean, go ahead and start up your instance and change the memory to 16 GB like we showed in the Introduction to Hadoop. After you have ssh’d into that machine, we will all be on the same page. Both Digital Ocean and regular Docker users can run this command:
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 cloudera/quickstart:latest /usr/bin/docker-quickstart bash
To get to Hue, regular Docker users can type localhost:8888 in their browser and should be able to go there. Digital Ocean users go to your homepage and look at your instance and find the IP address column like so:
Then you can use that ip address and put it into your browser like so 000.00.000.000:8888 except use your IP address. We should now all see a login screen. The username and password is cloudera and cloudera. Now we should be at this window:
From top to left let’s just go through what each of these buttons give you:
- Hue will take you to the homepage of Hue.
- The Home button will take you to your documents where you can create a Hive query, Impala query, Pig script, and Oozie components.
- Query Editors will give you a drop down for some of the editors that come with Hue including Hive, Impala, DB Query, Pig, and a Job Designer.
- Data Browsers will give you options like Metastore Tables, Hbase, and Sqoop imports.
- Workflows will allow you to look at and create Oozie jobs.
- Search is a UI for Solr Searching if you have that setup.
- Security will give a look at Sentry tables and HDFS File ACLs (Access Control List or the way to lock down directories in HDFS).
- File Browser will give you UI on top of HDFS with an interactive look.
- Job Browser will let you look at all of the jobs that are currently and have previously ran.
There are a ton of features in Hue. The only issue is that Hue sometimes can be unreliable depending on your installation of it but overall is a great tool for actually “seeing” some of the concepts of Hadoop. This has been available for all of the previous courses but it is important to learn to do everything the hard way. I encourage you to just explore Hue and see what all you can do.
Let’s load some data like we did in a previous article and put a Hive table over it.
Go ahead and click File Browser and you should see this.
Click the backslash in front of user to get you to the / and then press New in the top right and click directory and name it moviedata. Now you should see this:
Now let’s hit the Upload -> Files and pick the movies data that we have been working with. Once it is uploaded, grab and drop the file into the moviedata directory. Now let’s click on Query Editors and then Hive. This should bring up a screen like this:
Now inside of the query box, you can put this query and the magic should happen:
CREATE EXTERNAL TABLE movies ( movieid INT, title STRING, genres ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY "|" STORED AS TEXTFILE LOCATION '/moviedata/';
Now test it by doing a:
select * from movies;
And it should show up nice and neat. If you’re having some trouble, take a look back at the Introduction to Hive because these are the same files and similar commands, we just did them in a lot less time. Enjoy Hue! It does make a Hadoop a little more friendly!