HDFS Tutorial

December 20, 2017

Introduction to Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is the one of the two components that makes up the backbone of Hadoop. As the name suggests, HDFS was implemented based on the distributed file system architecture with a few extra features.

A Distributed File System (DFS) is a file system that uses network protocols to store and manage files on a server or set of servers. The goal of a DFS is to allow clients to access data as if it were on their local machine. A DFS also allows data to be stored and shared in a secure and convenient way for multiple users. The servers that store the data have full control over the data and give access control to the clients of the DFS. This is why access to servers that are part of a distributed file system are limited — the data must be retrieved through an API.

HDFS was designed based on the distributed file system architecture because it was intended to be used for huge amounts of data. Because of this need, the data is stored across many servers in what is referred to as a cluster. Since HDFS is built on commodity hardware, there is also a need for redundancy across the cluster as well to ensure no data is lost. By default, data is replicated three times inside the file system. As mentioned in a previous article, Hadoop is different because it brings the processing to the data. This is done by enabling processing to occur on the same servers that store the data.

HDFS Architecture

There are three main components that make up the Hadoop Distributed File System. They are the name node, the data nodes, and blocks.

Name Node

The Name Node is a server that is referred to as the master node because it helps direct traffic for the slave nodes (data nodes). There can be multiple name nodes for fault tolerance; however, only one is running at any given point. The name node is responsible for managing the metadata for the file system — it is responsible for knowing where every piece of data is physically located throughout the file system’s servers. It is also responsible for enforcing the security policies that have been set on the directories and files throughout the file system. Finally, the name node is responsible for executing common file system responsibilities such as opening files and directories and renaming files.

Data Nodes

Data nodes are the nodes that handle the data storage and the processing inside the Hadoop/HDFS ecosystem. Their primary responsibility is to store data inside blocks. With that comes operations to create blocks, delete blocks, and replicate blocks based on instructions received from the name node. Clients also interact with data nodes when doing read-write operations with the file system.

Blocks

Blocks are locations where segments of files are stored. HDFS is designed to store very large files. So that the files are stored efficiently for performance and replication reasons, files are divided into one or more blocks. Because of this a file can be distributed across many servers. Since HDFS does not read files like a human, segmenting of files does not bother HDFS. When the file is needed, the name node will inform the file system where each block is located and will return each block in the correct order to the client.

HDFS Architecture Diagram

In the above image, the client speaks to the name node to manage the metadata of where to write files into what blocks. The name node will look in the data nodes for space and tell the client where to write the blocks of data. The client then directly writes the blocks in to the data node that the name node specified. Once that write is complete, the replication process begins. The ideal situation is that the replication will take place over separate server racks just in case the whole rack loses power. Once each block is written, the block will report back to the name node as part of the block operations to confirm address and that the data was written.

Once the client is ready to read, it talks to the name node to know where to read from and then reaches out to the data node to read the block(s) in the parallel and then puts the file together.

Loading Data into HDFS (Hands-on Example)

First we need to make sure that you have the Cloudera Quickstart running. Please review the installation article in case you need a refresher.

Download the Data

Head over to https://grouplens.org/datasets/movielens/100k/. Download the .zip of all the data and save it somewhere on your computer that you can get to it. Make sure you unzip this file once it gets there. Once data is downloaded and unzipped, we are going to work with the u.data file.

Uploading Using Command Line

First thing to do is to tell the difference between the local linux file system and HDFS. So let’s start with the local linux file system:

ls -l /
dr-xr-xr-x   2 root root  4096 Apr  6  2016 bin
drwxr-xr-x   3 root root  4096 Apr  6  2016 boot
drwxr-xr-x  13 root root  3440 Dec 17 22:56 dev
drwxr-xr-x  85 root root  4096 Dec 17 22:56 etc
drwxrwxr-x   4 root root  4096 Dec 17 22:56 home
dr-xr-xr-x   9 root root  4096 Apr  6  2016 lib
dr-xr-xr-x   7 root root 12288 Apr  6  2016 lib64
drwx------   2 root root  4096 Mar  4  2015 lost+found
drwxr-xr-x   2 root root  4096 Sep 23  2011 media
drwxr-xr-x   2 root root  4096 Sep 23  2011 mnt
drwxr-xr-x   5 root root  4096 Apr  6  2016 opt
drwxr-xr-x   2 root root  4096 Apr  6  2016 packer-files
dr-xr-xr-x 169 root root     0 Dec 17 22:56 proc
dr-xr-x---   3 root root  4096 Apr  6  2016 root
dr-xr-xr-x   2 root root  4096 Apr  6  2016 sbin
drwxr-xr-x   3 root root  4096 Mar  4  2015 selinux
drwxr-xr-x   2 root root  4096 Sep 23  2011 srv
dr-xr-xr-x  13 root root     0 Dec 17 22:56 sys
drwxrwxrwt  36 root root  4096 Dec 17 23:03 tmp
drwxrwxr-x  15 root root  4096 Apr  6  2016 usr
drwxr-xr-x  24 root root  4096 Dec 17 22:58 var

Now let’s try querying the root of hdfs. To interact with hdfs, you need to use hdfs dfs commands.

hdfs dfs -ls /Found 5 items
drwxrwxrwx   - hdfs  supergroup          0 2016-04-06 02:26 /benchmarks
drwxr-xr-x   - hbase supergroup          0 2017-12-17 22:58 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2017-12-17 22:58 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2016-04-06 02:27 /user
drwxr-xr-x   - hdfs  supergroup          0 2016-04-06 02:27 /var

Notice that there are huge differences between the local linux file system and HDFS. The good news is that you can interact with HDFS using similar commands that you do with the local linux file system by just prepending hdfs dfs.

Now let’s load the data. If you’re working on a docker machine on Digital Ocean, copy the data to the docker-machine by using:

docker-machine scp -r ./HadoopCourse root@docker-sandbox:/home/course

If you’re working on a local installation of docker, copy the data to the docker container using this command instead (replace the local path to the u.data file):

docker cp path\to\ml-100k\u.data youthful_meitner:/home/course/ml-100k/u.data

Here, youthful_meitner is the name of the docker container that we setup on our local machine in the previous tutorial.

Now let’s make a new directory in HDFS named /hadoopTutorial using this command:

hdfs dfs -mkdir /hadoopTutorial

Now query HDFS again to ensure that it was created successfully:

hdfs dfs -ls /Found 6 items
drwxrwxrwx   - hdfs  supergroup          0 2016-04-06 02:26 /benchmarks
drwxr-xr-x   - root  supergroup          0 2017-12-17 23:09 /hadoopTutorial
drwxr-xr-x   - hbase supergroup          0 2017-12-17 22:58 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2017-12-17 22:58 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2016-04-06 02:27 /user
drwxr-xr-x   - hdfs  supergroup          0 2016-04-06 02:27 /var

Since we have verified that it is there we can now copy the data into a subdirectory under hadoopTutorial called data.

hdfs dfs -mkdir /hadoopTutorial/data
hdfs dfs -copyFromLocal home/course/ml-100k/u.data /hadoopTutorial/data/

Now let’s query HDFS to confirm that the data is there

hdfs dfs -ls /hadoopTutorial/data/
Found 1 items-rw-r--r--   1 root supergroup    1979173 2017-12-17 23:13 /hadoopTutorial/data/u.data

Congrats! You have now interacted with HDFS and put your first file onto the Hadoop Distributed File System.