December 20, 2017
Introduction to Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the one of the two components that makes up the backbone of Hadoop. As the name suggests, HDFS was implemented based on the distributed file system architecture with a few extra features.
A Distributed File System (DFS) is a file system that uses network protocols to store and manage files on a server or set of servers. The goal of a DFS is to allow clients to access data as if it were on their local machine. A DFS also allows data to be stored and shared in a secure and convenient way for multiple users. The servers that store the data have full control over the data and give access control to the clients of the DFS. This is why access to servers that are part of a distributed file system are limited — the data must be retrieved through an API.
HDFS was designed based on the distributed file system architecture because it was intended to be used for huge amounts of data. Because of this need, the data is stored across many servers in what is referred to as a cluster. Since HDFS is built on commodity hardware, there is also a need for redundancy across the cluster as well to ensure no data is lost. By default, data is replicated three times inside the file system. As mentioned in a previous article, Hadoop is different because it brings the processing to the data. This is done by enabling processing to occur on the same servers that store the data.
There are three main components that make up the Hadoop Distributed File System. They are the name node, the data nodes, and blocks.
The Name Node is a server that is referred to as the master node because it helps direct traffic for the slave nodes (data nodes). There can be multiple name nodes for fault tolerance; however, only one is running at any given point. The name node is responsible for managing the metadata for the file system — it is responsible for knowing where every piece of data is physically located throughout the file system’s servers. It is also responsible for enforcing the security policies that have been set on the directories and files throughout the file system. Finally, the name node is responsible for executing common file system responsibilities such as opening files and directories and renaming files.
Data nodes are the nodes that handle the data storage and the processing inside the Hadoop/HDFS ecosystem. Their primary responsibility is to store data inside blocks. With that comes operations to create blocks, delete blocks, and replicate blocks based on instructions received from the name node. Clients also interact with data nodes when doing read-write operations with the file system.
Blocks are locations where segments of files are stored. HDFS is designed to store very large files. So that the files are stored efficiently for performance and replication reasons, files are divided into one or more blocks. Because of this a file can be distributed across many servers. Since HDFS does not read files like a human, segmenting of files does not bother HDFS. When the file is needed, the name node will inform the file system where each block is located and will return each block in the correct order to the client.
HDFS Architecture Diagram
In the above image, the client speaks to the name node to manage the metadata of where to write files into what blocks. The name node will look in the data nodes for space and tell the client where to write the blocks of data. The client then directly writes the blocks in to the data node that the name node specified. Once that write is complete, the replication process begins. The ideal situation is that the replication will take place over separate server racks just in case the whole rack loses power. Once each block is written, the block will report back to the name node as part of the block operations to confirm address and that the data was written.
Once the client is ready to read, it talks to the name node to know where to read from and then reaches out to the data node to read the block(s) in the parallel and then puts the file together.
Loading Data into HDFS (Hands-on Example)
First we need to make sure that you have the Cloudera Quickstart running. Please review the installation article in case you need a refresher.
Download the Data
Head over to https://grouplens.org/datasets/movielens/100k/. Download the .zip of all the data and save it somewhere on your computer that you can get to it. Make sure you unzip this file once it gets there. Once data is downloaded and unzipped, we are going to work with the u.data file.
Uploading Using Command Line
First thing to do is to tell the difference between the local linux file system and HDFS. So let’s start with the local linux file system:
ls -l / dr-xr-xr-x 2 root root 4096 Apr 6 2016 bin drwxr-xr-x 3 root root 4096 Apr 6 2016 boot drwxr-xr-x 13 root root 3440 Dec 17 22:56 dev drwxr-xr-x 85 root root 4096 Dec 17 22:56 etc drwxrwxr-x 4 root root 4096 Dec 17 22:56 home dr-xr-xr-x 9 root root 4096 Apr 6 2016 lib dr-xr-xr-x 7 root root 12288 Apr 6 2016 lib64 drwx------ 2 root root 4096 Mar 4 2015 lost+found drwxr-xr-x 2 root root 4096 Sep 23 2011 media drwxr-xr-x 2 root root 4096 Sep 23 2011 mnt drwxr-xr-x 5 root root 4096 Apr 6 2016 opt drwxr-xr-x 2 root root 4096 Apr 6 2016 packer-files dr-xr-xr-x 169 root root 0 Dec 17 22:56 proc dr-xr-x--- 3 root root 4096 Apr 6 2016 root dr-xr-xr-x 2 root root 4096 Apr 6 2016 sbin drwxr-xr-x 3 root root 4096 Mar 4 2015 selinux drwxr-xr-x 2 root root 4096 Sep 23 2011 srv dr-xr-xr-x 13 root root 0 Dec 17 22:56 sys drwxrwxrwt 36 root root 4096 Dec 17 23:03 tmp drwxrwxr-x 15 root root 4096 Apr 6 2016 usr drwxr-xr-x 24 root root 4096 Dec 17 22:58 var
Now let’s try querying the root of hdfs. To interact with hdfs, you need to use hdfs dfs commands.
hdfs dfs -ls /Found 5 items drwxrwxrwx - hdfs supergroup 0 2016-04-06 02:26 /benchmarks drwxr-xr-x - hbase supergroup 0 2017-12-17 22:58 /hbase drwxrwxrwt - hdfs supergroup 0 2017-12-17 22:58 /tmp drwxr-xr-x - hdfs supergroup 0 2016-04-06 02:27 /user drwxr-xr-x - hdfs supergroup 0 2016-04-06 02:27 /var
Notice that there are huge differences between the local linux file system and HDFS. The good news is that you can interact with HDFS using similar commands that you do with the local linux file system by just prepending hdfs dfs.
Now let’s load the data. If you’re working on a docker machine on Digital Ocean, copy the data to the docker-machine by using:
docker-machine scp -r ./HadoopCourse root@docker-sandbox:/home/course
If you’re working on a local installation of docker, copy the data to the docker container using this command instead (replace the local path to the u.data file):
docker cp path\to\ml-100k\u.data youthful_meitner:/home/course/ml-100k/u.data
Here, youthful_meitner is the name of the docker container that we setup on our local machine in the previous tutorial.
Now let’s make a new directory in HDFS named /hadoopTutorial using this command:
hdfs dfs -mkdir /hadoopTutorial
Now query HDFS again to ensure that it was created successfully:
hdfs dfs -ls /Found 6 items drwxrwxrwx - hdfs supergroup 0 2016-04-06 02:26 /benchmarks drwxr-xr-x - root supergroup 0 2017-12-17 23:09 /hadoopTutorial drwxr-xr-x - hbase supergroup 0 2017-12-17 22:58 /hbase drwxrwxrwt - hdfs supergroup 0 2017-12-17 22:58 /tmp drwxr-xr-x - hdfs supergroup 0 2016-04-06 02:27 /user drwxr-xr-x - hdfs supergroup 0 2016-04-06 02:27 /var
Since we have verified that it is there we can now copy the data into a subdirectory under hadoopTutorial called data.
hdfs dfs -mkdir /hadoopTutorial/data hdfs dfs -copyFromLocal home/course/ml-100k/u.data /hadoopTutorial/data/
Now let’s query HDFS to confirm that the data is there
hdfs dfs -ls /hadoopTutorial/data/ Found 1 items-rw-r--r-- 1 root supergroup 1979173 2017-12-17 23:13 /hadoopTutorial/data/u.data
Congrats! You have now interacted with HDFS and put your first file onto the Hadoop Distributed File System.