Designing Real-World Systems

March 02, 2018

The following are a couple of other Hadoop ecosystem tools that we didn’t cover earlier but are important to know the basics about.

Other Hadoop Ecosystem Tools

Impala

Impala is very similar to Hive. They both use the same Hive Metastore to store information about the data and schema. Impala was developed by Cloudera to help improve the performance of queries. They achieved this by creating an execution engine that interacts directly with the data rather than utilizing MapReduce. It has also been proven to show faster query performance on small, quick queries.

Samza

Apache Samza is another streaming process tool. It combines Apache Kafka and Apache Hadoop YARN to create a fault tolerant streaming solution. Apache Kafka is used to ensure that the messages going through Samza are fault tolerant to help ensure no data loss. Samza uses YARN to ensure that processing of those messages are fault tolerant and secure. Any streaming use case is a good one for Samza especially if you are familiar with Kafka and YARN.

Solr

Solr is an open source search platform that is used by many enterprises. It’s written in Java and provides the ability to do full-text search on large amounts of data. It uses real-time indexing to allow searchability of large amounts of data. Moreover, Solr is fault tolerant, scalable, and is a proven tool for enterprise scale analytic use cases and searching.

If you don’t like Java, that’s not a problem. Client libraries for Solr are also available in other languages like C#, Ruby, PHP, and Python. It doesn’t have to be used in conjunction with Hadoop, however it is very popular in many Hadoop distributions such as Cloudera and Hortonworks.

Kudu

Apache Kudu is one of the new projects in the Hadoop ecosystem. It is a storage solution designed with fast analytics in mind. It features a columnar data store that utilizes in memory querying technique to return data fast. Kudu has also introduced with the ability to store data fast. It’s model is “fast analytics for fast data”.

It is still new to the Hadoop ecosystem with it’s first release coming in September of 2016. It is most similar to HBase because it allows storing and key-indexed record lookup in real time. However, it is also different in that Kudu takes a more relational approach to the data rather than schema-less approach HBase does. Kudu can be used when fast analytics are required over large data sets which could include logs, and metadata.

Hadoop on Amazon Web Services (AWS)

Amazon Web Services is one of the leading cloud providers. There are two ways that you can interact with Hadoop on AWS. One is actually spinning up servers and installing Hadoop yourself and working out all of the networking and configuration that goes with setting up a Hadoop cluster. Or you can go the second route and use Amazon EMR (Elastic MapReduce).

EMR is Amazon’s managed Hadoop framework that allows you to process lots of data across scalable EC2 instances. An EC2 (Elastic Compute Cloud) instance is just a web service to get compute capacity in the cloud. EMR is pretty cool because it allows you to simply pick and choose whatever you need. For example, you can get an EMR instance that supports Apache Flink or Presto or Spark in seconds without ever having to install or manage nitty gritty details. You can also hook these frameworks up to run processing on Amazon Datastores such as S3 or DynamoDB.

We’ve introduced a lot of different names and acronyms if you aren’t familiar with AWS and that’s ok. The purpose here isn’t to make you an expert in AWS infrastructure but rather to provide you a head start on where to start looking if you wanted to create a Hadoop like infrastructure on AWS. AWS has a lot of products and can be a little overwhelming to learn everything at first, but if you want to start with Hadoop use cases, EMR is a great tool to start with.

Putting It All Together

Throughout these articles, we have talked about a lot of different applications and how they fit together into a Big Data solution.

We started with our storage level of HDFS. HDFS is the backbone of Hadoop allowing for processing where the data is.
Next, we looked at different data processing frameworks for batch jobs such as MapReduce, Pig and Spark.
We then moved to data warehousing tools to show our data in a relational context such as Hive. We also showed how Sqoop makes it easy to import relational data into Hive and HDFS.
From there we looked at the usage of NoSQL databases such as HBase in the context of storing data.
We then moved on to different ways to interactively query the data using tools like Zeppelin and Presto.
We went back to the basics and looked at how a cluster is actually managed with different tools like YARN and Mesos.
We looked at how to actually schedule workflows and manage them using the tool Oozie.
From there, we went and looked at how to feed data into a Big Data environment using tools like Kafka and Flume.
Finally, we looked at how to analyze streams of data using tools like Spark Streaming, Flink, and Storm.

A typical Big Data solution will not include all of these technologies. There might be some but very rarely will we use most of these in a single solution. Let’s look at some common components:

You need a storage layer. That could be HDFS with HBase on top of it.
You will need a batch processing framework. That could be Spark.
You might want to look at your data with a relational view on top of it and run SQL queries. That would be Hive.
You need a data ingestion strategy. That could include both Flume and Kafka.
You’ll need a cluster management tool which will most likely be YARN.
You could need a streaming processing tool and that could be Apache Flink.
Most importantly you need some type of workflow management and that will most likely be Oozie.

We just described a very common Hadoop Tech Stack that could be used, but it can change depending on your use case. It is important to understand what is available to you so you can make the correct decision on the tech stack of your Big Data solution.

Hadoop is not always the Answer

In the very first article, we mentioned that Hadoop is not always the right answer for all use cases. It is important to reiterate that. Hadoop is really good at handling large amounts of data that should be processed together to gather analytics on that data. It is also good for machine learning where a plethora of data is avaialble.

Hadoop is not good for transactional use cases — those are still good to keep in relational databases. Hadoop isn’t the replacement for the relational database. It is just another tool that you can use to get the most out of your data.

Hadoop is really great at storing unstructured data in the hopes of running analytics across the data set. For example, imagine a large amount of social media data and applying a sentiment analysis algorithm across it to get an idea of what the customer is saying about your business or product. That is a very common use case for Hadoop.

Relational database solutions can feed data into Hadoop to help gain some insights to your data. Or a relational database can stream transactions to Hadoop in real time to predict if a fraudulent transactions have taken place. There are a lot of options that Hadoop can be used for as well as a lot it shouldn’t be used for. Use your best judgement when trying to determine which technology stack you will be using.

Conclusion

Hopefully you have enjoyed this introduction to the ecosystem of Hadoop. Remember Hadoop is a living and breathing thing. Developers are making commits to the open source project every day. New Hadoop ecosystem tools are being created to help make Hadoop the platform of choice for all products. What can you do to help the Hadoop ecosystem grow?