Apache Cassandra is an open source, distributed and decentralized database that is ideal for handling huge amounts of structured data across multiple site in multiple regions.
It provides high availability without a single point of failure, thus it is continuously online and is ideal for business applications that need high degrees of uptime.
Cassandra is very easy to scale by nature. Just like Hadoop when you scale Cassandra because you need more space, you also get more resources so that response time for queries can stay low even with the increase in data
Cassandra supports ACID transactions. Remember ACID stands for Atomicity, Consistency, Isolation, and Durability. ACID transactions come from relational database world which makes Cassandra very attractive.
However, Cassandra has a concept of eventually consistent meaning that after a write, the data that you wrote might not be available in another node for reading immediately. It takes time to r...
MongoDB is another type of NoSQL database that boasts high performance, high availability, and easy scalability. It is based on the idea of collections and documents.
A database in the MongoDB world is a physical, organized assembly of collections. A collection is a group of documents or a table. A document is a set of key-value pairs.
Documents have a dynamic schema meaning that every document in the same collection does not need to have matching structures. Even matching fields could hold different types of data. This concept makes MongoDB very flexible. This dynamic schema is sometimes known as schema-less meaning that it can pretty much be anything.
Let’s talk about a couple of niceties that MongoDB offers. One is the flexibility that you get with the schema less architecture that we touched on above. Another is that a single object’s structure is very clear. You can tell exactly what it is. Don’t worry about complex joins in MongoDB! We will get to why in a second. MongoDB c...
Sqoop is a tool that uses the mapreduce framework to export data from relational databases into HDFS in a parallel fashion. That’s all that Sqoop does — it is a simple but effective tool. Let’s make a MySQL database and put it into HDFS using Sqoop.
First thing we need to do is create a database in MySql.
mysql -u root -p
The password is cloudera.
create database people;use people;create table friends(id int, name varchar(20), nickname varchar(20));desc friends;mysql> desc friends;+----------+-------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |
MapReduce is a Java-based framework used for processing large amounts of data. The MapReduce framework is made up of two major components: Map and Reduce.
The Map part of the algorithm takes data and transforms it into key/value pairs (<k1,v1>) or tuples.
The Reduce part of the algorithm takes the output of a map and combines or reduces the input into a smaller group of tuples.
The advantage of this simple framework is the ability to scale across multiple nodes based on a configuration change.
We'll go through a hands-on example at the bottom of this article.
MapReduce is the main processing algorithm and processing framework in Hadoop. Whether you write your own MapReduce code or you use applications such as Pig that convert into MapReduce, MapRed...
The following are a couple of other Hadoop ecosystem tools that we didn’t cover earlier but are important to know the basics about.
Impala is very similar to Hive. They both use the same Hive Metastore to store information about the data and schema. Impala was developed by Cloudera to help improve the performance of queries. They achieved this by creating an execution engine that interacts directly with the data rather than utilizing MapReduce. It has also been proven to show faster query performance on small, quick queries.
Apache Samza is another streaming process tool. It c...
Nowadays, companies need an arsenal of tools to combat data problems. Traditionally, batch jobs have been able to give the companies the insights they need to perform at the right level. However, with the emergence of web based applications where data is being created at a high velocity and the customer wants to see results in near real time, new technology is necessary to combat this problem.
This is where Streaming Tools come into play. They are data processing tools that can carry out various forms of processing in near real time in the hopes of providing customers with accurate information instantaneously.
Streaming Tools can also be used to process data to give accurate metrics of how the business is performing to the organization. It can also find out in real time if a credit card transaction was fraudulent. There are a ton of use cases for streaming tools, but ...
In this article, we are going to be looking at some of the different type of data that flow through a system. We will also look at some of the components of a Big Data system at a high level.
There are three main types of data that you will see. These are structured, semi-structured, and unstructured data. Let’s take a look at them and see some examples of them.
If you have any experience with relational database, you know exactly what structured data is. Structured data is data that has clear organization and can be queried using basic algorithms. So...
With the advancements in technology, humans and machines now create more data in 2 days than the world has created since the beginning of time up until 2003.
Think about that for a second. More data since the beginning of time until 2003. That’s a lot of data.
Big Data is the notion of very large amounts of data that a person or organization uses to gather some information or insight. When talking about Big Data and its definition, the 5 V’s are very important: Volume, Velocity, Variety, Veracity, and Value.
Oozie is a workflow management system that is designed to schedule and run Hadoop jobs in a distributed environment. Oozie has the ability to schedule multiple complex jobs in a sequential order or to run in parallel. It integrates well with Hadoop jobs such as MapReduce, Hive, Pig, and others, and allows ssh and shell access.
Oozie is a Java web-application under the Apache 2.0 license. Oozie kicks off jobs using a unique callback HTTP URL so that it can notify that URL when the task is completed. If by chance that task doesn’t callback to that URL, Oozie will just poll to make sure that the task did complete.
There are three major types of configurable items inside of Oozie:
A NoSQL database is a type of database that stores and retrieves data differently than a traditional relational database. NoSQL databases are also known as non relational databases, or as Not-only-SQL databases because they can have a SQL-like language that is used to query data.
NoSQL databases grew in popularity with the emergence of companies such as Google and Amazon because of the requirements of real time web applications combined with massive amounts of data (Big Data). It's simplistic design allows horizontally scalability across multiple nodes easily, resulting in high availability. They also offer a lot of options when it comes to how the data is stored and retrieved — data can be stored in key value pairs, wide column, document, and graphs, providing a great deal of flexibility.
Hadoop has some limitations when it comes to process...