In this article, we are going to be looking at some of the different type of data that flow through a system. We will also look at some of the components of a Big Data system at a high level.
There are three main types of data that you will see. These are structured, semi-structured, and unstructured data. Let’s take a look at them and see some examples of them.
If you have any experience with relational database, you know exactly what structured data is. Structured data is data that has clear organization and can be queried using basic algorithms. Some examples of structured data include grocery store transactions, Excel spreadsheets, and anything else that has a clear organizational structure. Most data inside a relational databases is structured. If you look at grocery store transactions, every transaction will always have an item number, amount of money charged, and quantity. This organization allows a basic algorithm (addition) to compute the total of all the transactions very easily.
Unstructured data has little to no organization and has to be queried using very complex algorithms. The perfect example of this is the human language. People don’t always speak using the same structure. There are different languages, slangs, and just improper grammar. It's very hard to organize and query inside of a relational database.
Semi-structured data has some organization but doesn’t necessarily fit into the traditional constructs of structured data and takes more complex queries to glean information from it.
Semi-structured data has really become popular with the Internet. Think of a REST API. It only communicates using semi-structured data. JSON and XML perfect examples of semi-structured data. It has a clear structure, but we couldn’t put it straight into a database without some processing.
Think of Twitter post that has a distinct structure of the message and the time it was tweeted. Inside of that message there can be words, pictures, videos, and emojis. That makes it semi-structured by nature. We can pick out the message easily out of the payload, but processing the message takes more unstructured-type algorithms to get the necessary information out of it.
When we talk about Big Data and how to architect a system to manage the entire pipeline, most people want to skip to the end and talk about the outcome of it. That’s an important part, but there are many steps in between that need to happen to create an effective Big Data system. In this section, we are going to look at some of the major concepts of each piece of a system.
Data ingestion is an incredibly important part of the system. Graphs and dashboards are cool, but they require data. Data doesn’t just magically appear out of thin air inside of your system. There has to be some type of data ingestion to deliver data into the system. It has to be able to handle the volume and velocity of the data while being able to be available all the time. Since it has to be available, it has to be fault tolerant and be able to handle redirection when things go wrong because if anything is certain, it is that things can and will go wrong.
Data storage is another big part of the system. Data ingestion gets the data into the system and data storage is where you actually put the data. Since we are dealing with BIG data, we need a data storage system that has the potential to become very, very big — in other words it needs to be scalable. The ability to add more and more storage to your solution is an important concept so that you never run out of room. While you can keep adding more and more storage, it is also important to establish a data retention strategy so you’re not keeping old data around for no reason because that is wasted money.
It has been said that the information we get from our data is only as good as the data quality around it. Basically, if there is a bunch of inaccurate data, will you get the correct information out it? Absolutely not. As data flows through the system, there are many opportunities to make mistakes. Perhaps it is a transformation to get data into a certain format for processing and there is one minor mistake that changes an integer by one. That could be huge across a dataset of trillions and trillions of records. It is important to ensure that the system is ensuring good data quality practices. There should be regular audits to make sure the data is in the right format and within the expected range of values. Without it, your information that you get out the system might not be accurate.
Finally the fun stuff. Doing data operations is the part where you can turn data into information that ultimately lead to important decisions.
There are more data operations inside of a system. There are some operations that prep the data for batch processing, schedule the batch processing, as well as, adhoc operations to query the data.
All of these operations touch the data which opens up the door to having side effects. What if your batch process was written incorrectly and at the end of every run, it deletes the data. That’s very bad but can be prevented with an immutable data storage approach. What if your adhoc operation changes every value inside of your data to 0. Your data quality procedures should be able to detect this and send out a notification that something needs to be fixed.
Data operations are the heart of the system because that is where the executives or people paying for the system actually receive the value of the system. It is imperative to have the other concepts in place to ensure that your system is well put together and produces the information that is needed accurately.
There are a lot of moving pieces inside of Big Data system which can be vulnerable to attacks both externally and internally. If you look at some of the big companies that have gotten hacked, there is always a threat.
That’s why it is important to have security in mind when building Big Data systems. Perhaps only certain people need access to the data that you are housing in your data storage layer. There needs to be controls to ensure that they are the only ones that have that access. There needs to be access control on the data ingestion part to make sure that no rogue code goes in and redirects the data to another person’s data system.
Security is so important in Big Data systems because some or most of the data that is collected can be people’s personal data. As professionals we need to ensure that we are protecting that at all costs.
Two popular ways of building a Big Data system revolve around these two concepts: data streams and data lakes.
Imagine you have a stream of fish, moving in the current one after the other. You’re standing over the stream ready to count the fish as they past you in this stream. That’s exactly what a data stream is. There is a stream of data with some type of processing that is done on that stream as the data passes through it.
A data lake is the opposite. A data lake is where you put all of the data inside of a data storage system so that you can run huge computational algorithms over it. In our example above, a data lake would be having all of the fish in a lake and you were in a boat counting all of them inside of the lake rather than as they passed by. As you can imagine, counting all the fish in the lake takes a lot more processing, since you are processing them in a large batch rather than one by one.
- Tons of Data located in a Data Lake = Unlimited Analytics
- Allows many users to access the data
- Allows data to be conformed to a standard schema for easy querying
- Allows for SQL-like solutions like Hive
- Infrastructure must be in place before ingestion
- High startup price
- Need more security systems in place because of data living for long periods of time
- Hard to pivot to a new solution because of investment in infrastructure
- Fast calculations and fast results
- Storage is optional but suggested
- Requires streaming ingestion
- Have to deal with the data with little schema conformation
- Needs to have fast processing
- Only get a quick view of the data rather than comprehensive view
While there are advantages and disadvantages to data lakes and data streams, the truth is you can combine the two. You can have a data stream that is conforming or giving streaming analytics that subsequently saves off to a data lake for batch analytics.
We looked at the different high level of types that are involved with Big Data systems. We also looked at some of the major concepts of a Big Data system and explained why each is needed to have a successful system. Too much or too little focus on one could end up destroying your system.