Apache Zeppelin is a web-based notebook that creates a solution for interactive data analytics. A notebook in this context is a space where business users or data engineers can develop, organize, execute, and share code that creates visual results without having to worry about going to a command line or worrying about complex intricacies of a Hadoop cluster. Zeppelin is a browser based notebook that can be interacted with in real time. Apache Zeppelin is still an incubated project which means that it is still in active development. However, many enterprises are starting to adopt tools like Zeppelin to help solve the complexities of interacting with Hadoop clusters for business users.
Zeppelin supports many programming languages via tools called Zeppelin language interpreters or interpreters. Interpreters are usually packaged with Zeppelin and created by the development team. A developer can also create their own interpreter to take a certain language and make it runnable inside of Zeppelin.
Zeppelin supports activities such as data discovery, exploration, reporting, and visualization by integration with Spark and Hive out of the box. Zeppelin can also share or create a report that can be printed or sent out via some other method.
Apache Drill is a schema-free query engine that offers low latency querying for Big Data. That's a pretty loaded statement, so let's unpack it.
Schema-free means that there is no set schema that data has to conform to so that Drill can query it. The schema can be different for different queries. Low latency means that it can process a large amounts of data quickly because of its ability to scale.
Drill has an internal JSON document which allows it to query data of any structure. It doesn’t have to be hooked up to Hadoop and can actually interact with a lot of non-relational databases including HBase, Hadoop, MongoDB, and Amazon S3. It allows business users and data engineers to query data using SQL without having to worry about schema.
Let’s take a look at some key features and benefits of Drill:
We mentioned above that Drill has an extensible architecture that allows it to query a bunch of different databases. Let’s take a look at the architecture to understand what goes into it to create such an extensible product.
Let’s look at the path a query takes when submitted to Drill.
Since Drill's can expand it's processing across many nodes, it can query large amounts of data quickly.
Apache Phoenix is an open source relational database that supports OLTP (Online Transaction Processing) for the Hadoop ecosystem. It uses HBase at the backend store and allows business users to query large datasets using a SQL interface in real-time. It offers developers and business users full ACID transaction capabilities accessed via standard SQL and JDBC APIs.
Phoenix also allows users to add a schema while reading existing data inside of HBase. It can access data that has been processed and stored by popular Hadoop ecosystem applications such as Pig, MapReduce, Flume, Spark, and more.
Remember, HBase has the power of doing random read and writes on data that is stored in Hadoop but has some weird syntax for actually getting that data into and out of HBase. Apache Phoenix helps abstract the complexity of querying this data in HBase by enabling end users and developers to query the data using SQL via JDBC driver.
Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. ~ Source
Since HBase is the backend of Phoenix, it can offer real-time ACID transactions while being distributed across many nodes.
Phoenix also comes with another relational database concept — indexing, i.e. it can add indexes to help speed up queries. Since Phoenix uses Hadoop, it can run queries in parallel and actually execute those queries on nodes that house the actual data so the network doesn’t slow down the query with moving data around.
Phoenix is a great option if you want the speed of HBase but in a familiar relational database setting.
Presto is similar to Drill in the sense that it is a SQL engine that excels in providing interactive analytic queries across multiple data sources, including huge datasets. Presto was designed and implemented to help query huge data warehouses.
Presto runs queries on data where the data lives. Whether that is Cassandra, MongoDB, Hive, relational databases, or other data stores. A pretty cool feature of Presto is that in a single query, Presto can combine data from multiple sources so that you can query across any data store that your company offers. It does this by combining drivers internally that reach out to different databases.
Facebook is a heavy user of Presto and on their site, prestodb.io, Presto says that Facebook uses Presto for its interactive queries across several Facebook internal data stores including their 300 PB data warehouse.
Presto has a simple architecture. It works optimally when ran in a distributed fashion on a cluster of machines. There are two main components: Presto Coordinator and the Presto Worker. A client submits a query to a the Presto coordinator that parses, analyzes, and plans the execution of the query. The coordinator then passes the plan of action to the Presto workers to complete the query.
Interactive queries are still gaining momentum and support. Zeppelin, Drill, and Presto are some of the leading technologies that can run different types of queries across a wide range of data stores and data sets. We had a hands on tutorial to show exactly how Zeppelin works. Zeppelin is a little different than the rest because of its wide range of interpreters that allow it to use things like Spark, Hive, and other techniques to look at data interactively. Drill and Presto are more aligned with a SQL solutions. Phoenix gives the user the ability to query Hadoop data in a relational way which brings a sense of familiarity akin to using relational databases.