Data Querying Tools Tutorial — Zeppelin, Drill, Phoenix, and Presto
February 21, 2018
Apache Zeppelin is a web-based notebook that creates a solution for interactive data analytics. A notebook in this context is a space where business users or data engineers can develop, organize, execute, and share code that creates visual results without having to worry about going to a command line or worrying about complex intricacies of a Hadoop cluster. Zeppelin is a browser based notebook that can be interacted with in real time. Apache Zeppelin is still an incubated project which means that it is still in active development. However, many enterprises are starting to adopt tools like Zeppelin to help solve the complexities of interacting with Hadoop clusters for business users.
Zeppelin supports many programming languages via tools called Zeppelin language interpreters or interpreters. Interpreters are usually packaged with Zeppelin and created by the development team. A developer can also create their own interpreter to take a certain language and make it runnable inside of Zeppelin.
Zeppelin supports activities such as data discovery, exploration, reporting, and visualization by integration with Spark and Hive out of the box. Zeppelin can also share or create a report that can be printed or sent out via some other method.
Apache Drill is a schema-free query engine that offers low latency querying for Big Data. That’s a pretty loaded statement, so let’s unpack it.
Schema-free means that there is no set schema that data has to conform to so that Drill can query it. The schema can be different for different queries. Low latency means that it can process a large amounts of data quickly because of its ability to scale.
Drill has an internal JSON document which allows it to query data of any structure. It doesn’t have to be hooked up to Hadoop and can actually interact with a lot of non-relational databases including HBase, Hadoop, MongoDB, and Amazon S3. It allows business users and data engineers to query data using SQL without having to worry about schema.
Let’s take a look at some key features and benefits of Drill:
- Drill has an extensible architecture, permitting connections to multiple databases.
- Drill can scale to thousands of nodes and thus can query petabytes of data.
- Drill has a flexible data model that compliments its columnar execution model so SQL processing on complex data doesn’t require flattening of the data. With the columnar execution model, Drill has the ability to increase performance by looking at the query and getting rid of any unused columns.
We mentioned above that Drill has an extensible architecture that allows it to query a bunch of different databases. Let’s take a look at the architecture to understand what goes into it to create such an extensible product.
- Drillbit is the Daemon service of Drill that is responsible for managing clients, processing the queries, and getting the results back to the client.
- SQL Parser is the part of the architecture that is responsible for parsing the incoming queries using a framework called Calcite.
- Logical Plan is the part of the architecture that defines how the data flow should work at a high level. Once the logical plan is defined based on the parsed query, the optimizer calculates the most efficient plan.
- Optimizer is responsible for taking the logical plan and applying rule based and cost based heuristics to determine the physical path of the query that is most efficient. The optimizer allows the user to add some of their own rules to create a customized optimization based on their desires. If the user knows that all of the queries are on similar data, she can create an optimization to calculate that in a certain way to help improve performance.
- Physical Plan is the output of the optimizer that defines the actual execution plan. This is actually a DAG (directed acyclic graph) that provides a plan so that the query can be executed across multiple nodes in the most efficient way possible.
- Storage Engine Interface is the way that Drill interacts with data stores. The plugins for the Storage Engine Interface are extendable so that users can create their own interfaces with their own data store that might not come prepackaged with Drill.
Let’s look at the path a query takes when submitted to Drill.
- The user submits a query using the Drill client which is accepted by any Drillbit inside of the cluster.
- The Drillbit that picks up the query then parses, optimizes, and generates the physical plan for the query to ensure that the query can be distributed and fast and efficient.
- The Drillbit becomes the Foreman of the job and queries Zookeeper for the available nodes in the Drill cluster.
- The Foreman schedules the execution of the query in fragments across the nodes that are available inside of the Drill cluster.
- The nodes execute their responsibility of the query and return the data to the Foreman.
- The Foreman then returns the collection of results to the client.
Since Drill’s can expand it’s processing across many nodes, it can query large amounts of data quickly.
Apache Phoenix is an open source relational database that supports OLTP (Online Transaction Processing) for the Hadoop ecosystem. It uses HBase at the backend store and allows business users to query large datasets using a SQL interface in real-time. It offers developers and business users full ACID transaction capabilities accessed via standard SQL and JDBC APIs.
Phoenix also allows users to add a schema while reading existing data inside of HBase. It can access data that has been processed and stored by popular Hadoop ecosystem applications such as Pig, MapReduce, Flume, Spark, and more.
Remember, HBase has the power of doing random read and writes on data that is stored in Hadoop but has some weird syntax for actually getting that data into and out of HBase. Apache Phoenix helps abstract the complexity of querying this data in HBase by enabling end users and developers to query the data using SQL via JDBC driver.
Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. ~ Source
Since HBase is the backend of Phoenix, it can offer real-time ACID transactions while being distributed across many nodes.
Phoenix also comes with another relational database concept — indexing, i.e. it can add indexes to help speed up queries. Since Phoenix uses Hadoop, it can run queries in parallel and actually execute those queries on nodes that house the actual data so the network doesn’t slow down the query with moving data around.
Phoenix is a great option if you want the speed of HBase but in a familiar relational database setting.
Presto is similar to Drill in the sense that it is a SQL engine that excels in providing interactive analytic queries across multiple data sources, including huge datasets. Presto was designed and implemented to help query huge data warehouses.
Presto runs queries on data where the data lives. Whether that is Cassandra, MongoDB, Hive, relational databases, or other data stores. A pretty cool feature of Presto is that in a single query, Presto can combine data from multiple sources so that you can query across any data store that your company offers. It does this by combining drivers internally that reach out to different databases.
Facebook is a heavy user of Presto and on their site, prestodb.io, Presto says that Facebook uses Presto for its interactive queries across several Facebook internal data stores including their 300 PB data warehouse.
Presto has a simple architecture. It works optimally when ran in a distributed fashion on a cluster of machines. There are two main components: Presto Coordinator and the Presto Worker. A client submits a query to a the Presto coordinator that parses, analyzes, and plans the execution of the query. The coordinator then passes the plan of action to the Presto workers to complete the query.
Interactive queries are still gaining momentum and support. Zeppelin, Drill, and Presto are some of the leading technologies that can run different types of queries across a wide range of data stores and data sets. We had a hands on tutorial to show exactly how Zeppelin works. Zeppelin is a little different than the rest because of its wide range of interpreters that allow it to use things like Spark, Hive, and other techniques to look at data interactively. Drill and Presto are more aligned with a SQL solutions. Phoenix gives the user the ability to query Hadoop data in a relational way which brings a sense of familiarity akin to using relational databases.