Open source Java projects: Apache Phoenix
This installment of Open source Java projects introduces Java developers to Apache Phoenix. Since Phoenix runs on top of HBase, we'll start with an overview of HBase and how it differs from relational databases. You'll learn how Phoenix bridges the gap between SQL and NoSQL, and how it's optimized to efficiently interact with HBase. With those basics out of the way, we'll spend the remainder of the article learning how to work with Phoenix. You'll set up and integrate HBase and Phoenix, create a Java application that connects to HBase through Phoenix, and you'll write your first table, insert data, and run a few queries on it.
Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. HBase is a column-oriented database that leverages the distributed processing capabilities of the Hadoop Distributed File System (HDFS) and Hadoop's MapReduce programming paradigm. It was designed to host large tables with billions of rows and potentially millions of columns, all running across a cluster of commodity hardware.
Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.
In addition to capabilities inherited from Hadoop, HBase is a powerful database in its own right: it combines real-time queries with the speed of a key/value store, a robust table-scanning strategy for quickly locating records, and it supports batch processing using MapReduce. As such, Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.
HBase organizes data differently from traditional relational databases, supporting a four-dimensional data model in which each "cell" is represented by four coordinates:
Figure 1 shows how these four dimensional coordinates are related.
The model in Figure 1 shows that a row is comprised of a row key and an arbitrary number of column families. Each row key is associated to a collection of "rows in tables," each of which has its own columns. While each table must exist, the columns in tables may be different across rows. Each column family has a set of columns, and each column has a set of versions that map to the actual data in the row.
If we were modeling a person, the row key might be the person's social security number (to uniquely identify them), and we might have column families like address, employment, education, and so forth. Inside the address column family we might have street, city, state, and zip code columns, and each version might correspond to where the person lived at any given time. The latest version might list the city "Los Angeles," while the previous version might list "New York." You can see this example model in Figure 2.
In sum, HBase is a column-oriented database that represents data in a four dimensional model. It is built on top of the Hadoop Distributed File System (HDFS), which partitions data across potentially thousands of commodity machines. Developers using HBase can access data directly by accessing a row key, by scanning across a range of row keys, or by using batch processing via MapReduce.
Apache Phoenix is a top-level Apache project that provides an SQL interface to HBase, mapping HBase models to a relational database world. Of course, HBase provides its own API and shell for performing functions like scan, get, put, list, and so forth, but more developers are familiar with SQL than NoSQL. The goal of Phoenix is to provide a commonly understood interface for HBase.
In terms of features, Phoenix does the following:
In addition to supporting a vast set of SQL operations, Phoenix is also very high performing. It analyzes SQL queries, breaks them down into multiple HBase scans, and runs them in parallel, using the native API instead of MapReduce processes.
Phoenix uses two strategies--co-processors and custom filters--to bring computations closer to the data:
In sum, Phoenix leverages direct access to HBase APIs, co-processors, and custom filters to give you millisecond-level performance for small datasets and second-level performance for humongous ones. Above all, Phoenix exposes these capabilities to developers via a familiar JDBC and SQL interface.
In order to use Phoenix, you need to download and install both HBase and Phoenix. You can find the Phoenix download page (and HBase compatibility notes) here.
At the time of this writing, the latest version of Phoenix is 4.6.0 and the download page reads that 4.x is compatible with HBase version 0.98.1+. For my example, I downloaded the latest version of Phoenix that is configured to work with HBase 1.1. You can find it in the folder: phoenix-4.6.0-HBase-1.1/.
Here's the setup:
The process to integrate Phoenix into HBase is simple:
sqlline.py is a Python script that starts a console that connects to HBase's Zookeeper address; localhost in this case. You can walk through an example that I am going to summarize in this section here.
First, let's view all of the tables in HBase by executing !table:
Because this is a new instance of HBase the only tables that exist are system tables. You can create a table by executing a create table command:
This command creates a table named test, with an integer primary key named mykey and a varchar column named mycolumn. Now insert a couple rows by using the upsert command:
UPSERT is an SQL command for inserting a record if it does not exist or updating a record if it does. In this case, we inserted (1,'Hello') and (2,'World!'). You can find the complete Phoenix command reference here. Finally, query your table to see the values that you upserted by executing select * from test:
As expected, you'll see the values that you just inserted. If you want to clean up the table, execute a drop table test command.
Connecting to and interacting with HBase through Phoenix is as simple as connecting to any database using a JDBC driver:
I have uploaded the source code for this example to GitHub. I first set up a new Maven project and configured my POM file as shown in Listing 1.
This POM file imports the Phoenix Core Maven dependency, which provides access to the Phoenix JDBC driver:
The POM file does some housekeeping work next: it sets the source compilation to Java 6, specifies that dependencies should be copied to the target/lib folder during the build, and makes the resulting JAR file executable for the main class, com.geekcap.javaworld.phoenixexample.PhoenixExample.
Listing 2 shows the source code for the PhoenixExample class.
Listing 2 first creates a database connection by passing jdbc:phoenix:localhost as the JDBC URL to the DriverManager class, as shown here:
Just like in the shell console, localhost refers to the server running Zookeeper. If you were connecting to a production HBase instance, you would want to use the Zookeeper server name or IP address for that production instance. With a javax.sql.Connection, the rest of the example is simple JDBC code. The steps are as follows:
You can build the project as follows: mvn clean install.
Then execute it with the following command from the target directory:
You should see output like the following (note that I excluded the Log4j warning messages):
You can also verify this from the Phoenix console. First execute a !tables command to view the tables and observe that JAVATEST is there:
Finally, query the JAVATEST table to see your data:
Note that if you want to run this example multiple times you will want to drop the table using either the console or by adding the following to the end of Listing 2:
As you can see, using Phoenix is a simple matter of creating JDBC connection and using the JDBC APIs. With this knowledge you should be able to start using Phoenix with more advanced tools like Spring's JdbcTemplate or any of your other favorite JDBC abstraction libraries!
Apache Phoenix provides an SQL layer on top of Apache HBase that allows you to interact with HBase in a familiar manner. You can leverage the scalability that HBase derives from running on top of HDFS, along with the multi-dimensional data model that HBase provides, and you can do it using familiar SQL syntax. Phoenix also supports high performance by leveraging native HBase APIs rather than MapReduce processes; implementing co-processors to reduce client/server data transfer; and providing custom filters that improve the execution, navigation, and speed of data querying.
Using Phoenix is as simple as adding a JAR file to HBase, adding Phoenix's JDBC driver to your CLASSPATH, and creating a standard JDBC connection to Phoenix using its JDBC URL. Once you have a JDBC connection, you can use HBase just as you would any other database.
This Open source Java projects tutorial has provided an overview of both HBase and Phoenix, including the specific motivation for developing each of these technologies. You've set up and integrated Phoenix and HBase in your local working environment, and learned how to interact with Phoenix using the Phoenix console and through a Java application. With this foundation you should be well prepared to start building applications on top of HBase, using standard SQL.