How different SQL-on-Hadoop engines satisfy BI workloads

24.02.2016

According to a new benchmark, the three leading SQL-on-Hadoop engines — Apache Impala 2.3, Apache Spark 1.6 and Apache Hive 1.2 — all have unique strengths and weaknesses that make them well-suited to some Business Intelligence (BI) use cases and less suited to others.

"The conclusions really are that one engine does not meet all requirements," says Dave Mariani, CEO and founder of AtScale, a startup specializing in enabling BI on Hadoop. "What we have done in our deployments, for our customers, is plug in multiple engines."

For the Business Intelligence on Hadoop benchmark, AtScale set out to help technology evaluators select the best SQL-on-Hadoop technology for their BI use cases. AtScale's testing team used the Star Schema Benchmark (SSB) data set, based on widely used TPCH data, modified to more accurately represent a typical BI-oriented data layout. The data set allowed the test team to test queries across large tables: The lineorder table contains close to 6 billion rows and the large customer table contains over a billion rows.

Mariani explains that AtScale looked at three key requirements to evaluate the SQL-on-Hadoop engines and their fitness to satisfy BI workloads:

Mariani, who led the effort to build what may have been the world's largest OLAP cube for BI at Yahoo!, says he believes these three criteria are representative of the primary requirements the average enterprise doing BI on Hadoop will have to meet. The criteria were drawn from the test team's experience working with a large number of companies in financial services, healthcare, retail, telecommunications and other industries.

"We used real-world enterprise experience to produce a document that every technical evaluator can use as part of their evaluation process," adds Josh Klahr, vice president of Product Management at AtScale.

The test team found that all three engines passed the tests and are stable enough to support BI workloads, but one engine does not fit all needs. Each has its own "sweet spot," and enterprises are likely to find that blended usage of all engines might fit their goals best.

While Hive is generally considered the default for SQL-on-Hadoop, it was far and away the slowest of the engines in the benchmark, making it poorly suited to interactive queries.

"If you want to use Hive Tez as your interactive query engine exclusively, the best you're going to do is 2.4 seconds," Mariani says.

But while it may be slow, Hive is also the most stable of the three engines, with the best consistency across multiple query types.

"Hive Tez is the tortoise," Mariani adds. "It will always finish the race, but not in a spectacular, speedy fashion. It's the most reliable."

Impala and Spark, on the other hand, were at their best when it came to smaller data sets. Impala topped Spark across a gamut of workloads, but Mariani notes that Spark 1.6 was a vast performance improvement over Spark 1.5 and he expects that trend to continue as Spark has drawn a large open source community focused on its development. Cloudera recently proposed donating Impala to the Apache Software Foundation, which could also lend additional momentum to its development.

For now, Impala is the king for use cases that require large numbers of users.

"Impala kicks butt when it comes to concurrency," Mariani says. "If you're going to have a whole bunch of users running small, fast queries, Impala is a much better choice than Spark would be."

"If speed is not a priority, but stability and reliability is, I would choose to Use Hive Tez as my data pipeline engine," he adds. "For those big batch workloads I would choose Hive Tez. If I wanted my BI users to get access to my warehouse, I would choose to use Spark or Impala."

Mariani notes that while the team didn't benchmark other engines like Apache Drill or Apache Presto, they will next time.

"You never know between release and release who's going to be the better horse to bet on," he says.

(www.cio.com)

Thor Olavsrud