Apache Arrow aims to accelerate analytical workloads
While most new Apache efforts spend several years gaining steam as Apache Incubator projects before they become fully fledged projects, the new Apache Arrow is hitting the ground running as a top-level project from the beginning.
[ Related: Apache Hadoop turns 10 ]
"It's because of the people involved," says Jacques Nadeau, CTO of startup Dremio (still in stealth), vice president of the Apache Drill project and now vice president of Apache Arrow. "Because of the support behind it and the people involved with it, I look at it as an opportunity to establish the next phase of heterogeneous data infrastructure."
"We expect that within a few years, the majority of all the world's day will move through the Arrow representation," he adds.
Initially seeded by code from Apache Drill, a schema-free SQL query engine for large-scale datasets, Arrow is a high-performance cross-system data layer for columnar in-memory analytics. Nadeau says it will speed up both big data processing systems and big data storage systems by 10x to 100x by providing a common internal representation of data.
In many workloads, 70 percent to 80 percent of CPU cycles are spent serializing and deserializing data as it moves between systems and processes that each have their own custom data representations. With Arrow as the common representation, data can be shared between systems and processes with no serialization, deserialization or memory copies.
[ Related: Pentaho adds native Python integration ]
"An industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead," says Ted Dunning, vice present of the Apache Incubator and Apache Arrow PMC.
This gets at why Arrow is receiving such wide support from the very beginning, not just from some of the most well-known and important Apache committers and PMCs — including developers from projects including Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu, Parquet, Phoenix, Spark, Storm, Pandas and Ibis — but also vendors including Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce and Twitter. As a shared foundation for SQL execution engines, data analysis systems, streaming and queueing systems and storage systems, Nadeau says Arrow will provide the various projects in those areas much faster performance and interoperability.
"A columnar in-memory data layer enables systems and applications to process data at full hardware speeds," says Todd Lipcon, original Apache Kudu creator and Apache Arrow PMC. "Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."
In addition to traditional relational data, Arrow supports complex data and dynamic schemas. It can handle JSON data commonly used in Internet of Things (IoT) workloads, modern applications and log files, and implementations are already available (or underway) for programming languages including Java, C++ and Python. Nadeau says implementations of R and JavaScript should come by the end of the year, and Drill, Ibis, Impala, Kudu, Parquet and Spark will all adopt Arrow by the end of the year. Additional projects are also expected to adopt Arrow in that timeframe.
"Real-world use cases often include complex combinations of structured and rapidly growing complex-data," says Parth Chandra, Apache Drill PMC and Apache Arrow PMC. "Already tested with Apache Drill, the efficient in-memory columnar representation and processing in Arrow will enable users to enjoy the performance of columnar processing with the flexibility of JSON."
Nadeau expects the first formal release of Arrow to come within a few months.