Apache Impala

Cloudera Impala is an open source project created by Cloudera that provides SQL query execution to enable fast interactive ad-hoc queries, on data stored in Hadoop.

Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated from the ground up as part of the Hadoop ecosystem and leverages the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack.

With Impala, analysts and data scientists now have the ability to perform real-time, “speed of thought” analytics on data stored in Hadoop via SQL or through Business Intelligence (BI) tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

Features

Native MPP query engine can execute SQL queries in seconds
Integration with BI tools
Supports HDFS and Apache HBase storage
Reads widely used Hadoop formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet^[1]
Supports Hadoop security (Kerberos authentication)
Fine-grained, role-based authorization with Sentry^[2]
Leverages metadata, ODBC driver, and SQL syntax from Apache Hive

References

^ Parquet
^ Sentry

External links

[1] Parquet

[2] Sentry

[1]

[2]