Apache Arrow

Apache Arrow
Developer(s)	Apache Software Foundation
Initial release	October 10, 2016; 8 years ago
Stable release	v0.11.1... / October 19, 2018; 6 years ago
Repository	https://github.com/apache/arrow
Written in	C++, Java, Python
Type	Data analytics, machine learning algorithms
License	Apache License 2.0
Website	arrow.apache.org

Comment: Nothing of substance has changed since the last time this was declined. There's one paragraph added, which is referenced to three sources which do not mee WP:RS, i.e. a blog post, etc. -- RoySmith (talk) 04:30, 30 January 2019 (UTC)

Comment: User:SQL/PossibleCopyvioDrafts tagged Legacypac (talk) 07:48, 26 March 2018 (UTC)

Comment: Conflict of interest per @Missvain:, notability concerns as mentioned by The Drover's Wife (talk · contribs) Bkissin (talk) 03:42, 25 March 2018 (UTC)

Comment: REVIEWERS: Please note that the submitting editor is the chief marketing officer and vice president of strategy at this company. [1] Missvain (talk) 04:25, 18 March 2018 (UTC)

Apache Arrow is an open source software library for columnar in-memory data structures and processing.^[2]^[3]^[4]

Arrow is sponsored by the nonprofit Apache Software Foundation^[5] and was announced by Cloudera in 2016^[6]. Arrow is a component, rather than a standalone piece of software, and as such is included in many popular projects, including Apache Spark and pandas.^[7]

It defines a language-independent physical memory layout, enabling zero-copy, zero-deserialization interchange of flat and nested columnar data amongst a variety of systems such as Python, R, Apache Spark, ODBC protocols, and proprietary systems that utilize the open source components.^[8]^[9] Apache Arrow is a complement to on-disk columnar data formats such as Apache Parquet and Apache ORC in that it organizes data for efficient in-memory processing by CPUs and GPUs.

Arrow has been proposed as a format for in-memory analytics,^[10] genomics,^[11] and computation in the cloud.^[12]

Comparisons

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.^[13]. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.^[14] The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.^[15]

Daniel Abadi, Darnell-Kanal Professor of Computer Science at the University of Maryland^[16] and a pioneer of column-oriented databases,^[17] reviewed Apache Arrow in March 2018.^[18] "The time is right for database systems architects to agree on and adhere to a main memory data representation standard," he concluded. "[If your] workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard."

External links

Apache Arrow project web site
Apache Arrow GitHub project source code

References

^ "Github releases".
^ "Apache Arrow aims to speed access to big data: Apache's new project leverages columnar storage to speed data access not only for Hadoop but potentially for every language and project with big data needs".
^ "The first release of Apache Arrow".
^ "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".
^ "Apache Foundation rushes out Apache Arrow as top-level project".
^ "Introducing Apache Arrow".
^ "Apache Arrow unifies in-memory Big Data systems: Leaders from 13 existing open source projects band together to solve a common problem: how to represent Big Data in memory for maximum performance and interoperability".
^ "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".
^ "Apache Foundation rushes out Arrow as 'Top-Level Project'".
^ Dinsmore T.W. (2016). In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. doi:https://doi.org/10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4. {{cite book}}: Check |doi= value (help); External link in |doi= (help)
^ Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data. pp. 1232–1241.
^ Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems. pp. 138–143.
^ "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory".
^ "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".
^ "PyArrow:Reading and Writing the Apache Parquet Format".
^ "Daniel Abadi".
^ "Prof. Abadi Wins VLDB 10-Year Best Paper Award".
^ "An analysis of the strengths and weaknesses of Apache Arrow".

[1] "Github releases".

[2] "Apache Arrow aims to speed access to big data: Apache's new project leverages columnar storage to speed data access not only for Hadoop but potentially for every language and project with big data needs".

[3] "The first release of Apache Arrow".

[4] "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".

[5] "Apache Foundation rushes out Apache Arrow as top-level project".

[6] "Introducing Apache Arrow".

[7] "Apache Arrow unifies in-memory Big Data systems: Leaders from 13 existing open source projects band together to solve a common problem: how to represent Big Data in memory for maximum performance and interoperability".

[8] "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".

[9] "Apache Foundation rushes out Arrow as 'Top-Level Project'".

[10] Dinsmore T.W. (2016). In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. doi:https://doi.org/10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4. {{cite book}}: Check |doi= value (help); External link in |doi= (help)

[11] Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data. pp. 1232–1241.

[12] Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems. pp. 138–143.

[13] "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory".

[14] "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".

[15] "PyArrow:Reading and Writing the Apache Parquet Format".

[16] "Daniel Abadi".

[17] "Prof. Abadi Wins VLDB 10-Year Best Paper Award".

[18] "An analysis of the strengths and weaknesses of Apache Arrow".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]