Apache Arrow

Apache Arrow
Apache Arrow
Developer(s)	Apache Software Foundation
Initial release	October 10, 2016; 8 years ago
Stable release	v0.15.1... / November 1, 2019; 5 years ago
Repository	https://github.com/apache/arrow
Written in	C++ (reference implementation)
Type	Data format, algorithms
License	Apache License 2.0
Website	arrow.apache.org

Comment: The sources added don't really cover AA as the primary topic of their attention but more the topic in general. SITH (talk) 10:25, 29 May 2019 (UTC)

Comment: There's some relevant discussion of the sources on the talk page. Abadi's blog may meet the "published expert in the field" clause for reliable self-published sources. Huon (talk) 22:46, 13 April 2019 (UTC)

Comment: Nothing of substance has changed since the last time this was declined. There's one paragraph added, which is referenced to three sources which do not mee WP:RS, i.e. a blog post, etc. -- RoySmith (talk) 04:30, 30 January 2019 (UTC)

Comment: User:SQL/PossibleCopyvioDrafts tagged Legacypac (talk) 07:48, 26 March 2018 (UTC)

Comment: Conflict of interest per @Missvain:, notability concerns as mentioned by The Drover's Wife (talk · contribs) Bkissin (talk) 03:42, 25 March 2018 (UTC)

Comment: REVIEWERS: Please note that the submitting editor is the chief marketing officer and vice president of strategy at this company. [1] Missvain (talk) 04:25, 18 March 2018 (UTC)

Apache Arrow is a language-agnostic software framework for developing applications that efficiently load and consume in-memory columnar data in a standardized manner. It also specifies a standard memory format that represents flat and hierarchical data in an optimised columnar manner for efficient analytic operations on modern CPU and GPU hardware.^[2]^[3]^[4]^[5]^[6] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.^[7]

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project provides an open source software library written in C++ with bindings for many other programming languages, e.g. Python and Java. Arrow allows for zero-copy reads and fast data access and interchange without serialisation overhead between these languages and systems.^[2]

Applications

Arrow has been used in diverse domains, including analytics,^[8] genomics,^[9]^[7] and cloud computing.^[10]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.^[11] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.^[12] The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.^[13]

Reception

Daniel Abadi, Darnell-Kanal Professor of Computer Science at the University of Maryland^[14] and a pioneer of column-oriented databases,^[15] reviewed Apache Arrow in March 2018.^[16] "The time is right for database systems architects to agree on and adhere to a main memory data representation standard," he concluded. "[If your] workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard."

Governance

Arrow was announced by Cloudera^[17] and donated to the Apache Software Foundation^[18] in 2016, where it has been maintained and extended since.^[18]^[19]^[6]^[20]^[21] In October 2019, the Apache Arrow team announced that it plans to split the Arrow format and library versioning starting with the planned v1.0 release.^[22]

References

^ "Github releases".
^ ^a ^b "Apache Arrow and Distributed Compute with Kubernetes". 13 Dec 2018.
^ Baer, Tony (17 February 2016). "Apache Arrow: Lining Up The Ducks In A Row... Or Column". Seeking Alpha.
^ Baer, Tony (25 February 2019). "Apache Arrow: The little data accelerator that could". ZDNet.
^ Hall, Susan (23 February 2016). "Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark". The New Stack.
^ ^a ^b Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld.
^ ^a ^b Tanveer Ahmad (2019). "ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework". bioRxiv.
^ Dinsmore T.W. (2016). "In-Memory Analytics". In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4.
^ Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.
^ Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi:10.1145/3110000/3103003/p138-Maas (inactive 2019-08-19).{{cite journal}}: CS1 maint: DOI inactive as of August 2019 (link)
^ LeDem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets.
^ "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".
^ "PyArrow:Reading and Writing the Apache Parquet Format".
^ "Daniel Abadi". Department of Computer Science, University of Maryland.
^ "Prof. Abadi Wins VLDB 10-Year Best Paper Award".
^ "An analysis of the strengths and weaknesses of Apache Arrow".
^ "Introducing Apache Arrow".
^ ^a ^b Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register.
^ "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".
^ LeDem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times.
^ "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".
^ pmc (2019-10-06). "Apache Arrow 0.15.0 Release". Apache Arrow. Retrieved 2019-12-18.

External links

Apache Arrow project web site
Apache Arrow GitHub project source code

[1] "Github releases".

[xenonstack-2] "Apache Arrow and Distributed Compute with Kubernetes". 13 Dec 2018.

[seekingalpha-3] Baer, Tony (17 February 2016). "Apache Arrow: Lining Up The Ducks In A Row... Or Column". Seeking Alpha.

[zdnet-4] Baer, Tony (25 February 2019). "Apache Arrow: The little data accelerator that could". ZDNet.

[5] Hall, Susan (23 February 2016). "Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark". The New Stack.

[infoworld-6] Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld.

[biorxiv-7] Tanveer Ahmad (2019). "ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework". bioRxiv.

[8] Dinsmore T.W. (2016). "In-Memory Analytics". In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4.

[9] Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.

[10] Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi:10.1145/3110000/3103003/p138-Maas (inactive 2019-08-19).{{cite journal}}: CS1 maint: DOI inactive as of August 2019 (link)

[11] LeDem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets.

[12] "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".

[13] "PyArrow:Reading and Writing the Apache Parquet Format".

[14] "Daniel Abadi". Department of Computer Science, University of Maryland.

[15] "Prof. Abadi Wins VLDB 10-Year Best Paper Award".

[16] "An analysis of the strengths and weaknesses of Apache Arrow".

[17] "Introducing Apache Arrow".

[reg17Feb2016-18] Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register.

[19] "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".

[20] LeDem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times.

[21] "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".

[22] (2019-10-06). "Apache Arrow 0.15.0 Release". Apache Arrow. Retrieved 2019-12-18.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]