Jump to content

Apache Arrow

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by ChristophDeil (talk | contribs) at 11:59, 18 December 2019 (I have updated the page to reflect recent changes in Apache Arrow in the past half year, since the last submission. Apache Arrow has grown very rapidly in the past 2 years, there is no doubt that it deserves a Wikipedia page. I found this draft via Google "Wikipedia apache arrow" which to my surprise didn't yield a page, but https://www.dremio.com/why-apache-arrow-wikipedia/. Apache Arrow is supported now by many projects (https://arrow.apache.org/powered_by/) and used by many people.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Apache Arrow
Developer(s)Apache Software Foundation
Initial releaseOctober 10, 2016; 8 years ago (2016-10-10)
Stable release
v0.15.1...[1] / November 1, 2019; 5 years ago (2019-11-01)
Repositoryhttps://github.com/apache/arrow
Written inC++ (reference implementation)
TypeData format, algorithms
LicenseApache License 2.0
Websitearrow.apache.org

Apache Arrow is a cross-language development platform for in-memory data. It is based on a standard memory format to represent columnar table and hierarchical data, which is optimised for efficient analytic operations on modern hardware like CPUs and GPUs. It contains an open source software library written in C++, as well as support for many other programming languages, e.g. Python and Java. Many other libraries and systems have added support for Arrow (e.g. Apache Parquet, Apache Spark and pandas). Arrow allows for zero-copy reads and fast data access and interchange without serialisation overhead between these languages and systems. Arrow has been used in diverse domains, e.g. analytics,[2] genomics,[3] and cloud computing.[4]

Arrow was announced by Cloudera[5] and donated to the Apache Software Foundation[6] in 2016, where it has been maintained and extended since.[6][7][8][9][10] In October 2019 the Apache Arrow team announced that it plans to split the Arrow format and library versioning starting with the planned v1.0 release [11].

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[12] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[13] The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.[14]

Daniel Abadi, Darnell-Kanal Professor of Computer Science at the University of Maryland[15] and a pioneer of column-oriented databases,[16] reviewed Apache Arrow in March 2018.[17] "The time is right for database systems architects to agree on and adhere to a main memory data representation standard," he concluded. "[If your] workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard."

References

  1. ^ "Github releases".
  2. ^ Dinsmore T.W. (2016). "In-Memory Analytics". In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4.
  3. ^ Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.
  4. ^ Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi:10.1145/3110000/3103003/p138-Maas (inactive 2019-08-19).{{cite journal}}: CS1 maint: DOI inactive as of August 2019 (link)
  5. ^ "Introducing Apache Arrow".
  6. ^ a b Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register.
  7. ^ "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".
  8. ^ Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld.
  9. ^ LeDem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times.
  10. ^ "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".
  11. ^ pmc (2019-10-06). "Apache Arrow 0.15.0 Release". Apache Arrow. Retrieved 2019-12-18.
  12. ^ LeDem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets.
  13. ^ "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".
  14. ^ "PyArrow:Reading and Writing the Apache Parquet Format".
  15. ^ "Daniel Abadi". Department of Computer Science, University of Maryland.
  16. ^ "Prof. Abadi Wins VLDB 10-Year Best Paper Award".
  17. ^ "An analysis of the strengths and weaknesses of Apache Arrow".