Jump to content

Apache Arrow

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by StraussInTheHouse (talk | contribs) at 10:25, 29 May 2019 (Declining submission: v - Submission is improperly sourced (AFCH 0.9.1)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
  • Comment: The sources added don't really cover AA as the primary topic of their attention but more the topic in general. SITH (talk) 10:25, 29 May 2019 (UTC)
  • Comment: There's some relevant discussion of the sources on the talk page. Abadi's blog may meet the "published expert in the field" clause for reliable self-published sources. Huon (talk) 22:46, 13 April 2019 (UTC)
  • Comment: Nothing of substance has changed since the last time this was declined. There's one paragraph added, which is referenced to three sources which do not mee WP:RS, i.e. a blog post, etc. -- RoySmith (talk) 04:30, 30 January 2019 (UTC)
  • Comment: REVIEWERS: Please note that the submitting editor is the chief marketing officer and vice president of strategy at this company. [1] Missvain (talk) 04:25, 18 March 2018 (UTC)

Apache Arrow
Developer(s)Apache Software Foundation
Initial releaseOctober 10, 2016; 8 years ago (2016-10-10)
Stable release
v0.11.1...[1] / October 19, 2018; 6 years ago (2018-10-19)
Repositoryhttps://github.com/apache/arrow
Written inC++, Java, Python
TypeData analytics, machine learning algorithms
LicenseApache License 2.0
Websitearrow.apache.org

Apache Arrow is an open source software library for columnar in-memory data structures and processing.[2][3][4]

Arrow is sponsored by the nonprofit Apache Software Foundation[5] and was announced by Cloudera in 2016.[6] Arrow is a component, rather than a standalone piece of software, and as such is included in many popular projects, including Apache Spark and pandas.[7]

It defines a language-independent physical memory layout, enabling zero-copy, zero-deserialization interchange of flat and nested columnar data amongst a variety of systems such as Python, R, Apache Spark, ODBC protocols, and proprietary systems that utilize the open source components.[5][8] Apache Arrow is a complement to on-disk columnar data formats such as Apache Parquet and Apache ORC in that it organizes data for efficient in-memory processing by CPUs and GPUs.

Arrow has been proposed as a format for in-memory analytics,[9] genomics,[10] and computation in the cloud.[11]

Comparisons

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[12] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[13] The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.[14]

Daniel Abadi, Darnell-Kanal Professor of Computer Science at the University of Maryland[15] and a pioneer of column-oriented databases,[16] reviewed Apache Arrow in March 2018.[17] "The time is right for database systems architects to agree on and adhere to a main memory data representation standard," he concluded. "[If your] workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard."

References

  1. ^ "Github releases".
  2. ^ Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld.
  3. ^ LeDem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times.
  4. ^ "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow".
  5. ^ a b Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register.
  6. ^ "Introducing Apache Arrow".
  7. ^ "Apache Arrow unifies in-memory Big Data systems: Leaders from 13 existing open source projects band together to solve a common problem: how to represent Big Data in memory for maximum performance and interoperability".
  8. ^ "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says".
  9. ^ Dinsmore T.W. (2016). "In-Memory Analytics". In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4.
  10. ^ Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.
  11. ^ Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era" (PDF). Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi:10.1145/3110000/3103003/p138-Maas (inactive 2019-02-14).{{cite journal}}: CS1 maint: DOI inactive as of February 2019 (link)
  12. ^ LeDem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets.
  13. ^ "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".
  14. ^ "PyArrow:Reading and Writing the Apache Parquet Format".
  15. ^ "Daniel Abadi". Department of Computer Science, University of Maryland.
  16. ^ "Prof. Abadi Wins VLDB 10-Year Best Paper Award".
  17. ^ "An analysis of the strengths and weaknesses of Apache Arrow".