Zum Inhalt springen

Apache Arrow

aus Wikipedia, der freien Enzyklopädie
Dies ist eine alte Version dieser Seite, zuletzt bearbeitet am 12. April 2020 um 19:17 Uhr durch en>Wesmckinn (The statements made in the "Governance" section were factually inaccurate. While employees from Cloudera were involved in the founding of the project, the project was announced and created directly by the Apache Software Foundation by forking IP out of the Apache Drill. No organization made a "donation". The article also incorrectly stated that the C++ is the main implementation. There are 6 native implementations and 5 language bindings. I added citations supporting the Governance changes). Sie kann sich erheblich von der aktuellen Version unterscheiden.

Vorlage:Undisclosed paid

Vorlage:COI

Apache Arrow

Basisdaten

Entwickler Wes McKinney, Antoine Pitrou, Sutou Kouhei, Matt Topol[1]
Erscheinungsjahr 17. Februar 2016[2]
Aktuelle Version 22.0.0[3]
(24. Oktober 2025)
Lizenz Apache-Lizenz, Version 2.0
arrow.apache.org

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[4][5][6][7][8] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[9]

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C++, C# .NET, Go, Java, JavaScript, and Rust with bindings for other programming languages, such as Python, R, and Ruby. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[4]

Applications

Arrow has been used in diverse domains, including analytics,[10] genomics,[11][9] and cloud computing.[12]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[13] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[14] The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.[15]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016[16], with development led by a coalition of developers from other open source data analytics projects [17][18][8][19][20]. The initial codebase and Java library was seeded by code from Apache Drill [16].

References

Vorlage:Reflist

  1. github.com.
  2. Origin and History of Apache Arrow. (abgerufen am 16. November 2025).
  3. Release 22.0.0. 24. Oktober 2025 (abgerufen am 11. November 2025).
  4. a b Apache Arrow and Distributed Compute with Kubernetes. 13. Dezember 2018;.
  5. Tony Baer: Apache Arrow: Lining Up The Ducks In A Row... Or Column. In: Seeking Alpha. 17. Februar 2016;.
  6. Tony Baer: Apache Arrow: The little data accelerator that could. In: ZDNet. 25. Februar 2019;.
  7. Susan Hall: Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark. In: The New Stack. 23. Februar 2016;.
  8. a b Serdar Yegulalp: Apache Arrow aims to speed access to big data. In: InfoWorld. 27. Februar 2016;.
  9. a b Tanveer Ahmad: ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework. In: bioRxiv. 2019, S. 741843, doi:10.1101/741843 (biorxiv.org).
  10. Dinsmore T.W.: In-Memory Analytics. In: Disruptive Analytics. Apress, Berkeley, CA, 2016, ISBN 978-1-4842-1312-4, In-Memory Analytics, S. 97–116, doi:10.1007/978-1-4842-1311-7_5.
  11. Versaci F, Pireddu L, Zanetti G: Scalable genomics: from raw data to aligned reads on Apache YARN. In: IEEE International Conference on Big Data. 2016, S. 1232–1241 (biorxiv.org [PDF]).
  12. Maas M, Asanović K, Kubiatowicz J: Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM). 2017, S. 138–143, doi:10.1145/3102980.3103003 (acm.org [PDF]).
  13. Julien LeDem: Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory. In: KDnuggets.
  14. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? 31. Oktober 2017;.
  15. PyArrow:Reading and Writing the Apache Parquet Format.
  16. a b The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project. In: The Apache Software Foundation Blog.
  17. Alexander J. Martin: Apache Foundation rushes out Apache Arrow as top-level project. In: The Register. 17. Februar 2016;.
  18. Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says. 17. Februar 2016;.
  19. Julien LeDem: The first release of Apache Arrow. In: SD Times. 28. November 2016;.
  20. Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow.