Zum Inhalt springen

Apache Arrow

aus Wikipedia, der freien Enzyklopädie
Dies ist eine alte Version dieser Seite, zuletzt bearbeitet am 5. Juni 2025 um 08:59 Uhr durch en>Kku. Sie kann sich erheblich von der aktuellen Version unterscheiden.

Vorlage:Short description Vorlage:Infobox software

Portal: Free and open-source software – Übersicht zu Wikipedia-Inhalten zum Thema Free and open-source software

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[1][2][3][4][5] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[6]

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python (PyArrow[7], R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[1]

Applications

Arrow has been used in diverse domains, including analytics,[8] genomics,[9][6] and cloud computing.[10]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[11] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[12] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[13]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[14] with development led by a coalition of developers from other open source data analytics projects.[15][16][5][17][18] The initial codebase and Java library was seeded by code from Apache Drill.[14]

References

Vorlage:Reflist

Vorlage:Apache Software Foundation

  1. a b Apache Arrow and Distributed Compute with Kubernetes. 13. Dezember 2018;.
  2. Tony Baer: Apache Arrow: Lining Up The Ducks In A Row... Or Column. In: Seeking Alpha. 17. Februar 2016;.
  3. Tony Baer: Apache Arrow: The little data accelerator that could. In: ZDNet. 25. Februar 2019;.
  4. Susan Hall: Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark. In: The New Stack. 23. Februar 2016;.
  5. a b Serdar Yegulalp: Apache Arrow aims to speed access to big data. In: InfoWorld. 27. Februar 2016;.
  6. a b Tanveer Ahmad: ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework. In: bioRxiv. 2019, S. 741843, doi:10.1101/741843 (biorxiv.org).
  7. https://arrow.apache.org/docs/python/
  8. Dinsmore T.W.: Disruptive Analytics. Apress, Berkeley, CA, 2016, ISBN 978-1-4842-1312-4, In-Memory Analytics: Satisfying the Need for Speed, S. 97–116, doi:10.1007/978-1-4842-1311-7_5.
  9. Versaci F, Pireddu L, Zanetti G: Scalable genomics: from raw data to aligned reads on Apache YARN. In: IEEE International Conference on Big Data. 2016, S. 1232–1241 (biorxiv.org [PDF]).
  10. Maas M, Asanović K, Kubiatowicz J: Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM). 2017, S. 138–143, doi:10.1145/3102980.3103003.
  11. Julien Le Dem: Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory. In: KDnuggets.
  12. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? 31. Oktober 2017;.
  13. PyArrow:Reading and Writing the Apache Parquet Format.
  14. a b The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project. In: The Apache Software Foundation Blog. 17. Februar 2016;.
  15. Alexander J. Martin: Apache Foundation rushes out Apache Arrow as top-level project. In: The Register. 17. Februar 2016;.
  16. Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says. 17. Februar 2016, archiviert vom Original am 27. Juli 2016; abgerufen am 31. Januar 2018.
  17. Julien Le Dem: The first release of Apache Arrow. In: SD Times. 28. November 2016;.
  18. Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow.