Apache Arrow is a language-agnosticsoftware framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[2][3][4][5][6] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[7]
Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries.
The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python (PyArrow[8]), R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[2]
Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[12] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[13] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[14]
Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[15] with development led by a coalition of developers from other open source data analytics projects.[16][17][6][18][19] The initial codebase and Java library was seeded by code from Apache Drill.[15]
^Maas M, Asanović K, Kubiatowicz J (2017). "Return of the Runtimes: Rethinking the Language Runtime System for the Cloud 3.0 Era". Proceedings of the 16th Workshop on Hot Topics in Operating Systems. pp. 138–143. doi:10.1145/3102980.3103003. ISBN978-1-4503-5068-6.