RCFile
RCfile (Record Columnar File) is a data placement structure that is designed and implemented for big data analytics under the MapReduce programming environment. RCFile aims to meet the following four basic requirements of big data analytics in large scale distributed systems: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
RCFile is implemented as a part of Hive, a data warehouse infrastructure, on top of Hadoop. It is also adopted in HCatalog project (formerly known as Howl), a table and storage management service for Hadoop. RCFile has been widely used in big data analytics community where Hadoop is supported as the basic system environment. The largest user of RCFile is the Facebook, which has one of the busiest data processing system in the world.
RCFile is also a result of basic research from Facebook data Infrastructure team, ICT and The Ohio State University. A research paper entitled “RCFile: a Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse systems” [1] was published and presented in ICDE’ 11.
Description of basic structure
In a relational database, data is organized as two-dimensional tables. For example, a table in a database consistis from columns (c1 to c4):
c1 | c2 | c3 | c4 |
---|---|---|---|
11 | 12 | 13 | 14 |
21 | 22 | 23 | 24 |
31 | 32 | 33 | 34 |
41 | 42 | 43 | 44 |
51 | 52 | 53 | 54 |
To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store) or only partitioning the table vertically like the column-oriented DBMS (column-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups.
c1 | c2 | c3 | c4 |
---|---|---|---|
11 | 12 | 13 | 14 |
21 | 22 | 23 | 24 |
31 | 32 | 33 | 34 |
c1 | c2 | c3 | c4 |
---|---|---|---|
41 | 42 | 43 | 44 |
41 | 42 | 43 | 44 |
Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:
Row Group 1 Row Group 2 11, 21, 31; 41, 51; 12, 22, 32; 42, 52; 13, 23, 33; 43, 53; 14, 24, 34; 44, 54;
Performance Benefits: minimizing I/O costs and network bandwidths
In data warehousing systems, column-store is more efficient when a query only projects a subset of columns, because column-store only read necessary columns from disks but row-store will read a entire row. In MapReduce-based data warehousing systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based data warehousing systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks. RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.
Row-group Size: a critical optimization factor
The row-group size determines the amount of data in a row group and it is a user-specified value. Row-group size has a significant impact on data compression ratios, query processing performance and system fault-tolerance. The row-group size should be bounded by the HDFS block size, which gives a large space of optimization.
Column data compression
Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.
See also
References
- ^ Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu, "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems", Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2011. http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html