Programming with Big Data in R

pbdR
pbdR
	File:Pbdr.png
Paradigm	Single Programmer Multiple Data
Designed by	Wei-Chen Chen, George Ostrouchov, Pragneshkumar Patel, and Drew Schmidt
Developer	pbdR Core Team
First appeared	Sep. 2012
Preview release	Through GitHub at RBaigData
Typing discipline	Dynamic
OS	Cross-platform
License	General Public License and Mozilla Public License
Website	r-pbd.org
Influenced by
	R

Programming with Big Data in R (pbdR)^[1] is a free software programming language and a software environment for statistical computing with Big Data by utilizing high-performance statistical computation.^[2] The pbdR language is carried from R (programming language)^[3] used among statisticians and data miners for developing statistical software. The pbdR enables high-level distributed data parallelism in R, so that it can easily utilize large HPC platforms with thousands of cores, making the R language scale to unparalleled heights. The pbdR is a project that aims to elevate the statistical programming language R to leadership-class computing platforms. The main goal is empower data scientists by bringing flexibility and a big analytic toolbox to big data challenges, with an emphasis on productivity, portability, and performance. We achieve this in part by mapping high-level programming syntax to portable, high-performance, scalable, parallel libraries. In short, pbdR make R scalable.

Programming Features

Programming with pbdR requires usage of various packages developed by pbdR core team. Packages developed by them are the following.

General	I/O	Computation	Application
pbdDEMO	pbdNCDF4	pbdDMAT	pmclust
pbdMPI		pbdBASE
		pbdSLAP

pbdMPI --- an efficient interface to MPI with a focus on Single Program/Multiple Data (SPMD) parallel programming style
pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on ScaLAPACK version 2.0.2^[4]
pbdNCDF4 --- Interface to Parallel Unidata NetCDF4 format data files^[5]
pbdBASE --- low-level ScaLAPACK codes and wrappers
pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics
pbdDEMO --- set of package demonstrations and examples, and this unifying vignette^[6]

Amount those packages the pbdDEMO package consists of two main parts. The first is a collection of roughly 20+ package demos. These offer example uses of the various pbdR packages. The second is this vignette, which attempts to offer detailed explanations for the demos, as well as sometimes providing some mathematical or statistical insight.

Examples

Example 1

The following examples illustrate the basic syntax of the language of pbdR. Since pbdR is designed for SPMD, all the R scripts are stored in files and executed from the command line via mpiexec, mpirun, etc. Save the following code in a file called ``demo.r``

library(pbdMPI, quiet = TRUE)
init()
.comm.size <- comm.size()
.comm.rank <- comm.rank()

N <- 5
x <- (1:N) + N * .comm.rank

y <- allreduce(as.integer(x), op = "sum")
comm.print(y)
y <- allreduce(as.double(x), op = "sum")
comm.print(y)

finalize()

and use the command

mpiexec -np 4 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Example 2

The following examples illustrate the basic ddmatrix computation of pbdR. Save the following code in a file called ``demo.r``

# Initialize process grid
library(pbdDMAT, quiet=T)

if(comm.size() != 2)
  comm.stop("Exactly 2 processors are required for this demo.")

init.grid()

# Setup for the remainder
comm.set.seed(diff=TRUE)
M <- N <- 16
BL <- 2 # blocking --- passing single value BL assumes BLxBL blocking

dA <- ddmatrix("rnorm", nrow=M, ncol=N, mean=100, sd=10)
A <- as.matrix(dA)

# LA SVD
svd1 <- La.svd(A)
svd2 <- La.svd(dA)
svd2 <- lapply(svd2, as.matrix)
comm.print(sum(svd1$d - svd2$d))
comm.print(sum(svd1$u - svd2$u))
comm.print(sum(svd1$vt - svd2$vt))

# Finish
finalize()

and use the command

mpiexec -np 2 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

External links

Official website of the pbdR project
Technical website of the pbdR packages
Source Code of developing version of the pbdR packages
Discussion Group for any of pbdR related topics
Tutorial Website for beginners

Milestones

Version 0.1-0: Migrate from Rmpi to pbdMPI.
Version 0.1-1: Add pbdSLAP.
Version 0.102: Add pbdBASE and pbdDMAT.
Version 1.0-0: Add pbdNCDF4.
Version 1.0-1: Add pmclust.

References

^ Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)
^ Chen, W.-C. and Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research".{{cite web}}: CS1 maint: multiple names: authors list (link)
^ R Core Team (2012). R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0.
^ Blackford, L.S.; et al. (1997). ScaLAPACK Users' Guide. {{cite book}}: Explicit use of et al. in: |author= (help)
^ NetCDF Group (2008). "Network Common Data Form".
^ Schmidt, D., Chen, W.-C., Patel, P., Ostrouchov, G. (2013). "Speaking Serial R with a Parallel Accent". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

[1] Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)

[2] Chen, W.-C. and Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research".{{cite web}}: CS1 maint: multiple names: authors list (link)

[3] R Core Team (2012). R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0.

[4] Blackford, L.S.; et al. (1997). ScaLAPACK Users' Guide. {{cite book}}: Explicit use of et al. in: |author= (help)

[5] NetCDF Group (2008). "Network Common Data Form".

[6] Schmidt, D., Chen, W.-C., Patel, P., Ostrouchov, G. (2013). "Speaking Serial R with a Parallel Accent". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

[1]

[2]

[3]

[4]

[5]

[6]