Programming with Big Data in R
File:Pbdr.png | |
Paradigm | Single Programmer Multiple Data |
---|---|
Designed by | Wei-Chen Chen, George Ostrouchov, Pragneshkumar Patel, and Drew Schmidt |
Developer | pbdR Core Team |
First appeared | Sep. 2012 |
Preview release | |
Typing discipline | Dynamic |
OS | Cross-platform |
License | General Public License and Mozilla Public License |
Website | r-pbd.org |
Influenced by | |
R |
Programming with Big Data in R (pbdR)[1] is a free software programming language and a software environment for statistical computing with Big Data by utilizing high-performance statistical computation.[2] The pbdR language is carried from R (programming language)[3] used among statisticians and data miners for developing statistical software. The pbdR enables high-level distributed data parallelism in R, so that it can easily utilize large HPC platforms with thousands of cores, making the R language scale to unparalleled heights. The pbdR is a project that aims to elevate the statistical programming language R to leadership-class computing platforms. The main goal is empower data scientists by bringing flexibility and a big analytic toolbox to big data challenges, with an emphasis on productivity, portability, and performance. We achieve this in part by mapping high-level programming syntax to portable, high-performance, scalable, parallel libraries. In short, pbdR make R scalable.
Programming Features
Programming with pbdR requires usage of various packages developed by pbdR core team. Packages developed by them are the following.
General | I/O | Computation | Application |
---|---|---|---|
pbdDEMO | pbdNCDF4 | pbdDMAT | pmclust |
pbdMPI | pbdBASE | ||
pbdSLAP |
- pbdMPI --- an efficient interface to MPI with a focus on Single Program/Multiple Data (SPMD) parallel programming style
- pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on ScaLAPACK version 2.0.2[4]
- pbdNCDF4 --- Interface to Parallel Unidata NetCDF4 format data files[5]
- pbdBASE --- low-level ScaLAPACK codes and wrappers
- pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics
- pbdDEMO --- set of package demonstrations and examples, and this unifying vignette[6]
Amount those packages the pbdDEMO package consists of two main parts. The first is a collection of roughly 20+ package demos. These offer example uses of the various pbdR packages. The second is this vignette, which attempts to offer detailed explanations for the demos, as well as sometimes providing some mathematical or statistical insight.
Examples
Example 1
The following examples illustrate the basic syntax of the language of pbdR. Since pbdR is designed for SPMD, all the R scripts are stored in files and executed from the command line via mpiexec, mpirun, etc. Save the following code in a file called ``demo.r``
library(pbdMPI, quiet = TRUE)
init()
.comm.size <- comm.size()
.comm.rank <- comm.rank()
N <- 5
x <- (1:N) + N * .comm.rank
y <- allreduce(as.integer(x), op = "sum")
comm.print(y)
y <- allreduce(as.double(x), op = "sum")
comm.print(y)
finalize()
and use the command
mpiexec -np 4 Rscript demo.r
to execute the code where Rscript is one of command line executable program.
Example 2
The following examples illustrate the basic ddmatrix computation of pbdR. Save the following code in a file called ``demo.r``
# Initialize process grid
library(pbdDMAT, quiet=T)
if(comm.size() != 2)
comm.stop("Exactly 2 processors are required for this demo.")
init.grid()
# Setup for the remainder
comm.set.seed(diff=TRUE)
M <- N <- 16
BL <- 2 # blocking --- passing single value BL assumes BLxBL blocking
dA <- ddmatrix("rnorm", nrow=M, ncol=N, mean=100, sd=10)
A <- as.matrix(dA)
# LA SVD
svd1 <- La.svd(A)
svd2 <- La.svd(dA)
svd2 <- lapply(svd2, as.matrix)
comm.print(sum(svd1$d - svd2$d))
comm.print(sum(svd1$u - svd2$u))
comm.print(sum(svd1$vt - svd2$vt))
# Finish
finalize()
and use the command
mpiexec -np 2 Rscript demo.r
to execute the code where Rscript is one of command line executable program.
External links
- Official website of the pbdR project
- Technical website of the pbdR packages
- Source Code of developing version of the pbdR packages
- Discussion Group for any of pbdR related topics
- Tutorial Website for beginners
Milestones
- Version 0.1-0: Migrate from Rmpi to pbdMPI.
- Version 0.1-1: Add pbdSLAP.
- Version 0.102: Add pbdBASE and pbdDMAT.
- Version 1.0-0: Add pbdNCDF4.
- Version 1.0-1: Add pmclust.
References
- ^ Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".
{{cite web}}
: CS1 maint: multiple names: authors list (link) - ^ Chen, W.-C. and Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research".
{{cite web}}
: CS1 maint: multiple names: authors list (link) - ^ R Core Team (2012). R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0.
- ^ Blackford, L.S.; et al. (1997). ScaLAPACK Users' Guide.
{{cite book}}
: Explicit use of et al. in:|author=
(help) - ^ NetCDF Group (2008). "Network Common Data Form".
- ^ Schmidt, D., Chen, W.-C., Patel, P., Ostrouchov, G. (2013). "Speaking Serial R with a Parallel Accent".
{{cite journal}}
: Cite journal requires|journal=
(help)CS1 maint: multiple names: authors list (link)