Jump to content

Programming with Big Data in R

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Wccsnow (talk | contribs) at 17:13, 25 June 2013 (Created page with '{{Infobox programming language | name = pbdR | logo = 200px | paradigm = SPMD|Single Progra...'). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
pbdR
File:Pbdr.png
ParadigmSingle Programmer Multiple Data
Designed byWei-Chen Chen, George Ostrouchov, Pragneshkumar Patel, and Drew Schmidt
DeveloperpbdR Core Team
First appearedSep. 2012
Preview release
Through GitHub at RBaigData
Typing disciplineDynamic
OSCross-platform
LicenseGeneral Public License and Mozilla Public License
Websiter-pbd.org
Influenced by
R

Programming with Big Data in R (pbdR)[1] is a free software programming language and a software environment for statistical computing with Big Data by utilizing high-performance statistical computation.[2] The pbdR language is carried from R (programming language)[3] used among statisticians and data miners for developing statistical software. The pbdR enables high-level distributed data parallelism in R, so that it can easily utilize large HPC platforms with thousands of cores, making the R language scale to unparalleled heights. The pbdR is a project that aims to elevate the statistical programming language R to leadership-class computing platforms. The main goal is empower data scientists by bringing flexibility and a big analytic toolbox to big data challenges, with an emphasis on productivity, portability, and performance. We achieve this in part by mapping high-level programming syntax to portable, high-performance, scalable, parallel libraries. In short, pbdR make R scalable.

Programming Features

Programming with pbdR requires usage of various packages developed by pbdR core team. Packages developed by them are the following.

General I/O Computation Application
pbdDEMO pbdNCDF4 pbdDMAT pmclust
pbdMPI pbdBASE
pbdSLAP
  • pbdMPI --- an efficient interface to MPI with a focus on Single Program/Multiple Data (SPMD) parallel programming style
  • pbdSLAP --- bundles scalable dense linear algebra libraries in double precision for R, based on ScaLAPACK version 2.0.2[4]
  • pbdNCDF4 --- Interface to Parallel Unidata NetCDF4 format data files[5]
  • pbdBASE --- low-level ScaLAPACK codes and wrappers
  • pbdDMAT --- distributed matrix classes and computational methods, with a focus on linear algebra and statistics
  • pbdDEMO --- set of package demonstrations and examples, and this unifying vignette[6]

Amount those packages the pbdDEMO package consists of two main parts. The first is a collection of roughly 20+ package demos. These offer example uses of the various pbdR packages. The second is this vignette, which attempts to offer detailed explanations for the demos, as well as sometimes providing some mathematical or statistical insight.

Examples

Example 1

The following examples illustrate the basic syntax of the language of pbdR. Since pbdR is designed for SPMD, all the R scripts are stored in files and executed from the command line via mpiexec, mpirun, etc. Save the following code in a file called ``demo.r``

library(pbdMPI, quiet = TRUE)
init()
.comm.size <- comm.size()
.comm.rank <- comm.rank()

N <- 5
x <- (1:N) + N * .comm.rank

y <- allreduce(as.integer(x), op = "sum")
comm.print(y)
y <- allreduce(as.double(x), op = "sum")
comm.print(y)

finalize()

and use the command

mpiexec -np 4 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Example 2

The following examples illustrate the basic ddmatrix computation of pbdR. Save the following code in a file called ``demo.r``

# Initialize process grid
library(pbdDMAT, quiet=T)

if(comm.size() != 2)
  comm.stop("Exactly 2 processors are required for this demo.")

init.grid()

# Setup for the remainder
comm.set.seed(diff=TRUE)
M <- N <- 16
BL <- 2 # blocking --- passing single value BL assumes BLxBL blocking

dA <- ddmatrix("rnorm", nrow=M, ncol=N, mean=100, sd=10)
A <- as.matrix(dA)

# LA SVD
svd1 <- La.svd(A)
svd2 <- La.svd(dA)
svd2 <- lapply(svd2, as.matrix)
comm.print(sum(svd1$d - svd2$d))
comm.print(sum(svd1$u - svd2$u))
comm.print(sum(svd1$vt - svd2$vt))

# Finish
finalize()

and use the command

mpiexec -np 2 Rscript demo.r

to execute the code where Rscript is one of command line executable program.

Milestones

References

  1. ^ Ostrouchov, G., Chen, W.-C., Schmidt, D., Patel, P. (2012). "Programming with Big Data in R".{{cite web}}: CS1 maint: multiple names: authors list (link)
  2. ^ Chen, W.-C. and Ostrouchov, G. (2011). "HPSC -- High Performance Statistical Computing for Data Intensive Research".{{cite web}}: CS1 maint: multiple names: authors list (link)
  3. ^ R Core Team (2012). R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0.
  4. ^ Blackford, L.S.; et al. (1997). ScaLAPACK Users' Guide. {{cite book}}: Explicit use of et al. in: |author= (help)
  5. ^ NetCDF Group (2008). "Network Common Data Form".
  6. ^ Schmidt, D., Chen, W.-C., Patel, P., Ostrouchov, G. (2013). "Speaking Serial R with a Parallel Accent". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)