Jump to content

Data-intensive computing

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Jratke (talk | contribs) at 07:30, 7 March 2011 (Description of Data Intensive Computing). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Template:New unreviewed article

Data-Intensive Computing is a class of parallel computing applications

which use a data-parallel approach to processing large volumes of data typically

terabytes or petabytes in size and typically referred to as Big Data.

Computing applications which devote most of their execution time to computational

requirements are deemed compute-intensive and typically require small volumes of

data, whereas computing applications which require large volumes of data and devote

most of their processing time to I/O and manipulation of data are deemed data-

intensive [1].

Introduction

The rapid growth of the Internet and World Wide Web has led to vast amounts

of information available online. In addition, business and government organizations

create large amounts of both structured and unstructured information which needs to

be processed, analyzed, and linked. Vinton Cerf of Google has described this as

an “Information Avalanche” and has stated “we must harness the Internet’s energy

before the information it has unleashed buries us.” [2]. An IDC

white paper sponsored by EMC estimated the amount of information currently

stored in a digital form in 2007 at 281 exabytes and the overall compound growth

rate at 57% with information in organizations growing at even a faster rate [3]. In another study of the so-called information

explosion it was estimated that 95% of all current information exists in

unstructured form with increased data processing requirements compared to structured

information [4]. The storing, managing,

accessing, and processing of this vast amount of data represents a fundamental need

and an immense challenge in order to satisfy needs to search, analyze, mine, and

visualize this data as information [5]. Data-intensive

computing is intended to address this need.

Parallel processing approaches can be generally classified as either

compute-intensive, or data-intensive, [6]. [7]. [8] Compute-intensive is used to describe application

programs that are compute bound. Such applications devote most of their execution

time to computational requirements as opposed to I/O, and typically require small

volumes of data. Parallel processing of compute-intensive applications

typically involves parallelizing individual algorithms within an application

process, and decomposing the overall application process into separate tasks, which

can then be executed in parallel on an appropriate computing platform to achieve

overall higher performance than serial processing. In compute-intensive

applications, multiple operations are performed simultaneously, with each operation

addressing a particular part of the problem. This is often referred to as task

parallelism.

Data-intensive is used to describe applications that are I/O bound or with a need to

process large volumes of data [9]. Such applications devote most of their processing time to I/O and

movement and manipulation of data. Parallel processing of data-intensive

applications typically involves partitioning or subdividing the data into multiple

segments which can be processed independently using the same executable application

program in parallel on an appropriate computing platform, then reassembling the

results to produce the completed output data [10] The greater the aggregate distribution of the data, the

more benefit there is in parallel processing of the data. Data-intensive processing

requirements normally scale linearly according to the size of the data and are very

amenable to straightforward parallelization. The fundamental challenges for data-

intensive computing are managing and processing exponentially growing data volumes,

significantly reducing associated data analysis cycles to support practical, timely

applications, and developing new algorithms which can scale to search and process

massive amounts of data.

Data-Parallelism

Computer system architectures which can support data-parallel applications are a

potential solution to the terabyte and petabyte scale data processing requirements

of data-intensive computing [11].

Data-parallelism can be defined as a computation applied independently to each 

data item of a set of data which allows the degree of parallelism to be scaled with

the volume of data. The most important reason for developing data-parallel

applications is the potential for scalable performance, and may result in several

orders of magnitude performance improvement. The key issues with developing

applications using data-parallelism are the choice of the algorithm, the strategy

for data decomposition, load balancing on processing nodes, message passing

communications between nodes, and the overall accuracy of the results [12]. The development of a data-parallel application can involve

substantial programming complexity to define the problem in the context of available

programming tools, and to address limitations of the target architecture.

Information extraction from and indexing of Web documents is typical of data-

intensive computing which can derive significant performance benefits from [[data-

parallel]] implementations since Web and other types of document collections can

typically then be processed in parallel [13].

Characteristics of Data-Intensive Computing Systems

The National Science Foundation believes that data-intensive computing requires a

“fundamentally different set of principles” than current computing approaches [14] Through a funding program

within the Computer and Information Science and Engineering area, the NSF is seeking

to “increase understanding of the capabilities and limitations of data-intensive

computing.” The key areas of focus are:

  • Approaches to parallel programming to address the [[parallel processing]] of data on data-intensive systems
  • Programming abstractions including models, languages, and algorithms which allow a natural expression of parallel processing of data
  • Design of data-intensive computing platforms to provide high levels of reliability, efficiency, availability, and scalability.
  • Identifying applications that can exploit this computing paradigm and determining how it should evolve to support emerging data-intensive applications

Pacific Northwest National Labs has defined data-intensive computing as “capturing,

managing, analyzing, and understanding data at volumes and rates that push the

frontiers of current technologies.” [15]. [16]. They believe that to address the

rapidly growing data volumes and complexity requires “epochal advances in software,

hardware, and algorithm development” which can scale readily with size of the data

and provide effective and timely analysis and processing results.

Processing Approach

Current data-intensive computing platforms typically use a parallel computing

approach combining multiple processors and disks in large commodity [[computing

clusters]] connected using high-speed communications switches and networks which

allows the data to be partitioned among the available computing resources and

processed independently to achieve performance and scalability based on the amount

of data. A cluster can be defined as a type of parallel and distributed system,

which consists of a collection of inter-connected stand-alone computers working

together as a single integrated computing resource [17]. This approach to parallel processing is often

referred to as a “shared nothing” approach since each node consisting of processor,

local memory, and disk resources shares nothing with other nodes in the cluster. In

parallel computing this approach is considered suitable for data-intensive

computing and problems which are “embarrassingly parallel” , i.e. where it is

relatively easy to separate the problem into a number of parallel tasks and there is

no dependency or communication required between the tasks other than overall

management of the tasks. These types of data processing problems are inherently

adaptable to various forms of distributed computing including clusters and data

grids and cloud computing.

Common Characteristics

There are several important common characteristics of data-intensive computing

systems that distinguish them from other forms of computing:

(1) The principle of collocation of the data and programs or algorithms is used to

perform the computation. To achieve high performance in data-intensive computing,

it is important to minimize the movement of data [18]. This characteristic allows processing algorithms to execute on the nodes

where the data resides reducing system overhead and increasing performance [7].

Newer technologies such as InfiniBand allow data to be stored in a separate

repository and provide performance comparable to collocated data.

(2) The programming model utilized. Data-intensive computing systems utilize a

machine-independent approach in which applications are expressed in terms of high-

level operations on data, and the runtime system transparently controls the

scheduling, execution, load balancing, communications, and movement of programs and

data across the distributed computing cluster

[19].

The programming abstraction and language tools allow the processing to be expressed 

in terms of data flows and transformations incorporating new dataflow [[programming

languages]] and shared libraries of common data manipulation algorithms such as

sorting.

(3) A focus on reliability and availability. Large-scale systems with hundreds or

thousands of processing nodes are inherently more susceptible to hardware failures,

communications errors, and software bugs. Data-intensive computing systems are

designed to be fault resilient. This typically includes redundant copies of all

data files on disk, storage of intermediate processing results on disk, automatic

detection of node or processing failures, and selective re-computation of results.

(4) The inherent scalability of the underlying hardware and [[software

architecture]]. Data-intensive computing systems can typically be scaled in a

linear fashion to accommodate virtually any amount of data, or to meet time-critical

performance requirements by simply adding additional processing nodes. The number

of nodes and processing tasks assigned for a specific application can be variable or

fixed depending on the hardware, software, communications, and [[distributed file

system architecture]].

System Architectures

A variety of system architectures have been implemented for data-intensive

computing and large-scale data analysis applications including parallel and

distributed relational database management systems which have been available to

run on shared nothing clusters of processing nodes for more than two decades [20]. . However most data growth is with data in unstructured form and new processing

paradigms with more flexible data models were needed. Several solutions have emerged

including the MapReduce architecture pioneered by Google and now available in an

open-source implementation called Hadoop used by Yahoo, Facebook, and

others. [LexisNexis Risk Solutions]] also developed and implemented a scalable

platform for data-intensive computing which is used by LexisNexis.

=MapReduce

The MapReduce architecture and programming model pioneered by Google is an

example of a modern systems architecture designed for data-intensive computing

[21]. The

MapReduce architecture allows programmers to use a functional programming style to

create a map function that processes a key-value pair associated with the input

data to generate a set of intermediate key-value pairs, and a reduce function

that merges all intermediate values associated with the same intermediate key.

Since the system automatically takes care of details like partitioning the input

data, scheduling and executing tasks across a processing cluster, and managing the

communications between nodes, programmers with no experience in parallel

programming can easily use a large distributed processing environment.

The programming model for MapReduce architecture is a simple abstraction where

the computation takes a set of input key-value pairs associated with the input data

and produces a set of output key-value pairs. In the Map phase, the input data is

partitioned into input splits and assigned to Map tasks associated with processing

nodes in the cluster. The Map task typically executes on the same node containing

its assigned partition of data in the cluster. These Map tasks perform user-

specified computations on each input key-value pair from the partition of input data

assigned to the task, and generates a set of intermediate results for each key. The

shuffle and sort phase then takes the intermediate data generated by each Map task,

sorts this data with intermediate data from other nodes, divides this data into

regions to be processed by the reduce tasks, and distributes this data as needed to

nodes where the Reduce tasks will execute. The Reduce tasks perform additional

user-specified operations on the intermediate data possibly merging values

associated with a key to a smaller set of values to produce the output data. For

more complex data processing procedures, multiple MapReduce calls may be linked

together in sequence.

Hadoop

[Hadoop] is an open source software project sponsored by The [[Apache Software

Foundation]] which implements the MapReduce architecture. Hadoop now encompasses

multiple subprojects in addition to the base core, MapReduce, and HDFS distributed

filesystem. These additional subprojects provide enhanced application processing

capabilities to the base Hadoop implementation and currently include Avro, Pig,

HBase, ZooKeeper, Hive, and Chukwa. The Hadoop MapReduce architecture is

functionally similar to the Google implementation except that the base programming

language for Hadoop is Java instead of C++. The implementation is intended

to execute on clusters of commodity processors.

Hadoop implements a distributed data processing scheduling and execution environment

and framework for MapReduce jobs. Hadoop includes a distributed file system called

HDFS which is analogous to GFS in the Google MapReduce implementation. The

Hadoop execution environment supports additional distributed data processing

capabilities which are designed to run using the Hadoop MapReduce architecture.

These include HBase, a distributed column-oriented database which provides

random access read/write capabilities; Hive which is a data warehouse system

built on top of Hadoop that provides SQL-like query capabilities for data

summarization, ad-hoc queries, and analysis of large datasets; and Pig – a high-

level data-flow programming language and execution framework for data-intensive

computing.

Pig was developed at Yahoo! to provide a specific language notation for data

analysis applications and to improve programmer productivity and reduce development

cycles when using the Hadoop MapReduce environment. Pig programs are automatically

translated into sequences of MapReduce programs if needed in the execution

environment. Pig provides capabilities in the language for loading, storing,

filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining

operations on the data [22].

HPCC

LexisNexis Risk Solutions, independently developed and implemented a solution

for data-intensive computing called the HPCC (High-Performance Computing

Cluster). The development of this computing platform began in 1999 and applications

were in production by late 2000. The LexisNexis approach also utilizes

commodity clusters of hardware running the Linux operating system as shown in

Figure 1. Custom system software and middleware components were developed and

layered on the base Linux operating system to provide the execution environment and

distributed filesystem support required for data-intensive computing. LexisNexis

also implemented a new high-level language for data-intensive computing called ECL.

The ECL programming language is the primary distinguishing factor between HPCC

and other data-intensive computing solutions. It is a high-level, declarative,

data-centric, implicitly parallel language that allows the programmer to define

what the data processing result should be and the dataflows and transformations that

are necessary to achieve the result. The ECL language includes extensive

capabilities for data definition, filtering, data management, and data

transformation, and provides an extensive set of built-in functions to operate on

records in datasets which can include user-defined transformation functions.

ECL programs are compiled into optimized C++ source code, which is

subsequently compiled into executable code and distributed to the nodes of a

processing cluster.

To address both batch and online aspects data-intensive computing applications,

HPCC includes two distinct cluster environments, each of which can be optimized

independently for its parallel data processing purpose. The Thor platform is a

cluster whose purpose is to be a data refinery for processing of massive volumes of

raw data for applications such as data cleansing and hygiene, ETL (extract,

transform load), record linking and entity resolution, large-scale ad-hoc analysis

of data, and creation of keyed data and indexes to support high-performance

structured queries and data warehouse applications. A Thor system is similar in its

hardware configuration, function, execution environment, filesystem, and

capabilities to the Hadoop MapReduce platform, but provides higher performance in

equivalent configurations. The Roxie platform provides an online high-performance

structured query and analysis system or data warehouse delivering the parallel data

access processing requirements of online applications through Web services

interfaces supporting thousands of simultaneous queries and users with sub-second

response times. A Roxie system is similar in its function and capabilities to

Hadoop with HBase and Hive capabilities added, but provides an optimized

execution environment and filesystem for high-performance online processing. Both

Thor and Roxie systems utilize the same ECL programming language for implementing

applications, increasing programmer productivity.

See Also

  • [[List of important publications in concurrent, parallel, and distributed

computing]]

References

  1. ^ [http://www.cse.fau.edu/~borko/HandbookofCloudComputing.html/ Handbook of Cloud Computing], "Data-Intensive Technologies for Cloud Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.
  2. ^ An Information Avalanche, by Vinton Cerf, IEEE Computer, Vol. 40, No. 1, 2007, pp. 104-105.
  3. ^ [http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf The Expanding Digital Universe], by J.F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz, IDC, White Paper, 2007.
  4. ^ [http://www2.sims.berkeley.edu/research/projects/how-much-info- 2003/ How Much Information? 2003], by P. Lyman, and H.R. Varian, University of California at Berkeley, Research Report, 2003.
  5. ^ [http://www.sdsc.edu/about/director/pubs/communications200812-DataDeluge.pdf Got Data? A Guide to Data Preservation in the Information Age], by F. Berman, Communications of the ACM, Vol. 51, No. 12, 2008, pp. 50-56.
  6. ^ [http://portal.acm.org/citation.cfm?id=280278 Models and languages for parallel computation], by D.B. Skillicorn, and D. Talia, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 123-169.
  7. ^ [http://www.pnl.gov/science/images/highlights/computing/dic_special.pdfData- Intensive Computing in the 21st Century], by I. Gorton, P. Greenfield, A. Szalay, and R. Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.
  8. ^ [http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2008.122 High-Speed, Wide Area, Data Intensive Computing: A Ten Year Retrospective], by W.E. Johnston, IEEE Computer Society, 1998.
  9. ^ [https://computation.llnl.gov/casc/dcca- pub/dcca/Papers_files/data-intensive-ieee-computer-0408.pdf IEEE: Hardware Technologies for High-Performance Data-Intensive Computing], by M. Gokhale, J. Cohen, A. Yoo, and W.M. Miller, IEEE Computer, Vol. 41, No. 4, 2008, pp. 60- 68.
  10. ^ [http://www.agoldberg.org/Publications/DesignMethForDP.pdf IEEE: A Design Methodology for Data-Parallel Applications], by L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, IEEE Transactions on Software Engineering, Vol. 26, No. 4, 2000, pp. 293-314.
  11. ^ [http://www.patrickpantel.com/download/papers/2004/kdd-msw04-1.pdf The terascale challenge] by D. Ravichandran, P. Pantel, and E. Hovy. "The terascale challenge," Proceedings of the KDD Workshop on Mining for and from the Semantic Web, 2004
  12. ^ [http://www.cs.rochester.edu/u/umit/papers/ppopp01.ps Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations] by U. Rencuzogullari, and S. Dwarkadas. "Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations," Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001
  13. ^ [http://www.mathcs.emory.edu/~eugene/publications.html Information Extraction to Large Document Collections] by E. Agichtein, "Scaling Information Extraction to Large Document Collections," Microsoft Research, 2004
  14. ^ [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503324&org=IIS Data-Intensive Computing] by NSF. "Data-Intensive Computing," 2009.
  15. ^ [http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Computing] by PNNL. "Data Intensive Computing," 2008
  16. ^ [http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2009.26 The Changing Paradigm of Data-Intensive Computing] by R.T. Kouzes, G.A. Anderson, S.T. Elbert, I. Gorton, and D.K. Gracio, "The Changing Paradigm of Data-Intensive Computing," Computer, Vol. 42, No. 1, 2009, pp. 26-3
  17. ^ [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V06-4V47C7R- 1&_user=10&_coverDate=06%2F30%2F2009&_rdoc=1&_fmt=high&_orig=gateway&_origin=gateway &_sort=d&_docanchor=&view=c&_rerunOrigin=google&_acct=C000050221&_version=1&_urlVers ion=0&_userid=10&md5=824e4c2635a53c6fe068f3f2d11df096&searchtype=a Cloud computing and emerging IT platforms] by R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, Vol. 25, No. 6, 2009, pp. 599-616
  18. ^ Distributed Computing Economics by J. Gray, "Distributed Computing Economics," ACM Queue, Vol. 6, No. 3, 2008, pp. 63- 68.
  19. ^ http://www.cs.cmu.edu/~bryant/presentations/DISC-concept.ppt Data Intensive Scalable Computing] by R.E. Bryant. "Data Intensive Scalable Computing," 2008
  20. ^ [http://www.cse.nd.edu/~dthain/courses/cse40771/spring2010/benchmarks-sigmod09.pdf A Comparison of Approaches to Large-Scale Data Analysis] by A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden, and M. Stonebraker. Proceedings of the 35th SIGMOD International conference on Management of Data, 2009.
  21. ^ [http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters] by J. Dean, and S. Ghemawat. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.
  22. ^ [http://i.stanford.edu/~usriv/talks/sigmod08-pig- latin.ppt#283,18,User-Code as a First-Class Citizen Pig Latin: A Not-So-Foreign Language for Data Processing] by C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. (Presentation at SIGMOD 2008)," 2008.