Cuneiform (programming language)

Cuneiform
Cuneiform
Paradigm	functional, scientific workflow
Designed by	Jörgen Brandt, Marc Bux, and Ulf Leser
Developer	Humboldt University of Berlin
First appeared	2013
Stable release	2.2.0 / April 13, 2016
Implementation language	Erlang
OS	Linux, Mac OS
License	Apache License 2.0
Filename extensions	.cf
Website	cuneiform-lang.org
Influenced by
	Taverna, Galaxy, Lisp, MATLAB

Cuneiform is an open-source workflow language for large-scale scientific data analysis.^[1]^[2] It is a workflow DSL in the form of a functional programming language promoting parallelizable algorithmic skeletons. External tools and libraries, in, e.g., R or Python, can be integrated via a foreign function interface. Cuneiform's data-driven evaluation model and integration of external software originate in scientific workflow languages like Taverna, KNIME, or Galaxy while its algorithmic skeletons (second-order functions) for parallel execution originate in data-parallel programming models like MapReduce or Pig Latin. Cuneiform scripts can be executed on top of Hadoop^[3]^[4]^[5]^[6].

External Software Integration

External tools and libraries are integrated in a Cuneiform script through its foreign function interface. By defining a task in a foreign language it is possible to use the API of an external tool or library. This way, tools can be integrated directly without the need of writing a wrapper or reimplementing the tool.

Currently supported foreign programming languages are:

Parallel Execution

The task applications in a Cuneiform script form a data dependency graph. This dependency graph constrains the order in which tasks can be evaluated. Apart from data dependencies tasks can be evaluated in any order, assuming tasks are always side effect-free and deterministic. I.e., tasks without data dependencies can be evaluated in parallel. In addition, Cuneiform promotes algorithmic skeletons many of which allow parallel evaluation.

Map: Applies a task to each element in a list. Each task applications can run in parallel.
Cross product: Takes the Cartesian product of several lists and applies a task to each combination. Each task application can run in parallel.
Dot product: Given a pair of lists of equal sizes, each element of the first list is combined with its corresponding element in the second list. A task is applied to each combination. Each task application can run in parallel.
Aggregate: Applies a task to the list as a whole without decomposing it. Since the task is applied only once for the whole list, this skeleton leaves the parallelism potential unchanged.
Conditional: Evaluates a program branch, depending on a condition computed at runtime. This skeleton leaves the parallelism potential unchanged.

By partitioning input data and using parallelizable skeletons to process partitions the interpreter can exploit data parallelism even if the integrated tools are single-threaded. Workflows can be executed also in distributed compute environments.

Examples

A hello-world script:

deftask greet( out : person )in python *{
  out = "Hello "+person
}*

greet( person: "Peter" "Robert" );

This script defines a task greet in Python which prepends the string "Hello " to its argument person. The task has one output variable out. Applying the task greet, binding the argument person to the two-element list "Peter" "Robert" implicitly maps the task greet to each element of the input list. The workflow result is the two-element list "Hello Peter" "Hello Robert".

Command line tools can be integrated by defining a task in Bash:

deftask samtools-view( bam( File ) : sam( File ) )in bash *{
  samtools view -bS $sam > $bam
}*

In this example a task samtools-view is defined. It calls the tool SAMtools, consuming an input file in SAM format and producing an output file in BAM format. If this task is applied, binding the argument sam to a list of SAM files, the task is mapped to each element of that list.

References

^ https://github.com/joergen7/cuneiform
^ Brandt, Jörgen; Bux, Marc N.; Leser, Ulf (2015). "Cuneiform: A functional language for large scale scientific data analysis" (PDF). Proceedings of the Workshops of the EDBT/ICDT. 1330: 17–26.
^ https://github.com/marcbux/Hi-WAY
^ http://www.saasfee.io
^ Bux, Marc; Brandt, Jörgen; Lipka, Carsten; Hakimzadeh, Kamal; Dowling, Jim; Leser, Ulf (2015). "SAASFEE: scalable scientific workflow execution engine" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1892–1895.
^ Bessani, Alysson; Brandt, Jörgen; Bux, Marc; Cogo, Vinicius; Dimitrova, Lora; Dowling, Jim; Gholami, Ali; Hakimzadeh, Kamal; Hummel, Michael; Ismail, Mahmoud; Laure, Erwin; Leser, Ulf; Litton, Jan-Eric; Martinez, Roxanna; Niazi, Salman; Reichel, Jane; Zimmermann, Karin (2015). "Biobankcloud: a platform for the secure storage, sharing, and processing of large biomedical data sets" (PDF). The First International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).

[1] ttps://github.com/joergen7/cuneiform

[2] Brandt, Jörgen; Bux, Marc N.; Leser, Ulf (2015). "Cuneiform: A functional language for large scale scientific data analysis" (PDF). Proceedings of the Workshops of the EDBT/ICDT. 1330: 17–26.

[3] ttps://github.com/marcbux/Hi-WAY

[4] ttp://www.saasfee.io

[5] Bux, Marc; Brandt, Jörgen; Lipka, Carsten; Hakimzadeh, Kamal; Dowling, Jim; Leser, Ulf (2015). "SAASFEE: scalable scientific workflow execution engine" (PDF). Proceedings of the VLDB Endowment. 8 (12): 1892–1895.

[6] Bessani, Alysson; Brandt, Jörgen; Bux, Marc; Cogo, Vinicius; Dimitrova, Lora; Dowling, Jim; Gholami, Ali; Hakimzadeh, Kamal; Hummel, Michael; Ismail, Mahmoud; Laure, Erwin; Leser, Ulf; Litton, Jan-Eric; Martinez, Roxanna; Niazi, Salman; Reichel, Jane; Zimmermann, Karin (2015). "Biobankcloud: a platform for the secure storage, sharing, and processing of large biomedical data sets" (PDF). The First International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).

[1]

[2]

[3]

[4]

[5]

[6]