Jump to content

User:Bmpvieira/State of the Art in Reproducibility and Open Science

From Wikipedia, the free encyclopedia

State of the Art in Reproducibility and Open Science

[edit]

A common complaint is that there have been too many approaches and not enough focus; what does this landscape look like now? Are there projects or ideas already in existence that would be most valuable for thought leaders (ie, us) to put our efforts toward? How successful have these invidiual efforts been?

This product will serve as a resource for those working on RR and OA. The idea is to create an access point to as much philosophy and implementation of RR and OA as possible for folks to make use of as reference.

The deliverable would be either a wiki summarizing the results, a review paper, or both.

TODOs: Need more on metadata, provenance Some views from people who are anti-open access Empirical research on OA/RR Build/organize wiki moar everything from people who know stuff who aren't me

Longer Term TODOs: ++summarization of overarching philosophies

Contributors: (add yourself here)

[edit]

Camille Scott (@camille_codon) Titus Brown (@ctitusbrown) Adrian Alexa Naomi Attar Michael Markie (@mmmarksman) Bruno Vieira (@bmpvieira) Sam Nicholls (@samstudio8)

What is reproducibility?

[edit]

Blog post (C TItus Brown): http://ivory.idyll.org/blog/2014-our-paper-process.html Basic reproducible paper HOWTO 10 Simple Rules for RR: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285 Open Access: A researcher's perspectice (Karl Broman): https://www.biostat.wisc.edu/~kbroman/presentations/openaccess_withnotes.pdf "OA is about money" Replicability is not Reproducibility: http://cogprints.org/7691/7/icmlws09.pdf Lior Pachter: 'Reproducibility vs. usability': http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/ David Stern's 'prescription' for reproducibility in science: http://blogs.biomedcentral.com/bmcblog/2014/06/26/can-you-show-us-that-again-please/ Protocol-based approach: https://khmer-protocols.readthedocs.org/en/latest/mrnaseq/ blog post: http://ivory.idyll.org/blog/announcing-khmer-protocols.html Trinity RNAseq protocols (Broad Inst.): http://www.nature.com/nprot/journal/v8/n8/full/nprot.2013.084.html Idea: provide general protocols (software to use, suggested parameters, expected outputs) rather than pipelines Less prone to bitrot / compat issues / platform isssues; diff programs can be substituted when obsolete Victoria Stodden: Computational Sci. Best Practices: http://scholar.google.com/citations?view_op=view_citation&hl=en&user=LWw60SgAAAAJ&sortby=pubdate&citation_for_view=LWw60SgAAAAJ:dfsIfKJdRG4C Measuring RR in Computer Systems REsearch: http://reproducibility.cs.arizona.edu/tr.pdf summary: most papers not reproducible Open science ecosystem (one product of TGAC allbio RR workshop): http://s28.postimg.org/c3amcfikd/Ecosystem.png Shining Light Into Black Boxes:

Training

[edit]

Publishing on the web: http://software-carpentry.org/blog/2014/01/publishing-on-the-web.html How does one approach publishing on the web? Software Carpentry: http://software-carpentry.org/bootcamps/index.html Introducing scientists to basic software engineering principles NGS wikibook: http://en.m.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS) Open methods biostars: https://www.biostars.org/ Q&A forum for bioinformatics methods Data carpentry: http://software-carpentry.org/blog/2014/05/our-first-data-carpentry-workshop.html SWC, for data

Advocacy

[edit]

Open Knowledge Foundation: https://okfn.org/about/ SSI: http://software.ac.uk/ TGAC AllBio RR workshop: http://www.tgac.ac.uk/allbio-open-science-reproducibility-best-practice-workshop/ Us! Funder Agenda: Responsible Research & Innovation, and EC`s Digital Agenda (how OS is central to those) http://ec.europa.eu/research/science-society/document_library/pdf_06/responsible-research-and-innovation-leaflet_en.pdf Funder Agenda: Knowledge-based economic growth (& how Open Science supports) Houghton, J., Swan, A., Brown, S., 2011. Access to research and technical information in Denmark [WWW Document]. URL http://www.deff.dk/uploads/media/Access_to_Research_and_Technical_Information_in_Denmark.pdf Funder Agenda: EC Mandate on Access to Knowledge for any public funds beneficiary (specific to Open Data and Open Access, but opens the door to implementing across the research lifecycle) EC Digital Agenda & Access to Knowledge http://ec.europa.eu/digital-agenda/en/open-access-scientific-knowledge-0 BioMed Central blog open data posts: http://blogs.biomedcentral.com/bmcblog/tag/open-data/ (new dedicated blog site coming later this year) In general, Victoria Stodden has an enormous body of work: http://scholar.google.com/citations?hl=en&user=LWw60SgAAAAJ&view_op=list_works&sortby=pubdate Also Cameron Neylon: http://cameronneylon.net/ And Ian Gent's Recomputability Manifesto: http://www.recomputation.org Example of a recomputable paper generated as part of the EMCSR14 workshop: http://arxiv.org/abs/1408.2123

Full Ecosystems

[edit]

http://www.bioconductor.org/about/ Collection of libraries and standards for bioinformatics work in R Bionode http://bionode.io Biogems http://biogems.info BioJS http://biojs.net BioPython http://biopython.org/wiki/Main_Page http://environmentalomics.org/bio-linux/ Biology-targeted linux distro https://www.docker.com/whatisdocker/ Software executable environment and delivery: software versioning, provenance, packaging, IO metadata Galaxy: http://galaxyproject.org/ Fully integrated pipelining, data hosting, compute resources on many diff HPC platforms (includes Cloud platform) Cloud based ecosystems https://basespace.illumina.com DNANexus https://dnanexus.com Synapse https://www.synapse.org GeneStack http://genestack.org batlab: https://www.batlab.org/ Cross-platform automated software testing Cytoscape: http://www.cytoscape.org/

Notable Projects

[edit]

samtools/htslib http://www.htslib.org/ GATK https://www.broadinstitute.org/gatk/ (Mixed "closed-open source" model) Could be interesting to find out more about why not fully open-source?

Data Hosting

[edit]

"geometry of needs and challenges in publishing data" https://twitter.com/billdoesphysics/status/488447056759894016 Question of access to online databases (who can view, who can update), FTP resources etc...

Disitributed systems ~~ Dat http://dat-data.com Bittorrent for academics: http://academictorrents.com/about.php Mygene.info: http://mygene.info/ Query service that aggregates NCBI, EMBL, etc and provides API + libraries for many languages

Centralized ~~~ http://figshare.com Centralized OA for data and manuscripts (w/ or w/o peer review) http://datadryad.org Amazon ec2: http://aws.amazon.com/ec2/ S3? Cloud hosting for compute or data ccess It is better (and cheeper) to store your large data files in S3 than EC2 Dataverse:http://thedata.org/ GigaDB: http://gigadb.org/ Your lab's computer XSEDE: https://www.xsede.org/ HPC resources for scientists (apply for compute time) Services like DataONE https://www.dataone.org/ Dropbox-like online filestores (GDrive, etc.)

Provenence

[edit]

Data and software; metadata Ethan White: Nine simple ways to make it easier to (re)use your data: http://library.queensu.ca/ojs/index.php/IEE/article/view/4608

Code Hosting

[edit]

VC: git http://github.com (see https://guides.github.com/activities/citable-code/) http://zenodo.org ++git: http://bitbucket.org Code and executables: http://sourceforge.net ++git: https://gitlab.com https://gitorious.org/ Bioconductor (see "Full Ecosystems") Google Code SciForge http://www.sciforge-project.org/

Open Journals and Reviewing

[edit]

The reviewers oath: http://biomickwatson.wordpress.com/2013/02/11/the-reviewers-oath/ Oath/manifesto for ethical peer review Peer review on top of arxiv.org (open source project from GitHub team): http://theoj.org http://arxiv.org/ Blog post: Submit to arxiv http://phylogenomics.blogspot.co.uk/2012/03/calling-all-computational-biologists-do.html OpeneReview: http://openreview.net/about Network + advocacy for open peer review f1000: http://f1000research.com/ publish all the things: publish first, review later Biology Direct, PeerJ, BMC Series medical journals, eLife GigaScience: http://www.gigasciencejournal.com/ Data, research, and software publishing, all open access Insight Journal: http://www.insight-journal.org/ Luis Ibanze: https://opensource.com/users/luis-ibanez PlosOne: http://www.plosone.org/ Victoria Stodden: empirical analysis of journal data and code policy: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067111#pone-0067111-g003 bioRxiv http://biorxiv.org/ Journal of Open Research Software (SSI): http://openresearchsoftware.metajnl.com/ Review from a user: https://thewinnower.com/papers/an-author-based-review-of-the-journal-of-open-research-software

Representing Ideas

[edit]

IPython notebook: http://ipython.org/notebook.html Browser based python with matplotlib integration, markdown Literate programming in R: http://yihui.name/knitr/ Unified code, writing, maths ShareLatex: https://www.sharelatex.com/ "Google docs for latex" Writelatex: https://www.writelatex.com Another collab latex service

Software Philosophies

[edit]

VMs are Bad for RR (C T Brown): http://ivory.idyll.org/blog/vms-considered-harmful.html VMs don't allow remixing Greg Wilson, "VMs are PDF's for software": https://twitter.com/gvwilson/status/508402669825060864 Konrad Hinsen: moar on VM's being bad: http://khinsen.wordpress.com/2013/08/14/platforms-for-reproducible-research/ Bill Howe: "...the point is that publishing a VM is trivial, while making your code portable and reusable for others is not. I think if everyone published a VM associated with their paper, which incurs essentially zero extra effort, we'd be in a better state than we are today. You're holding out for the utopia of everyone becoming linux hackers." Software Sustainability Institute: http://www.software.ac.uk/policy/manifesto Kitware (open source, training, etc): http://www.kitware.com/company/about.html Small tools manifesto for bioinformatics: https://github.com/pjotrp/bioinformatics Requirements for Bionode modules https://github.com/bionode/bionode-template#bionode-template Docker https://medium.com/@gawbul/devops-and-reproducible-science-628ffc839de3 http://melissagymrek.com/science/2014/08/29/docker-reproducible-research.html http://www.bioinformaticszen.com/post/reproducible-assembler-benchmarks/ Dynamic figures http://juretriglav.si/how-scientific-figures-should-work-in-2014/ Others https://medium.com/@bmpvieira/my-views-about-science-35045625176f http://juretriglav.si/thoughts-on-reproducibility-of-open-scientific-software/ http://juretriglav.si/discovery-of-scientific-software/ Stodden: runmycode.org: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6404455&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6404455 Dissemination of platform for executing published code, gen results Modularity: https://github.com/bionode/bionode/issues/9#issuecomment-50770720