Talk:Flow cytometry bioinformatics

A wikidata item for the academic article needs to be provided (search Wikidata). See template documentation for details.)

Template:WikiProject Computational Biology Template:Wikiproject MCB

A fact from Flow cytometry bioinformatics appeared on Wikipedia's Main Page in the Did you know column on 24 December 2013 (check views). The text of the entry was as follows:

Did you know... that flow cytometry bioinformaticians use methods from computational statistics and machine learning to analyse single cell data gathered by flow cytometry for cancer and HIV/AIDS research?

A record of the entry may be seen at Wikipedia:Recent additions/2013/December. The nomination discussion and review may be seen at Template:Did you know nominations/Flow cytometry bioinformatics.

Wikipedia

Reviews

This article was created on the PLoS Computational Biology Wiki as a review paper to be co-published on Wikipedia. As such it went through a formal academic peer review process, which was open. The comments from the reviewers, and their responses, are noted below. These reviews, and the revision history of the article prior to being transferred to Wikipedia, are archived at that page.

Reviewer 1: Holden Maecker

I find this to be a good and wide-ranging summary of topics associated with flow cytometry analysis and bioinformatics. It spans the territory from basic flow cytometry concepts and gating, to newer bioinformatics approaches like SPADE and PCA, and routines for data processing such as those in Bioconductor. Few people's expertise spans all of these areas, but this page provides a good synthesis for folks who work in one or more of these areas, and want to learn more. I would suggest expanding the section on Gating, to make some basic but missing or merely implied points, e.g.: -Gating is hierarchical, usually focusing in on specific subsets by sequential selection of populations, usually in two dimensions at a time (e.g., Lymphocytes->T cells->CD4+ T cells->naive CD4+ T cells). -This approach suffers from the inability to visualize all other relevant dimensions when gating on only two dimensions at a time; it may even make it difficult to distinguish closely spaced populations that could be better separated in >2-dimensional space. And it suffers from "tunnel vision", in that an overview of the entire dataset is virtually impossible. -Boolean gates can be created (to some extent, automatically in software such as FlowJo) that divide a population of cells into all logical combinations of markers. This is a complementary approach to automated gating algorithms that find "where the cell clusters are"; in a Boolean approach, one asks "what are all the possible cell phenotypes" and then monitors those compartments to see which ones are populated and to what extent. It is, however, a deterministic approach, assuming that cells are either positive or negative for a given marker, and the user decides the positive/negative boundary. The number of compartments can also become staggering with increasing dimensions. Clustering algorithms are, by contrast, unsupervised, in that they do not require any user input about what is positive or negative; they simply find regions of cell density, inflection points, etc. -Holden Maecker

These are some excellent suggestions. As there is some overlap between these comments on gating and the comments of reviewer 3, we have addressed both reviewers' comments in our response there. -Kierano (talk) 11:48, 27 June 2013 (PDT)

Reviewer 2: Nolwenn Le Meur

This topic page gives a good review of the field of flow cytometry bioinformatics. It covers the fundamentals of data handling and analysis for flow cytometry. It also highlights new approaches and ongoing developments, notably for cell population identification where room for improvements remains.

My main comment is on the lead paragraph. The sentence “Flow cytometry bioinformatics is the application of bioinformatics, computational statistics and machine learning to analyze flow cytometry data” is confusing. As mentioned in the Wikipedia page for Bioinformatics, this interdisciplinary field uses many areas of computer science, mathematics and engineering and therefore includes the concept of data analysis with notably machine learning technics. I would rather say: “Flow cytometry bioinformatics is the application of bioinformatics to flow cytometry, which involves storing, retrieving, organizing and analyzing flow cytometry data using extensive computational resources and tools." Maybe it could be added that flow cytometry bioinformatics requires and contributes to the development of computational statistics and machine learning methods. In addition, the introduction could be developed with examples of application fields. Indeed flow cytometry is used in wide range of domains from medicine and environment for human health to the analysis of the microbiome in seawater (e.g. Wang, Y et al. (2010). Past, present and future applications of flow cytometry in aquatic microbiology. Trends in Biotechnology, 28(8), 416–424. doi:10.1016/j.tibtech.2010.04.006.)

A minor comment is on the description of the different steps in computational flow cytometry analysis. This description is well done although the concept of workflow could be emphasized. Some software allows storing analysis workflows, which are notably useful for qualitative and reproducible research. For instance, for gating which is a hierarchical process, it is especially required to keep track of the process used for population selection. It is also essential when flow cytometry is used as a diagnostic tool to automate population selection. Finally, workflows saved in standard file format such as XML can be played by different software, which can be useful in terms of reproducible research.

Nolwenn Le Meur

The comments on the lead section were extremely helpful, and have been taken into account in the expansion of that section.

We have added a listing of some of the applications of flow cytometry to the introduction section.

We have added a paragraph to the section overviewing the steps in flow cytometry analysis to emphasise the importance of workflows and their interchange for reproducibility. -Kierano (talk) 11:48, 27 June 2013 (PDT)

Reviewer 3: Jorge Pardo

This page provides an informative overview of the type of multidimensional data generated by flow cytometry and the role of bioinformatics in analyzing increasingly complex data sets.

The introductory section on the basics of fluorescence based flow cytometry is missing a description of spectral cross-over compensation.This would seem an oversight, as compensation and compensation matrices are mentioned in other sections of the page.

Manual gating in the analysis of flow data should be described earlier in the page, certainly before describing Gating-ML, and with a bit more detail. The authors describing the process as "error prone" and "non-reproducible".Given the same data set, two investigators may use different hierarchical manual gating strategies to define a cell population, but this does not imply intrinsic non-reproducibility in the process. Indeed, clinical flow cytometry laboratories are certified based on their ability to reproduce results while testing a defined sample, and this testing involves manual gating. As for "error prone", the inference is that there is a correct way to gate flow data, and that when this process is done manually, it is likely to be done incorrectly. This statement is then ignored in the discussion of combinatorial gating approaches, like flow type/RchyOptimyx, that use manual gating. On the other hand, the discussion of automated gating using clustering algorithms fails to mention that repeated analysis of data sets with large number of clusters may report different cluster partitions (http://www.biomedcentral.com/1471-2105/14/S1/S8). I would invite the authors to present a balanced characterization of manual gating that recognizes its limitations in the analysis of increasingly complex flow data; it is a time consuming hierarchical approach that is limited to two dimensional analysis at each step.

Re. manual gating:

We have made several changes to address this comment:

We have re-organized the content to discuss manual gating earlier as requested. We have also clarified that manual analysis can indeed be reproducible specially in controlled clinical settings and have better described the cases in which it can cause inaccuracies. We have also clarified that despite the recent advances in computational analysis, manual gating still is the main solution for identification of specific rare cell populations (e.g., for gating rare populations for the combinatorial gating algorithms). Finally, we have explained that the computational gating algorithms we have discussed here can automatically select the number of cell populations using different methods and that this choice can affect the sensitivity and specificity of the results. -Nnimaa (talk) 15:21, 28 June 2013 (PDT)

Lastly, I'd emphasize the need for informative representation of cell populations identified through automated gating of complex multidimensional flow data. It is not informative to show all cell populations defined through multidimensional analysis on two dimensional dot plots. The SPADE software does a great job as it organizes defined cell populations in hierarchies of related phenotypes and it also allows for the comparison of individual markers across all the cell populations. This facilitates the identification of cell lineages, identification of rare cell types and comparison of different samples. -Jorge Pardo

Re. visualization:

First, we would like to clarify that the SPADE algorithm is not always suitable for identification of lineages (as spanning trees are not necessarily representing lineages) or rare cell populations (due to the down sampling). Several approaches are being considered for addressing these limitations. This being said, we agree with the reviewer that SPADE is a fantastic algorithm for visualization of an entire sample to identify major cell populations and have discussed it in the "gating guided by dimension reduction" section. -Nnimaa (talk) 15:16, 28 June 2013 (PDT)

Re. compensation:

Initially we had thought to exclude compensation, as the methods for performing it, while computational and automated, are standard and have not advanced significantly since the development of multicolour flow. However, on re-reading, it did indeed feel missing, and we have consequently included a section discussing the computational aspects of compensation. -Kierano (talk) 11:48, 27 June 2013 (PDT)

Response to reviewers

We have added our responses to each reviewer below their review. All of the comments were extremely helpful and we feel have strengthened the paper; we thank all of the reviewers for their input.

Full details of the changes can be seen at this diff. -Kierano (talk) 11:53, 27 June 2013 (PDT)

Identifying cell populations

I have some comments on the section Flow cytometry bioinformatics#Identifying cell populations. I could try to resolve some of these issues myself by editing the article, but I thought it might be best to bring it up here first so that people can see where I am coming from.

I think it would help if we described the data we are working with. I take it that we have lots of variables that could be used to categorise the cells but no standard of what those categories are. The aim is to create categories, based on the measured data, right? This section sounds a bit like it is describing decision tree learning but that is different in that there we have a training set with the categories already known and the aim is to work out how best to infer the category from the explanatory variables so that in future we can calculate a probability of an item falling into a category based in measurements of the explanatory variables. If the aim of the identification process was made clearer this would help.
(Related to the above point.) Where has fluorescence intensity come from? It's mentioned in the second sentence of Flow cytometry bioinformatics#Gating, as if it is really important, but not earlier in the article. Are all the variables different types of fluorescence intensity?
I take it that the different subsections describe different methods for achieving the same aim. It would be good to make that clearer. It could look like you do one subsection and then the next.
It took me a while to get my head around the sentence "The data generated by flow-cytometers can be plotted in one or two dimensions to produce a histogram or scatter plot." - even though there was an example of the scatter plots just there. I was thinking "which two dimensions"? But now I realise that that point is that it could be any two. The data has multiple dimensions of explanatory variables and the point is that you take pairs of these variables. With a plot like the one shown, you can show every pair you want to, all next to each other in a logical way. Not sure what the best way to make this clearer is. Perhaps resolving points 1 and 2 will help.
The article says "As the number of markers measured by flow cytometry increases, the number of scatter plots that need to be investigated increase exponentially." If my understanding in point 4 is correct then it actually increases quadraticly. Perhaps best to write this as "The number of scatter plots that need to be investigated increase with the square of the number of markers measured by flow cytometry."
The text implies that probability binning is univariate and the multivariate equivalent is called "frequency difference gating". Is this correct? The image caption uses the term "probability binning", even though the image is multivariate. Might be best to just stick with one term? We could tweak the text, including removing the phrase "on a univariate basis".

Yaris678 (talk) 10:25, 24 December 2013 (UTC)[reply]