Skip to content
Parker Institute for Cancer Immunotherapy
Search Close

After the Pipette: Using Clustering to Unearth Insights in Single-Cell Data Video Tutorial

How can biologists use clustering to identify cell populations that are associated to an outcome of interest?

Pier Federico Gherardini is director of informatics at the Parker Institute for Cancer Immunotherapy and specializes in computational biology. He received his post-doctoral training from Stanford University. He is PICI’s favorite Italian .

Access tools and tutorials on clustering at

This is the first installment in a series of videos that highlight statistical methods after the pipette. Tell us what you want to see in our next installment in the comments on YouTube.

Video Transcript

Today we’re going to talk about statistical methods that you can use to identify interesting features in single cell data. Say, for example, you’ve collected a flow cytometry dataset and you want to see if there is any specific cell population that is associated to a patient being a responder versus a non-responder to therapy.

One obvious way of doing this would be through the process of manual gating so you can manually identify in your dataset different cell populations of interest (e.g. CD4 T cells, CD8 T cells etc.), and then investigate whether any of these are up-regulated or down-regulated in responders versus non-responders. However, there is a problem in doing things this way, which is – what happens if the biological population that is important for your process is not one that you knew about before? And also, how much time do you want to spend enumerating all possible different populations that you can think of? So one way around this problem is to use a process of clustering. So clustering is the process of using a computer algorithm to identify groups of cells that are similar in your data and therefore form a cohesive unit.

If you use clustering, then you can build a matrix like this one, where every row represents a separate cluster, and every column represents a different sample. And the values in this matrix represent the abundance of each cluster in any individual sample. And remember, since you also knew that some samples were from responders versus non-responders then you can use a variety of statistical methods to identify whether any of these clusters are associated with being with a responder or non-responder to therapy.

Now the thing that’s really important here is that if you want to do things this way, you really have to group together all your data and cluster it in a single run, because if you do things this way, then each cluster will contain similar cells from different samples and therefore you will be able to calculate the abundance of each cluster in any individual sample because the clusters have been defined consistently across the entire dataset.

If you do not do things this way and instead you cluster each sample individually, then you are not going be able to construct the matrix that I was showing before because you will not be able to associate one cluster in one sample with another cluster from another sample because all the clusters have been defined independently for each sample, and therefore there is no correspondence between them.

Instead if you pool all the data together, you will be able to construct the matrix that we were talking about before and at this point, once the data is in that form, there is a whole variety of statistical methods you can use to identify whether any cluster is specifically upregulated or downregulated in responders versus non-responders.

Also, the features that you use for your cluster doesn’t have to be limited to just the abundance of the cluster in a specific sample. For instance, say that you’ve also collected functional markers in your panel such as Ki67 or pSTAT5, then you can use these to inform the properties of your clusters. So instead of looking at the frequency of your clusters in any individual sample, you can also look at the activation state of that cluster in that sample by using markers, once again, such as Ki67 or pSTAT5.

If you’re interested in this kind of analysis, please go to our Github page where you can find tools and tutorials that will help you apply this concept to your own data. And also, please let us know what kind of data analysis topics that you would like for us to address in future installments of this series.