Yale University
Yale University
EliScholar – A Digital Platform for Scholarly Publishing at Yale
EliScholar – A Digital Platform for Scholarly Publishing at Yale
Yale Medicine Thesis Digital Library
School of Medicine
January 2020
Uncovering Intratumoral And Intertumoral Heterogeneity Among
Uncovering Intratumoral And Intertumoral Heterogeneity Among
Single-Cell Cancer Specimens
Single-Cell Cancer Specimens
William Shelton Chen
Follow this and additional works at: https://elischolar.library.yale.edu/ymtdl
Recommended Citation
Recommended Citation
Chen, William Shelton, “Uncovering Intratumoral And Intertumoral Heterogeneity Among Single-Cell
Cancer Specimens” (2020). Yale Medicine Thesis Digital Library. 3890.
https://elischolar.library.yale.edu/ymtdl/3890
This Open Access Thesis is brought to you for free and open access by the School of Medicine at EliScholar – A
Digital Platform for Scholarly Publishing at Yale. It has been accepted for inclusion in Yale Medicine Thesis Digital
Library by an authorized administrator of EliScholar – A Digital Platform for Scholarly Publishing at Yale. For more
information, please contact elischolar@yale.edu.
Uncovering Intratumoral and Intertumoral Heterogeneity Among Single-
Cell Cancer Specimens
A Thesis Submitted to the Yale University School of Medicine
in Partial Fulfillment of the Requirements for the
Degree of Doctor of Medicine
by
William S. Chen
2020
UNCOVERING INTRATUMORAL AND INTERTUMORAL HETEROGENEITY
AMONG SINGLE-CELL CANCER SPECIMENS
William S Chen, Nevena Zivanovic, David van Dijk, Guy Wolf, Bernd Bodenmiller, and
Smita Krishnaswamy. Department of Genetics, Yale University, School of Medicine,
New Haven, CT.
ABSTRACT
While several tools have been developed to map axes of variation among individual cells,
no analogous approaches exist for identifying axes of variation among multicellular
biospecimens profiled at single-cell resolution. Developing such an approach is of great
translational relevance and interest, as single-cell expression data are now often collected
across numerous experimental conditions (e.g., representing different drug perturbation
conditions, CRISPR knockdowns, or patients undergoing clinical trials) that need to be
compared. In this work, “Phenotypic Earth Mover’s Distance” (PhEMD) is presented as a
solution to this problem. PhEMD is a general method for embedding a “manifold of
manifolds,” in which each datapoint in the higher-level manifold (of biospecimens)
represents a collection of points that span a lower-level manifold (of cells).
PhEMD is applied to a newly-generated, 300-biospecimen mass cytometry drug
screen experiment to map small-molecule inhibitors based on their differing effects on
breast cancer cells undergoing epithelial–mesenchymal transition (EMT). These
experiments highlight EGFR and MEK1/2 inhibitors as strongly halting EMT at an early
stage and PI3K/mTOR/Akt inhibitors as enriching for a drug-resistant mesenchymal cell
subtype characterized by high expression of phospho-S6. More generally, these
experiments reveal that the final mapping of perturbation conditions has low intrinsic
dimension and that the network of drugs demonstrates manifold structure, providing
insight into how these single-cell experiments should be computational modeled and
visualized. In the presented drug-screen experiment, the full spectrum of perturbation
effects could be learned by profiling just a small fraction (11%) of drugs. Moreover,
PhEMD could be integrated with complementary datasets to infer the phenotypes of
biospecimens not directly profiled with single-cell profiling. Together, these findings
have major implications for conducting future drug-screen experiments, as they suggest
that large-scale drug screens can be conducted by measuring only a small fraction of the
drugs using the most expensive high-throughput single-cell technologies—the effects of
other drugs may be inferred by mapping and extending the perturbation space.
PhEMD is also applied to patient tumor biopsies to assess intertumoral
heterogeneity. Applied to a melanoma dataset and a clear-cell renal cell carcinoma
dataset (ccRCC), PhEMD maps tumors similarly to how it maps perturbation conditions
as above in order to learn key axes along which tumors vary with respect to their tumor-
infiltrating immune cells. In both of these datasets, PhEMD highlights a subset of tumors
demonstrating a marked enrichment of exhausted CD8+ T-cells. The wide variability in
tumor-infiltrating immune cell abundance and particularly prominent exhausted CD8+ T-
cell subpopulation highlights the importance of careful patient stratification when
assessing clinical response to T cell-directed immunotherapies.
Altogether, this work highlights PhEMD’s potential to facilitate drug discovery
and patient stratification efforts by uncovering the network geometry of a large collection
of single-cell biospecimens. Our varied experiments demonstrate that PhEMD is highly
scalable, compatible with leading batch effect correction techniques, and generalizable to
multiple experimental designs, with clear applicability to modern precision oncology
efforts.
Published in part:
Chen WS*, Zivanovic N*, van Dijk D, Wolf G, Bodenmiller B, Krishnaswamy S.
Uncovering axes of variation among single-cell cancer specimens. Nature Methods,
2020.
Presented in part:
Chen WS, Zivanovic N, Pe’er D, Bodenmiller B, Krishnaswamy S. Phenotypic analysis
of single-cell breast cancer inhibition data reveals insights into EMT. AACR Annual
Meeting, Washington, DC, Apr 2017.
ACKNOWLEDGEMENTS:
I would like to acknowledge the Krishnaswamy and Bodenmiller laboratories for
thought-provoking and productive discussions. I am especially appreciative of Prof.
Smita Krishnaswamy for her incredible mentorship and support. I am also indebted to
Nevena Zivanovic and Bernd Bodenmiller for their help generating and interpreting
much of the single-cell data presented in this work.
This work was supported in part by the Chan–Zuckerberg Initiative Seed
Networks for the Human Cell Atlas (S.K.), a Swiss National Science Foundation (SNSF)
R’Equip grant (B.B), a SNSF Assistant Professorship grant PP00P3-144874 (B.B.), the
SystemsX Transfer Project “Friends and Foes” (B.B.), the SystemX grants Metastasix
and PhosphoNEtX (B.B.), the European Research Council (ERC) under the European
Union’s Seventh Framework Program (FP/2007-2013)/ERC Grant Agreement 336921
(B.B.), the CRUK IMAXT Grand Challenge (B.B.), and the following National Institutes
of Health (NIH) grants: R01GM135929 (S.K, G.W.), UC4 DK108132 (B.B.), NIH–
NIDDK T35DK104689 (W.C.).
Table of Contents
INTRODUCTION ……………………………………………………………………………………………
1
Bulk vs. single-cell profiling ……………………………………………………………………
1
Approaches to characterizing axes of variation among a collection of cells
….
5
Principal Component Analysis (PCA)
……………………………………………….
6
t-Distributed Stochastic Neighbor Embedding (t-SNE) ………………………..
7
Uniform Manifold Approximation and Projection (UMAP) ………………….
8
Tree-based approaches ………………………………………………………………….
9
Diffusion maps ……………………………………………………………………………
10
PHATE
………………………………………………………………………………………
11
Characterizing axes of variation among a collection of multicellular cancer
specimens ……………………………………………………………………………………………
11
Hypothesis …………………………………………………………………………………………..
15
Specific Aims
……………………………………………………………………………………….
15
Aim 1: Develop a robust tool for uncovering axes of variation among
single-cell biospecimens
……………………………………………………………….
15
Aim 2: Characterize the differing effects of 233 small-molecule inhibitors
on breast cancer epithelial–mesenchymal transition (EMT) ……………….
15
Aim 3: Characterize the immune cell subpopulational variation among
melanomas and among clear-cell renal cell carcinomas (ccRCCs)
………
15
MATERIALS AND METHODS ……………………………………………………………………..
16
The PhEMD analytical approach
…………………………………………………………..
16
Data collection and processing ………………………………………………………………
22
Py2T cell culture and stimulation …………………………………………………..
22
Small-molecule inhibitors
……………………………………………………………..
23
Chronic kinase inhibition screen ……………………………………………………
23
Cell collection …………………………………………………………………………….
24
Metal-labeled antibodies ………………………………………………………………
24
Mass-tag cellular barcoding and antibody staining …………………………..
25
Mass cytometry data processing …………………………………………………….
25
In-depth analysis of breast cancer EMT cell-state space and drug-inhibitor
manifold from a single mass cytometry run ……………………………………………
26
Integrating batch-effect correction to compare 300 EMT inhibition and
control conditions measured in five experimental runs ……………………………
27
Intrinsic dimensionality analysis of the EMT perturbation state space ……..
28
Imputing the effects of inhibitions based on a small measured dictionary
….
29
Incorporating drug-target binding specificity data to extend the PhEMD
embedding and predict the effects of unmeasured inhibitors on TGFβ-
induced breast cancer EMT ………………………………………………………………….
30
Predicting drug-target binding specificities based on PhEMD results from
EMT perturbation experiment
………………………………………………………………
32
Generation and analysis of dataset with known ground-truth branching
structure ……………………………………………………………………………………………..
34
Analysis of melanoma single-cell RNA-sequencing dataset ………………………
35
Analysis of clear cell renal cell carcinoma dataset……………………………………
35
Statistical methods ……………………………………………………………………………….
36
Data availability …………………………………………………………………………………..
36
Code availability ………………………………………………………………………………….
36
Author contributions ……………………………………………………………………………
37
RESULTS ……………………………………………………………………………………………………..
38
Overview of PhEMD
…………………………………………………………………………….
38
Comparing specimens pairwise using Earth Mover’s Distance (EMD) ……..
39
Evaluating accuracy of PhEMD in mapping multi-specimen, single-cell
dataset with known ground-truth structure ……………………………………………
41
Assessing the differing effects of selected drug perturbations on EMT in
breast cancer ……………………………………………………………………………………….
43
Batch effect correction in multi-run EMT experiment
………………………..
44
Cell-subtype definition via manifold clustering ………………………………..
47
Constructing and clustering the EMD-based drug-inhibitor manifold……
50
Analyzing EMT perturbations measured in a single CyTOF run ……………..
52
Cell subtype definition via manifold clustering
…………………………………
53
Constructing and clustering the EMD-based drug-inhibitor manifold……
55
Imputing the effects of inhibitors based on a small measured dictionary …..
58
Validating the PhEMD embedding using external information on similarities
between small-molecule inhibitors …………………………………………………………
60
Predicting the effects of three selected inhibitors on breast cancer EMT
relatively to the effects of measured inhibitors based on known drug-target
binding specificities……………………………………………………………………..
60
Imputing the single-cell phenotypes of three unmeasured inhibitors based
on drug-target similarity to measured inhibitors
………………………………..
62
Predicting drug-target binding specificities based on PhEMD results from
EMT perturbation experiment ……………………………………………………….
63
PhEMD highlights manifold structure of tumor specimens measured using
CyTOF and single-cell RNA-sequencing ………………………………………………..
64
DISCUSSION ………………………………………………………………………………………………..
69
REFERENCES
………………………………………………………………………………………………
73
SUPPLEMENTARY TABLES ………………………………………………………………………..
80
1
INTRODUCTION
Bulk vs. single-cell profiling
Next-generation sequencing (NGS) has revolutionized the way in which diseases can be
studied. Bulk DNA sequencing (DNA-seq) of germline biospecimens can be leveraged to
discover disease-specific polymorphisms and to investigate disease heritability at an
unprecedented scope and level of detail (1–3). In the setting of cancer, bulk DNA-seq of
liquid- or solid-tumor biopsies has been used to identify somatic alterations (e.g.,
mutations, copy number alterations, and structural variants) that can serve as biomarkers
prognostic of clinical outcomes and predictive of response to therapies (4–9).
Complementarily, bulk RNA-sequencing (RNA-seq) has been used to quantitate gene
expression of protein-coding genes and long non-coding RNAs at the exon level of
resolution. Paired with proteomic assays, NGS approaches have facilitated our
understanding of cellular biology and genomic drivers of disease at all steps of the central
dogma, from DNA to RNA to protein.
While instrumental in building our foundational understanding of cancer
genomics, bulk tumor profiling faces the notable limitation of being unable to resolve
intratumoral heterogeneity. By nature of the sample preparation procedure for bulk NGS,
DNA or RNA fragments are isolated from all cells of a biospecimen in aggregate, and
per-cell read counts cannot be determined. Thus, genomic variants identified via bulk
DNA-seq can only be interpreted as being present in some fraction of profiled cells.
Moreover, it is impossible to determine which of the variants co-occur in a given cancer
cell. The readout of bulk RNA-seq is similarly coarse in that the reported expression of a
given gene represents the average expression across all cells in the biospecimen without
2
any consideration of cell-to-cell variation. In practice, when comparing expression values
across biospecimens measured using bulk profiling or when performing association
studies between specific DNA variants and clinical phenotypes, a simplifying assumption
is often made that all (or at least a substantial-enough proportion of) cells in each
biospecimen harbor the genomic variant or gene expression signature of interest. In
reality, this assumption may not always be valid, and bulk measurements may fail to
accurately reflect the expression profiles of individual cells. Bulk profiling may also fail
to detect true biological differences between experimental conditions. The following
example demonstrates these concepts more concretely and highlights the utility of single-
cell analytical approaches for accurately characterizing and distinguishing between
multicellular biospecimens.
Consider a multi-specimen dataset consisting of immune cells with collectively
variable expression of CD4 and CD8. Each specimen is comprised of a cell population
that fits one of four distribution patterns, as shown below (Figure 1A). Each Group A
specimen consists of a homogeneous immune cell population characterized by
intermediate expression of both CD4 and CD8. Each Group B specimen consists of two
similarly-abundant immune cell subpopulations: one CD4+ and one CD8+ subpopulation.
Group C specimens consist of a mixture of CD4+, CD8+, and CD4/CD8 double-positive
(DP) immune cells. Group D specimens consist of one CD4+ and one CD8+
subpopulation of roughly equal abundance and one additional rare subpopulation of
CD4/CD8 double-negative (DN) immune cells. Note that these immune cell subtypes
(CD4+, CD8+, DP, and DN) have been reported to exist in normal thymus as well as
various disease states (e.g., breast and hematologic malignancies (10, 11)). The simulated
3
experiment consists of 32 specimens in total (eight of each of Groups A-D). By design,
the bulk (average) expression of CD4 and CD8 for each biospecimen is roughly the same
for all biospecimens, regardless of differences in cell subpopulation characteristics.
Figure 1. a) Single-cell profiles of each multicellular biospecimen in a computationally-generated immune
cell dataset. Each point represents a single cell. Groups A-D each have 8 biospecimens that fit the single-
cell profile (i.e., are comprised of some combination of the cell subpopulations depicted) for a total of 32
biospecimens. By design, all biospecimens have roughly the same bulk expression (mean across all cells)
of CD4 and CD8. b) Diffusion map embedding generated by embedding a specimen-to-specimen distance
matrix, where pairwise distances between specimens were computed by taking the Euclidean distance
between specimens represented as bulk expression of CD4 and CD8. Bulk expression profiles do not
adequately reflect the biological differences between specimens in this dataset and cannot be used to
distinguish specimens in a biologically meaningful way. c) Diffusion map embedding generated by
embedding a PhEMD distance matrix, which accounts for the single-cell characteristics of each specimen
(see “Overview of PhEMD” in Results section). PhEMD successfully distinguishes specimens with
different single-cell profiles from one another.
Next, consider the aim of relating the 32 specimens to one another in a
biologically meaningful way. This could be done by generating a low-dimensional
embedding that could be visualized to view the similarity of any two specimens relative
to the rest and to identify groups of similar biospecimens. First, consider an approach
4
using bulk expression measurements. A biospecimen–biospecimen distance matrix can be
generated by computing pairwise (Euclidean) distances between each pair of
biospecimens, with each biospecimen represented as the average expression of each gene
(i.e., CD4 and CD8) across all cells in the biospecimen. This distance matrix can then be
embedded and visualized in two dimensions using the diffusion map nonlinear
dimensionality reduction approach. The result is an embedding that fails to differentiate
specimens based on biologically important differences. Specifically, specimens of the
same known, ground-truth subtype (i.e., Group A-D) failed to map to similar parts of the
resulting embedding (Figure 1B).
A better approach to comparing these specimens is to compare the presence and
abundance of all single-cell subpopulations in each specimen. I aim to formalize such an
approach in this thesis and demonstrate that it can be used to effectively distinguish
single-cell specimens from one another that cannot be distinguished based on bulk or
average expression patterns. In the above example, the approach yields a final low-
dimensional map that vastly outperforms a bulk approach (Figure 1B) and successfully
differentiates specimens based on biologically important differences in cell subpopulation
characteristics and proportions (Figure 1C).
The exploration of cell-to-cell variation within a given biospecimen has been
facilitated by the recent development of single-cell expression profiling (measurement of
gene expression on a per-cell rather than average-across-all-cells level). Early studies
leveraging these technologies have uncovered important insights not previously identified
by bulk profiling. Several studies have highlighted the compositional heterogeneity of
tumors as a mixture of specific cancer and non-malignant (e.g., immune and stromal) cell
5
types and have revealed profound cellular heterogeneity among melanoma (12), clear-cell
renal cell carcinoma (ccRCC) (13), and breast cancer cells (14), even within a single
tumor biopsy. Additional studies have used single-cell profiling to better elucidate cell
signaling, differentiation, and reprogramming in the context of cancer (15), aging (16),
and other physiologic and disease processes (17–19). Among else, single-cell profiling is
particularly useful for studying cancer, as cancer is understood to arise from the genomic
mis-programming of a single cell and the downstream sequelae. While the analytical
approaches presented in this work are generalizable to studying many biological
phenomena at a single-cell level, the focus of this thesis will be on leveraging single-cell
technologies to better understand cancer progression, cellular response to
chemotherapies, and the tumor microenvironment.
Approaches to characterizing axes of variation among a collection of cells
As the readout of single-cell expression profiling is highly complex, new computational
tools have been developed in parallel with single-cell profiling techniques in order to
facilitate the extraction of biological insights. A particularly challenging property of
single-cell expression data is its high dimensionality: each biospecimen is comprised of
many cells, each of which is represented by tens to thousands of gene or protein
measurements. The analysis of high-dimensional data, especially in unsupervised or
exploratory settings, often introduces various challenges that are collectively referred to
as the “curse of dimensionality” (20, 21). While the ambient dimension of single-cell data
is often high (equal to the number of genes or proteins measured), the intrinsic
dimension, or minimum number of variables needed to represent the data adequately, is
often much lower. Mapping single-cell data from its ambient dimension to this lower-
6
dimensional space is termed “dimensionality reduction,” which is often a critical first
step for learning and visualizing the ways in which cells vary. It is also instrumental in
identifying distinct, biologically meaningful cell subpopulations (e.g., by clustering cells
in the lower-dimensional space). The following subsections provide an overview of
several of the leading dimensionality reduction techniques for learning and visualizing
axes of cell-to-cell variation among a set of cells measured using single-cell expression
profiling. In the below subsections, each “dataset” refers to a heterogeneous cell
population and each “point” represents a single cell, characterized by multiple measured
features (i.e., gene expression values).
Principal Component Analysis (PCA)
Principal component analysis (PCA) is a dimensionality reduction technique that aims to
find new, uncorrelated variables (“principal components”) that successively maximize
variance while minimizing information loss from the original dataset (22). It does so by
performing an orthogonal transformation of the original dataset such that the new
variables are linear combinations of features in the original ambient-dimensional space.
This transformation can be computed as the solution to an eigenvalue/eigenvector
problem, in which principal components are defined as eigenvectors of the covariance
matrix and corresponding eigenvalues represent the proportion of the data variance
explained by the eigenvectors.
PCA is useful in many settings for learning a low-dimensional representation of
the data, although it does make several key assumptions. Firstly, it assumes that the
principal components are appropriately modeled as linear combinations of the original
dimensions. Secondly, PCA assumes that principal components are orthogonal to one
7
another. Thirdly, PCA assumes that the input data are scaled and normalized
appropriately prior to application, as the approach is not scale invariant. In the event that
any of these assumptions are violated, PCA may fail to recover optimal axes of variation
in the data. Additionally, by design, PCA prioritizes preserving global structure (i.e.,
distances between faraway points) over local structure (i.e., distances between points
within the same “neighborhood”) when mapping from high- to low-dimensional space.
Thus, the approach is especially sensitive to outliers and measurement noise.
t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a popular dimensionality
reduction approach that preserves local relationships between points when mapping them
from an ambient-dimensional to low-dimensional space (23). Put another way, points that
are close to one another in the original representation of the data are mapped to be close
to one another in the final low-dimensional t-SNE space. Consider two points i and j
denoted by xi and xj respectively in the ambient-dimensional space and yi and yj
respectively in the low-dimensional space. t-SNE first models each point in the ambient-
dimensional space as a Gaussian probability distribution centered on the actual
coordinates of the point (with a data-dependent variance proportional to a user-specified
“perplexity” value), then computes pairwise similarity between points xi and xj as the
conditional probability Pj|i that xi would select xj as its neighbor if neighbors were
selected in proportion to their probability density under a Gaussian centered at xi (24). An
analogous conditional probability Qj|i is computed between points yi and yj in the low-
dimensional t-SNE space, wherein points yi and yj are modeled as Student t-distribution
with one degree of freedom (rather than as a Gaussian distribution). The final low-
8
dimensional embedding is learned through gradient descent by minimizing the Kullback-
Leibler (KL) divergence between P and Q. In the perfect case, the difference between Pj|i
and Qj|i is zero for all i and j, i.e., the pairwise relationships between points are perfectly
preserved in the ambient and t-SNE dimensions.
Strengths of t-SNE include non-linearity, which renders it superior to linear
approaches such as PCA when applied to curved manifolds, and preservation of local
structure, which reveals subtle differences between similar yet distinct cell
subpopulations. A limitation of t-SNE is its high computational resource demands. In its
exact form, t-SNE has a quadratic time and space complexity, making applications to
datasets larger than 10,000 points often computationally intractable. To mitigate this
issue, various approximations and optimizations have been developed (25–27). Another
limitation is the loss of global structure preservation in the final embedding. t-SNE
effectively identifies neighborhoods of points but generally yields disjoint “clouds” of
points. Thus, continuous (e.g. cellular differentiation) processes and trajectories are often
fragmented in the t-SNE embedding, and relative distances between faraway points or
clusters in the embedding are not preserved (28).
Uniform Manifold Approximation and Projection (UMAP)
Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction
approach that has been recently popularized due to its purported advantages over t-SNE
in terms of improved scalability and preservation of both local and global data structure
(29). Similarly to t-SNE, UMAP models points as probability distributions and performs
gradient descent to iteratively “move” points more similar to one another in the ambient-
dimensional space to be closer to one another in the low-dimensional embedding.
9
However, among other minor differences, UMAP omits the normalization of probabilities
used in t-SNE (thus improving runtime) and uses binary cross-entropy instead of KL
divergence as the cost function when comparing relationships between points in the
ambient-dimensional and low-dimensional spaces. UMAP also employs a graph
Laplacian approach to assigning the initial coordinates of the points in low-dimensional
space (prior to the first iteration of gradient descent), in contrast to the random
initialization employed by t-SNE. Early studies have claimed advantages of UMAP over
t-SNE in terms of faster computational runtime, greater preservation of global data
structure, and increased reproducibility of results across different iterations. However,
there is ongoing debate as to whether global structure is truly better preserved using
UMAP than t-SNE and if so, why exactly this may be (30).
Tree-based approaches
Several graph-based approaches have been developed to explicitly model single-cell
expression datasets as an interconnected “web” or “tree” of cells. Particularly aimed at
organizing and visualizing data with intrinsic trajectory structure (e.g., bifurcating
differentiation processes), these approaches typically represent cells as nodes and
relationships between similar cells as edges between nodes. Distances between cells can
then be defined as the shortest path between representative nodes (i.e., minimum number
of edges separating one cell from the other or, in the event of weighted edges, minimum
sum of edge lengths in a path from one cell to the other). Several particular single-cell
tools that employ such an approach include SPADE (31), Wishbone (32), and Monocle2
(33). These approaches are particularly useful for modeling continuous manifolds and for
resolving local neighborhood structure. A key limitation of most tree-based approaches is
10
poor scalability. In practice, when working with large datasets, these approaches often
require cell subsampling or prior identification of “landmark points” which may then
collectively comprise a relatively small number of graph nodes. Additionally, the number
of branches recovered in the final tree can vary greatly and is often dependent on user-
defined parameters, which may be challenging to tune if the expected number of branches
is not known a priori.
Diffusion maps
Diffusion maps are another nonlinear dimensionality reduction technique based on the
idea that a collection of points (e.g., cells) may be modeled such that a given point (e.g.,
cell) may “transition” to another point (e.g., similar cell state) with a probability
proportional to the known similarity of the two points (34). Diffusion maps first model
points as an interconnected graph, with connectivity between points generally based on
their distance in the ambient-dimensional space (e.g., Gaussian kernel, which prioritizes
preservation of local neighborhood structure). The point-to-point connectivity metric is
then used to represent the probability of “transitioning” from one cell to another in one
step of a random walk. A diffusion process is then performed over a diffusion time t (i.e.,
t-step random walk), wherein the local connectivity of the data is used to reveal the
global geometric structure of the data. The end result is a set of t-step transition
probabilities, which can be used to embed a low-dimensional map that captures both local
and global structure in the data. Diffusion maps are particularly well-suited for modeling
single-cell datasets with known trajectory structure and are modeled on underlying
principles that reflect our intuitive understanding of cellular differentiation processes
(e.g., “transition” from one cell state to the next). Diffusion maps are also attractive for
11
their nonlinearity and inherent denoising properties. Limitations of diffusion maps
include high computational runtime and sensitivity to scale parameter σ, which
determines the scale at which the data are visualized (35). In the traditional
implementation of diffusion maps, a fixed σ is used for all points in the dataset, often
imposing a tradeoff between preserving global and local structure with a bias toward
preserving global structure (34, 35). However, subsequent adaptations to the original
implementation proposed by Coifman and Lafon have been developed to better preserve
local structure (36).
PHATE
Potential of Heat-diffusion for Affinity-based Transition Embedding (PHATE) is a
nonlinear dimensionality reduction technique that aims to preserve both local and global
structure when mapping from high- to low-dimensional space (37). Similarly to diffusion
maps, PHATE models cell-to-cell connectivity as one-step transition probabilities in a
random walk model and then performs the diffusion process over diffusion time t to
determine t-step transition probabilities. However, PHATE employs a distance metric
(“potential distance”) between points in the diffusion space distinct from the “diffusion
distance” metric used in diffusion maps. In so doing, PHATE better preserves both local
and global data geometry and yields a more stable embedding than diffusion maps (37).
Characterizing axes of variation among a collection of multicellular cancer
specimens
The multitude of dimensionality reduction techniques described above have been adopted
and adapted to elucidate clusters, patterns, and progressions from high-dimensional
12
single-cell data. These techniques all rely on the ability to create a geometry from
datapoints by comparing them on the basis of their features. More specifically, these
techniques often compute a distance between the datapoints (i.e., cells) in order to
organize cells into lower-dimensional embeddings, such as diffusion maps or t-SNE
embeddings, in order to extract biologically meaningful clusters (i.e., cell subtypes) or
trajectories (e.g., cell differentiation pathways) from the data.
However, single-cell experimental designs are becoming increasingly complex,
with data now often collected across numerous experimental conditions to characterize
libraries of drugs, pools of CRISPR knockdowns, or groups of patients undergoing
clinical trials (12, 13, 38–42). The challenge in these experiments is to characterize the
ways in which not only individual cells but also multicellular experimental conditions
vary. Comparing single-cell experimental conditions (e.g., distinct perturbation
conditions or patient samples) is challenging, as each condition is itself high-dimensional,
comprised of a heterogeneous population of cells with each cell characterized by many
gene measurements. Each “datapoint” in such settings should then be a patient or an
experimental condition, which has a collection of measurements associated with it instead
of a single measurement. In this setting, a datapoint is no longer a single set of features
(i.e., a vector) but a collection of observations, each containing its own features (i.e., a
two-dimensional matrix). Thus, existing techniques can no longer be directly applied for
analysis in a straightforward manner.
While two prior studies presented approaches to comparing two single-cell
biospecimens (43, 44), no existing methods address the problem of simultaneously
relating many biospecimens and identifying biologically meaningful ways in which they
13
differ. In this work, Phenotypic Earth Mover’s Distance “PhEMD” is presented as a novel
“manifold-of-manifolds” approach to mapping the key axes of variation among a large
set of experimental conditions. PhEMD first leverages the observation that the structure
of a single-cell experimental condition (i.e., multicellular biospecimen) can be well
represented as a low-dimensional manifold (i.e., cell-state embedding) using techniques
such as PHATE or diffusion maps. In this first-level manifold, individual datapoints
represent cells, and distances between cells represent cell-to-cell dissimilarity. PhEMD
models the cellular state space of each experimental condition as a “low-level” manifold
and then models the experimental condition state space as a “higher-level” manifold. The
ultimate goal of PhEMD is to generate this higher-level manifold, in which each
datapoint represents a distinct experimental condition and distances between points
represent biospecimen-to-biospecimen dissimilarity. In this work, the properties and
potentially applications of this final higher-level manifold are explored in depth. Namely,
the manifold can be visualized and clustered to reveal the key axes of variation among a
large set of experimental conditions. Such embeddings can also be extended with
additional data sources to impute experimental conditions not directly measured with
single-cell technologies.
The accuracy of PhEMD is first validated on a synthetic dataset with known
underlying data geometry. PhEMD is then applied to a newly-generated single-cell
dataset to reveal insights into cancer progression and cancer drug-perturbation effects.
Specifically, the dataset represents a large perturbation screen performed on breast cancer
cells undergoing TGFb1-induced epithelial-to-mesenchymal transition (EMT), measured
at single-cell resolution with mass cytometry. EMT is a process that is thought to play a
14
role in cancer metastasis, whereby polarized epithelial cells within a local tumor undergo
specific biochemical changes that result in cells with increased migratory capacity,
invasiveness, and other characteristics consistent with the mesenchymal phenotype (45).
In the drug-screen experiment, each perturbation condition consists of cells from the
Py2T breast cancer cell line stimulated simultaneously with TGF-b1 (to undergo EMT)
and a unique kinase inhibitor, with the ultimate goal being to compare the effects of
different inhibitors on our model EMT system. PhEMD is used to embed the space of the
kinase inhibitors to reveal the main axes of variation among all inhibitors with respect to
their effects on breast cancer cells undergoing EMT. Reproducibility of results is assessed
through three biological replicates. Additionally, the drug-effect findings are further
validated by showing that they are consistent with the drug-effect findings of a previously
published study that profiled the drug-target binding specificities of several of the same
drugs as those used in the present drug-screen experiment.
PhEMD is also applied to two distinct single-cell datasets to reveal insights into
variation in the immune cell infiltrate of solid tumors profiled with single-cell resolution.
The first dataset consists of a collection of 17 melanoma samples profiled using single-
cell RNA-sequencing (scRNA-seq) and the second is comprised of 75 clear-cell renal cell
carcinoma (ccRCC) samples profiled using mass cytometry. These experiments yield a
low-dimensional map of patient tumors that highlights profound inter-tumoral
heterogeneity with respect to tumor-infiltrating immune cells, demonstrating the potential
utility of PhEMD for disease subtyping and patient stratification (e.g., for immunotherapy
trials or clinical outcomes studies). Collectively, the analyses present a new generalizable
analytical framework for organizing single-cell data and reveal new potential strategies
15
for identifying effective cancer therapies.
Hypothesis
A “manifold-of-manifolds” approach to modeling multi-specimen single-cell data can
accurately identify axes of variation among biospecimens and simultaneously reveal
insights into both intra- and inter-specimen heterogeneity.
Specific Aims
Aim 1: Develop a robust tool for uncovering axes of variation among single-cell
biospecimens
Aim 2: Characterize the differing effects of 233 small-molecule inhibitors on breast
cancer epithelial–mesenchymal transition (EMT)
Aim 3: Characterize the immune cell subpopulational variation among melanomas and
among clear-cell renal cell carcinomas (ccRCCs)
16
MATERIALS AND METHODS
The PhEMD analytical approach
In single-cell data, each cell is characterized by a set of features, such as protein or
transcript expression levels of genes. The purpose of measuring these expression-based
features for each cell (e.g., via single-cell RNA-seq or mass cytometry) is to answer
biological questions especially related to the cell subpopulations present in a
biospecimen. In particular, the features may be used for defining phenotypes of cells (38,
46), resolving cellular dynamics using transition-process modeling (32, 33, 47), and
studying signaling networks (18, 48). In sum, the features are shared, quantitative
characteristics of cells that may be used to organize a set of cells into a data geometry. An
analogy can be made when attempting to compare single-cell specimens rather than
individual cells. A biospecimen is a collection of cells. In order to compare single-cell
biospecimens for the purpose of organizing a set of cell collections (e.g., different patient
specimens or perturbation conditions), one must first determine useful features for a cell
collection. Previous studies have shown that cell subtypes are highly useful features that
are shared across all specimens and can be quantitatively measured (46, 49). Moreover,
they can be used to represent single-cell specimens efficiently for downstream analyses.
Just as transcript counts can be measured for selected genes in a single cell, so can cell
counts be measured for selected cell subtypes in a cell collection.
In the present work, PHATE is used for the task of defining cell subtypes (37).
PHATE is a diffusion-based single-cell dimensionality reduction technique that both
identifies unique cell subpopulations and relates them to one another on a low-
dimensional manifold that can be visualized. Of note, PHATE preserves an information