Learn how to detect tumor cells, infer chromosomal copy number changes, and uncover subclonal structure directly from single-cell RNA-seq data — no matched DNA sequencing required
Introduction: Reading the Cancer Genome Through Gene Expression
What Is Copy Number Variation and Why Does It Matter in Cancer?
If you have followed this tutorial series, you have built a rich picture of your single-cell data: you have processed raw reads into count matrices (Part 1), filtered low-quality cells (Part 2), integrated samples and identified clusters (Part 3), annotated cell types (Part 4), and explored differential expression and cell-cell communication (Parts 5-10). You know who the cells are, what they express, and how they signal to each other.
But in cancer biology, there is a deeper layer of information hiding in your data: the genome itself.
In a healthy diploid cell, virtually every gene has exactly two copies — one from each chromosome. In cancer cells, chromosomal instability causes large genomic blocks to be amplified (gaining extra copies) or deleted (losing copies). These changes are called copy number variations (CNVs). For example:
- A gain on chromosome 7 means tumor cells may carry 3-5 copies of EGFR, fueling growth signaling.
- A deletion on chromosome 17p removes a copy of TP53, disabling a key tumor suppressor.
- Whole-arm gains or losses are the hallmark of aneuploidy — the defining genomic feature of most solid tumors.
CopyKAT (Copy number Karyotyping of Tumors) detects these large-scale genomic alterations directly from your scRNA-seq count matrix, without any matched DNA sequencing.
Why CopyKAT? The Case for a Modern Tool
If you have read older scRNA-seq papers, you may have seen CNV analysis performed with inferCNV from the Broad Institute. We use CopyKAT in this tutorial for a concrete practical reason: inferCNV is deprecated. The developers have archived the repository and explicitly recommend seeking alternative tools for scRNA-seq CNV calling. CopyKAT is the leading actively maintained replacement, with several key advantages over the now-archived inferCNV:
| Feature | CopyKAT | inferCNV (deprecated) |
|---|---|---|
| Maintenance status | Actively maintained | Archived; no longer supported |
| Reference cells | Automatically detected via GMM | Must be manually specified |
| Subclone detection | Built in | Requires separate post-hoc steps |
| Speed | Faster (minutes to hours) | Slower (hours to days) |
| Output | Binary aneuploid/diploid label + CNV matrix | CNV scores only |
A note of caution: CopyKAT predictions are probabilistic inferences from expression data, not direct DNA measurements. Concordance with whole-genome sequencing is approximately 80%. Always visually validate predictions against the CNV heatmap, and always cross-reference with your cell type annotation to distinguish true positives from false positives. We cover this in detail in the interpretation section.
How Does CopyKAT Work?
The key insight behind CopyKAT is elegant: the expression levels of physically adjacent genes are correlated with the underlying genomic copy number. If a large chromosomal region is amplified, all the genes in that region will, on average, be expressed at higher levels. If a region is deleted, expression drops. By smoothing expression across genomic windows and comparing each cell’s profile to a diploid baseline, CopyKAT infers copy number states at approximately 5 Mb resolution.
Here is the logic in three steps:
- Order genes by chromosomal position — genes are arranged along each chromosome in genomic order.
- Average expression within sliding windows — each ~220 kb bin is summarized to reduce the noise from individual gene dropouts.
- Compare each cell to a diploid reference — cells that deviate from the diploid baseline across many chromosomal regions are classified as aneuploid (tumor); cells with flat profiles are classified as diploid (normal).
CopyKAT identifies the diploid reference cells automatically using a Gaussian Mixture Model (GMM). It then clusters cells hierarchically by their inferred CNV profiles to reveal tumor subclones.
Where Does CopyKAT Fit in the Single-Cell Pipeline?
Raw FASTQ reads
|
v
Cell Ranger (count matrix) [Part 1]
|
v
Quality control & filtering [Part 2]
|
v
Integration & clustering [Part 3]
|
v
Cell type annotation [Part 4]
|
v
+-----------------------------------------------+
| CopyKAT (this tutorial) |
| Input: raw UMI count matrix (one sample) |
| Output: tumor/diploid labels per cell |
| CNV matrix (genomic bins x cells) |
| hierarchical clustering (subclones) |
+-----------------------------------------------+
|
v
Refined annotation + downstream analysis
CopyKAT runs on raw UMI counts — before normalization, scaling, or batch correction. Its labels feed back into your Seurat object as new metadata.
Example Dataset: GSE131907 (Human Lung Adenocarcinoma)
Study Design and Biological Questions
This tutorial uses GSE131907, a landmark publicly available single-cell RNA-seq dataset accompanying the study “Single cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma” (Kim et al., Nature Communications, 2020).
Background: This dataset profiles 208,506 cells from 44 patients across 58 samples, covering primary lung tumors, adjacent normal lung, lymph node and brain metastases, and pleural effusions. All primary tumor and normal lung samples were obtained from patients undergoing surgery without prior treatment, making the genomic profiles of tumor cells unconfounded by therapy-induced changes.
Organism: Homo sapiens (human). All downstream analysis uses the hg20 (hg38) human reference genome.
Pre-annotated cell types:
| Cell Type | Total cells |
|---|---|
| T lymphocytes | 79,676 |
| Myeloid cells | 42,245 |
| Epithelial cells | 36,467 |
| B lymphocytes | 27,657 |
| NK cells | 11,551 |
| Fibroblasts | 4,172 |
| MAST cells | 3,396 |
| Endothelial cells | 1,996 |
Two-Sample Design: Tumor vs. Normal
This tutorial runs CopyKAT on two samples side by side: a primary tumor sample to detect aneuploid cancer cells, and a matched normal lung sample as a negative control. Running both with identical parameters is critical — it teaches you what genuine CNV signal looks like versus background noise.
| Sample | Origin | Cells | Epithelial cells | Expected CopyKAT result |
|---|---|---|---|---|
| LUNG_T06 | Primary lung tumor | 3,426 | 145 (4.2%) | Mixed: aneuploid (epithelial) + diploid (immune/stromal) |
| LUNG_N06 | Normal lung tissue | 2,839 | 178 (6.3%) | Predominantly diploid across all cell types |
Cell type composition:
LUNG_T06 (primary tumor):
| Cell Type | Count |
|---|---|
| T lymphocytes | 1,499 |
| Myeloid cells | 588 |
| B lymphocytes | 521 |
| Fibroblasts | 485 |
| Epithelial cells | 145 |
| MAST cells | 94 |
| Endothelial cells | 54 |
| NK cells | 40 |
LUNG_N06 (normal lung):
| Cell Type | Count |
|---|---|
| Myeloid cells | 1,310 |
| T lymphocytes | 1,107 |
| Epithelial cells | 178 |
| NK cells | 166 |
| MAST cells | 30 |
| B lymphocytes | 29 |
| Fibroblasts | 10 |
| Endothelial cells | 9 |
Note on tumor purity: LUNG_T06 contains only 145 epithelial cells out of 3,426 total (~4.2%). This is a low-purity sample in terms of malignant cell content, which is common in real clinical biopsies where immune infiltration can dominate. CopyKAT handles this well precisely because the abundant immune and stromal cells provide strong diploid reference signal.
Software and Package Versions
- R: 4.4.x
- Seurat: 5.x (layer-based architecture;
LayerData()required) - data.table: 1.15.x
- CopyKAT: 1.1.0 (from GitHub)
- Genome reference: hg20 (hg38, human)
Required Input Files
Download both files from GEO accession GSE131907:
| File | Description |
|---|---|
GSE131907_Lung_Cancer_cell_annotation.txt | Cell-level metadata for all 208,506 cells |
GSE131907_Lung_Cancer_raw_UMI_matrix.rds | Full raw count matrix as a sparse R matrix (~several GB) |
Part 1: Preparing Input Data
Step 1.1 — Install and Load Libraries
library(Seurat)
library(data.table)
library(copykat)
library(ggplot2)
library(patchwork)
# Create all output directories upfront
dir.create("data/", showWarnings = FALSE, recursive = TRUE)
dir.create("results/", showWarnings = FALSE, recursive = TRUE)
dir.create("plots/", showWarnings = FALSE, recursive = TRUE)
results_dir <- "results/"
plots_dir <- "plots/"
Step 1.2 — Load Cell Metadata
# Load the full metadata for all 208,506 cells
meta <- fread("GSE131907_Lung_Cancer_cell_annotation.txt", data.table = FALSE)
# The first column ("Index") contains unique cell identifiers
rownames(meta) <- meta[[1]]
The metadata contains these key columns:
| Column | Description |
|---|---|
Index | Unique cell identifier (barcode + sample ID) |
Barcode | Raw 10x Genomics barcode |
Sample | Sample ID (e.g., LUNG_T06, LUNG_N06) |
Sample_Origin | Tissue of origin (tLung, nLung, etc.) |
Cell_type | Broad cell type label (8 categories) |
Cell_type.refined | Finer cell type annotation |
Cell_subtype | Finest-level annotation |
Step 1.3 — Load the Raw Count Matrix
# Load the full raw count matrix
exprmx <- readRDS("GSE131907_Lung_Cancer_raw_UMI_matrix.rds")
Step 1.4 — Align Metadata and Count Matrix
# Reorder metadata rows to match the column order of the expression matrix
meta <- meta[colnames(exprmx), ]
# This MUST return TRUE before proceeding.
identical(rownames(meta), colnames(exprmx))
Step 1.5 — Create the Seurat Object and Subset
We create one Seurat object from the full dataset, then subset to the two samples. This approach loads the large RDS file only once.
# Create the full Seurat object
seu_obj <- CreateSeuratObject(
counts = exprmx,
meta.data = meta,
project = "GSE131907"
)
# Subset to the primary tumor and normal lung samples
seu_obj_lnt06 <- subset(seu_obj, subset = Sample == "LUNG_T06")
seu_obj_lnn06 <- subset(seu_obj, subset = Sample == "LUNG_N06")
# Free the full object from memory -- it is no longer needed
rm(seu_obj, exprmx)
gc()
Step 1.6 — Extract Raw Counts
CopyKAT requires raw UMI counts — never normalized, scaled, or batch-corrected data. In Seurat 5, raw counts are accessed with LayerData():
# Extract raw counts for each sample
raw_lnt06 <- LayerData(seu_obj_lnt06, assay = "RNA", layer = "counts")
raw_lnn06 <- LayerData(seu_obj_lnn06, assay = "RNA", layer = "counts")
dim(raw_lnt06) # [1] 29634 3426
dim(raw_lnn06) # [1] 29634 2839
Part 2: Running CopyKAT
Installing CopyKAT
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("navinlabcode/copykat")
Understanding the Key Parameters
| Parameter | Default | What it controls |
|---|---|---|
rawmat | — | Raw count matrix (required) |
id.type | "S" | Gene ID type: "S" = gene symbol, "E" = Ensembl ID |
genome | "hg20" | Genome build: "hg20" (human/hg38) or "mm10" (mouse) |
ngene.chr | 5 | Min genes per chromosome for a cell to be analyzed |
win.size | 25 | Genes per genomic smoothing window (range: 15-150) |
KS.cut | 0.1 | Segmentation sensitivity: 0 = strict, 1 = permissive |
LOW.DR | 0.05 | Min fraction of cells a gene must be expressed in |
UP.DR | 0.1 | Max fraction of cells a gene can be expressed in |
distance | "euclidean" | Clustering distance: "euclidean", "pearson", or "spearman" |
norm.cell.names | "" | Optional: known diploid cell barcodes as a character vector |
cell.line | "no" | Set "yes" for pure cell line data (no normal cells present) |
n.cores | 1 | Number of CPU cores for parallel computation |
sam.name | "" | Output file prefix |
Step 2.1 — Run CopyKAT on Both Samples
# -------------------------------------------------------------------
# Run 1: Primary tumor (LUNG_T06)
# -------------------------------------------------------------------
ck_tumor <- copykat(
rawmat = as.matrix(raw_lnt06), # convert sparse to dense matrix
id.type = "S", # HGNC gene symbols
genome = "hg20", # human hg38 annotation
ngene.chr = 5, # min 5 genes per chromosome per cell
win.size = 25, # genes per smoothing window
KS.cut = 0.1, # segmentation sensitivity
LOW.DR = 0.05, # min gene detection fraction
UP.DR = 0.1, # max gene detection fraction
distance = "euclidean", # hierarchical clustering distance
cell.line = "no", # tissue sample with mixed cell types
n.cores = 4,
sam.name = file.path(results_dir, "LUNG_T06")
)
# -------------------------------------------------------------------
# Run 2: Normal lung (LUNG_N06) -- negative control
# -------------------------------------------------------------------
ck_normal <- copykat(
rawmat = as.matrix(raw_lnn06),
id.type = "S",
genome = "hg20",
ngene.chr = 5,
win.size = 25,
KS.cut = 0.1,
LOW.DR = 0.05,
UP.DR = 0.1,
distance = "euclidean",
cell.line = "no",
n.cores = 4,
sam.name = file.path(results_dir, "LUNG_N06")
)
**”WARNING! NOT CONVERGENT!”: You will likely see many of these warnings. This means the Gaussian Mixture Model’s EM algorithm reached its maximum iterations before fully converging for some cells. CopyKAT still produces results, but classification confidence is reduced for those cells — they are more likely to end up as
not.defined. This warning is common in samples with complex cell type mixtures and is not a sign that something went wrong.
Step 2.2 — Output Files and Object Structure
After each run, the following files are written to results/:
results/LUNG_T06_copykat_prediction.txt # per-cell labels
results/LUNG_T06_copykat_CNA_results.txt # CNV values per 220kb bin
results/LUNG_T06_copykat_CNA_raw_results_gene_by_cell.txt # raw gene-level CNVs
results/LUNG_T06_copykat_heatmap.jpeg # auto-generated heatmap
results/LUNG_T06_copykat_with_genes_heatmap.pdf # heatmap with gene labels
results/LUNG_T06_copykat_clustering_results.rds # hierarchical clustering object
results/LUNG_N06_copykat_prediction.txt
results/LUNG_N06_copykat_CNA_results.txt
results/LUNG_N06_copykat_CNA_raw_results_gene_by_cell.txt
results/LUNG_N06_copykat_heatmap.jpeg
results/LUNG_N06_copykat_with_genes_heatmap.pdf
results/LUNG_N06_copykat_clustering_results.rds
The R objects returned by copykat() contain three key components:
| Slot | Content | How you use it |
|---|---|---|
$prediction | Data frame: cell barcode + copykat.pred label | Add to Seurat metadata for visualization |
$CNAmat | Matrix: columns 1-6 = chromosomal position; columns 7+ = per-cell CNV values | Heatmap and subclone analysis |
$hclustering | Hierarchical clustering of CNV profiles | Dendrogram cutting for subclone definition |
The three possible labels in $prediction$copykat.pred:
"aneuploid"— predicted malignant cell; high CNV burden."diploid"— predicted normal cell; flat CNV profile."not.defined"— CopyKAT could not make a confident call. Treat as ambiguous.
Part 3: Interpreting CopyKAT Results
Step 3.1 — Add CopyKAT Labels to Each Seurat Object
# --- LUNG_T06 ---
pred_tumor <- ck_tumor$prediction
colnames(pred_tumor)[colnames(pred_tumor) == "copykat.pred"] <- "copykat_label"
seu_obj_lnt06$copykat_label <- pred_tumor$copykat_label[
match(colnames(seu_obj_lnt06), pred_tumor$cell.names)
]
seu_obj_lnt06$copykat_label[is.na(seu_obj_lnt06$copykat_label)] <- "not.defined"
# --- LUNG_N06 ---
pred_normal <- ck_normal$prediction
colnames(pred_normal)[colnames(pred_normal) == "copykat.pred"] <- "copykat_label"
seu_obj_lnn06$copykat_label <- pred_normal$copykat_label[
match(colnames(seu_obj_lnn06), pred_normal$cell.names)
]
seu_obj_lnn06$copykat_label[is.na(seu_obj_lnn06$copykat_label)] <- "not.defined"
# Check label distributions
table(seu_obj_lnt06$copykat_label)
table(seu_obj_lnn06$copykat_label)
Expected output:
LUNG_T06:
aneuploid diploid not.defined
1063 1893 470
LUNG_N06:
aneuploid diploid not.defined
1097 1502 240
Step 3.2 — Generate UMAP Embeddings
process_for_umap <- function(seu) {
seu <- NormalizeData(seu, verbose = FALSE)
seu <- FindVariableFeatures(seu, verbose = FALSE)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, verbose = FALSE)
seu <- RunUMAP(seu, dims = 1:20, verbose = FALSE)
seu
}
seu_obj_lnt06 <- process_for_umap(seu_obj_lnt06)
seu_obj_lnn06 <- process_for_umap(seu_obj_lnn06)
Step 3.3 — Visualize CopyKAT Labels on UMAP
copykat_colors <- c(
"aneuploid" = "#D62728",
"diploid" = "#1F77B4",
"not.defined" = "#AAAAAA"
)
p_t06_celltype <- DimPlot(seu_obj_lnt06, group.by = "Cell_type",
label = TRUE, repel = TRUE, pt.size = 0.4) +
labs(title = "LUNG_T06: Cell Types") +
theme_minimal(base_size = 12) + theme(legend.position = "none")
p_t06_copykat <- DimPlot(seu_obj_lnt06, group.by = "copykat_label",
cols = copykat_colors, pt.size = 0.4, alpha = 0.6) +
labs(title = "LUNG_T06: CopyKAT Labels") +
theme_minimal(base_size = 12)
p_n06_celltype <- DimPlot(seu_obj_lnn06, group.by = "Cell_type",
label = TRUE, repel = TRUE, pt.size = 0.4) +
labs(title = "LUNG_N06: Cell Types") +
theme_minimal(base_size = 12) + theme(legend.position = "none")
p_n06_copykat <- DimPlot(seu_obj_lnn06, group.by = "copykat_label",
cols = copykat_colors, pt.size = 0.4, alpha = 0.6) +
labs(title = "LUNG_N06: CopyKAT Labels") +
theme_minimal(base_size = 12)
png(file.path(plots_dir, "01_umap_copykat_comparison.png"),
width = 2400, height = 1600, res = 150)
(p_t06_celltype | p_t06_copykat) / (p_n06_celltype | p_n06_copykat)
dev.off()

How to interpret the 2×2 panel:
In the LUNG_T06 row (top), you will see aneuploid labels (red) concentrated in the fibroblast cluster and the myeloid cluster, as well as in part of the epithelial cluster. T lymphocytes and NK cells remain predominantly diploid (blue). This mixed picture is expected and is fully explained in Step 3.5.
You may also notice that the epithelial cells form more than one cluster on the UMAP (there are two to three small separated groups in the upper-right area), but only a subset of them is labeled red. This is biologically meaningful. The multiple epithelial sub-clusters likely represent different transcriptional states — for example, one cluster may contain more proliferating tumor cells with strong CNV signal while another contains more quiescent or differentiated epithelial cells whose expression profile does not deviate enough from the diploid baseline for CopyKAT to classify them as aneuploid. CopyKAT classifies cells based on genome-wide expression variance relative to the baseline, not based on UMAP proximity, so neighboring clusters can receive different labels.
Note on reproducibility: CopyKAT uses a Gaussian Mixture Model with random initialization. Re-running the same data may produce modestly different aneuploid/diploid assignments, particularly for cells near the classification boundary. The overall pattern should be consistent across runs, but exact cell counts may vary.
In the LUNG_N06 row (bottom), the myeloid cluster appears predominantly red just as in the tumor sample, while T cells, NK cells, and epithelial cells are largely blue. This confirms the myeloid aneuploid signal is a systematic artifact rather than a tumor-specific finding. Step 3.5 explains why.
Step 3.4 — Validate Against the Pre-Existing Cell Type Annotation
This is the most important step. Because GSE131907 comes with a Cell_type annotation, we can directly measure how well CopyKAT’s labels align with the expected biology.
# Aneuploid fraction per cell type -- LUNG_T06
aneuploid_by_celltype_tumor <- sort(
tapply(seu_obj_lnt06$copykat_label == "aneuploid",
seu_obj_lnt06$Cell_type, mean, na.rm = TRUE),
decreasing = TRUE
)
print(round(aneuploid_by_celltype_tumor, 3))
# Aneuploid fraction per cell type -- LUNG_N06
aneuploid_by_celltype_normal <- sort(
tapply(seu_obj_lnn06$copykat_label == "aneuploid",
seu_obj_lnn06$Cell_type, mean, na.rm = TRUE),
decreasing = TRUE
)
print(round(aneuploid_by_celltype_normal, 3))
Actual output:
LUNG_T06:
Fibroblasts Myeloid cells Epithelial cells Endothelial cells
0.893 0.871 0.559 0.278
MAST cells T lymphocytes B lymphocytes NK cells
0.021 0.010 0.010 0.000
LUNG_N06:
Myeloid cells Epithelial cells B lymphocytes Endothelial cells
0.836 0.011 0.000 0.000
Fibroblasts MAST cells NK cells T lymphocytes
0.000 0.000 0.000 0.000
Step 3.5 — Understanding the Results: True Signals vs. False Positives
The results above reveal something important: CopyKAT produces both true signals and systematic false positives in certain cell types. This is not a sign that the tool has failed — it is a known behavior that every practitioner needs to understand and account for. Let us work through the results cell type by cell type.
Myeloid cells — a systematic false positive in both samples:
The most revealing finding is that myeloid cells show 87.1% aneuploid in LUNG_T06 and 83.6% aneuploid in LUNG_N06 (normal tissue). Normal myeloid cells in healthy lung tissue cannot be 84% aneuploid — this is biologically impossible. This is a false positive driven by cell-type-specific expression patterns. Tissue-resident myeloid cells, particularly alveolar macrophages, have a highly distinctive transcriptional profile compared to circulating monocytes or other immune cells. Their unique gene expression program, spread across the genome, creates systematic expression deviations that CopyKAT’s algorithm interprets as copy number changes. Because this pattern appears in both samples equally, it is definitively an artifact, not a biological CNV signal.
Fibroblasts in LUNG_T06 — likely cancer-associated fibroblasts:
Fibroblasts at 89.3% aneuploid in LUNG_T06, but 0% in LUNG_N06. This is a more nuanced situation. The fibroblasts in a tumor microenvironment are largely cancer-associated fibroblasts (CAFs) — cells that have been transcriptionally reprogrammed by the tumor to support its growth. CAFs have dramatically altered gene expression compared to normal fibroblasts, and these widespread expression changes can create spurious CNV-like patterns in CopyKAT. Whether a small fraction are genuinely aneuploid tumor cells that express fibroblast-like markers is impossible to determine from CopyKAT alone and would require orthogonal validation.
Epithelial cells — the credible signal:
Epithelial cells show 55.9% aneuploid in LUNG_T06 and only 1.1% in LUNG_N06. This is a biologically credible result: approximately half of the tumor epithelial cells carry detectable CNV burden, while normal lung epithelial cells are largely diploid. This differential between tumor and normal (55.9% vs. 1.1%) is the signal we can trust.
Why does LUNG_N06 appear to have more aneuploid cells overall than LUNG_T06?
LUNG_N06 has 1,310 myeloid cells (vs. 588 in LUNG_T06). Since ~84% of myeloid cells are false-positive aneuploid in both samples, this larger myeloid compartment in LUNG_N06 produces more total aneuploid calls (roughly 1,095 cells from myeloid alone), making LUNG_N06 appear to have more aneuploidy than the tumor sample. This is an artifact of sample composition, not biology.
The key lesson: Raw CopyKAT output must always be cross-referenced with your cell type annotation. The aneuploid labels for myeloid cells and, in this case, fibroblasts should not be interpreted as evidence of malignancy.
Step 3.6 — Reading the CNV Heatmaps

CopyKAT automatically saves a heatmap for each sample. Open both side by side:
results/LUNG_T06_copykat_heatmap.jpegresults/LUNG_N06_copykat_heatmap.jpeg
How to read the CopyKAT CNV heatmap (human hg20 format):
Axes:
Rows = individual cells (one row per cell), ordered by hierarchical
clustering of CNV profiles
Columns = ~220 kb genomic bins, ordered from chr1 to chrX
Side bar (orange) = pred.aneuploid cells
Side bar (green) = pred.diploid cells
Color scale:
RED = chromosomal GAIN vs. diploid baseline
BLUE = chromosomal LOSS vs. diploid baseline
WHITE = no significant deviation from diploid
What you will observe in LUNG_T06:
The aneuploid rows (orange bar, top portion) show scattered red/blue
patterns across chromosomes, but the patterns are relatively noisy
rather than showing the large coherent arm-level blocks you would see
in a highly aneuploid cancer cell line. This reflects the heterogeneous
nature of the aneuploid cells -- a mix of true tumor epithelial cells
(genuine CNVs) and false-positive myeloid/fibroblast cells (noise).
What you will observe in LUNG_N06:
The aneuploid rows also show scattered patterns similar to LUNG_T06.
The diploid rows look relatively flat. Critically, the overall
heatmap does NOT look dramatically different from LUNG_T06 because
both samples' aneuploid fractions are dominated by the same myeloid
false positive signal.
What "data quality is ok" means:
CopyKAT prints this message when it determines that the input data
passed internal quality thresholds and the analysis proceeded normally.
It does not indicate that all predictions are correct.
Step 3.7 — Quality Checks Before Proceeding
- Epithelial cells in LUNG_T06 show a substantially higher aneuploid fraction than in LUNG_N06 — this is the genuine signal.
- Myeloid cells show high aneuploid fractions in both samples — this is a known false positive artifact.
- T cells, NK cells, and B cells have near-zero aneuploid fractions in both samples — this is the expected behavior for lymphocytes.
- The
not.definedfraction is 13.7% in LUNG_T06 (related to the GMM convergence warnings) and 8.5% in LUNG_N06.
Step 3.8 — Identifying Tumor Subclones
CopyKAT’s hierarchical clustering of CNV profiles reveals groups of cells that share the same pattern of chromosomal aberrations — these are subclones. Here we run the subclone analysis on all aneuploid cells, exactly as CopyKAT’s built-in workflow does, and then interpret what the resulting subclones represent by cross-referencing with the cell type annotation.
# Get barcodes of all aneuploid cells from the tumor sample
aneuploid_cells <- ck_tumor$prediction$cell.names[
ck_tumor$prediction$copykat.pred == "aneuploid"
]
# The CNAmat has 6 chromosomal annotation columns (chr, abspos, start, end,
# length, gene_counts); per-cell CNV values start at column 7
cna_tumor <- t(ck_tumor$CNAmat[, 7:ncol(ck_tumor$CNAmat)]) # cells x bins
cna_tumor <- cna_tumor[aneuploid_cells, , drop = FALSE]
# Hierarchical clustering of all aneuploid cells by CNV profile
hc_tumor <- hclust(dist(cna_tumor, method = "euclidean"), method = "ward.D2")
# Cut the dendrogram into k subclones
# Start with k=2; inspect the heatmap for the number of visually distinct
# row blocks to guide your choice
k <- 2
subclones <- cutree(hc_tumor, k = k)
subclone_df <- data.frame(
cell.names = names(subclones),
subclone = paste0("Subclone_", subclones),
stringsAsFactors = FALSE
)
# Add subclone labels to Seurat object; non-aneuploid cells get "Non-tumor"
seu_obj_lnt06$subclone <- subclone_df$subclone[
match(colnames(seu_obj_lnt06), subclone_df$cell.names)
]
seu_obj_lnt06$subclone[is.na(seu_obj_lnt06$subclone)] <- "Non-tumor"
Important: Running subclone analysis on all aneuploid cells will produce subclones that reflect the dominant false-positive signal — in this dataset, the two “subclones” will correspond largely to the myeloid cell cluster vs. the fibroblast cluster, not to distinct tumor cell lineages. This is a feature of the raw output, not a bug, and the differential expression step below will make this explicit. For a publication-ready subclone analysis of true tumor subclones, you would filter to aneuploid epithelial cells only before clustering — we demonstrate this interpretation in Step 4.2.
Step 3.9 — Visualize Subclones on UMAP
subclone_colors <- c(
"Subclone_1" = "#E41A1C",
"Subclone_2" = "#FF7F00",
"Non-tumor" = "#BBBBBB"
)
p_subclone <- DimPlot(seu_obj_lnt06, group.by = "subclone",
cols = subclone_colors, pt.size = 0.5, alpha = 0.7) +
labs(title = "LUNG_T06: CopyKAT Subclones",
color = "Subclone") +
theme_minimal(base_size = 13)
png(file.path(plots_dir, "02_umap_subclones_LUNG_T06.png"),
width = 900, height = 750, res = 150)
p_subclone
dev.off()

Part 4: Integrating CopyKAT with Downstream Analysis
Step 4.1 — Refine Cell Type Annotation
Combine the pre-existing Cell_type annotation with CopyKAT’s labels to create a refined_annotation column. We prefix all aneuploid-labeled cells with "Malignant_" regardless of cell type. This preserves the raw CopyKAT output in the annotation and makes it easy to subset any cell type by malignancy status for downstream analysis.
# Prefix all aneuploid-labeled cells with "Malignant_"
# This includes myeloid and fibroblast false positives -- they are labeled
# honestly here so you can filter them out in downstream steps using Cell_type
seu_obj_lnt06$refined_annotation <- ifelse(
seu_obj_lnt06$copykat_label == "aneuploid",
paste0("Malignant_", seu_obj_lnt06$Cell_type),
seu_obj_lnt06$Cell_type
)
# In LUNG_N06: annotation stands as-is
seu_obj_lnn06$refined_annotation <- seu_obj_lnn06$Cell_type
table(seu_obj_lnt06$refined_annotation)
Expected output:
B lymphocytes Endothelial cells Epithelial cells
516 39 64
Fibroblasts Malignant_B lymphocytes Malignant_Endothelial cells
52 5 15
Malignant_Epithelial cells Malignant_Fibroblasts Malignant_MAST cells
81 433 2
Malignant_Myeloid cells Malignant_T lymphocytes MAST cells
512 15 92
Myeloid cells NK cells T lymphocytes
76 40 1484
The Malignant_Fibroblasts (433 cells) and Malignant_Myeloid cells (512 cells) entries reflect the false positive patterns discussed in Step 3.5. When using refined_annotation for downstream analyses such as cell-cell communication or differential expression, filter to Malignant_Epithelial cells as the biologically credible malignant population.
Step 4.2 — Differential Expression Between Subclones Reveals Their Identity
Running DE between the two subclones is the most direct way to understand what they actually represent. We run it first on all aneuploid cells (the raw CopyKAT output), then interpret the results.
# Subset to all aneuploid cells
seu_tumor_only <- subset(seu_obj_lnt06, subset = copykat_label == "aneuploid")
Idents(seu_tumor_only) <- "subclone"
subclone_markers <- FindAllMarkers(
seu_tumor_only,
assay = "RNA",
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25
)
write.csv(subclone_markers,
file.path(results_dir, "LUNG_T06_subclone_markers.csv"),
row.names = FALSE)
# Top 10 markers per subclone
top10 <- do.call(rbind, lapply(
split(subclone_markers, subclone_markers$cluster),
function(df) head(df[order(df$avg_log2FC, decreasing = TRUE), ], 10)
))
print(top10[, c("cluster", "gene", "avg_log2FC", "p_val_adj")])

What the markers tell us:
The top markers for each subclone immediately reveal that neither represents a genuine tumor subclone:
Subclone_1 (EGFL6, MEG3, SGIP1, ECM2, ANGPTL2, SFRP4, TBX2, ASPN, FMOD, FAP) is a textbook cancer-associated fibroblast (CAF) signature:
- FAP (fibroblast activation protein) and ASPN (asporin) are canonical CAF markers routinely used to identify activated stromal fibroblasts in tumor tissue.
- FMOD (fibromodulin) and ECM2 are extracellular matrix proteins produced by fibroblasts.
- SFRP4 is a WNT signaling antagonist highly expressed in CAFs.
- ANGPTL2 and EGFL6 are secreted factors associated with stromal cell activity.
This “subclone” is not a tumor clone — it is the CAF population whose widespread transcriptional reprogramming was falsely classified as aneuploid by CopyKAT.
Subclone_2 (NLRP3, P2RY13, PILRA, CSF1R, C1QC, CD33, LILRB2, C5AR1, CLEC12A) is a textbook macrophage/myeloid signature:
- CSF1R (colony-stimulating factor 1 receptor) and CD33 are defining markers of the myeloid lineage.
- C1QC is a complement component characteristic of tumor-associated macrophages (TAMs).
- LILRB2, PILRA, and CLEC12A are myeloid inhibitory receptors expressed on macrophages and monocytes.
- C5AR1 (complement C5a receptor) and NLRP3 (inflammasome component) further confirm macrophage identity.
This “subclone” is the myeloid cell population whose tissue-specific expression program was systematically misclassified as aneuploid, as predicted by the LUNG_N06 negative control result.
Neither subclone represents a genuine tumor cell lineage. This outcome demonstrates exactly why Step 3.5 matters: CopyKAT’s raw subclone output can be cell-type contamination, not tumor evolution. For a biologically meaningful subclone analysis, filter to aneuploid cells within the annotated Epithelial cells population:
# Publication-ready subclone analysis: aneuploid epithelial cells only
seu_epithelial_aneuploid <- subset(
seu_obj_lnt06,
subset = copykat_label == "aneuploid" & Cell_type == "Epithelial cells"
)
Idents(seu_epithelial_aneuploid) <- "subclone"
epithelial_subclone_markers <- FindAllMarkers(
seu_epithelial_aneuploid,
assay = "RNA",
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25
)
Publication note:
FindAllMarkersis appropriate for exploratory analysis. For publication-quality comparisons across multiple patients, use a pseudobulk DESeq2 approach as described in Part 5.
Step 4.3 — Formal Tumor vs. Normal Comparison Table
cell_types <- union(unique(seu_obj_lnt06$Cell_type), unique(seu_obj_lnn06$Cell_type))
summary_df <- data.frame(
cell_type = cell_types,
aneuploid_LUNG_T06 = sapply(cell_types, function(ct) {
cells <- seu_obj_lnt06$Cell_type == ct
if (sum(cells) == 0) return(NA)
mean(seu_obj_lnt06$copykat_label[cells] == "aneuploid", na.rm = TRUE)
}),
aneuploid_LUNG_N06 = sapply(cell_types, function(ct) {
cells <- seu_obj_lnn06$Cell_type == ct
if (sum(cells) == 0) return(NA)
mean(seu_obj_lnn06$copykat_label[cells] == "aneuploid", na.rm = TRUE)
})
)
summary_df <- summary_df[order(summary_df$aneuploid_LUNG_T06, decreasing = TRUE), ]
summary_df[, 2:3] <- round(summary_df[, 2:3], 3)
print(summary_df)
write.csv(summary_df,
file.path(results_dir, "aneuploid_fraction_summary.csv"),
row.names = FALSE)

Part 5: Saving Your Results
# Save updated Seurat objects (requires the "data/" directory created in Step 1.1)
saveRDS(seu_obj_lnt06, "data/seu_obj_LUNG_T06_copykat.rds")
saveRDS(seu_obj_lnn06, "data/seu_obj_LUNG_N06_copykat.rds")
# Export per-cell summary tables
write.csv(
data.frame(
cell_id = colnames(seu_obj_lnt06),
sample = seu_obj_lnt06$Sample,
cell_type = seu_obj_lnt06$Cell_type,
copykat_label = seu_obj_lnt06$copykat_label,
refined_annotation = seu_obj_lnt06$refined_annotation,
subclone = seu_obj_lnt06$subclone
),
file.path(results_dir, "LUNG_T06_cell_summary.csv"),
row.names = FALSE
)
write.csv(
data.frame(
cell_id = colnames(seu_obj_lnn06),
sample = seu_obj_lnn06$Sample,
cell_type = seu_obj_lnn06$Cell_type,
copykat_label = seu_obj_lnn06$copykat_label,
refined_annotation = seu_obj_lnn06$refined_annotation
),
file.path(results_dir, "LUNG_N06_cell_summary.csv"),
row.names = FALSE
)
Part 6: Practical Tips, Caveats, and Best Practices
When CopyKAT Works Best
- The dataset contains a genuine mixture of tumor and non-malignant cells. The automatic diploid detection depends on finding a “flat” population to serve as the baseline.
- Read depth is adequate. Aim for at least 2,000 UMIs per cell. Lower depth produces noisy profiles where expression variation mimics CNVs.
- The tumor has real chromosomal instability. Lung adenocarcinoma is known for substantial genomic instability, making it a good target.
- Your dataset is human or mouse. CopyKAT supports only
hg20(hg38) andmm10.
Known False Positive Patterns
CopyKAT performs best on epithelial tumors (carcinomas) with clear aneuploid signal. Certain cell types are systematically prone to false positives and their aneuploid calls should be treated with skepticism:
| Cell Type | False Positive Risk | Biological Reason |
|---|---|---|
| Myeloid cells / Macrophages | High | Tissue-specific activation states create genome-wide expression patterns that mimic CNVs; seen in both tumor AND normal samples |
| Cancer-associated fibroblasts | Moderate-High | Tumor-induced transcriptional reprogramming creates widespread expression changes |
| Endothelial cells in tumors | Moderate | Tumor-induced angiogenic states alter expression broadly |
| B cells / Plasma cells | Low-Moderate | Immunoglobulin gene expression clusters on specific chromosomes can create false gains at those loci |
| T cells / NK cells | Low | TCR gene expression rarely generates strong false positives |
The practical rule: Always calculate aneuploid fractions per cell type and compare between tumor and matched normal. Any cell type with high aneuploid fractions in both tumor and normal is almost certainly a false positive in both.
Common Pitfalls and How to Avoid Them
Pitfall 1: Accepting all aneuploid calls without cell type cross-referencing
Never use raw aneuploid labels without checking which cell types they fall in. In this tutorial, the aneuploid_fraction_summary.csv table is essential for distinguishing signal from noise.
Pitfall 2: Using normalized or batch-corrected counts as input
# CORRECT
raw_mat <- LayerData(seu_obj_lnt06, assay = "RNA", layer = "counts")
# WRONG
wrong1 <- LayerData(seu_obj_lnt06, assay = "RNA", layer = "data")
wrong2 <- LayerData(seu_obj_lnt06, assay = "integrated", layer = "data")
Pitfall 3: Mixing samples in a single run
Always run each sample separately. Inter-patient expression differences confound the diploid reference detection.
Pitfall 4: Running subclone analysis on all aneuploid cells
As demonstrated in this tutorial, running subclone analysis on all aneuploid cells produces “subclones” that reflect cell type differences rather than tumor evolution. Restrict to the cell type where you have credible aneuploid signal (e.g., epithelial cells in a carcinoma).
Conclusion
In this tutorial, you learned how to:
- Run CopyKAT on a tumor and matched normal sample.
- Interpret results honestly: CopyKAT’s aneuploid labels must always be cross-referenced with cell type annotation. Myeloid cells are systematically prone to false positives; the credible signal in LUNG_T06 is concentrated in Epithelial cells (55.9% aneuploid vs. 1.1% in normal).
- Run subclone analysis correctly: restrict to the annotated cell type with credible aneuploid signal to avoid “subclones” that actually represent cell type differences.
- Diagnose unexpected results: the apparent higher aneuploidy in LUNG_N06 versus LUNG_T06 is explained by LUNG_N06’s larger myeloid compartment (1,310 cells) producing more false-positive aneuploid calls.
What comes next?
With confirmed malignant epithelial labels and confident diploid labels for the immune compartment, you can:
- Run cell-cell communication analysis (Parts 9-10) on the tumor microenvironment.
- Apply gene regulatory network analysis (WGCNA) to co-expression modules specific to each tumor subclone.
References
- Gao R, Bai S, Henderson YC, et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nature Biotechnology. 2021;39(5):599-608. doi:10.1038/s41587-020-00795-2
- Kim N, Kim HK, Lee K, et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nature Communications. 2020;11(1):2285. doi:10.1038/s41467-020-16164-1
- Sikkema L, Ramirez-Suastegui C, Strobl DC, et al. Benchmarking scRNA-seq copy number variation callers. Nature Communications. 2025. doi:10.1038/s41467-025-62359-9
- Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology. 2024;42(2):293-304. doi:10.1038/s41587-023-01767-y
- CopyKAT GitHub repository: https://github.com/navinlabcode/copykat
This tutorial is part of the comprehensive NGS101.com single-cell RNA-seq analysis series for beginners.





Leave a Reply