How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 3: Integration and Clustering

Table of Contents

Introduction: Why Integration Matters in Multi-Sample scRNA-seq Analysis

In Part 1 and Part 2 of this tutorial series, we processed PBMC samples from the GSE174609 dataset through the complete pipeline: from raw FASTQ files to quality-controlled count matrices. Now we face a critical question: How do we analyze multiple samples together to identify cell types and compare biological conditions?

For this tutorial, we’ll work with 8 samples (4 Healthy + 4 Post-Treatment) from the full dataset. This subset keeps the analysis manageable for learning while still demonstrating all integration concepts. The same workflow scales to any number of samples.

The intuitive approach might be to simply merge all samples into one large dataset and proceed with clustering. However, this “naive merge” often produces misleading results where cells cluster by sample identity or processing batch rather than by biological cell type. This technical variation obscures the true biological signal we’re trying to detect.

The Batch Effect Problem in Single-Cell Data

Batch effects are systematic technical differences between samples that arise from:

Different processing days: Samples prepared on different days experience varying environmental conditions (temperature, humidity, reagent age)
Different operators: Even with standardized protocols, human technique introduces variability
10x chip variation: Each Chromium chip has slight manufacturing differences
Sequencing runs: Different flow cells or sequencing dates create systematic biases
Sample preparation timing: Hours between sample collection and processing affect cell stress responses

These technical variations can be larger than biological differences between cell types or treatment conditions, making it impossible to distinguish true biology from technical artifacts.

What is Data Integration?

Data integration (also called batch correction or data harmonization) is the computational process of removing technical variation between samples while preserving biological variation. Think of it as aligning samples so that cells of the same type cluster together regardless of which sample they came from.

Key principle: Integration corrects for who prepared the sample, not what the sample represents biologically.

The Critical Misconception: Will Integration Erase My Treatment Effects?

This is the most common concern when analyzing multi-sample experiments with different conditions, and it manifests in a question we see repeatedly from beginners:

“After integration, I don’t see my healthy and diseased samples separating on the UMAP anymore. They’re all mixed together. Doesn’t this mean integration overcorrected and removed all the biological differences between my conditions?”

Short answer: No. The mixing you see is exactly what successful integration should look like.

This confusion stems from a fundamental misunderstanding about what UMAP visualization represents in single-cell analysis versus what PCA plots represent in bulk RNA-seq. Let’s break down why this intuition from bulk RNA-seq doesn’t translate to single-cell data.

The Bulk RNA-seq vs Single-Cell RNA-seq Visualization Paradigm Shift

In Bulk RNA-seq PCA plots (what most people learned first):

When you perform bulk RNA-seq and create a PCA plot, you expect and want to see your treatment groups separate into distinct clusters. If your healthy samples cluster on the left and your diseased samples cluster on the right, you celebrate—this visual separation confirms that your treatment has a strong transcriptional effect. The distance between sample clusters on the PCA plot directly reflects the magnitude of treatment-induced changes.

This makes intuitive sense: each sample is one data point representing millions of pooled cells, and the PCA shows you whether samples from different conditions have different overall gene expression profiles.

In scRNA-seq UMAP plots after integration (the paradigm shift):

When you perform scRNA-seq integration and create a UMAP plot, you expect and want to see your samples NOT separating by condition. If cells from healthy donors intermingle completely with cells from diseased patients, you celebrate—this visual mixing confirms that integration successfully removed technical batch effects while preserving biological cell type structure.

This seems counterintuitive at first, but here’s why it makes sense: each data point is now an individual cell, not a sample. You have thousands of cells from each sample, representing multiple different cell types (T cells, B cells, monocytes, etc.). The UMAP is designed to show you cell type relationships, not condition differences.

The key insight that resolves the confusion:

Bulk RNA-seq PCA: Shows sample-level differences (treatment effects)
scRNA-seq UMAP after integration: Shows cell-level similarities (cell type structure)

These are fundamentally different questions being answered by fundamentally different visualizations.

What You’re Actually Seeing When Samples “Mix” on the UMAP

Let’s use a concrete example from your periodontitis study to understand what’s happening:

Before Integration (the problem):

Your UMAP shows clear separation, but for the wrong reasons:

All cells from Healthy Donor 1 cluster together (T cells + B cells + monocytes from this donor)
All cells from Healthy Donor 2 cluster in a different region (T cells + B cells + monocytes from this donor)
All cells from Post-Treatment Patient 1 cluster in yet another region
All cells from Post-Treatment Patient 2 cluster separately

This separation is technical artifact, not biology. Why? Because T cells from Healthy Donor 1 should be more similar to T cells from Post-Treatment Patient 1 (same cell type) than to B cells from Healthy Donor 1 (same sample but different cell type). But they’re not clustering that way—they’re clustering by which sample they came from, which indicates batch effects.

After Integration (the solution):

Your UMAP now shows mixing:

All T cells cluster together (from healthy donors + post-treatment patients intermixed)
All B cells cluster together (from healthy donors + post-treatment patients intermixed)
All monocytes cluster together (from healthy donors + post-treatment patients intermixed)

This mixing is biological truth: cells of the same type ARE more similar to each other than cells of different types, regardless of which donor they came from or which condition that donor represents.

The critical realization: The treatment differences haven’t disappeared—they’re just not visible in the UMAP because UMAP shows cell type structure, not treatment effects.

Where Treatment Effects Are Actually Preserved (And Why You Can’t See Them on the UMAP)

Here’s the resolution to the paradox: Treatment effects exist in gene expression space, not in UMAP visualization space.

What Integration Actually Does:

Integration algorithms are sophisticated enough to distinguish between two types of variation in your data:

Type 1: Sample-level technical variation (what integration REMOVES)

Healthy Donor 1 was processed on Monday morning by Technician A
Healthy Donor 2 was processed on Tuesday afternoon by Technician B
The technical differences in sample processing create systematic biases
These biases affect all cell types equally within each sample
Integration identifies and removes these sample-specific technical effects

Type 2: Condition-level biological variation (what integration PRESERVES)

Post-treatment monocytes express different genes than healthy monocytes
Post-treatment samples have different proportions of cell types than healthy samples
Post-treatment CD4+ T cells show different activation states than healthy CD4+ T cells
These biological differences are consistent across all post-treatment patients
Integration recognizes these as real biology and preserves them

Why can’t you see treatment effects on the UMAP?

The UMAP is a 2-dimensional projection of a 20,000-dimensional gene expression space. It’s specifically optimized to show the largest sources of variation in your data. In single-cell data, the largest source of variation is cell type identity—the differences between T cells, B cells, and monocytes dwarf the differences between healthy and diseased states within any given cell type.

Think of it this way: the difference between a car and a bicycle (cell types) is much larger than the difference between a red car and a blue car (treatment conditions within a cell type). The UMAP emphasizes the car vs bicycle distinction, but the red vs blue information is still preserved in the full dataset—you just need different analyses to see it.

How We Actually Detect and Measure Treatment Effects

This is where Part 5 of the tutorial series becomes critical. Treatment effects are detected through three complementary approaches, none of which rely on visual UMAP separation:

Approach 1: Within-Cell-Type Differential Expression

After identifying cell types, you compare gene expression between conditions within each specific cell type. For example:

Compare monocytes from healthy donors vs monocytes from post-treatment patients
This reveals: Post-treatment monocytes express 2-fold higher levels of anti-inflammatory cytokines
The treatment effect is clear and quantified, even though healthy and diseased monocytes mixed on the UMAP

Approach 2: Cell Type Composition Changes

Calculate the proportion of each cell type in healthy vs diseased samples:

Healthy samples: 60% T cells, 15% B cells, 20% monocytes, 5% other
Post-treatment samples: 50% T cells, 15% B cells, 30% monocytes, 5% other
Treatment effect: 50% increase in monocyte proportion
Again, this analysis doesn’t require UMAP separation—it’s a simple counting exercise

Approach 3: Cell State Distribution Shifts

Within a cell type, examine how cells distribute across different activation states:

Healthy CD4+ T cells: 70% resting, 20% activated, 10% memory
Post-treatment CD4+ T cells: 50% resting, 30% activated, 20% memory
Treatment effect: Shift toward activated and memory states
These shifts are detected through gene expression signatures, not UMAP positions

The key insight: All three approaches reveal treatment effects that are completely invisible on the UMAP but perfectly preserved in the underlying data.

Quantitative Proof That Biology Isn’t Lost

Remember when we calculated integration quality metrics? We measured two independent scores:

Score 1: Sample Mixing (Technical Correction)

Measures how well cells from different samples intermingle within neighborhoods
High score = batch effects successfully removed
This is what makes the UMAP look “mixed”

Score 2: Biological Preservation (Signal Retention)

Measures whether conditions remain distinguishable using appropriate metrics
High score = biological differences retained despite visual mixing
This proves the treatment signal wasn’t destroyed

Your integration method was chosen because it had BOTH high mixing AND high biological preservation. If integration had overcorrected, you would see:

High mixing score (samples mix) ✓
Low biological preservation score (conditions indistinguishable) ✗
But you don’t see this—your biological preservation score is also high

This quantitative evidence confirms that the visual mixing on the UMAP does not indicate lost biology.

What Actual Overcorrection Looks Like (And How to Recognize It)

True overcorrection has specific warning signs that go beyond just visual UMAP mixing:

Warning Sign 1: Known Marker Genes Lose Specificity

CD3D (T cell marker) becomes equally expressed in B cells and monocytes
CD14 (monocyte marker) shows up strongly in T cells
Cell-type-defining genes become ambiguous across all clusters

Warning Sign 2: Biological Preservation Score Drops

Quantitative metrics show conditions are now indistinguishable
Treatment groups have become statistically inseparable
Known disease signatures disappear from the data

Warning Sign 3: Cell Type Proportions Become Identical

Healthy samples: 60% T cells, 15% B cells, 25% monocytes
Diseased samples: 60% T cells, 15% B cells, 25% monocytes
All biological differences in cell composition artificially equalized

Warning Sign 4: Within-Cell-Type Differential Expression Fails

Testing for treatment effects within monocytes returns zero differentially expressed genes
Known disease markers (validated by qPCR or flow cytometry) no longer show differences
Literature-validated biology cannot be recapitulated

Warning Sign 5: Results Contradict Orthogonal Validation

Flow cytometry shows 50% more activated T cells in disease
Your integrated data shows no activation difference
Integration has removed real biology

The critical distinction: Visual UMAP mixing alone is NOT evidence of overcorrection. You need to see multiple warning signs, particularly loss of known biology and failed differential expression, before concluding that integration was too aggressive.

How Integration Works: The Conceptual Framework

Modern integration methods use anchoring strategies to align datasets:

Identify anchor cells: Find cells from different samples that are clearly the same cell type or cluster (e.g., T cells in Sample 1 match T cells in Sample 2)
Learn technical differences: Calculate how gene expression differs between anchored cells from different samples
Apply corrections: Adjust all cells to remove sample-specific technical variation
Preserve biology: Retain variation that’s consistent across samples (true biological differences)

Different integration methods use variations of this strategy with different assumptions and strengths.

Evaluating Integration Quality: How to Know It Worked

Good integration should produce specific visual and quantitative outcomes:

Visual indicators:

UMAP mixing: Cells from different samples intermingle within each cluster
Preserved cell types: Distinct clusters corresponding to known biological cell types remain separated
Condition-specific populations: Rare or condition-specific cell types are still detectable
No technical clusters: No clusters driven purely by sample identity

Quantitative metrics:

Mixing metric: Measures how well samples mix within neighborhoods
Silhouette score: Assesses cluster separation
kBET (k-nearest neighbor Batch Effect Test): Tests for batch effects in local neighborhoods
Biological conservation: Marker genes remain differentially expressed

Red flags indicating over-integration:

All samples perfectly mixed even in rare cell populations
Loss of known marker gene expression
Disappearance of expected biological variation
Creation of artificial intermediate cell types

Red flags indicating under-integration:

Samples form distinct clusters instead of mixing
Same cell type splits across multiple clusters by sample
Technical variation dominates biological signal

Understanding Clustering: Finding Cell Populations

After integration removes technical variation, clustering groups cells based on transcriptional similarity. This allows us to:

Identify cell types: T cells, B cells, monocytes, etc.
Discover cell states: Activated vs resting, proliferating vs quiescent
Find rare populations: Dendritic cells, stem cells, transitional states
Enable downstream analysis: Differential expression requires knowing which cells to compare

Clustering workflow:

Dimensionality reduction: Compress 20,000 genes into ~30 principal components capturing major variation
Nearest neighbor graph: Connect each cell to its most similar neighbors
Community detection: Find groups of densely connected cells (clusters)
Visualization: Project onto 2D UMAP for interpretation

The resolution parameter controls clustering granularity:

Low resolution (0.2-0.5): Broad cell types (T cells, B cells, monocytes)
Medium resolution (0.6-1.0): Cell subtypes (CD4+ T cells, CD8+ T cells, naive vs memory)
High resolution (1.2-2.0): Fine-grained states (Th1 vs Th2 vs Th17 CD4+ T cells)

The Complete Multi-Sample Analysis Workflow

8 QC'd Samples (from Part 2: 4 Healthy + 4 Post-Treatment)
    ↓
Attempt 1: Naive Merge (no integration)
    ↓ → Batch effects dominate
    ↓ → Cells cluster by sample, not biology
    ↓
Attempt 2: Four Integration Methods (R-based)
    ↓
    ├─→ CCA Integration (canonical correlation)
    ├─→ RPCA Integration (reciprocal PCA)
    ├─→ Harmony (probabilistic, fast)
    └─→ FastMNN (mutual nearest neighbors)
        ↓
Compare Integration Results
    ↓ (mixing scores, biological preservation)
    ↓
Select Best Method
    ↓
Clustering at Multiple Resolutions
    ↓
Evaluate Cluster Quality
    ↓
Cell Type Annotation
    ↓
Differential Expression

What You’ll Learn in This Tutorial

By the end of Part 3, you’ll be able to:

✓ Understand why naive merging fails and integration is necessary
✓ Apply four different integration methods to your data
✓ Compare integration results visually and quantitatively
✓ Select the optimal method for your specific dataset
✓ Perform clustering at appropriate resolutions
✓ Evaluate whether clusters represent true biology or artifacts
✓ Prepare data for cell type annotation and downstream analysis

Let’s begin by setting up our environment and loading the quality-controlled data from Part 2.

Setting Up Your R Environment

We’ll build on the environment established in Part 2, adding packages specifically for integration methods.

Required Packages for Integration and Clustering

#-----------------------------------------------
# Installation: Integration-specific packages
#-----------------------------------------------

# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Core packages (already installed in Part 2)
# Seurat, ggplot2, dplyr, patchwork

# Install Bioconductor packages for advanced integration
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(c(
  "batchelor",  # FastMNN integration (used by Seurat's IntegrateLayers)
  "scran"       # Additional normalization methods
))

# Install Harmony for integration
install.packages("harmony")

# Install additional visualization and utility packages
install.packages(c(
  "remotes",         # Install R packages stored in GitHub
  "ggrepel",         # Non-overlapping text labels
  "RColorBrewer",    # Color palettes
  "viridis"          # Perceptually uniform color scales
))

# Wrapper functions for integration methods
remotes::install_github('satijalab/seurat-wrappers')

# Install future for parallel processing (used by RPCA integration)
install.packages("future")

# Install FNN for k-nearest neighbor calculations (used in quality metrics)
install.packages("FNN")

# Install cluster for clustering quality metrics (silhouette scores)
install.packages("cluster")

# Install reshape2 for data reshaping (used in cluster quality assessment)
install.packages("reshape2")

Loading Libraries and Configuration

#-----------------------------------------------
# STEP 1: Load required libraries
#-----------------------------------------------

# Core single-cell analysis
library(Seurat)
library(SeuratObject)

# Integration methods
library(harmony)
library(batchelor)
library(SeuratWrappers)

# Visualization and data manipulation
library(ggplot2)
library(ggrepel)
library(dplyr)
library(patchwork)
library(RColorBrewer)
library(viridis)

# Quality metrics and utilities
library(FNN)              # K-nearest neighbor calculations for mixing metrics
library(cluster)          # Silhouette scores for clustering quality
library(reshape2)         # Data reshaping for visualization

# Parallel processing
library(future)
options(future.globals.maxSize = 20 * 1024^3)  # Increase to 20GB for large datasets

# Set working directory (adjust to your path)
setwd("~/GSE174609_scRNA/integration_analysis")

# Create output directories
dir.create("plots", showWarnings = FALSE)
dir.create("plots/integration_comparison", showWarnings = FALSE)
dir.create("plots/clustering", showWarnings = FALSE)
dir.create("integrated_data", showWarnings = FALSE)
dir.create("metadata", showWarnings = FALSE)

# Set random seed for reproducibility
set.seed(42)

# Configure plotting defaults
theme_set(theme_classic(base_size = 12))

cat("Environment setup complete\n")
cat("Seurat version:", as.character(packageVersion("Seurat")), "\n")
# Seurat version: 5.3.1

Loading Quality-Controlled Data from Part 2

In Part 2, we performed comprehensive quality control on samples from GSE174609. Each sample was independently filtered for empty droplets, doublets, low-quality cells, and uninformative genes. Now we’ll load these clean datasets for integration.

For this tutorial, we use 8 samples (4 Healthy + 4 Post-Treatment) to keep the analysis manageable while demonstrating all key concepts.

Understanding “batches” in this dataset

In scRNA-seq, each individual sample typically represents its own technical batch due to separate processing. Even though samples may be processed on the same day, slight variations in handling time, reagent temperature, cell dissociation efficiency, and other technical factors create sample-specific effects. Therefore:

Technical batch = individual sample (8 batches in our analysis)
Biological replicate = patient/donor (different individuals within each condition)

Integration corrects for the technical differences between these 8 samples while preserving the biological differences between the 2 conditions (Healthy vs Post-treatment).

Expected outcomes after integration:

Samples should mix within each cell type
Treatment effects should remain visible in:
Cell type proportions
Gene expression within cell types
Cell state distributions

Loading All QC’d Samples

#-----------------------------------------------
# STEP 2: Load QC-filtered samples from Part 2
#-----------------------------------------------

# Define sample metadata
sample_metadata <- data.frame(
  sample_id = c(
    "Healthy_1", "Healthy_2", "Healthy_3", "Healthy_4",
    "Post_Patient_1", "Post_Patient_2", "Post_Patient_3", "Post_Patient_4"
  ),
  condition = c(
    rep("Healthy", 4),
    rep("Post_Treatment", 4)
  ),
  patient_id = c(
    "Donor_1", "Donor_2", "Donor_3", "Donor_4",
    "Patient_1", "Patient_2", "Patient_3", "Patient_4"
  ),
  stringsAsFactors = FALSE
)

# Load QC-filtered Seurat objects
qc_data_path <- "~/GSE174609_scRNA/QC_analysis/filtered_data"

seurat_list <- lapply(sample_metadata$sample_id, function(sample_id) {
  obj <- readRDS(file.path(qc_data_path, paste0(sample_id, "_qc_filtered.rds")))

  # Ensure metadata is consistent
  obj$sample_id <- sample_id
  obj$condition <- sample_metadata$condition[sample_metadata$sample_id == sample_id]
  obj$patient_id <- sample_metadata$patient_id[sample_metadata$sample_id == sample_id]

  return(obj)
})

names(seurat_list) <- sample_metadata$sample_id

# Ensure all samples have the same genes (critical for FastMNN integration)
all_genes <- lapply(seurat_list, rownames)
common_genes <- Reduce(intersect, all_genes)

# Subset all samples to common genes
seurat_list <- lapply(seurat_list, function(obj) {
  obj[common_genes, ]
})

# Report dimensions
cat("\nLoaded", length(seurat_list), "QC-filtered samples:\n")
for (sample_id in names(seurat_list)) {
  cat(sprintf("  %s: %d cells × %d genes\n", 
              sample_id, 
              ncol(seurat_list[[sample_id]]), 
              nrow(seurat_list[[sample_id]])))
}

cat("\nTotal cells across all samples:", 
    sum(sapply(seurat_list, ncol)), "\n")

#   Healthy_1: 8018 cells × 18861 genes
#   Healthy_2: 8249 cells × 18861 genes
#   Healthy_3: 9057 cells × 18861 genes
#   Healthy_4: 10175 cells × 18861 genes
#   Post_Patient_1: 8132 cells × 18861 genes
#   Post_Patient_2: 8465 cells × 18861 genes
#   Post_Patient_3: 9339 cells × 18861 genes
#   Post_Patient_4: 11214 cells × 18861 genes

Important: These Seurat objects were already normalized in Part 2 using NormalizeData() and have variable features identified with FindVariableFeatures(). We’ll use this normalized data for integration.

Quick Quality Check

Before proceeding, let’s verify our data quality across all samples. This is a good checkpoint before starting integration.

The Naive Approach: Direct Merge Without Integration

Before exploring integration methods, let’s see what happens when we simply merge samples without correction. This demonstrates why integration is necessary.

Merging Samples Directly

#-----------------------------------------------
# STEP 3: Naive merge without integration
#-----------------------------------------------

# Merge all samples
merged_naive <- merge(
  x = seurat_list[[1]],
  y = seurat_list[-1],
  add.cell.ids = names(seurat_list),
  project = "GSE174609_Naive_Merge"
)

cat("Merged dataset:", ncol(merged_naive), "cells ×", nrow(merged_naive), "genes\n")
# Merged dataset: 72649 cells × 18861 genes

# Standard Seurat workflow
merged_naive <- NormalizeData(merged_naive, verbose = FALSE)
merged_naive <- FindVariableFeatures(merged_naive, nfeatures = 2000, verbose = FALSE)
merged_naive <- ScaleData(merged_naive, verbose = FALSE)
merged_naive <- RunPCA(merged_naive, npcs = 50, verbose = FALSE)
merged_naive <- RunUMAP(merged_naive, dims = 1:30, reduction = "pca", verbose = FALSE)
merged_naive <- FindNeighbors(merged_naive, dims = 1:30, verbose = FALSE)
merged_naive <- FindClusters(merged_naive, resolution = 0.6, verbose = FALSE)

cat("Clustering complete:", length(unique(merged_naive$seurat_clusters)), "clusters identified\n")
# Clustering complete: 19 clusters identified

Visualizing Batch Effects

#-----------------------------------------------
# STEP 4: Visualize batch effects in naive merge
#-----------------------------------------------

# Define color palettes for visualization
# 8 samples need 8 distinct, visible colors
sample_colors <- c(
  "#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",  # Healthy 1-4: Red, Blue, Green, Purple
  "#FF7F00", "#A65628", "#F781BF", "#999999"   # Post_Patient 1-4: Orange, Brown, Pink, Gray
)
names(sample_colors) <- sample_metadata$sample_id

# 2 conditions need 2 distinct colors
condition_colors <- c(
  "Healthy" = "#2E86AB",       # Blue
  "Post_Treatment" = "#F18F01"  # Orange
)

# UMAP colored by sample - shows batch effects
p1_naive <- DimPlot(merged_naive, reduction = "umap", group.by = "sample_id", 
                     pt.size = 0.05, cols = sample_colors) +
  ggtitle("Naive Merge: Colored by Sample") +
  theme(legend.position = "right", legend.text = element_text(size = 8))

# UMAP colored by condition
p2_naive <- DimPlot(merged_naive, reduction = "umap", group.by = "condition",
                     pt.size = 0.05, cols = condition_colors) +
  ggtitle("Naive Merge: Colored by Condition")

# UMAP colored by clusters
p3_naive <- DimPlot(merged_naive, reduction = "umap", group.by = "seurat_clusters",
                     pt.size = 0.05, label = TRUE, label.size = 5) +
  ggtitle("Naive Merge: Clusters") +
  NoLegend()

# Split by sample to see separation
p4_naive <- DimPlot(merged_naive, reduction = "umap", group.by = "condition",
                     split.by = "sample_id", pt.size = 0.05, ncol = 4, 
                     cols = condition_colors) +
  ggtitle("Naive Merge: Split by Sample") +
  theme(strip.text = element_text(size = 9, face = "bold"))

# Combine plots
combined_naive <- (p1_naive | p2_naive) / (p3_naive | p4_naive)
ggsave("plots/01_naive_merge_batch_effects.png", combined_naive, 
       width = 16, height = 12, dpi = 300)

Understanding Batch Effects in Large Datasets: With 72,649 cells in our example, the UMAP plot becomes very dense with overlapping points, making it difficult to visually assess batch effects by color alone. While smaller datasets (<10,000 cells) show clear visual separation when samples cluster by batch rather than biology, large datasets require quantitative metrics (mixing scores, silhouette scores) to properly evaluate batch effects. The split view (bottom right panel) and the quantification section below provide clearer evidence of batch-driven clustering.

What Batch Effects Look Like: In a naive merge, cells often cluster by sample identity rather than cell type. This means Healthy_1 and Healthy_2 may form distinct clusters even though they contain the same cell types. Integration corrects this by aligning samples based on shared biological variation while removing technical variation.

Here is another dataset with far fewer cells. The UMAPs from the naïve merge clearly reveal the batch effects discussed above.

Quantifying Batch Effects

#-----------------------------------------------
# STEP 5: Quantify batch effects with mixing metrics
#-----------------------------------------------

# Calculate per-cluster sample composition
cluster_composition <- table(merged_naive$seurat_clusters, merged_naive$sample_id)
cluster_composition_pct <- prop.table(cluster_composition, margin = 1) * 100

# Find clusters dominated by single samples (>50% from one sample = batch-driven)
dominant_sample_clusters <- apply(cluster_composition_pct, 1, max) > 50
n_batch_clusters <- sum(dominant_sample_clusters)

cat("\nBatch effect assessment:\n")
cat("  Clusters dominated by single sample (>50%):", n_batch_clusters, 
    "/", nrow(cluster_composition), "\n")
# Clusters dominated by single sample (>50%): 1 / 19

if (n_batch_clusters > 0) {
  cat("  ⚠  Naive merge shows significant batch effects\n")
  cat("  → Integration is necessary\n")
}
# → Integration is necessary

Data Integration: Five Methods Compared

Now we’ll apply four different integration methods to remove batch effects while preserving biological variation. Each method has different strengths and is suited to different scenarios.

Overview of Integration Methods

1. CCA Integration (Canonical Correlation Analysis)

Algorithm: Identifies shared correlation structures across datasets
Strengths:
Excellent for diverse cell types
Handles datasets with different cell type compositions
Robust to varying cell numbers per sample
Best for: Experiments where samples may have different cell type proportions
Computational cost: Medium
When to use: Default choice for most experiments

2. RPCA Integration (Reciprocal PCA)

Algorithm: Uses reciprocal PCA to identify shared variation
Strengths:
Faster than CCA
Works well with large datasets (>100,000 cells)
Less memory intensive
Best for: Large-scale studies, similar cell type compositions across samples
Computational cost: Low-Medium
When to use: When computational efficiency is critical or datasets are very large

3. Harmony

Algorithm: Iteratively adjusts cell embeddings to remove batch effects
Strengths:
Very fast
Preserves global structure well
Simple single-step process
Scales to very large datasets
Best for: Large datasets, soft batch correction, preserving broad structure
Computational cost: Low
When to use: First-pass integration, large datasets, when speed matters

4. FastMNN (Fast Mutual Nearest Neighbors)

Algorithm: Finds mutual nearest neighbors across batches and corrects in low-dimensional space
Strengths:
Preserves biological variation well
Handles complex batch structures
Less aggressive correction (conservative)
Best for: Datasets with subtle biological differences, hierarchical batch structures
Computational cost: Medium
When to use: When biological signal is subtle and you want conservative correction

5. scVI (Single-Cell Variational Inference)

Algorithm: Deep learning-based method using variational autoencoders
Strengths:
State-of-the-art performance on complex datasets
Can integrate millions of cells
Handles very large batch effects
Learns data-specific corrections
Best for: Very large datasets, complex batch structures, when computational resources allow
Computational cost: High (GPU recommended)
When to use: When R-based methods show insufficient correction and computational resources are available

Why This Tutorial Focuses on Four R-Based Methods

The four R-based methods (CCA, RPCA, Harmony, FastMNN) we cover are:

✅ Easy to install through CRAN/Bioconductor
✅ Work reliably across different systems
✅ Often perform comparably to scVI for most datasets
✅ Sufficient for high-quality integration
✅ No GPU required

These methods will meet the needs of >95% of scRNA-seq integration tasks. We provide detailed scVI information later for advanced users who need it.

Preparing Data for Integration

All integration methods in Seurat 5 use a unified workflow through IntegrateLayers(). First, we need to prepare our merged object:

#-----------------------------------------------
# STEP 6: Prepare data for integration
#-----------------------------------------------

# Merge all samples with split layers (required for Seurat 5 integration)
merged_seurat <- merge(
  x = seurat_list[[1]],
  y = seurat_list[-1],
  add.cell.ids = names(seurat_list),
  project = "GSE174609_Integration"
)

# Add all metadata
# Extract sample_id from cell names (added via add.cell.ids)
# Cell names format: "SampleID_BARCODE-1"
cell_names <- colnames(merged_seurat)
merged_seurat$sample_id <- gsub("_[ACGT].*$", "", cell_names)  # Remove barcode, keep sample_id

# Add other metadata by matching sample_id
merged_seurat$condition <- sample_metadata$condition[match(merged_seurat$sample_id, sample_metadata$sample_id)]
merged_seurat$patient_id <- sample_metadata$patient_id[match(merged_seurat$sample_id, sample_metadata$sample_id)]

# Check if layers are already split
current_layers <- Layers(merged_seurat[["RNA"]])
if (length(current_layers) > 1) {
  cat("Layers already split by merge operation (", length(current_layers), " layers)\n", sep = "")
} else {
  # Split layers by sample if not already split
  cat("Splitting layers by sample\n")
  merged_seurat[["RNA"]] <- split(merged_seurat[["RNA"]], f = merged_seurat$sample_id)
}

# Normalize and find variable features on split layers
merged_seurat <- NormalizeData(merged_seurat, verbose = FALSE)
merged_seurat <- FindVariableFeatures(merged_seurat, nfeatures = 2000, verbose = FALSE)

# Scale data and run PCA on split layers
merged_seurat <- ScaleData(merged_seurat, verbose = FALSE)
merged_seurat <- RunPCA(merged_seurat, npcs = 50, verbose = FALSE)

cat("Data prepared for integration\n")
cat("Layers:", length(Layers(merged_seurat, search = "data")), "samples\n")
# Layers: 8 samples

Method 1: CCA Integration

#-----------------------------------------------
# STEP 7: CCA Integration
#-----------------------------------------------

# Integrate using CCA
integrated_cca <- IntegrateLayers(
  object = merged_seurat,
  method = CCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.cca",
  dims = 1:30,
  verbose = FALSE
)

# Standard downstream workflow
integrated_cca <- FindNeighbors(integrated_cca, reduction = "integrated.cca", dims = 1:30, verbose = FALSE)
integrated_cca <- FindClusters(integrated_cca, resolution = 0.6, verbose = FALSE)
integrated_cca <- RunUMAP(integrated_cca, reduction = "integrated.cca", dims = 1:30, 
                          reduction.name = "umap.cca", verbose = FALSE)

cat("CCA integration complete:", length(unique(integrated_cca$seurat_clusters)), "clusters\n")
# CCA integration complete: 23 clusters

Method 2: RPCA Integration

RPCA is computationally efficient and works well with large datasets.

Note on Parallel Processing: RPCA integration uses the future package for parallel processing. We configured future.globals.maxSize = 20GB in Step 1 to handle large object transfers. If you still encounter memory errors, increase this value or reduce the dataset size.

#-----------------------------------------------
# STEP 8: RPCA Integration
#-----------------------------------------------

# Integrate using RPCA
integrated_rpca <- IntegrateLayers(
  object = merged_seurat,
  method = RPCAIntegration,
  orig.reduction = "pca",
  new.reduction = "integrated.rpca",
  dims = 1:30,
  verbose = FALSE
)

# Downstream workflow
integrated_rpca <- FindNeighbors(integrated_rpca, reduction = "integrated.rpca", dims = 1:30, verbose = FALSE)
integrated_rpca <- FindClusters(integrated_rpca, resolution = 0.6, verbose = FALSE)
integrated_rpca <- RunUMAP(integrated_rpca, reduction = "integrated.rpca", dims = 1:30,
                           reduction.name = "umap.rpca", verbose = FALSE)

cat("RPCA integration complete:", length(unique(integrated_rpca$seurat_clusters)), "clusters\n")
# RPCA integration complete: 22 clusters

Method 3: Harmony Integration

Harmony is fast and effective, working directly on the PCA embedding rather than requiring layer-based operations.

#-----------------------------------------------
# STEP 9: Harmony Integration
#-----------------------------------------------

# Harmony works directly on PCA embedding
# First, join layers and run standard workflow
merged_harmony <- JoinLayers(merged_seurat)
merged_harmony <- RunPCA(merged_harmony, npcs = 50, verbose = FALSE)

# Run Harmony (corrects batch effects on PCA embedding)
integrated_harmony <- RunHarmony(merged_harmony, "sample_id")

# Downstream workflow
integrated_harmony <- FindNeighbors(integrated_harmony, reduction = "harmony", dims = 1:30, verbose = FALSE)
integrated_harmony <- FindClusters(integrated_harmony, resolution = 0.6, verbose = FALSE)
integrated_harmony <- RunUMAP(integrated_harmony, reduction = "harmony", dims = 1:30,
                                reduction.name = "umap.harmony", verbose = FALSE)

cat("Harmony integration complete:", length(unique(integrated_harmony$seurat_clusters)), "clusters\n")
# Harmony integration complete: 21 clusters

Method 4: FastMNN Integration

FastMNN uses mutual nearest neighbors to align datasets. It works directly on the split layers in the merged Seurat object using IntegrateLayers().

#-----------------------------------------------
# STEP 10: FastMNN Integration
#-----------------------------------------------

# Integrate using FastMNN (works on split layers)
integrated_fastmnn <- IntegrateLayers(
  object = merged_seurat,
  method = FastMNNIntegration,
  new.reduction = "integrated.mnn",
  verbose = FALSE
)

cat("FastMNN integration complete\n")

# Downstream workflow
integrated_fastmnn <- FindNeighbors(integrated_fastmnn, reduction = "integrated.mnn", dims = 1:30, verbose = FALSE)
integrated_fastmnn <- FindClusters(integrated_fastmnn, resolution = 0.6, verbose = FALSE)
integrated_fastmnn <- RunUMAP(integrated_fastmnn, reduction = "integrated.mnn", dims = 1:30,
                              reduction.name = "umap.mnn", verbose = FALSE)

cat("Clustering complete:", length(unique(integrated_fastmnn$seurat_clusters)), "clusters\n")
# FastMNN integration complete: 20 clusters

Method 5: scVI Integration (Advanced – Not Covered)

scVI (single-cell Variational Inference) is a powerful deep learning-based integration method that uses variational autoencoders to learn a low-dimensional representation of scRNA-seq data. While it often achieves excellent integration quality, we skip this method in this tutorial due to significant setup challenges.

Why scVI is Powerful:

Uses deep learning to model complex batch effects
Can integrate very large datasets (millions of cells)
Often produces superior integration compared to linear methods
Can model both technical and biological variation simultaneously

Why We’re Skipping It:

Python dependency: Requires Python 3.8+ and PyTorch installation
Complex environment: Setting up scvi-tools and its dependencies is error-prone
Version conflicts: Frequent incompatibilities between PyTorch, CUDA, and scvi-tools versions
Resource intensive: Requires significant computational resources (GPU recommended)
Debugging difficulty: Python-R interface issues through reticulate are hard to troubleshoot
Time investment: Can take hours to set up correctly, even for experienced users

Our Recommendation:
The four R-based methods we covered (CCA, RPCA, Harmony, FastMNN) are:

✅ Easier to install – pure R packages through CRAN/Bioconductor
✅ Faster to run – no deep learning training required
✅ Equally effective – often perform comparably to scVI for most datasets
✅ More reproducible – fewer version dependency issues

If You Want to Try scVI Anyway:
For those comfortable with Python environments, see:

scVI documentation: https://docs.scvi-tools.org/
Seurat-scVI integration: https://satijalab.org/seurat/articles/seurat5_integration_bridge

For this tutorial, we proceed with our 4 reliable R-based methods, which are more than sufficient for high-quality integration.

Comparing Integration Methods

Now that we’ve applied four different integration methods, let’s systematically compare their results to determine which works best for our dataset.

Methods we’re comparing:

Naive Merge (no integration – baseline)
CCA Integration (Canonical Correlation Analysis)
RPCA Integration (Reciprocal PCA)
Harmony Integration (Fast, probabilistic)
FastMNN Integration (Mutual nearest neighbors)

Visual Comparison: UMAPs Side by Side

#-----------------------------------------------
# STEP 11: Generate comparison UMAPs
#-----------------------------------------------

cat("\n=== Generating Comparison Plots ===\n")

# Create a function to make standardized UMAP plots
make_comparison_plot <- function(seurat_obj, reduction_name, title, group_by = "sample_id", colors = NULL) {
  p <- DimPlot(seurat_obj, reduction = reduction_name, group.by = group_by, pt.size = 0.05) +
    ggtitle(title) +
    theme(plot.title = element_text(face = "bold", size = 12),
          legend.text = element_text(size = 7),
          legend.key.size = unit(0.3, "cm"))

  # Apply custom colors if provided
  if (!is.null(colors)) {
    p <- p + scale_color_manual(values = colors)
  }

  return(p)
}

# UMAPs colored by sample (assesses mixing)
p_sample_naive <- make_comparison_plot(merged_naive, "umap", "Naive Merge", 
                                       colors = sample_colors)
p_sample_cca <- make_comparison_plot(integrated_cca, "umap.cca", "CCA", 
                                     colors = sample_colors)
p_sample_rpca <- make_comparison_plot(integrated_rpca, "umap.rpca", "RPCA", 
                                      colors = sample_colors)
p_sample_harmony <- make_comparison_plot(integrated_harmony, "umap.harmony", "Harmony", 
                                         colors = sample_colors)
p_sample_mnn <- make_comparison_plot(integrated_fastmnn, "umap.mnn", "FastMNN", 
                                     colors = sample_colors)

# Combine sample-colored UMAPs
combined_samples <- (p_sample_naive | p_sample_cca | p_sample_rpca) / 
                    (p_sample_harmony | p_sample_mnn | plot_spacer())

ggsave("plots/integration_comparison/02_integration_by_sample.png", 
       combined_samples, width = 16, height = 12, dpi = 300)

# UMAPs colored by condition (assesses biological preservation)
p_cond_naive <- make_comparison_plot(merged_naive, "umap", "Naive Merge", "condition", 
                                     colors = condition_colors)
p_cond_cca <- make_comparison_plot(integrated_cca, "umap.cca", "CCA", "condition", 
                                   colors = condition_colors)
p_cond_rpca <- make_comparison_plot(integrated_rpca, "umap.rpca", "RPCA", "condition", 
                                    colors = condition_colors)
p_cond_harmony <- make_comparison_plot(integrated_harmony, "umap.harmony", "Harmony", "condition", 
                                       colors = condition_colors)
p_cond_mnn <- make_comparison_plot(integrated_fastmnn, "umap.mnn", "FastMNN", "condition", 
                                   colors = condition_colors)

# Combine condition-colored UMAPs
combined_conditions <- (p_cond_naive | p_cond_cca | p_cond_rpca) /
                       (p_cond_harmony | p_cond_mnn | plot_spacer())

ggsave("plots/integration_comparison/03_integration_by_condition.png",
       combined_conditions, width = 16, height = 12, dpi = 300)

Here is another dataset with far fewer cells, which better demonstrates the effects of integration visually. The UMAP projections compare different integration methods. In contrast to the naïve merge, the integrated datasets show that cells from different samples are well mixed across clusters, with no cluster dominated by cells from a single sample. This indicates successful removal of batch effects associated with sample identity.

Quantitative Comparison: Mixing and Separation Metrics

To objectively compare integration methods, we calculate mixing scores that quantify how well cells from different samples mix within biological clusters. Higher scores indicate better integration.

How the mixing metric works: For each cell, we find its k=50 nearest neighbors in the UMAP space and calculate the diversity of samples among those neighbors using the inverse Simpson’s Index. If all neighbors are from different samples (perfect mixing), the score is high. If all neighbors are from the same sample (poor integration), the score is low.

#-----------------------------------------------
# STEP 12: Calculate integration quality metrics
#-----------------------------------------------

# Function to calculate mixing metric (local inverse Simpson's Index)
# This measures sample diversity in each cell's k-nearest neighborhood
calculate_mixing_metric <- function(seurat_obj, group_by = "sample_id", reduction = "umap", k = 50) {
  # Get embedding
  embedding <- Embeddings(seurat_obj, reduction = reduction)

  # Calculate k-nearest neighbors using FNN package
  nn_result <- FNN::get.knn(embedding[, 1:2], k = k)

  # For each cell, calculate diversity of samples in its neighborhood
  group_labels <- seurat_obj@meta.data[[group_by]]
  mixing_scores <- sapply(1:nrow(nn_result$nn.index), function(i) {
    neighbors <- nn_result$nn.index[i, ]
    neighbor_groups <- group_labels[neighbors]
    props <- table(neighbor_groups) / length(neighbor_groups)
    # Inverse Simpson's Index (higher = more mixed)
    1 / sum(props^2)
  })

  return(mean(mixing_scores))
}

# Calculate for all methods
methods_list <- list(
  "Naive" = list(obj = merged_naive, reduction = "umap"),
  "CCA" = list(obj = integrated_cca, reduction = "umap.cca"),
  "RPCA" = list(obj = integrated_rpca, reduction = "umap.rpca"),
  "Harmony" = list(obj = integrated_harmony, reduction = "umap.harmony"),
  "FastMNN" = list(obj = integrated_fastmnn, reduction = "umap.mnn")
)

# Calculate mixing metrics
mixing_results <- data.frame(
  method = names(methods_list),
  mixing_score = sapply(methods_list, function(x) {
    calculate_mixing_metric(x$obj, group_by = "sample_id", reduction = x$reduction)
  }),
  n_clusters = sapply(methods_list, function(x) {
    length(unique(x$obj$seurat_clusters))
  })
)

# Display results
cat("\nIntegration Quality Metrics:\n")
cat("(Higher mixing score = better sample mixing)\n\n")
print(mixing_results)

# Save metrics
write.csv(mixing_results, "metadata/integration_comparison_metrics.csv", row.names = FALSE)

# Visualize mixing scores
p_mixing <- ggplot(mixing_results, aes(x = reorder(method, -mixing_score), y = mixing_score, fill = method)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(mixing_score, 2)), vjust = -0.5) +
  labs(title = "Integration Method Comparison: Sample Mixing",
       subtitle = "Higher score = better integration (samples mix within cell types)",
       x = "Integration Method",
       y = "Mixing Score (Local Inverse Simpson's Index)") +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

ggsave("plots/integration_comparison/04_mixing_scores.png", p_mixing, 
       width = 10, height = 6, dpi = 300)

Assessing Biological Signal Preservation

A good integration method should remove technical variation (samples mix) while preserving biological variation (conditions remain distinguishable).

How This Metric Works (For Beginners):

Integration is a balancing act. We want to remove batch effects (technical variation) but keep biological differences (biological variation). This metric helps us check if we’ve struck the right balance.

The Measurement Process:

Find condition centers: Calculate the average UMAP position for all Healthy cells and all Post_Treatment cells
Measure separation: Calculate the distance between these two centers
Interpret the distance:

High distance = Conditions are well-separated (biological signal preserved) ✓
Low distance = Conditions overlap (might have over-integrated, losing biology) ⚠️

What We’re Looking For:

High mixing score (from previous section) → Samples within each condition mix well
Moderate-to-high separation (this section) → Conditions remain distinguishable

Examples:

Good integration: Healthy cells mix together (mixing score high), BUT Healthy cells and Post_Treatment cells remain in distinct regions (separation high)
Over-integration: Everything mixes together, losing biological differences (mixing high, separation low) ⚠️
Under-integration: Samples stay separated, batch effects remain (mixing low, separation high) ⚠️

#-----------------------------------------------
# STEP 13: Assess biological signal preservation
#-----------------------------------------------

# Function to test if conditions are still distinguishable after integration
# Uses UMAP distances between conditions
assess_condition_separation <- function(seurat_obj, reduction) {
  umap_coords <- seurat_obj[[reduction]]@cell.embeddings[, 1:2]
  conditions <- seurat_obj$condition

  # Calculate mean UMAP position for each condition
  condition_centers <- aggregate(umap_coords, by = list(conditions), FUN = mean)

  # Calculate pairwise distances between condition centers
  dist_matrix <- dist(condition_centers[, -1])
  mean_dist <- mean(dist_matrix)

  return(mean_dist)
}

# Calculate for all methods
separation_results <- data.frame(
  method = names(methods_list),
  condition_separation = sapply(methods_list, function(x) {
    assess_condition_separation(x$obj, x$reduction)
  })
)

mixing_results$condition_separation <- separation_results$condition_separation

cat("\nBiological Signal Preservation:\n")
cat("(Higher separation = conditions remain distinguishable)\n\n")
print(mixing_results[, c("method", "mixing_score", "condition_separation", "n_clusters")])

# Ideal integration: High mixing + Moderate separation
# Over-integration: High mixing + Low separation (conditions lost)
# Under-integration: Low mixing + High separation (batches not corrected)

# Visualize the trade-off
p_tradeoff <- ggplot(mixing_results, aes(x = mixing_score, y = condition_separation, label = method)) +
  geom_point(size = 4, aes(color = method)) +
  geom_text_repel(size = 4, box.padding = 0.5) +
  labs(title = "Integration Quality: Mixing vs Biological Preservation",
       subtitle = "Ideal: Upper right (high mixing + preserved biology)",
       x = "Sample Mixing Score (higher = better integration)",
       y = "Condition Separation (higher = preserved biology)") +
  theme(legend.position = "none") +
  scale_color_brewer(palette = "Set2")

ggsave("plots/integration_comparison/05_mixing_vs_preservation.png", p_tradeoff,
       width = 10, height = 7, dpi = 300)

Selecting the Best Integration Method

#-----------------------------------------------
# STEP 14: Select optimal integration method
#-----------------------------------------------

# Rank methods by mixing score (primary criterion)
mixing_results <- mixing_results[order(-mixing_results$mixing_score), ]

cat("\nMethod ranking (by sample mixing):\n")
for (i in 1:nrow(mixing_results)) {
  cat(sprintf("  %d. %s (mixing: %.2f, separation: %.2f)\n",
              i, 
              mixing_results$method[i],
              mixing_results$mixing_score[i],
              mixing_results$condition_separation[i]))
}

# Select top method
best_method <- mixing_results$method[1]
cat("\nRecommended method:", best_method, "\n")

# For subsequent analysis, we'll use the best method
# In most cases, this will be Harmony or CCA
if (best_method == "Harmony") {
  integrated_final <- integrated_harmony
  reduction_final <- "umap.harmony"
} else if (best_method == "CCA") {
  integrated_final <- integrated_cca
  reduction_final <- "umap.cca"
} else if (best_method == "RPCA") {
  integrated_final <- integrated_rpca
  reduction_final <- "umap.rpca"
} else if (best_method == "FastMNN") {
  integrated_final <- integrated_fastmnn
  reduction_final <- "umap.mnn"
} else {
  # Default to CCA if method not found
  integrated_final <- integrated_cca
  reduction_final <- "umap.cca"
  cat("Warning: Method not recognized, defaulting to CCA\n")
}

cat("\nUsing", best_method, "for downstream clustering analysis\n")
# Using CCA for downstream clustering analysis

Interpretation Guide:

High mixing score (>3.0): Good integration, samples mix well within cell types

Moderate separation: Biological conditions remain distinguishable

Balanced scores: Ideal integration removes technical variation while preserving biology

Warning signs:

Very high mixing + very low separation: Over-integration (lost biological signal)

Very low mixing + very high separation: Under-integration (batches not corrected)

Understanding Clustering: From Integrated Data to Cell Populations

With successfully integrated data, we can now cluster cells to identify distinct populations. Clustering groups cells with similar gene expression profiles, allowing us to discover cell types, cell states, and rare populations.

The Clustering Algorithm: Graph-Based Community Detection

Seurat uses the Louvain algorithm (or its faster variant, Leiden) for clustering:

Step 1: Build k-nearest neighbor (KNN) graph

Each cell connects to its k most similar neighbors
Similarity measured in integrated PCA space (not raw gene expression)
Default: k=20 neighbors

Step 2: Calculate shared nearest neighbors (SNN)

Weight edges based on shared neighbors between cells
Cells with many common neighbors = stronger connection
Creates more robust clusters

Step 3: Community detection

Apply Louvain/Leiden algorithm to find communities
Communities = groups of densely connected cells
Resolution parameter controls granularity

The Resolution Parameter: Controlling Cluster Granularity

The resolution parameter controls how many clusters are identified:

Low resolution (0.2-0.5):

Fewer, larger clusters
Broad cell types (T cells, B cells, monocytes)
Good for initial exploration
Use when: You want major cell type categories

Medium resolution (0.6-1.0):

Moderate number of clusters
Cell subtypes (CD4+ T cells, CD8+ T cells, classical monocytes)
Standard choice for most analyses
Use when: Balanced between detail and interpretability

High resolution (1.2-2.0):

Many small clusters
Fine-grained cell states (Th1, Th2, Th17 CD4+ T cells)
May split biological populations unnecessarily
Use when: You need maximum detail or are focusing on specific cell types

Rules of thumb:

~3,000 cells: resolution 0.4-0.6
~10,000 cells: resolution 0.6-0.8
~50,000 cells: resolution 0.8-1.2
>100,000 cells: resolution 1.0-1.5

For our dataset (~72,000 cells), we’ll test resolutions from 0.4 to 1.2.

Common Misconception: Cluster Number Should Match Expected Cell Types

A frequent misunderstanding among beginners is that the number of clusters should exactly equal the number of cell types you expect to find. For example, if you expect to find 8 major PBMC cell types (T cells, B cells, monocytes, NK cells, dendritic cells, etc.), you might think you should adjust the resolution until you get exactly 8 clusters.

This is incorrect. Here’s why:

1. Cell types exist at multiple hierarchical levels

At low resolution: “T cells” = 1 cluster
At medium resolution: “CD4+ T cells” + “CD8+ T cells” = 2 clusters
At high resolution: “Naive CD4+ T” + “Memory CD4+ T” + “Naive CD8+ T” + “Effector CD8+ T” = 4+ clusters

2. Cell states add complexity beyond cell types

Activated vs resting cells within the same type
Proliferating vs quiescent cells
Stressed or transitional states
Cell cycle phases

3. The “right” number of clusters depends on your question

Broad comparison? Use fewer clusters (major types)
Detailed analysis? Use more clusters (subtypes and states)
There is no single “correct” answer

4. Biological reality is continuous, not discrete

Cells exist on spectrums and gradients
Clustering imposes artificial boundaries
Two similar cell types may legitimately appear as one cluster
One heterogeneous cell type may split into multiple clusters

The proper approach:

Test multiple resolutions systematically
Evaluate biological interpretability at each resolution
Use marker genes to validate cluster identities
Choose resolution based on your analysis goals, not preconceived cluster counts
Remember: 12 clusters representing 8 cell types (with subtypes) is perfectly valid

For PBMCs, you might expect 8-10 major cell types, but 12-15 clusters is common and appropriate when accounting for T cell subtypes, monocyte subsets, and cell states. Don’t force your data into an arbitrary number of clusters.

Clustering at Multiple Resolutions

#-----------------------------------------------
# STEP 15: Perform clustering at multiple resolutions
#-----------------------------------------------

# Test multiple resolutions
resolutions <- c(0.4, 0.6, 0.8, 1.0, 1.2)

for (res in resolutions) {
  integrated_final <- FindClusters(
    integrated_final,
    resolution = res,
    verbose = FALSE
  )

  n_clusters <- length(unique(integrated_final@meta.data[[paste0("RNA_snn_res.", res)]]))
  cat(sprintf("Resolution %.1f: %d clusters\n", res, n_clusters))
}

# Rename cluster columns for clarity
colnames(integrated_final@meta.data) <- gsub("RNA_snn_res\\.", "clusters_res_", 
                                              colnames(integrated_final@meta.data))

Visualizing Multi-Resolution Clustering

#-----------------------------------------------
# STEP 16: Visualize clustering results across resolutions
#-----------------------------------------------

# Create UMAP plots for each resolution
plot_list <- lapply(resolutions, function(res) {
  cluster_col <- paste0("clusters_res_", res)
  n_clusters <- length(unique(integrated_final@meta.data[[cluster_col]]))

  DimPlot(integrated_final, 
          reduction = reduction_final,
          group.by = cluster_col,
          label = TRUE,
          label.size = 4,
          pt.size = 0.3) +
    ggtitle(paste0("Resolution ", res, " (", n_clusters, " clusters)")) +
    NoLegend()
})

# Combine plots
combined_resolutions <- wrap_plots(plot_list, ncol = 2)
ggsave("plots/clustering/06_multi_resolution_clustering.png", 
       combined_resolutions, width = 14, height = 14, dpi = 300)

Determining Optimal Resolution

The Silhouette Score Metric:

We use silhouette scores to objectively assess cluster quality at each resolution:

What it measures: How well each cell fits into its assigned cluster

For each cell: Calculate average distance to cells in its own cluster vs other clusters
Score ranges from -1 to +1

Interpretation:

High score (~0.5-0.7): Clusters are well-separated, cells clearly belong to their clusters ✓
Moderate score (~0.3-0.5): Reasonable clusters, some overlap between groups
Low score (<0.3): Clusters poorly defined, may be over-clustering or under-clustering ⚠️

The trade-off:

Too few clusters (low resolution) → Biology is oversimplified, different cell types lumped together
Too many clusters (high resolution) → Over-splitting of natural cell populations

Our Strategy:

Test multiple resolutions (0.2, 0.4, 0.6, 0.8, 1.0, 1.2)
Calculate silhouette score for each (using 5,000-cell subsample)
Look for resolution with high silhouette score + interpretable cluster number
Visually validate the chosen resolution makes biological sense

#-----------------------------------------------
# STEP 17: Evaluate optimal clustering resolution
#-----------------------------------------------

# We'll use silhouette scores to assess cluster quality at each resolution
# Silhouette score measures how similar cells are to their own cluster vs other clusters
# Values range from -1 to 1 (higher is better)
library(cluster)

# Map UMAP reduction name to corresponding integrated reduction name
# We use integrated space (not UMAP) for silhouette calculation because:
# - Integrated space preserves more information (30+ dimensions vs UMAP's 2)
# - Silhouette scores are more meaningful in the space where clustering was performed
integrated_reduction <- switch(reduction_final,
  "umap.cca" = "integrated.cca",
  "umap.rpca" = "integrated.rpca",
  "umap.harmony" = "harmony",
  "umap.mnn" = "integrated.mnn",
  "umap" = "pca"  # Fallback for naive merge
)

cat("\nEvaluating clustering resolutions using", integrated_reduction, "space\n")

# For large datasets (>10,000 cells), subsample for silhouette calculation (use 5000 cells in our case)
# Computing distance matrices for all cells is computationally prohibitive
n_cells <- ncol(integrated_final)
max_cells_for_silhouette <- 5000

if (n_cells > max_cells_for_silhouette) {
  cat("Dataset has", n_cells, "cells - subsampling", max_cells_for_silhouette, "cells for silhouette calculation\n")
  set.seed(42)
  subsample_idx <- sample(1:n_cells, max_cells_for_silhouette)
} else {
  subsample_idx <- 1:n_cells
}

silhouette_scores <- sapply(resolutions, function(res) {
  cluster_col <- paste0("clusters_res_", res)

  # Get clusters for subsampled cells
  clusters <- as.numeric(integrated_final@meta.data[[cluster_col]][subsample_idx])

  # Get integrated coordinates for subsampled cells (use first 30 dimensions)
  coords <- Embeddings(integrated_final, reduction = integrated_reduction)[subsample_idx, ]
  coords <- coords[, 1:min(30, ncol(coords))]

  # Calculate silhouette (only if we have at least 2 clusters)
  if (length(unique(clusters)) > 1) {
    dist_matrix <- dist(coords)
    sil <- silhouette(clusters, dist_matrix)
    mean(sil[, 3])
  } else {
    NA  # Return NA if only 1 cluster
  }
})

# Create resolution comparison table
resolution_comparison <- data.frame(
  resolution = resolutions,
  n_clusters = sapply(resolutions, function(res) {
    cluster_col <- paste0("clusters_res_", res)
    length(unique(integrated_final@meta.data[[cluster_col]]))
  }),
  silhouette_score = silhouette_scores
)

cat("\nClustering resolution comparison:\n")
print(resolution_comparison)

##   resolution n_clusters silhouette_score
## 1        0.4         17        0.2040787
## 2        0.6         23        0.2013915
## 3        0.8         24        0.1951003
## 4        1.0         26        0.1962814
## 5        1.2         30        0.1718389

Note on Subsampling for Large Datasets: When datasets have >5,000 cells, computing full distance matrices for silhouette scores becomes computationally prohibitive (would require >25 GB memory for this dataset). Subsampling 5,000 cells is standard practice and provides reliable estimates of cluster quality. The silhouette scores from the subsample accurately reflect the separation quality of the full clustering.

# Recommend optimal resolution
# Choose resolution with good silhouette score and interpretable cluster number
optimal_idx <- which.max(resolution_comparison$silhouette_score)
optimal_resolution <- resolution_comparison$resolution[optimal_idx]

cat("\nRecommended resolution:", optimal_resolution, 
    "(", resolution_comparison$n_clusters[optimal_idx], "clusters )\n")

# Recommended resolution: 0.4 ( 17 clusters )

# Set the optimal clustering as the default
integrated_final$seurat_clusters <- integrated_final@meta.data[[paste0("clusters_res_", optimal_resolution)]]
Idents(integrated_final) <- "seurat_clusters"

Silhouette Score Interpretation:

>0.5: Strong cluster separation (good)

0.3-0.5: Moderate separation (acceptable)

<0.3: Weak separation (may be over-clustering)

Selection strategy: Choose the resolution with the highest silhouette score while maintaining biologically interpretable cluster numbers. For PBMCs, 8-15 clusters is typical (major cell types + subtypes).

Cluster Quality Assessment

#-----------------------------------------------
# STEP 18: Assess cluster quality and stability
#-----------------------------------------------

# Check cluster sizes
cluster_sizes <- table(integrated_final$seurat_clusters)
cat("\nCluster sizes:\n")
print(cluster_sizes)

# Identify very small clusters (<1% of cells)
min_cluster_size <- 0.01 * ncol(integrated_final)
small_clusters <- which(cluster_sizes < min_cluster_size)

if (length(small_clusters) > 0) {
  cat("\n⚠  Warning: Small clusters detected (<1% of cells):\n")
  print(cluster_sizes[small_clusters])
  cat("   These may represent rare cell types or over-clustering\n")
}

# Check sample distribution across clusters
sample_cluster_table <- table(integrated_final$seurat_clusters, integrated_final$sample_id)
sample_cluster_pct <- prop.table(sample_cluster_table, margin = 1) * 100

# Identify sample-specific clusters (>70% from one sample)
sample_specific <- apply(sample_cluster_pct, 1, max) > 70
if (any(sample_specific)) {
  cat("\n⚠  Warning: Sample-dominated clusters detected (>70% from single sample):\n")
  print(names(which(sample_specific)))
  cat("   These may indicate incomplete batch correction\n")
}

# Visualize sample distribution in clusters
sample_dist_data <- as.data.frame.matrix(sample_cluster_pct)
sample_dist_data$cluster <- rownames(sample_dist_data)
sample_dist_long <- reshape2::melt(sample_dist_data, id.vars = "cluster",
                                    variable.name = "sample", value.name = "percentage")

p_sample_dist <- ggplot(sample_dist_long, aes(x = cluster, y = percentage, fill = sample)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Sample Distribution Across Clusters",
       subtitle = "Check for sample-dominated clusters (poor integration)",
       x = "Cluster", y = "Percentage of Cells") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")

ggsave("plots/clustering/08_sample_distribution_clusters.png", p_sample_dist,
       width = 12, height = 6, dpi = 300)

# Check condition distribution
condition_dist <- table(integrated_final$seurat_clusters, integrated_final$condition)
condition_dist_pct <- prop.table(condition_dist, margin = 1) * 100

cat("\nCondition distribution across clusters:\n")
print(round(condition_dist_pct, 1))

# Visualize condition distribution
condition_dist_data <- as.data.frame.matrix(condition_dist_pct)
condition_dist_data$cluster <- rownames(condition_dist_data)
condition_dist_long <- reshape2::melt(condition_dist_data, id.vars = "cluster",
                                      variable.name = "condition", value.name = "percentage")

p_condition_dist <- ggplot(condition_dist_long, aes(x = cluster, y = percentage, fill = condition)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Condition Distribution Across Clusters",
       subtitle = "Check if biological conditions are preserved",
       x = "Cluster", y = "Percentage of Cells") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("Healthy" = "#2E86AB", 
                                "Post_Treatment" = "#F18F01"))

ggsave("plots/clustering/09_condition_distribution_clusters.png", p_condition_dist,
       width = 12, height = 6, dpi = 300)

Final Integrated and Clustered Visualization

#-----------------------------------------------
# STEP 19: Create comprehensive visualization of final results
#-----------------------------------------------

# Main UMAP with clusters
p_final_clusters <- DimPlot(integrated_final, 
                            reduction = reduction_final,
                            group.by = "seurat_clusters",
                            label = TRUE,
                            label.size = 5,
                            pt.size = 0.05) +
  ggtitle(paste0("Final Clustering (", best_method, " Integration)")) +
  theme(plot.title = element_text(face = "bold", size = 14))

# Split by condition to show preservation
p_by_condition <- DimPlot(integrated_final,
                          reduction = reduction_final,
                          group.by = "seurat_clusters",
                          split.by = "condition",
                          label = TRUE,
                          label.size = 4,
                          pt.size = 0.05,
                          ncol = 2) +
  ggtitle("Clusters by Treatment Condition") +
  theme(strip.text = element_text(face = "bold"))

# Colored by sample (integration quality check)
p_by_sample <- DimPlot(integrated_final,
                       reduction = reduction_final,
                       group.by = "sample_id",
                       pt.size = 0.05,
                       cols = sample_colors) +
  ggtitle("Sample Mixing (Integration Quality)") +
  theme(legend.text = element_text(size = 7))

# Colored by condition
p_by_condition_single <- DimPlot(integrated_final,
                                 reduction = reduction_final,
                                 group.by = "condition",
                                 pt.size = 0.05,
                                 cols = condition_colors) +
  ggtitle("Treatment Conditions")

# Combine final visualizations
combined_final <- (p_final_clusters | p_by_condition_single) /
                  (p_by_sample | plot_spacer())

ggsave("plots/clustering/10_final_integrated_clustered.png", combined_final,
       width = 16, height = 12, dpi = 300)

# Split by condition detailed view
ggsave("plots/clustering/11_clusters_by_condition_split.png", p_by_condition,
       width = 18, height = 6, dpi = 300)

Saving Integrated and Clustered Data

#-----------------------------------------------
# STEP 20: Save integrated and clustered data
#-----------------------------------------------

cat("\n=== Saving Final Results ===\n")

# Join layers for downstream analysis
integrated_final <- JoinLayers(integrated_final)

# Save the final integrated object
saveRDS(integrated_final, "integrated_data/integrated_clustered_seurat.rds")

# Save metadata
write.csv(integrated_final@meta.data, 
          "metadata/integrated_cell_metadata.csv",
          row.names = TRUE)

# Create comprehensive summary
integration_summary <- data.frame(
  dataset = "GSE174609",
  n_samples = length(unique(integrated_final$sample_id)),
  n_cells_total = ncol(integrated_final),
  n_genes = nrow(integrated_final),
  integration_method = best_method,
  optimal_resolution = optimal_resolution,
  n_clusters = length(unique(integrated_final$seurat_clusters)),
  mixing_score = mixing_results$mixing_score[mixing_results$method == best_method],
  condition_separation = mixing_results$condition_separation[mixing_results$method == best_method]
)

write.csv(integration_summary, "metadata/integration_summary.csv", row.names = FALSE)

Understanding Your Integration Results: Interpreting the Plots

What Good Integration Looks Like

After successful integration and clustering, you should observe:

1. Sample Mixing (plots/clustering/10_final_integrated_clustered.png, bottom left)

✓ All 8 samples intermingle within each cluster
✓ No obvious sample-specific sub-clusters
✓ Uniform color distribution across the UMAP

Red flags:

✗ Samples form separate islands
✗ One sample dominates a cluster (>70%)
✗ Clear batch structure still visible

2. Biological Signal Preservation (plots/clustering/10_final_integrated_clustered.png, top right)

✓ Healthy and Post-treatment conditions are detectable
✓ Some clusters show condition-specific enrichment (this is biology!)
✓ Treatment effects are visible but don’t drive overall structure

Red flags:

✗ All conditions perfectly mixed in every cluster (over-integration)
✗ No condition-specific patterns at all
✗ Loss of expected biological differences

3. Cluster Quality (plots/clustering/06_multi_resolution_clustering.png)

✓ Clear cluster boundaries with minimal bridging
✓ Clusters remain stable across similar resolutions
✓ Clusters have reasonable sizes (not 95% in one cluster, 5% scattered)

Red flags:

✗ Diffuse clouds with no clear boundaries
✗ Many tiny clusters (<50 cells each)
✗ Clusters disappear or merge dramatically at small resolution changes

Interpreting Multi-Resolution Clustering

Low resolution (0.4-0.5): Major cell types emerge

Expect: 6-10 clusters for PBMCs
Typical: “T cells”, “B cells”, “Myeloid cells”, “NK cells”

Medium resolution (0.6-0.8): Cell subtypes resolve

Expect: 10-15 clusters
Typical: “CD4+ T cells”, “CD8+ T cells”, “Classical monocytes”, “Non-classical monocytes”

High resolution (1.0-1.2): Fine-grained states appear

Expect: 15-20+ clusters
Typical: “Naive CD4+ T”, “Memory CD4+ T”, “Cytotoxic CD8+ T”, “CD14+ monocytes”, “CD16+ monocytes”

Choosing resolution:

Start with medium (0.6-0.8) for standard analysis
Use high resolution if focusing on specific cell types
Use low resolution for broad overview or preliminary analysis

Common Integration Issues and Diagnoses

Issue 1: Under-Integration

Symptoms: Samples still separate, mixing score <2.5
Diagnosis: Check plots/integration_comparison/02_integration_by_sample.png – samples form distinct clusters
Solutions:
Try more aggressive method (CCA > RPCA, Harmony > FastMNN)
Increase integration dimensions
Check if samples are genuinely too different (different tissues, different protocols)

Issue 2: Over-Integration

Symptoms: Perfect mixing but lost biological signal, separation score <1.5
Diagnosis: Check plots/integration_comparison/03_integration_by_condition.png – conditions indistinguishable
Solutions:
Use less aggressive method (FastMNN instead of Harmony)
Reduce integration strength (fewer dimensions)
Check if biological differences are truly subtle

Issue 3: Sample-Specific Clusters

Symptoms: One cluster is >70% from single sample
Diagnosis: Check plots/clustering/08_sample_distribution_clusters.png
Solutions:
Re-run integration with stronger correction
Check if that sample has genuine unique biology (e.g., rare disease sample)
Verify QC wasn’t too lenient for that sample

Issue 4: Over-Clustering

Symptoms: Many tiny clusters, unstable across resolutions
Diagnosis: Check plots/clustering/06_multi_resolution_clustering.png
Solutions:
Lower resolution parameter
Check for over-splitting of continuous populations
May indicate heterogeneous cell states rather than discrete types

Best Practices for Integration and Clustering

General Principles

1. Always Compare Multiple Integration Methods

Don’t assume one method is universally best
Different methods excel with different data characteristics
Visual comparison is more informative than any single metric
Save results from multiple methods for reviewer questions

2. Validate Integration Quality Systematically

Visual inspection (UMAPs colored by sample and condition)
Quantitative metrics (mixing scores, silhouette scores)
Biological validation (known marker genes still differential)
Sample distribution across clusters

3. Choose Resolution Based on Biological Question

Broad comparison: Low resolution (major cell types)
Cell type focus: Medium resolution (subtypes)
Cell state analysis: High resolution (fine-grained states)
Can always re-cluster at different resolution later

4. Document All Decisions

Record which integration method and why
Note resolution choice and rationale
Save all comparison plots
Keep integration quality metrics

Method Selection Guidelines

Use CCA when:

✓ Standard dataset (10,000-100,000 cells)
✓ Different cell type compositions between samples
✓ You want robust, well-tested method
✓ Default choice for most experiments

Use RPCA when:

✓ Large datasets (>100,000 cells)
✓ Similar cell type composition across samples
✓ Computational resources are limited
✓ Speed is important

Use Harmony when:

✓ Very large datasets (>500,000 cells)
✓ Need fast integration for exploration
✓ Soft correction is preferred
✓ Multiple batch variables to correct

Use FastMNN when:

✓ Subtle biological differences need preservation
✓ Conservative correction preferred
✓ Complex batch structure
✓ Reference-based integration

Note: scVI is another powerful integration method but requires complex Python/PyTorch setup. The four R-based methods above are sufficient for most analyses and much easier to configure.

Resolution Selection Strategy

Step 1: Survey multiple resolutions

# Test range of resolutions
resolutions <- seq(0.2, 1.5, by = 0.2)

Step 2: Evaluate stability

Clusters should be similar at adjacent resolutions
Major cell types stable across wide range
Subclusters emerge gradually, not suddenly

Step 3: Consider cluster sizes

Avoid <50 cell clusters unless biologically justified
Aim for relatively balanced cluster sizes
Very large clusters (>30%) may need higher resolution

Step 4: Biological interpretation

Can you assign biological identity to most clusters?
Do cluster numbers match expected cell type diversity?
For PBMCs: 8-15 clusters typical

Step 5: Downstream analysis needs

Differential expression: Medium resolution (statistical power)
Trajectory analysis: Lower resolution (avoid fragmenting trajectories)
Cell-cell communication: Medium-high resolution (detail needed)

Common Pitfalls to Avoid

Pitfall 1: Using Single Integration Method Without Comparison

Problem: May not be optimal for your data
Solution: Always test 2-3 methods, compare results

Pitfall 2: Ignoring Biological Signal Loss

Problem: Over-aggressive integration removes true differences
Solution: Check condition separation, validate marker genes

Pitfall 3: Arbitrary Resolution Choice

Problem: Too high (over-clustering) or too low (under-clustering)
Solution: Test multiple resolutions, use metrics and biological knowledge

Pitfall 4: Not Validating Integration Quality

Problem: Proceed with poorly integrated data
Solution: Check mixing scores, sample distribution, visual inspection

Pitfall 5: Treating Clusters as Ground Truth

Problem: Clusters are computational constructs, not biological absolutes
Solution: Validate with marker genes, biological knowledge, orthogonal methods

Pitfall 6: Ignoring Small Clusters

Problem: Miss rare cell types or fail to detect over-clustering
Solution: Investigate small clusters (<1%) – rare cell type or artifact?

When Integration May Not Be Necessary

Skip integration if:

Single sample experiment
No batch effects observed (rare)
Samples genuinely represent different tissues/conditions you want to analyze separately
Preliminary QC/exploration phase

Use minimal integration if:

Mild batch effects
Want to preserve maximum biological variation
Samples very similar in composition

Issue-Specific Solutions

Issue: Samples Still Separate After Integration

Diagnosis:

# Check mixing score
mixing_results  # Should be >3.0

# Visual check
DimPlot(integrated_obj, group.by = "sample_id")

Solutions:

# Try more aggressive method
integrated <- IntegrateLayers(
  method = CCAIntegration,  # Instead of RPCAIntegration
  dims = 1:40,              # Increase dimensions
  k.anchor = 20             # Increase anchor strength
)

Issue: Over-Integration (Lost Biological Signal)

Diagnosis:

# Check condition separation
assess_condition_separation(integrated_obj)  # Should be >2.0

# Check marker genes
FeaturePlot(integrated_obj, features = c("CD3D", "CD14"))

Solutions:

# Use conservative method
integrated <- IntegrateLayers(
  method = FastMNNIntegration,  # Conservative
  dims = 1:20                    # Fewer dimensions
)

# Or reduce correction strength
integrated <- RunHarmony(
  theta = 1,  # Lower theta = less correction (default is 2)
  ...
)

Issue: Very Unbalanced Cluster Sizes

Diagnosis:

table(integrated_obj$seurat_clusters)
# One cluster with 80% of cells?

Solutions:

# Increase resolution
integrated_obj <- FindClusters(integrated_obj, resolution = 1.2)

# Check if PCA captures enough variation
ElbowPlot(integrated_obj, ndims = 50)

# Try more PCs
integrated_obj <- RunPCA(integrated_obj, npcs = 50)
integrated_obj <- FindNeighbors(integrated_obj, dims = 1:40)

Issue: Too Many Tiny Clusters

Diagnosis:

# Many clusters with <50 cells
cluster_sizes <- table(integrated_obj$seurat_clusters)
sum(cluster_sizes < 50)

Solutions:

# Decrease resolution
integrated_obj <- FindClusters(integrated_obj, resolution = 0.4)

# Increase k.param (more neighbors)
integrated_obj <- FindNeighbors(integrated_obj, k.param = 30)

Issue: Sample-Specific Clusters

Diagnosis:

# Check sample distribution
table(integrated_obj$seurat_clusters, integrated_obj$sample_id)

Solutions:

# 1. Re-run integration with stronger correction
# 2. Check if that sample failed QC
# 3. Verify if truly unique biology (e.g., disease-specific cells)

# If technical, remove outlier sample:
integrated_obj <- subset(integrated_obj, subset = sample_id != "problematic_sample")

Conclusion: From Raw Data to Biological Insights

Congratulations! You’ve completed the integration and clustering phase of single-cell RNA-seq analysis.

Critical Insights

1. Integration is Not Optional
For multi-sample experiments, proper integration is essential. Technical variation from sample preparation, sequencing, and processing will overwhelm biological signal without correction.

2. Method Choice Matters
Different integration methods have different strengths. Always compare multiple approaches for your specific dataset rather than blindly applying one method.

3. Biology Should Guide Analysis
While algorithms provide suggestions, biological knowledge should inform decisions about:

Integration aggressiveness
Clustering resolution
Cluster merging or splitting
Outlier handling

4. Validation is Continuous
Integration and clustering aren’t one-time steps. As you proceed to cell type annotation and differential expression, you may need to revisit and refine these analyses.

What Makes Good Integration

Remember these principles:

Samples mix within cell types (technical variation removed)
Conditions remain distinguishable (biological variation preserved)
Known biology is recapitulated (expected cell types present)
Rare populations are retained (not over-corrected away)
Marker genes remain differential (biological signal intact)

Resources for Continued Learning

Official Documentation:

Seurat v5 Integration vignettes: https://satijalab.org/seurat/articles/seurat5_integration
Harmony documentation: https://github.com/immunogenomics/harmony
FastMNN documentation: https://bioconductor.org/packages/batchelor/

Recommended Reading:

Luecken & Theis (2019) – Current best practices in single-cell RNA-seq
Tran et al. (2020) – Benchmark of batch-effect correction methods
Korsunsky et al. (2019) – Fast, sensitive and accurate integration (Harmony)

References

Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293-304. doi:10.1038/s41587-023-01767-y
Korsunsky I, Millard N, Fan J, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16(12):1289-1296. doi:10.1038/s41592-019-0619-0
Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421-427. doi:10.1038/nbt.4091
Stuart T, Butler A, Hoffman P, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177(7):1888-1902.e21. doi:10.1016/j.cell.2019.05.031
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15(12):1053-1058. doi:10.1038/s41592-018-0229-2
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. doi:10.15252/msb.20188746
Tran HTN, Ang KS, Chevrier M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21(1):12. doi:10.1186/s13059-019-1850-9
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008. doi:10.1088/1742-5468/2008/10/P10008
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. doi:10.1038/s41598-019-41695-z
Büttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods. 2019;16(1):43-49. doi:10.1038/s41592-018-0254-1
Lee H, Joo JY, Sohn DH, et al. Single-cell RNA sequencing reveals rebalancing of immunological response in patients with periodontitis after non-surgical periodontal therapy. J Transl Med. 2022;20(1):504. doi:10.1186/s12967-022-03686-8
Amezquita RA, Lun ATL, Becht E, et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods. 2020;17(2):137-145. doi:10.1038/s41592-019-0654-x
Luecken MD et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods. 2022;19(1):41-50.
Gayoso A et al. A Python library for probabilistic analysis of single-cell omics data. Nature Biotechnology. 2022;40:163-166.

This tutorial is part of the NGS101.com series on single cell sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.

Comments

6 responses to “How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 3: Integration and Clustering”

Sreeram Peela

December 25, 2025

Hi,
Thank you for a detailed tutorial. I am now introduced to different metric called mixing score by this blogpost. I would like to know if there is a publication or detailed rationale how it works and scales for atlas level integration tasks.
Also, does RPCA performs better than harmony for <200000 cells? I used harmony and scVI and both gave similar results.
PS: a great addition would be discussion about specifying covariates during batch integration. It still confuses me to this day.

1. Lei
  
  December 25, 2025
  
  Hi Sreeram,
  
  Thank you for your questions.
  
  For details on the evaluation metrics you asked about, please refer to the References section of the tutorial.
  
  When working with large datasets, Harmony, RPCA, and scVI perform better. Feel free to try each and select the one that works best for your data using the evaluation metrics outlined in the tutorial.
  
  Regarding covariate specification during batch integration: in Step 6, batches are automatically defined by the split layers, with each layer treated as a separate batch.
  
pan

March 12, 2026

Hi~

Thank you so much for your tutorial. It has been incredibly helpful for my research.

I have a question: after integrating the Seurat object data, if I want to subset one cluster for further analysis, what steps should I follow for re-clustering and re-U mapping? Do I need to split the layers again and restart the process from the beginning?

Thank you~

1. Lei
  
  March 13, 2026
  
  Seurat v5 stores data in layers, so after subsetting, the split layers are preserved and there is no need to split them again. Just re-run the downstream steps on the subset:
  FindVariableFeatures() → ScaleData() → RunPCA() → IntegrateLayers() → FindNeighbors() → FindClusters() → RunUMAP()
  
  1. pan
    
    March 16, 2026
    
    Thank you so much for your response!
    
    I can now see that, for integration methods that don’t require JoinLayer, such as Harmony, I can use FindVariableFeatures directly and follow the subsequent steps. I have another question regarding subsetting: The cell proportions vary among all cell types and samples. After subsetting, some samples may have many cells while others may have very few. Can I use the IntegrateLayer method even if some samples have too few cells after subsetting, and do I need to filter those samples out?
    
    Also, I’d love to get your thoughts on the Conos integration method. Which one do you think works well for integrating samples from various sources, such as bone, liver, lymph nodes, lungs, and primary sites?
    
    1. Lei
      
      March 16, 2026
      
      Glad it’s clicking! You’re right that the Harmony workflow used in the tutorial doesn’t require JoinLayers — I went with RunHarmony() directly rather than IntegrateLayers(method = HarmonyIntegration) due to compatibility issues I ran into with the latter on my system.
      
      On subsetting with uneven cell numbers: Harmony is quite robust to imbalanced sample sizes since it operates on the PCA embedding level rather than finding pairwise anchors, so you can generally proceed without worrying too much. That said, I’d still recommend filtering out samples with very few cells (fewer than ~50 as a rough cutoff) — they contribute little signal and can introduce noise into the embedding.
      
      Regarding Conos — it’s a separate R package that doesn’t appear to be integrated with Seurat, and I haven’t used it myself, so I can’t speak to it from experience. For diverse tissue sources like yours, I’d stick with Harmony — it’s well-supported, handles compositional differences across samples well, and saves you the overhead of setting up an unfamiliar tool.

How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 3: Integration and Clustering

Introduction: Why Integration Matters in Multi-Sample scRNA-seq Analysis

The Batch Effect Problem in Single-Cell Data

What is Data Integration?

The Critical Misconception: Will Integration Erase My Treatment Effects?

The Bulk RNA-seq vs Single-Cell RNA-seq Visualization Paradigm Shift

What You’re Actually Seeing When Samples “Mix” on the UMAP

Where Treatment Effects Are Actually Preserved (And Why You Can’t See Them on the UMAP)

How We Actually Detect and Measure Treatment Effects

Quantitative Proof That Biology Isn’t Lost

What Actual Overcorrection Looks Like (And How to Recognize It)

How Integration Works: The Conceptual Framework

Evaluating Integration Quality: How to Know It Worked

Understanding Clustering: Finding Cell Populations

The Complete Multi-Sample Analysis Workflow

What You’ll Learn in This Tutorial

Setting Up Your R Environment

Required Packages for Integration and Clustering

Loading Libraries and Configuration

Loading Quality-Controlled Data from Part 2

Understanding “batches” in this dataset

Loading All QC’d Samples

Quick Quality Check

The Naive Approach: Direct Merge Without Integration

Merging Samples Directly

Visualizing Batch Effects

Quantifying Batch Effects

Data Integration: Five Methods Compared

Overview of Integration Methods

Why This Tutorial Focuses on Four R-Based Methods

Preparing Data for Integration

Method 1: CCA Integration

Method 2: RPCA Integration

Method 3: Harmony Integration

Method 4: FastMNN Integration

Method 5: scVI Integration (Advanced – Not Covered)

Comparing Integration Methods

Visual Comparison: UMAPs Side by Side

Quantitative Comparison: Mixing and Separation Metrics

Assessing Biological Signal Preservation

Selecting the Best Integration Method

Understanding Clustering: From Integrated Data to Cell Populations

The Clustering Algorithm: Graph-Based Community Detection

The Resolution Parameter: Controlling Cluster Granularity

Common Misconception: Cluster Number Should Match Expected Cell Types

Clustering at Multiple Resolutions

Visualizing Multi-Resolution Clustering

Determining Optimal Resolution

Cluster Quality Assessment

Final Integrated and Clustered Visualization

Saving Integrated and Clustered Data

Understanding Your Integration Results: Interpreting the Plots

What Good Integration Looks Like

Interpreting Multi-Resolution Clustering

Common Integration Issues and Diagnoses

Best Practices for Integration and Clustering

General Principles

Method Selection Guidelines

Resolution Selection Strategy

Common Pitfalls to Avoid

When Integration May Not Be Necessary

Issue-Specific Solutions

Conclusion: From Raw Data to Biological Insights

Critical Insights

What Makes Good Integration

Resources for Continued Learning

References

Share this:

Like this:

Comments

6 responses to “How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 3: Integration and Clustering”

Leave a Reply Cancel reply

Search

Categories

Recent Posts

Tags