How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 5: Cell Type-Specific Differential Expression, Proportion Testing, and Functional Pathway Analysis

Table of Contents

Introduction: From Cell Types to Biological Mechanisms

In Parts 1-4 of this tutorial series, we’ve taken scRNA-seq data from raw sequencing reads through quality control, integration, clustering, and cell type annotation. We now have a beautifully annotated dataset where every cell has a biological identity (CD4+ T cells, monocytes, etc.) and metadata linking it to experimental conditions (Healthy vs Post-Treatment).

But here’s the critical question: What biological changes does periodontal treatment induce in different immune cell populations?

This is where single-cell analysis truly shines compared to bulk RNA-seq. Instead of averaging across all cells and losing cell-type-specific information, scRNA-seq allows us to ask precise, mechanistic questions:

Do treatment effects differ across cell types? (Maybe monocytes respond strongly while B cells don’t)
Which immune populations expand or contract with treatment? (Cell proportion changes)
What molecular programs activate in specific cell types? (Functional state analysis)
Are treatment responses consistent across patients? (Individual variation)

Why Single-Cell RNA-seq Transforms Differential Analysis

The fundamental limitation of bulk RNA-seq:

Imagine you have a bag of mixed fruit and you blend it all together. Bulk RNA-seq measures the average properties of this smoothie. If apples decrease but oranges increase, you might see no change at all—the signal cancels out.

How scRNA-seq solves this:

With scRNA-seq, you can separate the fruit first, then measure each type individually. You discover that apples actually doubled while oranges tripled—critical biological insights that bulk analysis would miss entirely.

Concrete example from our periodontitis study:

Bulk RNA-seq might show: “Inflammatory genes are slightly upregulated after treatment” (p=0.08, not significant)

scRNA-seq reveals:

CD14+ monocytes: Massively downregulate inflammatory genes (p<0.001)
CD16+ monocytes: Dramatically upregulate inflammatory genes (p<0.001)
T cells: No change in inflammatory genes (p=0.95)

The bulk result was averaging opposite effects, obscuring the true biology! Only scRNA-seq reveals that treatment induces an anti-inflammatory shift in classical monocytes while activating non-classical monocytes.

Additional advantages of scRNA-seq analysis:

Cell composition changes: Discover if treatment works by eliminating certain cell types or recruiting new ones
Cell-type-specific responses: Understand why some cells respond while others don’t
Rare population discovery: Identify small but critical cell populations driving disease
State transitions: Track cells moving between functional states
Patient heterogeneity: Understand individual variation in treatment response

The Three Complementary Analyses in This Tutorial

To fully understand treatment effects, we perform three interconnected analyses:

Analysis 1: Cell Type Proportion Analysis

Question: Does treatment change the abundance of different immune cell types?

What we measure: The percentage of each cell type in each sample

Example findings:

“CD14+ monocytes increase from 15% to 25% post-treatment” (recruitment)
“B cells decrease from 10% to 5%” (depletion or migration)
“NK cells remain constant at 8%” (unaffected by treatment)

Why it matters: Proportion changes can indicate:

Which immune cells are recruited to fight infection
Whether treatment depletes pathogenic populations
If compensatory mechanisms activate (one type replaces another)

Analysis 2: Cell-Type-Specific Differential Expression

Question: Within each cell type, which genes change expression between conditions?

What we measure: Gene expression differences between Healthy and Post-Treatment within each cell type separately

Example findings:

“In CD14+ monocytes: IL1B, TNF, and S100A8 are downregulated post-treatment” (reduced inflammation)
“In CD4+ T cells: IFNG and IL2 are upregulated post-treatment” (T cell activation)
“In B cells: No significant changes” (not responsive to treatment)

Two statistical approaches we’ll use:

Approach 1: Cell-Level Analysis (FindMarkers)

Compares individual cells between conditions
Fast and straightforward
Good for exploratory analysis
Limitation: Doesn’t account for patient/sample structure (pseudoreplication)

Approach 2: Pseudobulk Analysis (DESeq2, limma)

Aggregates cells by sample, then compares samples
Accounts for biological replicates properly
More statistically rigorous
Recommended for publication

Why we need both:

FindMarkers is fast for exploration and biomarker discovery
Pseudobulk is rigorous for final statistical claims
Agreement between methods strengthens conclusions

Critical difference from bulk RNA-seq:

Aspect	Bulk RNA-seq	scRNA-seq
What’s measured	Average across all cells	Individual cells
DE testing	Sample vs sample	Cell vs cell OR pseudobulk
Cell type info	Lost (averaged away)	Preserved (analyzed separately)
Statistical unit	Biological sample	Cell (cell-level) OR sample (pseudobulk)
Main advantage	Deep sequencing depth	Cell-type resolution

Analysis 3: Functional State Scoring

Question: Are specific biological processes (inflammation, cell cycle, stress response) activated or suppressed in certain cell types?

What we measure: Enrichment of predefined gene sets (pathways) in each cell

Example findings:

“CD14+ monocytes show high ‘Inflammatory Response’ scores in pre-treatment but low scores post-treatment”
“NK cells show elevated ‘Interferon Response’ regardless of treatment”
“B cells have low scores for all immune activation pathways” (quiescent)

How it works:

Define gene sets: Download curated pathways from MSigDB (e.g., “Hallmark Inflammatory Response”)
Score cells: Calculate how strongly each cell expresses that gene set
Compare conditions: Test if scores differ between Healthy vs Post-Treatment
Interpret biology: High scores = pathway active, low scores = pathway suppressed

When to use functional state scoring:

Use when:

You want to test specific biological hypotheses (e.g., “Does treatment reduce inflammation?”)
You’re interested in biological processes rather than individual genes
You want interpretable, pathway-level insights
Multiple genes in a pathway show subtle coordinated changes

Don’t use when:

You’re doing unbiased discovery (use DE analysis instead)
Gene sets for your process don’t exist
Individual gene changes are more important than pathways
You need to identify novel biology (gene sets are predefined)

Examples of biological questions answered by functional scoring:

Is oxidative stress higher in diseased vs healthy cells?
Do specific T cell subsets show cell cycle activity?
Are interferon responses activated in viral infection?
Does treatment suppress inflammatory signaling?

Understanding Differential Expression: Cell-Level vs Pseudobulk

This is a critical concept that confuses many beginners. Let’s explain both approaches clearly.

Approach 1: Cell-Level Differential Expression (Seurat’s FindMarkers)

What it does: Directly compares gene expression in individual cells between conditions

Example:

Group A: 5,000 CD14+ monocytes from Healthy donors
Group B: 5,000 CD14+ monocytes from Post-Treatment patients
Test: Is gene X higher in Group A vs Group B?

Statistical test: Wilcoxon rank-sum test (default) or other non-parametric tests

Advantages:

Fast and simple
No aggregation required
Good for biomarker discovery
Captures cell-to-cell variability

Disadvantages:

Pseudoreplication: Treats 1,000 cells from one patient as independent samples (they’re not!)
Inflated false positives (many “significant” genes that aren’t truly different)
Doesn’t account for patient/batch effects
Not ideal for publication claims

When to use:

Exploratory analysis
Quick screening for candidate genes
When you don’t have biological replicates
Visualization and interpretation (still useful even if not final statistics)

Approach 2: Pseudobulk Differential Expression (DESeq2, limma)

What it does: Aggregates cells by sample, then compares samples (like bulk RNA-seq)

Statistical test: DESeq2 (negative binomial model accounting for count overdispersion)

Advantages:

Proper biological replicates (samples, not cells)
Accounts for patient effects
More conservative (fewer false positives)
Publication-quality statistics
Handles count data properly (overdispersion, mean-variance relationship)

Disadvantages:

Loses single-cell resolution
Requires multiple samples per condition (need biological replicates)
Slower computationally
May miss very rare cell-type-specific effects

When to use:

Final statistical claims for publication
When you have >=3 biological replicates per condition
Testing treatment effects with patient structure
Comparing across multiple conditions with complex designs

Why pseudobulk is necessary: The pseudoreplication problem

Imagine you have 4 healthy donors and 4 post-treatment patients. Each donor contributes ~1,000 CD14+ monocytes.

Cell-level analysis says: “I have 4,000 vs 4,000 cells = 8,000 independent samples!”

Reality: You have 4 vs 4 patients = 8 independent samples. The 1,000 cells per patient are not independent—they come from the same person and share patient-specific effects (genetics, environment, technical batch).

The consequence: Cell-level analysis massively inflates your sample size, leading to:

p-values that are far too small (10^-50 instead of realistic 0.01)
Hundreds of “significant” genes (most are false positives)
Claims that don’t replicate in new experiments

Pseudobulk fixes this by treating each patient as one sample, giving you the correct statistical power and realistic p-values.

Analogy:

Cell-level: Asking 1,000 people from New York their opinion and claiming you surveyed 1,000 independent cities
Pseudobulk: Correctly recognizing you surveyed 1 city (New York) with 1,000 residents

Choosing Between Cell-Level and Pseudobulk Analysis

Our recommendation: Always do both!

Start with FindMarkers (cell-level) for:

Fast exploration
Identifying candidate genes
Creating visualizations
Understanding biological patterns

Validate with Pseudobulk (DESeq2) for:

Final statistical claims
Publication figures
Identifying robust, reproducible changes
Accounting for patient effects

Report genes that pass both:

Significant in FindMarkers (biomarker candidates)
Significant in pseudobulk (statistically robust)
These are your high-confidence hits

Understanding Functional State Scoring

Functional state scoring answers the question: “Is a biological process active in this cell?”

The concept:

Instead of looking at one gene at a time, we look at sets of genes that work together in biological pathways. If many genes in the “Inflammatory Response” pathway are highly expressed, that cell is in an inflammatory state.

Example gene sets from MSigDB Hallmarks:

Inflammatory Response: IL1B, TNF, IL6, CXCL8, CCL2, etc. (50 genes)
Interferon Gamma Response: STAT1, IRF1, GBP1, CXCL9, etc. (200 genes)
Oxidative Phosphorylation: COX5A, ATP5A1, NDUFB5, etc. (200 genes)
Cell Cycle: CDK1, CCNB1, MKI67, TOP2A, etc. (200 genes)

How scoring works (simplified):

For each cell:

Look at expression of all genes in the gene set
Calculate average expression (with controls for background)
Assign a score (high score = many genes expressed strongly)
Result: Each cell gets a score for each pathway

What functional scores tell you:

High score: Pathway is active (cell is in that functional state)
Low score: Pathway is inactive (cell is not in that state)
Comparing conditions: Do Healthy vs Post-Treatment cells differ in pathway activity?

When functional scoring is most powerful:

Best use cases:

Testing specific hypotheses (“Does treatment reduce inflammation?”)
Identifying cell states (inflammatory vs resting monocytes)
Comparing activation levels across conditions
Understanding coordinated gene programs

Less useful when:

Doing unbiased discovery (DE analysis is better)
Pathways of interest aren’t well-defined
Individual genes matter more than pathways

Why we use pseudobulk for statistical comparison:

Just like with differential expression, comparing functional scores between individual cells has the pseudoreplication problem. We need to:

Calculate scores for each cell (captures cell-to-cell variation)
Aggregate scores by sample (get mean score per patient)
Compare samples statistically (proper biological replicates)

This ensures our conclusions about pathway differences are statistically sound.

What You’ll Learn in This Tutorial

By completing Part 5, you’ll be able to:

Quantify cell type proportions and test for composition differences
Perform cell-level DE analysis with FindMarkers for exploration
Conduct pseudobulk DE analysis with DESeq2 for publication
Compare results between cell-level and pseudobulk approaches
Score functional states using MSigDB gene sets
Interpret biological meaning of differential expression and proportions
Create publication-quality visualizations (volcano plots, heatmaps, violin plots)
Identify robust, reproducible changes across methods

Dataset Reminder: GSE174609 PBMC Study

We continue with our 8-sample dataset:

4 Healthy donors: Baseline immune profiles
4 Post-treatment patients: After non-surgical periodontal therapy

Expected biological findings:

Proportion changes: Treatment may alter immune cell recruitment
Gene expression changes: Anti-inflammatory effects in monocytes/T cells
Functional states: Reduced inflammatory signatures, altered activation

Let’s begin by setting up our R environment and loading the annotated data from Part 4.

Setting Up Your R Environment

We’ll build on the packages from Parts 1-4 and add tools specifically for differential expression and functional analysis.

Installing Required Packages

#-----------------------------------------------
# Install required packages (run once)
#-----------------------------------------------

# Install BiocManager if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install CRAN packages
install.packages(c("Seurat", "msigdbr", "ggplot2", "pheatmap", 
                   "ggridges", "ggrepel", "patchwork", "dplyr", "tidyr"))

# Install Bioconductor packages
BiocManager::install(c("DESeq2", "edgeR", "limma", "speckle", "fgsea", "EnhancedVolcano"))

# Faster FindMarkers
# install.packages("devtools")
devtools::install_github("immunogenomics/presto")

Loading Libraries and Configuration

#-----------------------------------------------
# STEP 1: Load required libraries
#-----------------------------------------------

# Core single-cell analysis
library(Seurat)
library(SeuratObject)

# Differential expression
library(DESeq2)
library(edgeR)
library(limma)

# Proportion analysis
library(speckle)

# Functional analysis
library(fgsea)
library(msigdbr)

# Visualization
library(ggplot2)
library(pheatmap)
library(ggridges)
library(ggrepel)
library(EnhancedVolcano)
library(patchwork)
library(dplyr)
library(tidyr)

# Set working directory
setwd("~/GSE174609_scRNA/differential_analysis")

# Create output directories
dir.create("plots", showWarnings = FALSE)
dir.create("plots/proportions", showWarnings = FALSE)
dir.create("plots/DE_celllevel", showWarnings = FALSE)
dir.create("plots/DE_pseudobulk", showWarnings = FALSE)
dir.create("plots/functional_scoring", showWarnings = FALSE)
dir.create("results", showWarnings = FALSE)

# Set random seed for reproducibility
set.seed(42)

# Configure plotting defaults
theme_set(theme_classic(base_size = 12))

Loading Annotated Data from Part 4

#-----------------------------------------------
# STEP 2: Load annotated Seurat object from Part 4
#-----------------------------------------------

# Load the annotated object
seurat_obj <- readRDS("~/GSE174609_scRNA/cell_type_annotation/annotations/integrated_annotated_seurat.rds")

Note: We’re using final_annotation from Part 4, which represents the consensus cell type labels built from multiple annotation methods (Manual, SingleR, scType, scCATCH).

Part 1: Cell Type Proportion Analysis

Cell type proportions reveal whether treatment affects the composition of the immune system—recruiting specific cell types or depleting others.

Understanding Proportion Analysis

What we’re measuring: The percentage of each cell type in each sample

Why it matters:

Treatment may recruit specific immune cells (proportions increase)
Disease may deplete certain populations (proportions decrease)
Compositional changes reveal immune system reorganization

The compositional challenge:

Cell type proportions are compositional data—they sum to 100% within each sample. This creates statistical challenges:

If one cell type increases, others must decrease (not independent)
Standard t-tests are inappropriate for proportion data
We need methods designed for compositional data

Our approach: Use the propeller function from the speckle package, which:

Accounts for compositional nature of proportions
Tests for significant changes between conditions
Provides FDR-corrected p-values

Calculating Cell Type Proportions

#-----------------------------------------------
# STEP 3: Calculate cell type proportions per sample
#-----------------------------------------------

# Calculate proportions for each sample
proportion_data <- seurat_obj@meta.data %>%
  group_by(sample_id, condition, final_annotation) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(sample_id) %>%
  mutate(
    total_cells = sum(count),
    proportion = count / total_cells,
    percentage = proportion * 100
  ) %>%
  ungroup()

# Save proportion data
write.csv(proportion_data, "results/cell_type_proportions_per_sample.csv", row.names = FALSE)

# Display summary
proportion_summary <- proportion_data %>%
  group_by(condition, final_annotation) %>%
  summarise(
    mean_percentage = mean(percentage),
    sd_percentage = sd(percentage),
    .groups = "drop"
  )

Visualizing Proportions

#-----------------------------------------------
# STEP 4: Visualize cell type proportions
#-----------------------------------------------

# Stacked barplot showing composition of each sample
p_stacked <- ggplot(proportion_data, 
                    aes(x = sample_id, y = percentage, fill = final_annotation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ condition, scales = "free_x") +
  theme_classic(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right") +
  labs(title = "Cell Type Composition by Sample",
       x = "Sample", y = "Percentage (%)",
       fill = "Cell Type") +
  scale_fill_brewer(palette = "Set3")

ggsave("plots/proportions/01_stacked_composition.png", p_stacked,
       width = 12, height = 6, dpi = 300)

# Grouped barplot with error bars
p_grouped <- ggplot(proportion_summary, 
                    aes(x = final_annotation, y = mean, fill = condition)) +
  geom_bar(stat = "identity", position = position_dodge(0.9)) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd),
                position = position_dodge(0.9), width = 0.2) +
  theme_classic(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Mean Cell Type Proportions by Condition",
       x = "Cell Type", y = "Mean Percentage (%)") +
  scale_fill_manual(values = c("Healthy" = "#2E86AB", "Post_Treatment" = "#F18F01"))

ggsave("plots/proportions/02_grouped_proportions.png", p_grouped,
       width = 12, height = 6, dpi = 300)

Reading the plots:

Stacked bars: Show overall composition (all bars = 100%)

Grouped bars: Compare mean proportions with error bars (SD)

Statistical Testing with Propeller

The propeller function performs statistical testing specifically designed for compositional data. It uses arcsin-square-root transformation followed by a linear model to test for differences in cell type proportions between conditions.

How propeller works:

Transformation: Cell type proportions are transformed using arcsin(sqrt(proportion)) to stabilize variance and make the data more suitable for linear modeling
Linear Model: Fits a linear model with the transformed proportions as the response variable and experimental conditions as predictors
Moderated t-statistics: Uses empirical Bayes methods (similar to limma) to borrow information across cell types, improving statistical power
Multiple testing correction: Applies FDR correction across all cell types tested

This approach properly accounts for the compositional nature of the data while providing robust statistical inference.

#-----------------------------------------------
# STEP 5: Statistical testing of proportion differences
#-----------------------------------------------

# Prepare data for propeller
cluster_ids <- seurat_obj$final_annotation
sample_ids <- seurat_obj$sample_id

# Create group vector (condition per sample)
sample_to_condition <- unique(seurat_obj@meta.data[, c("sample_id", "condition")])
rownames(sample_to_condition) <- sample_to_condition$sample_id
group <- sample_to_condition[sample_ids, "condition"]

# Run propeller test
propeller_results <- propeller(
  clusters = cluster_ids,
  sample = sample_ids,
  group = group
)

# Save results
write.csv(propeller_results, "results/propeller_proportion_test.csv", row.names = FALSE)

Understanding Propeller Results:

PropMean.Healthy/Post_Treatment: Mean proportion in each condition

PropRatio: Ratio of proportions (Post/Healthy). >1 = increase, <1 = decrease

P.Value: Raw p-value from statistical test

FDR: False discovery rate-adjusted p-value (use this for significance)

Part 2: Cell-Level Differential Expression with FindMarkers

Now we identify differentially expressed genes within each cell type using Seurat’s FindMarkers function. This provides fast exploratory analysis to identify candidate genes.

Running FindMarkers for All Cell Types

#-----------------------------------------------
# STEP 6: Cell-level DE analysis with FindMarkers
#-----------------------------------------------

# Get unique cell types
cell_types <- unique(seurat_obj$final_annotation)

# Set default assay and identify for comparison
DefaultAssay(seurat_obj) <- "RNA"
Idents(seurat_obj) <- "condition"

# Function to run FindMarkers for one cell type
run_findmarkers <- function(celltype, seurat_obj) {
  cat("  Processing:", celltype, "\n")

  # Subset to this cell type
  seurat_obj_sub <- subset(seurat_obj, subset = final_annotation == celltype)

  # Run FindMarkers comparing conditions
  markers <- FindMarkers(
    seurat_obj_sub,
    ident.1 = "Post_Treatment",
    ident.2 = "Healthy",
    test.use = "wilcox",
    logfc.threshold = 0,
    min.pct = 0.1,
    verbose = FALSE
  )

  # Add gene names and cell type
  markers$gene <- rownames(markers)
  markers$cell_type <- celltype

  return(markers)
}

# Run FindMarkers for all cell types
findmarkers_list <- lapply(cell_types, run_findmarkers, seurat_obj = seurat_obj)
names(findmarkers_list) <- cell_types

# Combine results
findmarkers_all <- do.call(rbind, findmarkers_list)
rownames(findmarkers_all) <- NULL

# Add significance flag
findmarkers_all$significant <- findmarkers_all$p_val_adj < 0.05 & 
                               abs(findmarkers_all$avg_log2FC) > 0.25

# Save results
write.csv(findmarkers_all, "results/findmarkers_celllevel_results.csv", row.names = FALSE)

# Summary of significant genes per cell type
sig_summary <- findmarkers_all %>%
  filter(significant) %>%
  group_by(cell_type) %>%
  summarise(
    total_genes_tested = n(),
    sig_genes = sum(significant),
    sig_up = sum(avg_log2FC > 0.25 & p_val_adj < 0.05),
    sig_down = sum(avg_log2FC < -0.25 & p_val_adj < 0.05),
    .groups = "drop"
  )

FindMarkers Parameters:

ident.1: Post-Treatment (numerator in fold-change)

ident.2: Healthy (denominator)

test.use: Statistical test (Wilcoxon rank-sum)

logfc.threshold: Minimum |log2FC| to test (0.25 = 1.19-fold)

min.pct: Gene must be detected in >=10% of cells in at least one group

Visualizing Cell-Level DE Results

#-----------------------------------------------
# STEP 7: Visualize FindMarkers results
#-----------------------------------------------

# Volcano plots for each cell type
for (ct in cell_types) {
  # Get data for this cell type
  ct_data <- findmarkers_all %>% filter(cell_type == ct)

  # Identify top genes to label
  top_genes <- ct_data %>%
    filter(significant) %>%
    arrange(p_val_adj) %>%
    head(10)

  # Create volcano plot
  p_volcano <- ggplot(ct_data, aes(x = avg_log2FC, y = -log10(p_val_adj))) +
    geom_point(aes(color = significant), alpha = 0.5, size = 1) +
    scale_color_manual(values = c("FALSE" = "gray", "TRUE" = "red")) +
    theme_classic(base_size = 12) +
    labs(title = paste("Volcano Plot:", ct),
         x = "log2 Fold Change (Post-Treatment vs Healthy)",
         y = "-log10(FDR)",
         color = "Significant") +
    geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "blue") +
    geom_vline(xintercept = c(-0.25, 0.25), linetype = "dashed", color = "blue")

  # Add labels for top genes
  if (nrow(top_genes) > 0) {
    p_volcano <- p_volcano +
      geom_text_repel(data = top_genes, aes(label = gene), size = 3, max.overlaps = 10)
  }

  ggsave(paste0("plots/DE_celllevel/volcano_", gsub(" ", "_", ct), ".png"),
         p_volcano, width = 8, height = 7, dpi = 300)
}

Part 3: Pseudobulk Differential Expression Analysis

Now we perform the statistically rigorous differential expression analysis by aggregating cells into pseudobulk samples and using DESeq2.

Understanding the Pseudobulk Approach

As explained in the introduction, cell-level analysis treats cells from the same patient as independent samples (pseudoreplication). Pseudobulk corrects this by:

Aggregating cells by sample (sum counts from all cells of a type in each sample)
Creating pseudobulk samples (one per patient, like bulk RNA-seq)
Comparing samples with proper biological replicates (DESeq2)
Accounting for patient effects and count overdispersion

The aggregation process:

Before aggregation (cell-level):
  Healthy_1: 1,200 CD14+ monocytes (each with individual counts)
  Healthy_2: 1,150 CD14+ monocytes
  Healthy_3: 1,100 CD14+ monocytes
  Healthy_4: 1,050 CD14+ monocytes
  Post_1: 1,100 CD14+ monocytes
  Post_2: 1,050 CD14+ monocytes
  Post_3: 1,000 CD14+ monocytes
  Post_4: 950 CD14+ monocytes

After aggregation (pseudobulk):
  Healthy_1: Sum of 1,200 cells -> 1 pseudobulk sample
  Healthy_2: Sum of 1,150 cells -> 1 pseudobulk sample
  ...
  Post_4: Sum of 950 cells -> 1 pseudobulk sample

Result: 4 vs 4 samples (proper biological replicates!)

Creating Pseudobulk Count Matrices with Seurat 5

Seurat 5 provides a dedicated function PseudobulkExpression() specifically designed for creating pseudobulk count matrices.

#-----------------------------------------------
# STEP 8: Create pseudobulk count matrices using PseudobulkExpression
#-----------------------------------------------

# Function to create pseudobulk for one cell type
create_pseudobulk_seurat5 <- function(celltype, seurat_obj) {
  # Subset to this cell type
  seurat_subset <- subset(seurat_obj, subset = final_annotation == celltype)

  # Use PseudobulkExpression to aggregate counts by sample
  pseudobulk_result <- PseudobulkExpression(
    seurat_subset,
    assays = "RNA",
    group.by = "sample_id",     # Aggregate by sample
    layer = "counts",            # Use raw counts
    method = "aggregate"         # Sum counts
  )

  # Extract the RNA counts matrix
  pseudobulk_matrix <- pseudobulk_result$RNA

  # Convert column names back to match original sample_id format
  colnames(pseudobulk_matrix) <- gsub("-", "_", colnames(pseudobulk_matrix))

  # Return as regular matrix (required for DESeq2)
  return(as.matrix(pseudobulk_matrix))
}

# Create pseudobulk for all cell types
pseudobulk_list <- lapply(cell_types, function(ct) {
  cat("  Processing:", ct, "\n")
  create_pseudobulk_seurat5(ct, seurat_obj)
})

names(pseudobulk_list) <- cell_types

CD4+ T cells Pseudobulk Count Matrix:

Running DESeq2 for Each Cell Type

#-----------------------------------------------
# STEP 9: Run DESeq2 pseudobulk analysis
#-----------------------------------------------

# Create sample metadata
sample_metadata <- unique(seurat_obj@meta.data[, c("sample_id", "condition", "patient_id")])

sample_metadata <- as.data.frame(sample_metadata)
rownames(sample_metadata) <- sample_metadata$sample_id

# Function to run DESeq2 for one cell type
run_deseq2_celltype <- function(celltype, pseudobulk_counts, sample_metadata) {

  # Get sample names from count matrix
  samples_in_counts <- colnames(pseudobulk_counts)

  # Verify that sample_metadata rownames match count matrix colnames
  if (!all(samples_in_counts %in% rownames(sample_metadata))) {
    stop("ERROR: Sample names in count matrix don't match sample_metadata rownames!")
  }

  # Order metadata to match count matrix columns
  sample_metadata_ordered <- sample_metadata[samples_in_counts, , drop = FALSE]

  # Create DESeq2 dataset
  dds <- DESeqDataSetFromMatrix(
    countData = pseudobulk_counts,
    colData = sample_metadata_ordered,
    design = ~ condition
  )

  # Filter low-count genes (improve power and speed)
  keep <- rowSums(counts(dds) >= 10) >= 3
  dds <- dds[keep, ]

  # Run DESeq2 analysis
  dds <- DESeq(dds, quiet = TRUE)

  # Extract results
  res <- results(
    dds,
    contrast = c("condition", "Post_Treatment", "Healthy"),
    alpha = 0.05
  )

  # Convert to data frame
  res_df <- as.data.frame(res)
  res_df$gene <- rownames(res_df)
  res_df$cell_type <- celltype

  # Add significance flag
  res_df$significant <- !is.na(res_df$padj) & 
                        res_df$padj < 0.05 & 
                        abs(res_df$log2FoldChange) > 0.25

  return(res_df)
}

# Run DESeq2 for all cell types
deseq2_list <- lapply(cell_types, function(ct) {
  run_deseq2_celltype(ct, pseudobulk_list[[ct]], sample_metadata)
})

names(deseq2_list) <- cell_types

# Combine results
deseq2_all <- do.call(rbind, deseq2_list)
rownames(deseq2_all) <- NULL

# Save results
write.csv(deseq2_all, "results/deseq2_pseudobulk_results.csv", row.names = FALSE)

# Summary of significant genes per cell type
deseq2_summary <- deseq2_all %>%
  group_by(cell_type) %>%
  summarise(
    total_genes_tested = n(),
    sig_genes = sum(significant, na.rm = TRUE),
    sig_up = sum(significant & log2FoldChange > 0.25, na.rm = TRUE),
    sig_down = sum(significant & log2FoldChange < -0.25, na.rm = TRUE),
    .groups = "drop"
  )

Validating Results with limma-voom

To ensure our differential expression findings are robust, we’ll use limma-voom as an independent validation method. Limma uses a different statistical framework than DESeq2, so genes significant in both methods are particularly trustworthy.

Running limma-voom Analysis

#-----------------------------------------------
# STEP 10: limma-voom pseudobulk analysis
#-----------------------------------------------

# Function to run limma-voom for one cell type
limma_paired_analysis <- function(
    counts,
    meta,
    design_formula = ~0 + condition,
    contrast_str = "Post_Treatment - Healthy",
    min_count = 10,
    voom_with_quality_weights = TRUE
) {

  # Order metadata to match count matrix columns
  meta <- meta[match(colnames(counts), meta$sample_id), ]
  rownames(meta) <- meta$sample_id

  # Convert factors
  meta$condition <- factor(meta$condition, levels = c("Healthy", "Post_Treatment"))

  # Filter low expressed genes
  keep <- rowSums(counts) >= min_count
  counts <- counts[keep, ]

  # Create design matrix
  design <- model.matrix(design_formula, data = meta)
  colnames(design) <- gsub("condition", "", colnames(design))

  # Voom transformation with quality weights
  if (voom_with_quality_weights) {
    v <- voomWithQualityWeights(counts, design, plot = FALSE)
  } else {
    v <- voom(counts, design, plot = FALSE)
  }

  # Fit linear model
  fit <- lmFit(v, design)

  # Define contrast
  contrast.matrix <- makeContrasts(
    contrasts = contrast_str,
    levels = colnames(design)
  )

  # Fit contrasts
  fit2 <- contrasts.fit(fit, contrast.matrix)
  fit2 <- eBayes(fit2, trend = TRUE, robust = TRUE)

  # Get results
  res <- topTable(fit2, number = Inf, sort.by = "none", adjust.method = "BH")

  return(res)
}

# Run limma for all cell types
limma_list <- lapply(cell_types, function(ct) {
  cat("\nProcessing:", ct, "\n")
  limma_paired_analysis(pseudobulk_list[[ct]], sample_metadata)
})

names(limma_list) <- cell_types

# Process limma results
limma_all <- lapply(names(limma_list), function(ct) {
  res <- limma_list[[ct]]
  res$gene <- rownames(res)
  res$cell_type <- ct
  res$significant <- res$adj.P.Val < 0.05 & abs(res$logFC) > 0.25
  return(res)
}) %>% bind_rows()

# Save results
write.csv(limma_all, "results/limma_pseudobulk_results.csv", row.names = FALSE)

# Summary of significant genes per cell type
limma_summary <- limma_all %>%
  group_by(cell_type) %>%
  summarise(
    total_genes_tested = n(),
    sig_genes = sum(significant, na.rm = TRUE),
    sig_up = sum(significant & logFC > 0.25, na.rm = TRUE),
    sig_down = sum(significant & logFC < -0.25, na.rm = TRUE),
    .groups = "drop"
  )

What to Expect from Each Method

FindMarkers (Cell-Level Analysis):

FindMarkers typically identifies many more significant genes because it:

Treats each cell as an independent observation (thousands of “samples”)
Tests gene expression differences across all individual cells
Has high sensitivity for detecting cell-to-cell variation
Does not account for sample/patient structure
Is prone to pseudoreplication (cells from the same patient are not independent)

Expected outcome: Hundreds to thousands of significant genes per cell type, with very small p-values (often p < 10^-50)

DESeq2 (Pseudobulk Analysis):

DESeq2 typically identifies fewer but more robust genes because it:

Treats each patient/sample as the statistical unit (proper biological replicates)
Aggregates cells by sample before testing
Accounts for between-patient variability
Is more conservative (fewer false positives)
Has lower power with small sample sizes (we only have 4 vs 4 samples)

Expected outcome: Fewer significant genes (sometimes zero for certain cell types), with realistic p-values reflecting true statistical power

Understanding the Discrepancy in Our Results

Looking at our specific results:

FindMarkers Results:

DESeq2 Results:

What This Tells Us:

FindMarkers is overly optimistic: The hundreds of “significant” genes per cell type are likely inflated by pseudoreplication. When we properly account for patient-level variation, most of these differences disappear.
DESeq2 reveals the statistical reality: With only 4 samples per condition, we have limited statistical power to detect differences. The few genes that remain significant in DESeq2 are likely to be:

Genes with very large effect sizes
Genes with consistent changes across all patients
The most robust, reproducible changes

The dramatic difference indicates:

High between-patient variability in gene expression
Treatment effects may be subtle or patient-specific
Most “significant” genes in FindMarkers are false positives due to pseudoreplication
We would need more samples (>=8 per condition) for adequate power

Biological Interpretation

Why are most cell types showing no significant genes in DESeq2?

This does NOT mean there are no biological effects. It means:

Small sample size: With only 4 vs 4 samples, DESeq2 has limited power to detect differences, especially if:

Effect sizes are moderate (log2FC < 1)
Patient-to-patient variability is high
Changes are not consistent across all patients

Real biology is complex: Treatment responses may be:

Patient-specific (some respond strongly, others don’t)
Cell-state-dependent (only certain subpopulations respond)
Transient (timing of sampling matters)

Conservative approach is correct: DESeq2’s stringent requirements ensure that genes we DO call significant are likely to be reproducible in independent experiments.

Practical Recommendations

For Publication:

Report DESeq2 results as your primary statistical findings
Use FindMarkers results for hypothesis generation only
Clearly state sample size limitations
Consider the few DESeq2-significant genes as high-confidence hits

For Follow-up Studies:

Validate DESeq2-significant genes with qPCR or protein assays
Increase sample size (aim for >=8 per condition) for better power
Consider genes significant in BOTH methods as especially robust
Look for genes with large effect sizes in FindMarkers that approach significance in DESeq2

For Interpretation:

The discrepancy highlights why pseudobulk analysis is essential
Cell-level analysis alone would lead to many false positive claims
Even with “no significant genes,” biological effects may exist but require larger studies to detect

This comparison underscores a fundamental principle: statistical significance depends on both effect size AND sample size. Our dataset reveals real biological complexity that would require larger studies to characterize fully.

Part 4: Functional State Scoring Analysis

Rather than looking at individual genes, functional state scoring evaluates whether entire biological pathways are activated or suppressed in different conditions.

Understanding Pathway Scoring

The concept: Many genes work together in coordinated biological programs. By scoring cells for expression of gene sets (pathways), we can identify functional states.

Advantages over single-gene analysis:

More interpretable (pathways have biological meaning)
More robust (averages over many genes)
Can detect subtle coordinated changes
Answers higher-level questions about cell state

Downloading Gene Sets from MSigDB

#-----------------------------------------------
# STEP 11: Download gene sets from MSigDB
#-----------------------------------------------

# Get Hallmark gene sets for humans
hallmark_gene_sets <- msigdbr(
  species = "Homo sapiens",
  category = "H"  # Hallmark gene sets
)

# Create named list format for scoring
# We'll use HALLMARK_INFLAMMATORY_RESPONSE as our example
hallmark_list <- hallmark_gene_sets %>%
  filter(gs_name == "HALLMARK_INFLAMMATORY_RESPONSE") %>%
  pull(gene_symbol)

MSigDB Hallmark Gene Sets: Curated gene sets representing well-defined biological states or processes, derived from multiple sources and refined by experts.

Scoring Cells for Inflammatory Response

#-----------------------------------------------
# STEP 12: Score cells for inflammatory response pathway
#-----------------------------------------------

# Use Seurat's AddModuleScore to calculate pathway scores
seurat_obj <- AddModuleScore(
  seurat_obj,
  features = list(hallmark_list),
  name = "Inflammatory_Response",
  nbin = 24,
  ctrl = 100
)

# The score is added as a new metadata column
# Rename for clarity (Seurat adds "1" to the name)
colnames(seurat_obj@meta.data)[colnames(seurat_obj@meta.data) == "Inflammatory_Response1"] <- "Inflammatory_Response_Score"

AddModuleScore Parameters:

features: List of gene sets to score

name: Prefix for score column names

nbin: Number of expression bins for control gene selection

ctrl: Number of control genes per bin

Visualizing Pathway Scores by Cell Type and Condition

#-----------------------------------------------
# STEP 13: Visualize inflammatory response scores
#-----------------------------------------------

# Ridge plot showing score distributions by cell type and condition
p_ridge <- ggplot(seurat_obj@meta.data, 
                  aes(x = Inflammatory_Response_Score, 
                      y = final_annotation, 
                      fill = condition)) +
  geom_density_ridges(alpha = 0.7) +
  theme_classic(base_size = 12) +
  labs(title = "Inflammatory Response Pathway Scores",
       x = "Pathway Score",
       y = "Cell Type",
       fill = "Condition") +
  scale_fill_manual(values = c("Healthy" = "#2E86AB", "Post_Treatment" = "#F18F01"))

ggsave("plots/functional_scoring/ridge_inflammatory_response.png", p_ridge,
       width = 10, height = 8, dpi = 300)

Statistical Testing of Pathway Scores (Pseudobulk Approach)

Just like with differential expression, we need to account for patient-level variation when comparing pathway scores statistically.

#-----------------------------------------------
# STEP 14: Pseudobulk statistical testing of pathway scores
#-----------------------------------------------

# Aggregate scores by sample and cell type
score_aggregated <- seurat_obj@meta.data %>%
  group_by(sample_id, condition, final_annotation) %>%
  summarise(
    mean_score = mean(Inflammatory_Response_Score),
    .groups = "drop"
  )

# Function to test one cell type
test_pathway_celltype <- function(celltype, score_data) {
  # Filter to this cell type
  ct_data <- score_data %>% filter(final_annotation == celltype)

  if (nrow(ct_data) < 6) {
    return(data.frame(
      cell_type = celltype,
      mean_healthy = NA,
      mean_post = NA,
      diff = NA,
      p_value = NA,
      significant = FALSE
    ))
  }

  # Separate by condition
  healthy_scores <- ct_data %>% filter(condition == "Healthy") %>% pull(mean_score)
  post_scores <- ct_data %>% filter(condition == "Post_Treatment") %>% pull(mean_score)

  # Statistical test (t-test)
  test_result <- t.test(post_scores, healthy_scores)

  return(data.frame(
    cell_type = celltype,
    mean_healthy = mean(healthy_scores),
    mean_post = mean(post_scores),
    diff = mean(post_scores) - mean(healthy_scores),
    p_value = test_result$p.value
  ))
}

# Test all cell types
pathway_tests <- lapply(cell_types, test_pathway_celltype, 
                       score_data = score_aggregated)
pathway_results <- do.call(rbind, pathway_tests)

# Add FDR correction
pathway_results$FDR <- p.adjust(pathway_results$p_value, method = "BH")

# Save results
write.csv(pathway_results, "results/inflammatory_response_pathway_test.csv", 
          row.names = FALSE)

Interpretation:

Positive diff: Pathway score higher in Post-Treatment (pathway activated)

Negative diff: Pathway score lower in Post-Treatment (pathway suppressed)

FDR < 0.05: Statistically significant change in pathway activity

Best Practices for Single-Cell Differential Analysis

General Principles

1. Always Use Multiple Approaches

Don’t rely on a single method (cell-level OR pseudobulk)
Use complementary analyses (proportions + DE + pathways)
Genes/pathways significant in multiple analyses are most robust

2. Account for Biological Structure

Use pseudobulk for final statistical claims (avoids pseudoreplication)
Include patient/batch as covariates in complex designs
Consider paired designs if pre/post samples from same patients

3. Interpret Within Biological Context

Fold-changes must be biologically meaningful (not just statistically significant)
Small changes in key regulators can be more important than large changes in abundant genes
Cell type-specific responses often differ (don’t expect all types to respond similarly)

4. Validate Key Findings

Independent validation cohort (if available)
Orthogonal technologies (qPCR, flow cytometry, immunofluorescence)
Functional experiments (knockdown, overexpression)

Differential Expression Best Practices

When to Use Cell-Level (FindMarkers):

Initial exploration and biomarker discovery
Visualizing gene expression patterns
Single-sample experiments (no biological replicates)
Identifying candidate genes for hypothesis generation

When to Use Pseudobulk (DESeq2):

Final statistical claims for publication
Experiments with >=3 biological replicates per condition
Comparing across conditions with patient/batch structure
Testing treatment effects while accounting for individual variation

Statistical Thresholds:

FDR < 0.05: Standard threshold for significance
|log2FC| > 0.5: Minimum meaningful fold-change (1.4-fold)
For publication: Consider more stringent (FDR < 0.01, |log2FC| > 1)

Gene Filtering:

Remove genes with very low expression (unreliable)
DESeq2: >=10 reads in >=3 samples
FindMarkers: >=10% cells expressing in at least one group

Proportion Analysis Best Practices

Sample Size Requirements:

Minimum: >=3 biological replicates per condition
Recommended: >=5 replicates for robust statistical power
More samples needed if expecting subtle changes

Interpretation Guidelines:

Proportion ratio >1.5 or <0.67 typically indicates meaningful change
Consider biological context (is change plausible?)
Validate with orthogonal methods (flow cytometry, IHC)

Functional Scoring Best Practices

Choosing Gene Sets:

Use well-curated databases (MSigDB Hallmarks, GO, KEGG)
Gene sets should have 15-200 genes (too small = noisy, too large = diluted)
Match gene set species to your data

Statistical Testing:

Always use pseudobulk approach (aggregate scores by sample)
Use appropriate tests (t-test, paired t-test, linear models)
Apply multiple testing correction across pathways

Interpretation:

Score differences should be interpreted relative to baseline variation
Consider both statistical significance and effect size
Validate pathway-level findings with key genes

Common Pitfalls to Avoid

Pitfall 1: Using Cell-Level Statistics for Publication

Problem: Pseudoreplication inflates significance
Solution: Always use pseudobulk for final claims

Pitfall 2: Ignoring Multiple Testing Correction

Problem: Testing thousands of genes without FDR control
Solution: Always report FDR-adjusted p-values

Pitfall 3: Over-Interpreting Small Fold-Changes

Problem: p<0.05 but log2FC=0.1 (7% change) may not be meaningful
Solution: Set fold-change thresholds (|log2FC| > 0.25 or 0.5)

Pitfall 4: Insufficient Sample Size

Problem: <3 replicates per condition = underpowered
Solution: Design experiments with >=5 biological replicates

Pitfall 5: Comparing Across Cell Types Instead of Within

Problem: “Gene X is higher in monocytes vs T cells in treatment” confounds cell identity with condition
Solution: Compare within cell types across conditions

Pitfall 6: Ignoring Batch Effects in Pseudobulk

Problem: Samples processed on different days may have technical variation
Solution: Include batch as covariate in DESeq2 design
Example: design = ~ batch + condition

Pitfall 7: Not Reporting Method Details

Problem: Reviewers/readers can’t evaluate or reproduce analysis
Solution: Report all parameters clearly:
Statistical test used
Thresholds (FDR, fold-change)
Number of biological replicates
Software versions

Conclusion: From Differential Analysis to Biological Discovery

Congratulations! You’ve completed comprehensive differential analysis of your scRNA-seq data, integrating three complementary approaches:

Differential proportions – Which immune cells change in abundance?
Differential expression – Which genes change within each cell type?
Functional state scoring – Which biological pathways are activated/suppressed?

Resources for Continued Learning

Statistical Methods:

DESeq2 vignette: https://bioconductor.org/packages/DESeq2/
limma user guide: https://bioconductor.org/packages/limma/
Propeller documentation: https://www.bioconductor.org/packages/speckle/
Pseudobulk best practices: Squair et al., Nat Commun 2021

Gene Set Analysis:

MSigDB database: https://www.gsea-msigdb.org/
GSEA methodology: Subramanian et al., PNAS 2005
AUCell for single-cell: Aibar et al., Nat Methods 2017

Seurat 5 Documentation:

PseudobulkExpression function: https://satijalab.org/seurat/reference/pseudobulkexpression
Seurat v5 vignettes: https://satijalab.org/seurat/articles/

Visualization:

EnhancedVolcano: https://bioconductor.org/packages/EnhancedVolcano/
ComplexHeatmap: https://jokergoo.github.io/ComplexHeatmap-reference/book/

Best Practices:

Luecken & Theis (2019) – Current best practices in scRNA-seq
Hao et al. (2024) – Seurat v5 integration and analysis
Crowell et al. (2020) – Muscat: Multi-sample multi-group scRNA-seq

References

Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293-304. doi:10.1038/s41587-023-01767-y [Seurat 5]
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi:10.1186/s13059-014-0550-8 [DESeq2]
Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi:10.1093/nar/gkv007 [limma]
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29. doi:10.1186/gb-2014-15-2-r29 [voom]
Squair JW, Gautier M, Kathe C, et al. Confronting false discoveries in single-cell differential expression. Nat Commun. 2021;12(1):5692. doi:10.1038/s41467-021-25960-2 [Pseudobulk best practices]
Crowell HL, Soneson C, Germain PL, et al. muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat Commun. 2020;11(1):6077. doi:10.1038/s41467-020-19894-4 [Muscat – pseudobulk DE]
Phipson B, Maksimovic J, Oshlack A. propeller: testing for differences in cell type proportions in single cell data. Bioinformatics. 2022;38(20):4720-4726. doi:10.1093/bioinformatics/btac582 [Propeller]
Liberzon A, Birger C, Thorvaldsdóttir H, et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1(6):417-425. doi:10.1016/j.cels.2015.12.004 [MSigDB Hallmarks]
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102 [GSEA]
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7. doi:10.1186/1471-2105-14-7 [GSVA]
Aibar S, González-Blas CB, Moerman T, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083-1086. doi:10.1038/nmeth.4463 [AUCell scoring]
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. doi:10.15252/msb.20188746
Zimmerman KD, Espeland MA, Langefeld CD. A practical solution to pseudoreplication bias in single-cell studies. Nat Commun. 2021;12(1):738. doi:10.1038/s41467-021-21038-1
Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018;15(4):255-261. doi:10.1038/nmeth.4612
Finak G, McDavid A, Yajima M, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16:278. doi:10.1186/s13059-015-0844-5 [MAST]
Van den Berge K, Perraudeau F, Soneson C, et al. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018;19(1):24. doi:10.1186/s13059-018-1406-4 [zinbwave]
Lee H, Joo JY, Sohn DH, et al. Single-cell RNA sequencing reveals rebalancing of immunological response in patients with periodontitis after non-surgical periodontal therapy. J Transl Med. 2022;20(1):504. doi:10.1186/s12967-022-03686-8 [Dataset source]

This tutorial is part of the NGS101.com series on single cell sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.

Comments

2 responses to “How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 5: Cell Type-Specific Differential Expression, Proportion Testing, and Functional Pathway Analysis”

Abdulmajid Musa

January 12, 2026

I really appreciate this tutorials series. I am getting better at RNAseq analysis and I am understanding it clearly.

1. Lei
  
  January 12, 2026
  
  Hi Abdulmajid,
  
  I’m really glad to hear that you found the tutorial helpful! Keep up the great work with your learning journey.

How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 5: Cell Type-Specific Differential Expression, Proportion Testing, and Functional Pathway Analysis

Introduction: From Cell Types to Biological Mechanisms

Why Single-Cell RNA-seq Transforms Differential Analysis

The Three Complementary Analyses in This Tutorial

Analysis 1: Cell Type Proportion Analysis

Analysis 2: Cell-Type-Specific Differential Expression

Analysis 3: Functional State Scoring

Understanding Differential Expression: Cell-Level vs Pseudobulk

Approach 1: Cell-Level Differential Expression (Seurat’s FindMarkers)

Approach 2: Pseudobulk Differential Expression (DESeq2, limma)

Choosing Between Cell-Level and Pseudobulk Analysis

Understanding Functional State Scoring

What You’ll Learn in This Tutorial

Dataset Reminder: GSE174609 PBMC Study

Setting Up Your R Environment

Installing Required Packages

Loading Libraries and Configuration

Loading Annotated Data from Part 4

Part 1: Cell Type Proportion Analysis

Understanding Proportion Analysis

Calculating Cell Type Proportions

Visualizing Proportions

Statistical Testing with Propeller

Part 2: Cell-Level Differential Expression with FindMarkers

Running FindMarkers for All Cell Types

Visualizing Cell-Level DE Results

Part 3: Pseudobulk Differential Expression Analysis

Understanding the Pseudobulk Approach

Creating Pseudobulk Count Matrices with Seurat 5

Running DESeq2 for Each Cell Type

Validating Results with limma-voom

Running limma-voom Analysis

What to Expect from Each Method

Understanding the Discrepancy in Our Results

Biological Interpretation

Practical Recommendations

Part 4: Functional State Scoring Analysis

Understanding Pathway Scoring

Downloading Gene Sets from MSigDB

Scoring Cells for Inflammatory Response

Visualizing Pathway Scores by Cell Type and Condition

Statistical Testing of Pathway Scores (Pseudobulk Approach)

Best Practices for Single-Cell Differential Analysis

General Principles

Differential Expression Best Practices

Proportion Analysis Best Practices

Functional Scoring Best Practices

Common Pitfalls to Avoid

Conclusion: From Differential Analysis to Biological Discovery

Resources for Continued Learning

References

Share this:

Like this:

Comments

2 responses to “How to Analyze Single-Cell RNA-seq Data – Complete Beginner’s Guide Part 5: Cell Type-Specific Differential Expression, Proportion Testing, and Functional Pathway Analysis”

Leave a Reply Cancel reply

Search

Categories

Recent Posts

Tags