Introduction
After completing the data preparation, statistical testing, and visualization steps, we’re finally ready to explore the biological significance of our RNA sequencing data. As biologists, this is the moment we’ve been waiting for – but how do we make sense of the hundreds or thousands of differentially expressed genes (DEGs) we’ve identified?
Living organisms are complex systems where thousands of genes work together in harmony. These genes form functional groups, each responsible for different processes like cell division, protein binding, metabolism, or disease responses. To understand what our DEGs are telling us, we need to identify which functional groups they belong to and what biological processes they influence. This is where functional enrichment analysis, also known as pathway analysis, becomes crucial.
Pathway analysis helps us:
- Connect expression changes to known biological mechanisms
- Provide biological context for our experimental findings
- Generate hypotheses for further investigation
Pathway Databases: Your Guide to Gene Function
Pathway databases are specialized resources containing curated information about biological pathways and their associated gene sets. During pathway analysis, we compare our DEGs against these databases to identify significantly overlapping gene sets and their corresponding pathways. Here are the most widely used databases for RNA-seq analysis:
Gene Ontology (GO)
Gene Ontology (GO) is a comprehensive bioinformatics resource that provides a structured vocabulary (ontologies) describing gene functions, processes, and cellular locations. It’s organized into three main categories:
- Molecular Function (MF)
- Describes specific activities at the molecular level
- Examples: ATP binding, protein kinase activity
- Biological Process (BP)
- Defines broader biological goals involving multiple molecular functions
- Examples: cell cycle regulation, immune response, apoptosis
- Cellular Component (CC)
- Indicates where gene products are active within the cell
- Examples: nucleus, cytoplasm, mitochondrial membrane
KEGG (Kyoto Encyclopedia of Genes and Genomes)
KEGG is a comprehensive database integrating genomic, chemical, and functional information. It includes:
- Pathways: Detailed maps of biological processes
- KEGG Orthology (KO): Groups of genes with shared functions across species
- KEGG Disease: Disease pathways and gene-disease associations
- KEGG Drug and Enzyme: Drug interactions and metabolic enzyme information
- KEGG Compound: Chemical compound structures and roles
Molecular Signatures Database (MSigDB)
MSigDB is the most comprehensive pathway database, incorporating both GO and KEGG data plus much more. It’s your one-stop shop for pathway analysis, featuring:
- Hallmark Gene Sets: Core biological processes and pathways
- Curated Gene Sets (C2): Pathways from KEGG, Reactome, and published studies
- Motif Gene Sets (C3): Gene regulatory motifs
- Computational Gene Sets (C4): Co-expression networks
- GO Gene Sets (C5): Gene Ontology groupings
- Oncogenic Signatures (C6): Cancer-related pathways
- Immunologic Signatures (C7): Immune system patterns
- Cell Type Signatures (C8): Cell-specific expression patterns
Statistical Methods for Pathway Analysis
1. Over-Representation Analysis (ORA)
ORA uses statistical tests (typically Fisher’s exact test or hypergeometric test) to evaluate gene set enrichment.
Advantages:
- Simple to understand and interpret
- Straightforward implementation
Limitations:
- Requires predefined cutoffs for DEG selection
- Treats all genes as equally important
- Doesn’t consider expression change magnitude
2. Gene Set Enrichment Analysis (GSEA)
GSEA uses a running-sum statistic to analyze gene distribution patterns within ranked lists.
Advantages:
- No arbitrary cutoffs needed
- Considers expression change magnitude
- Detects subtle expression patterns
Limitations:
- More computationally intensive
- Can be more complex to interpret
Practical Implementation
Prerequisites
First, install the required R packages:
# Install required packages if needed
install.packages(c("BiocManager"))
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "pathview", "enrichplot", "DOSE"))
1. Data Preparation
1.1 Loading Required Libraries
library(data.table)
library(clusterProfiler)
# library(org.Hs.eg.db)
library(org.Mm.eg.db)
# library(org.Rn.eg.db)
library(ggplot2)
library(DOSE)
library(enrichplot)
1.2 Input Data Format
First, let’s import the DEG table we created in the previous tutorial. For ORA analysis, we need to establish thresholds for both P-value and fold change to filter our DEGs. Note that these filtering thresholds are specific to ORA – the GSEA method analyzes the entire gene set without preliminary filtering.
deg <- fread("DEG_P1_lfc0.tsv")
p_threshold <- 0.05
fc_threshold <- 2
2. Gene Set Enrichment Analysis (GSEA)
2.1 GO Enrichment Analysis
# rank the DEGs by the fold change
deg_order_fc <- deg[order(-logFC)] # rank the genes by logFC in descending order
logfc <- deg_order_fc$logFC # get logFC
names(logfc) <- deg_order_fc$V1 # make a named vector of logFC
# GSEA
# Mouse: OrgDb = org.Mm.eg.db; Rat: OrgDb = org.Rn.eg.db; Human: OrgDb = org.Hs.eg.db;
enrich_go_gsea <- gseGO(geneList = logfc, OrgDb = org.Mm.eg.db, ont = "ALL", pvalueCutoff = 0.05, keyType ="SYMBOL", verbose= FALSE)
# save the enrichment table
enrich_go_gsea_df <- enrich_go_gsea@result
fwrite(enrich_go_gsea_df, "enrich_go_gsea_df.tsv", sep = "\t")
2.2 Visualize The Top Enriched GO terms
The clusterProfiler package offers numerous visualization options for enriched GO terms. We’ll start with the dot plot visualization, which is the most widely used and intuitive representation. For additional visualization methods, including advanced graphics options, refer to the clusterProfiler documentation.
dotplot_enrich_go_gsea <- dotplot(enrich_go_gsea, showCategory = 10, orderBy="GeneRatio")
ggsave("dotplot_enrich_go_gsea.png", dotplot_enrich_go_gsea, device = "png", units = "cm", width = 16, height = 18)
GSEA results can be visualized at the individual pathway level to examine specific enrichment patterns.
# specify the GO term you want to plot (check your "enrich_go_gsea_df" above)
GOID <- "GO:0010634"
gsea_plot <- gseaplot2(enrich_go_gsea, geneSetID = GOID, title = subset(enrich_go_gsea_df, ID == GOID)$Description)
ggsave(paste0("gsea_plot_", gsub(":", "", GOID), ".png"), gsea_plot, device = "png", units = "cm", width = 18, height = 16)
2.3 KEGG Enrichment Analysis
The process follows the same workflow as above, but uses the ‘gseKEGG’ function specifically for KEGG pathway analysis. Note the “organism” option (“hsa”: human, “mmu”: mouse and “rna”: rat).
enrich_kegg_gsea <- gseKEGG(geneList = logfc, organism = "mmu")
3. Over-Representation Analysis (ORA)
3.1 GO Enrichment Analysis
The ORA method requires only a list of differentially expressed genes as input. In this example, we’ll focus on analyzing up-regulated genes, but you can apply the same process to down-regulated genes as well.
# make a gene list
deg_up <- deg[logFC > log2(fc_threshold) & adj.P.Val < p_threshold]$V1
# deg_dn <- deg[logFC < -log2(fc_threshold) & adj.P.Val < p_threshold]$V1
enrich_go_fet_up <- enrichGO(gene = deg_up, OrgDb=org.Mm.eg.db, keyType="SYMBOL", ont="ALL", pvalueCutoff=0.05, pAdjustMethod="BH", qvalueCutoff=0.05)
enrich_go_fet_up_df <- enrich_go_fet_up@result
fwrite(enrich_go_fet_up_df, "enrich_go_fet_up_df.tsv", sep = "\t")
3.2 Visualize The Top Enriched GO terms
# show the top 10 terms
dotplot_enrich_go_fet_up <- dotplot(enrich_go_fet_up, showCategory = 10, orderBy="GeneRatio")
ggsave("dotplot_enrich_go_fet_up.png", dotplot_enrich_go_fet_up, device = "png", units = "cm", width = 16, height = 18)
3.3 KEGG Enrichment Analysis
The methodology mirrors the GO enrichment analysis, with the key difference being the use of the ‘enrichKEGG’ function for KEGG pathway enrichment.
enrich_kegg_fet_up <- enrichKEGG(enrich_go_fet_up, organism = "mmu", keyType = "kegg")
4. MSigDB Enrichment
Now for the most comprehensive analysis option: MSigDB analysis. This powerful database incorporates GO, KEGG, Hallmark, Reactome, and many other pathway collections, allowing you to explore your DEGs across multiple annotation systems simultaneously.
Before beginning the MSigDB analysis, download the appropriate gene set file from the MSigDB website. Select the correct species-specific file (mouse or human) and either place it in your R working directory or note its full file path for later use.
The visualization methods for MSigDB enrichment follow the same principles we used for GO and KEGG analyses.
4.1 The GSEA Method
msigdb <- read.gmt("msigdb.v2024.1.Mm.symbols.gmt")
enrich_msigdb_gsea <- GSEA(logfc, TERM2GENE = msigdb)
enrich_msigdb_gsea_df <- enrich_msigdb_gsea@result
fwrite(enrich_msigdb_gsea_df, "enrich_msigdb_gsea_df.tsv", sep = "\t")
4.2 The ORA Method
msigdb <- read.gmt("msigdb.v2024.1.Mm.symbols.gmt")
enrich_msigdb_fet_up <- enricher(deg_up, TERM2GENE = msigdb)
enrich_msigdb_fet_up_df <- enrich_msigdb_fet_up@result
fwrite(enrich_msigdb_fet_up_df, "enrich_msigdb_fet_up_df.tsv", sep = "\t")
Results Interpretation
When analyzing your pathway analysis results, consider these key factors:
- Statistical Significance
- Examine adjusted p-values (FDR)
- Consider gene set sizes
- Look for consistent patterns across methods
- Biological Relevance
- Focus on pathways related to your experimental context
- Consider pathway relationships and hierarchies
- Look for unexpected but interesting connections
- Technical Considerations
- Account for gene set size effects
- Verify multiple testing correction methods
- Ensure appropriate background gene selection
Best Practices
- Use appropriate statistical cutoffs consistently
- Consider your experimental context when interpreting results
- Validate key findings using alternative methods
- Document all analysis parameters and software versions
- Cross-reference findings with current literature
- Consider pathway redundancy and overlap
References
- Yu G, Wang LG, Han Y, He QY (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology
- Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology
Leave a Reply