How to Analyze RNAseq Data for Absolute Beginners Part 5: From DEGs to Pathways - Best Practices

Video Tutorial

Introduction

After completing the data preparation, statistical testing, and visualization steps, we’re finally ready to explore the biological significance of our RNA sequencing data. As biologists, this is the moment we’ve been waiting for – but how do we make sense of the hundreds or thousands of differentially expressed genes (DEGs) we’ve identified?

Living organisms are complex systems where thousands of genes work together in harmony. These genes form functional groups, each responsible for different processes like cell division, protein binding, metabolism, or disease responses. To understand what our DEGs are telling us, we need to identify which functional groups they belong to and what biological processes they influence. This is where functional enrichment analysis, also known as pathway analysis, becomes crucial.

Pathway analysis helps us:

Connect expression changes to known biological mechanisms
Provide biological context for our experimental findings
Generate hypotheses for further investigation

Pathway Databases: Your Guide to Gene Function

Pathway databases are specialized resources containing curated information about biological pathways and their associated gene sets. During pathway analysis, we compare our DEGs against these databases to identify significantly overlapping gene sets and their corresponding pathways. Here are the most widely used databases for RNA-seq analysis:

Gene Ontology (GO)

Gene Ontology (GO) is a comprehensive bioinformatics resource that provides a structured vocabulary (ontologies) describing gene functions, processes, and cellular locations. It’s organized into three main categories:

Molecular Function (MF)

Describes specific activities at the molecular level
Examples: ATP binding, protein kinase activity

Biological Process (BP)

Defines broader biological goals involving multiple molecular functions
Examples: cell cycle regulation, immune response, apoptosis

Cellular Component (CC)

Indicates where gene products are active within the cell
Examples: nucleus, cytoplasm, mitochondrial membrane

KEGG (Kyoto Encyclopedia of Genes and Genomes)

KEGG is a comprehensive database integrating genomic, chemical, and functional information. It includes:

Pathways: Detailed maps of biological processes
KEGG Orthology (KO): Groups of genes with shared functions across species
KEGG Disease: Disease pathways and gene-disease associations
KEGG Drug and Enzyme: Drug interactions and metabolic enzyme information
KEGG Compound: Chemical compound structures and roles

Molecular Signatures Database (MSigDB)

MSigDB is the most comprehensive pathway database, incorporating both GO and KEGG data plus much more. It’s your one-stop shop for pathway analysis, featuring:

Hallmark Gene Sets: Core biological processes and pathways
Curated Gene Sets (C2): Pathways from KEGG, Reactome, and published studies
Motif Gene Sets (C3): Gene regulatory motifs
Computational Gene Sets (C4): Co-expression networks
GO Gene Sets (C5): Gene Ontology groupings
Oncogenic Signatures (C6): Cancer-related pathways
Immunologic Signatures (C7): Immune system patterns
Cell Type Signatures (C8): Cell-specific expression patterns

Statistical Methods for Pathway Analysis

1. Over-Representation Analysis (ORA)

ORA uses statistical tests (typically Fisher’s exact test or hypergeometric test) to evaluate gene set enrichment.

Advantages:

Simple to understand and interpret
Straightforward implementation

Limitations:

Requires predefined cutoffs for DEG selection
Treats all genes as equally important
Doesn’t consider expression change magnitude

2. Gene Set Enrichment Analysis (GSEA)

GSEA uses a running-sum statistic to analyze gene distribution patterns within ranked lists.

Advantages:

No arbitrary cutoffs needed
Considers expression change magnitude
Detects subtle expression patterns

Limitations:

More computationally intensive
Can be more complex to interpret

Practical Implementation

Prerequisites

First, install the required R packages:

# Install required packages if needed
install.packages(c("BiocManager"))
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db", "pathview", "enrichplot", "DOSE"))

install.packages(c("data.table", "ggplot2"))

1. Data Preparation

1.1 Loading Required Libraries

library(data.table)
library(clusterProfiler)
# library(org.Hs.eg.db)
library(org.Mm.eg.db)
# library(org.Rn.eg.db)
library(ggplot2)
library(DOSE)
library(enrichplot)

1.2 Input Data Format

First, let’s import the DEG table we created in the previous tutorial. For ORA analysis, we need to establish thresholds for both P-value and fold change to filter our DEGs. Note that these filtering thresholds are specific to ORA – the GSEA method analyzes the entire gene set without preliminary filtering.

deg <- fread("DEG_P1_lfc0.tsv")

p_threshold <- 0.05
fc_threshold <- 2

2. Gene Set Enrichment Analysis (GSEA)

2.1 GO Enrichment Analysis

# rank the DEGs by the fold change
deg_order_fc <- deg[order(-logFC)] # rank the genes by logFC in descending order
logfc <- deg_order_fc$logFC # get logFC
names(logfc) <- deg_order_fc$V1 # make a named vector of logFC

# GSEA
# Mouse: OrgDb = org.Mm.eg.db; Rat: OrgDb = org.Rn.eg.db; Human: OrgDb = org.Hs.eg.db;
enrich_go_gsea <- gseGO(geneList = logfc, OrgDb = org.Mm.eg.db, ont = "ALL", pvalueCutoff = 0.05, keyType ="SYMBOL", verbose= FALSE)

# save the enrichment table
enrich_go_gsea_df <- enrich_go_gsea@result

fwrite(enrich_go_gsea_df, "enrich_go_gsea_df.tsv", sep = "\t")

2.2 Visualize The Top Enriched GO terms

The clusterProfiler package offers numerous visualization options for enriched GO terms. We’ll start with the dot plot visualization, which is the most widely used and intuitive representation. For additional visualization methods, including advanced graphics options, refer to the clusterProfiler documentation.

dotplot_enrich_go_gsea <- dotplot(enrich_go_gsea, showCategory = 10, orderBy="GeneRatio")

ggsave("dotplot_enrich_go_gsea.png", dotplot_enrich_go_gsea, device = "png", units = "cm", width = 16, height = 18)

GSEA results can be visualized at the individual pathway level to examine specific enrichment patterns.

# specify the GO term you want to plot (check your "enrich_go_gsea_df" above)
GOID <- "GO:0010634"

gsea_plot <- gseaplot2(enrich_go_gsea, geneSetID = GOID, title = subset(enrich_go_gsea_df, ID == GOID)$Description)

ggsave(paste0("gsea_plot_", gsub(":", "", GOID), ".png"), gsea_plot, device = "png", units = "cm", width = 18, height = 16)

2.3 KEGG Enrichment Analysis

The process follows the same workflow as above, but uses the ‘gseKEGG’ function specifically for KEGG pathway analysis. Note the “organism” option (“hsa”: human, “mmu”: mouse and “rna”: rat).

enrich_kegg_gsea <- gseKEGG(geneList = logfc, organism = "mmu")

3. Over-Representation Analysis (ORA)

3.1 GO Enrichment Analysis

The ORA method requires only a list of differentially expressed genes as input. In this example, we’ll focus on analyzing up-regulated genes, but you can apply the same process to down-regulated genes as well.

# make a gene list
deg_up <- deg[logFC > log2(fc_threshold) & adj.P.Val < p_threshold]$V1
# deg_dn <- deg[logFC < -log2(fc_threshold) & adj.P.Val < p_threshold]$V1

enrich_go_fet_up <- enrichGO(gene = deg_up, OrgDb=org.Mm.eg.db, keyType="SYMBOL", ont="ALL", pvalueCutoff=0.05, pAdjustMethod="BH", qvalueCutoff=0.05)

enrich_go_fet_up_df <- enrich_go_fet_up@result

fwrite(enrich_go_fet_up_df, "enrich_go_fet_up_df.tsv", sep = "\t")

3.2 Visualize The Top Enriched GO terms

# show the top 10 terms
dotplot_enrich_go_fet_up <- dotplot(enrich_go_fet_up, showCategory = 10, orderBy="GeneRatio")

ggsave("dotplot_enrich_go_fet_up.png", dotplot_enrich_go_fet_up, device = "png", units = "cm", width = 16, height = 18)

3.3 KEGG Enrichment Analysis

The methodology mirrors the GO enrichment analysis, with the key difference being the use of the ‘enrichKEGG’ function for KEGG pathway enrichment.

enrich_kegg_fet_up <- enrichKEGG(enrich_go_fet_up, organism = "mmu", keyType = "kegg")

4. MSigDB Enrichment

Now for the most comprehensive analysis option: MSigDB analysis. This powerful database incorporates GO, KEGG, Hallmark, Reactome, and many other pathway collections, allowing you to explore your DEGs across multiple annotation systems simultaneously.

Before beginning the MSigDB analysis, download the appropriate gene set file from the MSigDB website. Select the correct species-specific file (mouse or human) and either place it in your R working directory or note its full file path for later use.

The visualization methods for MSigDB enrichment follow the same principles we used for GO and KEGG analyses.

4.1 The GSEA Method

msigdb <- read.gmt("msigdb.v2024.1.Mm.symbols.gmt")

enrich_msigdb_gsea <- GSEA(logfc, TERM2GENE = msigdb)

enrich_msigdb_gsea_df <- enrich_msigdb_gsea@result

fwrite(enrich_msigdb_gsea_df, "enrich_msigdb_gsea_df.tsv", sep = "\t")

4.2 The ORA Method

msigdb <- read.gmt("msigdb.v2024.1.Mm.symbols.gmt")

enrich_msigdb_fet_up <- enricher(deg_up, TERM2GENE = msigdb)

enrich_msigdb_fet_up_df <- enrich_msigdb_fet_up@result

fwrite(enrich_msigdb_fet_up_df, "enrich_msigdb_fet_up_df.tsv", sep = "\t")

Results Interpretation

When analyzing your pathway analysis results, consider these key factors:

Statistical Significance

Examine adjusted p-values (FDR)
Consider gene set sizes
Look for consistent patterns across methods

Biological Relevance

Focus on pathways related to your experimental context
Consider pathway relationships and hierarchies
Look for unexpected but interesting connections

Technical Considerations

Account for gene set size effects
Verify multiple testing correction methods
Ensure appropriate background gene selection

Best Practices

Use appropriate statistical cutoffs consistently
Consider your experimental context when interpreting results
Validate key findings using alternative methods
Document all analysis parameters and software versions
Cross-reference findings with current literature
Consider pathway redundancy and overlap

References

Yu G, Wang LG, Han Y, He QY (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology
Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology

NGS Learning Hub

How to Analyze RNAseq Data for Absolute Beginners Part 5: From DEGs to Pathways – Best Practices

Video Tutorial

Introduction

Pathway Databases: Your Guide to Gene Function

Gene Ontology (GO)

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Molecular Signatures Database (MSigDB)

Statistical Methods for Pathway Analysis

1. Over-Representation Analysis (ORA)

2. Gene Set Enrichment Analysis (GSEA)

Practical Implementation

Prerequisites

1. Data Preparation

1.1 Loading Required Libraries

1.2 Input Data Format

2. Gene Set Enrichment Analysis (GSEA)

2.1 GO Enrichment Analysis

2.2 Visualize The Top Enriched GO terms

2.3 KEGG Enrichment Analysis

3. Over-Representation Analysis (ORA)

3.1 GO Enrichment Analysis

3.2 Visualize The Top Enriched GO terms

3.3 KEGG Enrichment Analysis

4. MSigDB Enrichment

4.1 The GSEA Method

4.2 The ORA Method

Results Interpretation

Best Practices

References

Additional Resources

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags

How to Analyze RNAseq Data for Absolute Beginners Part 5: From DEGs to Pathways – Best Practices

Video Tutorial

Introduction

Pathway Databases: Your Guide to Gene Function

Gene Ontology (GO)

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Molecular Signatures Database (MSigDB)

Statistical Methods for Pathway Analysis

1. Over-Representation Analysis (ORA)

2. Gene Set Enrichment Analysis (GSEA)

Practical Implementation

Prerequisites

1. Data Preparation

1.1 Loading Required Libraries

1.2 Input Data Format

2. Gene Set Enrichment Analysis (GSEA)

2.1 GO Enrichment Analysis

2.2 Visualize The Top Enriched GO terms

2.3 KEGG Enrichment Analysis

3. Over-Representation Analysis (ORA)

3.1 GO Enrichment Analysis

3.2 Visualize The Top Enriched GO terms

3.3 KEGG Enrichment Analysis

4. MSigDB Enrichment

4.1 The GSEA Method

4.2 The ORA Method

Results Interpretation

Best Practices

References

Additional Resources

Share this:

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags