How to Perform Master Regulator Analysis on RNA-seq Data Using RegEnrich and RTN – A Complete Beginner’s Guide

How to Perform Master Regulator Analysis on RNA-seq Data Using RegEnrich and RTN – A Complete Beginner’s Guide

By

Lei

Discover the transcription factors controlling gene expression changes in your RNA-seq experiments

Table of Contents

Introduction: Understanding Master Regulator Analysis

In the intricate symphony of gene regulation, not all transcription factors play equal roles. Some act as “master regulators” – key transcription factors that orchestrate broad changes in gene expression programs, controlling entire networks of downstream genes. Identifying these master regulators is crucial for understanding the molecular mechanisms driving biological processes, from development to disease.

What is Master Regulator Analysis?

Master regulator analysis (MRA) is a computational approach designed to identify transcription factors (TFs) that are likely responsible for observed changes in gene expression between biological conditions. Rather than simply finding which genes are differentially expressed, MRA asks a more mechanistic question: “Which transcription factors are driving these expression changes?”

The fundamental principle behind MRA is that if a transcription factor is a key regulator of a biological process, many of its known target genes should show coordinated expression changes. By integrating:

  • Differential gene expression data from your RNA-seq experiment
  • Prior knowledge of transcription factor-target gene relationships
  • Statistical enrichment methods to identify over-represented regulatory relationships

MRA can pinpoint the transcription factors most likely orchestrating the observed transcriptional program.

Beginner’s Tip: Think of master regulator analysis as detective work. You observe which genes changed in your experiment (the evidence), then use known regulatory relationships (the database of suspects) to identify which transcription factors are most likely responsible (the culprit).

The Biological Mechanism Behind Master Regulator Analysis

To understand how MRA works, let’s examine the biological relationships it leverages:

Transcription Factor Binding and Gene Regulation:

  1. Transcription factors bind to specific DNA sequences (regulatory elements) near target genes
  2. Upon binding, TFs can activate or repress transcription of their target genes
  3. A single TF can regulate dozens to hundreds of target genes
  4. Target genes of the same TF often show coordinated expression changes when that TF’s activity changes

The Regulatory Network:
Master regulators sit at the top of regulatory hierarchies. When a master regulator becomes active or inactive:

  • Its direct target genes change expression
  • Some target genes may themselves be transcription factors, creating cascades
  • The combined effect produces coordinated changes across entire gene programs

Computational Inference:
MRA methods leverage databases of known or predicted TF-target relationships (derived from ChIP-seq, motif analysis, or literature curation). By testing whether a TF’s known targets are enriched among differentially expressed genes, these methods can infer which TFs are likely active regulators in your experimental condition.

Biological Insights from Master Regulator Analysis

Performing MRA on your RNA-seq data can reveal several types of biological insights:

  1. Mechanistic Understanding: Beyond knowing which genes changed, you learn why they changed – which regulatory programs were activated
  2. Therapeutic Targets: Master regulators often make better drug targets than downstream effectors, as they control multiple pathways
  3. Regulatory Hierarchies: Identify primary drivers versus secondary responders in your biological system
  4. Network Rewiring: Discover how regulatory networks are reorganized in disease or during development
  5. Biomarker Discovery: Master regulators can serve as molecular signatures for disease subtypes or treatment response

Introducing RegEnrich and RTN: Two Complementary Approaches

This tutorial focuses on two powerful Bioconductor packages for master regulator analysis, each with distinct strengths:

RegEnrich (Regulatory Enrichment Analysis):

  • Approach: Integrates differential expression analysis with regulatory network enrichment
  • Strengths:
  • User-friendly workflow from count matrix to results
  • Built-in database of TF-target relationships
  • Multiple enrichment methods for robust results
  • Excellent visualization capabilities
  • Best for: Researchers wanting a streamlined, integrated workflow with minimal data preprocessing

RTN (Reconstruction of Transcriptional regulatory Networks):

  • Approach: Reconstructs regulatory networks from expression data using mutual information, then tests for master regulators
  • Strengths:
  • Infers regulatory relationships directly from your data
  • Can discover novel regulatory relationships not in databases
  • Multiple MRA methods (Fisher’s exact test and GSEA)
  • Sophisticated network analysis tools
  • Best for: Researchers interested in data-driven network reconstruction or working with less-studied organisms

Both methods are complementary: RegEnrich leverages curated knowledge, while RTN discovers relationships de novo from your data. Using both approaches can provide a more complete picture of regulatory mechanisms.

Setting Up Your Analysis Environment

Before beginning the analysis, we’ll create a dedicated computational environment with all necessary tools. This ensures reproducibility and prevents conflicts with other software.

Creating an R Environment for Master Regulator Analysis

Open your RStudio, install the required packages:

# Install Bioconductor if not already installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install core analysis packages
BiocManager::install(c(
    "RegEnrich",
    "RTN",
    "DESeq2",
    "limma",
    "edgeR",
    "org.Hs.eg.db",
    "RedeR",
    "igraph",
    "clusterProfiler"
))

# Install additional utilities
install.packages(c(
    "pheatmap",
    "ggplot2",
    "dplyr",
    "tidyr",
    "data.table",
    "VennDiagram"
))

Downloading and Preparing the Example Dataset

For this tutorial, we’ll use RNA-seq data from GEO dataset GSE53239, which examines gene expression differences in a controlled experimental setting. This dataset provides an good example for demonstrating master regulator analysis.

Understanding the GSE53239 Dataset

The dataset consists of:

  • Treatment samples: HGB_11 through HGB_16 (6 samples)
  • Control samples: HGB_17 through HGB_22 (6 samples)
  • Gene expression data: Raw count data from RNA-seq with gene symbols

Downloading the Expression Matrix

# Create project directory structure
mkdir -p ~/MRA_Analysis
cd ~/MRA_Analysis

# Download the gene expression matrix
wget "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE53239&format=file&file=GSE53239%5FHGB11%2D22%5FgeneLevel%5FHTSeqCountData%2Etxt%2Egz" \
    -O GSE53239_expression_matrix.txt.gz

# Decompress the file
gunzip GSE53239_expression_matrix.txt.gz

Preparing Data for Master Regulator Analysis

Now we’ll load all required libraries and prepare the data for analysis.

Loading Required Libraries

# Load all required libraries
library(tidyr)
library(dplyr)
library(DESeq2)
library(data.table)
library(edgeR)
library(RegEnrich)
library(RTN)
library(VennDiagram)

# Set working directory
setwd("~/MRA_Analysis")

Creating Sample Information and Loading Data

# Create sample metadata
sample_info <- data.frame(
    sample_id = paste0("HGB_", 11:22),
    condition = c(rep("Treatment", 6), rep("Control", 6)),
    stringsAsFactors = FALSE
)

# Save the sample information
write.table(sample_info, 
            file = "sample_info.txt", 
            sep = "\t", 
            row.names = FALSE, 
            quote = FALSE)

# Read the raw count matrix
counts_raw <- read.table("GSE53239_expression_matrix.txt", 
                         header = TRUE)

# Remove duplicate gene symbols (keep first occurrence)
counts_raw <- counts_raw[!duplicated(counts_raw[[1]]), ]

# Set gene symbols as row names
rownames(counts_raw) <- counts_raw[[1]]

# Remove the gene symbol column
counts_raw <- counts_raw[, -1]

# Read sample information
sample_info <- read.table("sample_info.txt", 
                          header = TRUE,
                          stringsAsFactors = FALSE)

# Ensure sample order matches between count matrix and metadata
sample_info <- sample_info[match(colnames(counts_raw), sample_info$sample_id), ]

# Verify the match
all(colnames(counts_raw) == sample_info$sample_id)

Filtering Low-Expressed Genes

# Create DESeq2 dataset object
dds <- DESeqDataSetFromMatrix(
    countData = counts_raw,
    colData = sample_info,
    design = ~ condition
)

# Filter genes: keep genes expressed in at least 80% of samples
perc_keep <- 0.8
gene_keep <- rowSums(counts_raw > 0) >= ceiling(perc_keep * ncol(counts_raw))
dds <- dds[gene_keep, ]

# Also create filtered count table for other analyses
count_tbl_filtered <- counts_raw[gene_keep, ]

Filtering Strategy: We keep genes that are expressed (count > 0) in at least 80% of samples. This removes genes with very sparse expression that are unlikely to be reliable for downstream analysis while retaining genes with condition-specific expression patterns.

Normalizing Data for RTN Analysis

# Perform variance stabilizing transformation for RTN
vsd <- vst(dds, blind = FALSE)
normalized_counts <- assay(vsd)

Why VST? The variance stabilizing transformation from DESeq2 removes the dependence of variance on the mean, making the data suitable for correlation-based methods used by RTN.

Master Regulator Analysis with RegEnrich

RegEnrich provides an integrated workflow that combines limma-voom for differential expression with regulatory network enrichment analysis.

Understanding the RegEnrich Workflow

RegEnrich analysis proceeds through these steps:

  1. Differential expression analysis using limma-voom
  2. Network inference to identify TF-target relationships
  3. Enrichment testing using GSEA
  4. Integration and scoring of results

Preparing Data for RegEnrich

# Load human transcription factors from RegEnrich or a list of TFs specific to your dataset
data(TFs)
human_tfs <- intersect(TFs$TF_name, rownames(dge_v$E))

# Create DGEList object and normalize with TMM
dge <- DGEList(counts = count_tbl_filtered, samples = sample_info)
dge <- calcNormFactors(dge, method = "TMM")

# Apply voom transformation
dge_v <- voom(dge, plot = TRUE)

# Create design matrix
design_mtx <- model.matrix(~ 0 + dge_v$targets$condition)
colnames(design_mtx) <- gsub("dge_v\\$targets\\$condition", "", colnames(design_mtx))

# Create contrast matrix
ctrst_mtx <- makeContrasts(contrasts = "Treatment-Control", levels = design_mtx)

Why Voom? The voom transformation combines the speed of limma with appropriate modeling of RNA-seq count data by estimating mean-variance relationships and computing precision weights for each observation.

Running RegEnrich Analysis

# Initialize RegEnrich object
regen_obj <- RegenrichSet(
    expr = dge_v$E,                           # voom-transformed expression
    colData = dge_v$targets,                  # sample information
    reg = human_tfs,  # TFs present in data
    method = "limma",                         # DE method
    design = design_mtx,                      # design matrix
    contrast = ctrst_mtx,                     # contrast matrix
    networkConstruction = "COEN",             # network inference method
    enrichTest = "GSEA"                       # enrichment method
)

# Run the complete analysis pipeline
regen_obj <- regenrich_diffExpr(regen_obj) %>% 
    regenrich_network(RsquaredCut = 0.8) %>%  # Lower for smaller sample sizes
    regenrich_enrich() %>% 
    regenrich_rankScore()

# Extract results
res_regenrich <- na.omit(as.data.frame(results_score(regen_obj)))

# Save results
fwrite(res_regenrich, "res_mra_regenrich.csv")

Understanding RegEnrich Parameters

Key parameters to note:

  • networkConstruction = “COEN”: Uses correlation-based network inference from your expression data
  • enrichTest = “GSEA”: Uses Gene Set Enrichment Analysis for robust enrichment testing
  • RsquaredCut = 0.8: Correlation threshold for defining TF-target relationships. Use lower values (0.6-0.7) for smaller sample sizes

Interpreting RegEnrich Results

The results table contains these key columns:

  • reg: Transcription factor name
  • negLogPDEA: -log10(p-value) from differential expression of the TF itself
  • negLogPEnrich: -log10(p-value) from enrichment analysis of TF targets
  • logFC: Log2 fold change of the TF
  • score: Integrated score combining multiple evidence types

Master regulators are identified by:

  • High negLogPEnrich (>1.3, corresponding to p < 0.05): Significant enrichment of TF targets
  • High score: Strong integrated evidence
  • Biological relevance: Consider logFC direction and TF expression level

Master Regulator Analysis with RTN

RTN takes a data-driven approach by first reconstructing regulatory networks from your expression data, then applying two different methods to identify master regulators.

Understanding the RTN Workflow

RTN’s analysis follows these steps:

  1. Construct a transcriptional network using mutual information
  2. Perform permutation analysis to assess network significance
  3. Bootstrap analysis for network stability
  4. Apply DPI filter to remove indirect interactions
  5. Test for master regulators using two methods:
  • Fisher’s Exact Test (FET): Tests for over-representation of differentially expressed genes among TF targets
  • GSEA: Tests for enrichment of TF targets in ranked gene lists

Network Reconstruction

# Create RTN object
rtni <- tni.constructor(
    expData = normalized_counts,
    regulatoryElements = human_tfs
)

# Compute mutual information between TFs and all genes
rtni <- tni.permutation(rtni, nPermutations = 100)

# Bootstrap analysis for network stability
rtni <- tni.bootstrap(rtni)

# Apply DPI filter to remove indirect interactions
rtni <- tni.dpi.filter(rtni)

# Check how many regulons were successfully constructed
regulon_summary <- tni.regulon.summary(rtni)

Computational Note: Network reconstruction can take 30-60 minutes. The nPermutations = 100 is used here for demonstration; use nPermutations > 1000 for real analysis to ensure robust statistical assessment.

Performing Differential Expression Analysis

# Set reference level for comparison
dds$condition <- relevel(dds$condition, ref = "Control")

# Run DESeq2 differential expression analysis
dds_de <- DESeq(dds)
res_deseq2 <- results(dds_de)

# Sort by adjusted p-value
res_deseq2 <- res_deseq2[order(res_deseq2$padj), ]
res_deseq2_df <- as.data.frame(res_deseq2)

# Create ranked gene list by log2 fold change
logfc_ranked <- res_deseq2_df$log2FoldChange
names(logfc_ranked) <- rownames(res_deseq2_df)
logfc_ranked <- sort(logfc_ranked, decreasing = TRUE)

# Get top differentially expressed genes
# Note: Adjust the number or set thresholds based on your data quality
sig_genes <- rownames(res_deseq2_df)[1:1000]

Note on DE Genes: In this example, we take the top 1000 genes ranked by p-value. In practice, you should use appropriate FDR and FC cutoffs (e.g., padj < 0.05) if the statistical power is sufficient. The number of significant genes will depend on your experimental design and biological effect size.

Method 1: Master Regulator Analysis with Fisher’s Exact Test

# Prepare RTN object for MRA
rtna <- tni2tna.preprocess(
    object = rtni,
    phenotype = logfc_ranked,
    hits = sig_genes
)

# Run MRA using Fisher's exact test
rtna_mra <- tna.mra(rtna)

# Get results
results_mra_fet <- tna.get(
    rtna_mra,
    what = "mra",
    ntop = -1
)

# Save results
fwrite(results_mra_fet, "results_rtn_mra_fet.csv")

Interpreting Fisher’s Exact Test Results

The Fisher’s exact test results contain:

  • Regulon: Transcription factor name
  • Regulon.Size: Total number of target genes in the regulon
  • Observed.Hits: Number of significant DE genes among targets
  • Expected.Hits: Expected number by chance
  • Pvalue: Statistical significance of over-representation
  • Adjusted.Pvalue: FDR-adjusted p-value

Key Interpretations:

  • Observed > Expected: TF targets are over-represented among DE genes
  • Low Adjusted.Pvalue: Strong statistical evidence for master regulator
  • Large Size: More reliable results typically come from larger regulons (15-500 targets)

Method 2: Master Regulator Analysis with GSEA

# Run MRA using GSEA approach
rtna_gsea <- tna.gsea2(
    rtna,
    pValueCutoff = 0.05,
    pAdjustMethod = "BH",
    minRegulonSize = 15
)

# Get results
results_mra_gsea <- tna.get(
    rtna_gsea,
    what = "gsea2",
    ntop = -1
)

# Save results
fwrite(results_mra_gsea$differential, "results_rtn_mra_gsea.csv")

Interpreting GSEA Results

The GSEA results (in the differential data frame) contain:

  • Regulon: Transcription factor name
  • Regulon.Size: Number of target genes
  • Observed.Score: Enrichment score indicating direction and magnitude
  • Pvalue: Statistical significance
  • Adjusted.Pvalue: FDR-adjusted p-value

FET vs GSEA: Fisher’s exact test focuses on over-representation (are DE genes enriched?), while GSEA considers both enrichment and direction of change. GSEA is generally more powerful for detecting coordinated regulatory changes and provides information about activation vs. repression.

Visualizing and Interpreting Master Regulator Results

Comparing results across multiple methods strengthens confidence in identified master regulators.

Comparing Results Between Methods

# Create list of master regulators from each method
mra_list <- list(
    RegEnrich = res_regenrich[res_regenrich$negLogPEnrich > 1.3, ]$reg,
    MRA_FET = results_mra_fet[results_mra_fet$Adjusted.Pvalue < 0.05, ]$Regulon,
    GSEA = results_mra_gsea$differential[results_mra_gsea$differential$Adjusted.Pvalue < 0.05, ]$Regulon
)

# Create Venn diagram
venn.diagram(
    x = mra_list,
    filename = 'Venn_Diagram_Regulators.png',
    output = TRUE,
    lwd = 2,
    lty = 'blank',
    fill = c("#440154ff", "#21908dff", "#fde725ff"),
    cex = 1.5,
    fontface = "bold",
    fontfamily = "sans",
    cat.cex = 1.8,
    cat.fontface = "bold",
    cat.default.pos = "outer",
    cat.pos = c(-27, 27, 135),
    cat.dist = c(0.055, 0.055, 0.085),
    cat.fontfamily = "sans",
    rotation = 1
)

Interpreting the Comparison

The Venn diagram reveals:

  1. Common regulators (center overlap): High-confidence master regulators identified by all methods
  2. Method-specific regulators: May reflect different sensitivities or assumptions
  • RegEnrich unique: May capture regulators based on curated databases
  • RTN MRA-FET unique: Focuses on over-representation
  • RTN GSEA unique: Sensitive to coordinated directional changes

Focus on regulators appearing in multiple methods for highest confidence, but don’t ignore method-specific findings – they may represent biologically relevant regulators with different characteristics.

Biological Interpretation and Next Steps

Interpreting Your Master Regulator Results

When examining your master regulator analysis results, consider these key questions:

1. Are the identified regulators biologically relevant?

  • Do they match known biology of your system?
  • Are they expressed in your tissue/cell type?
  • Have they been implicated in similar conditions in the literature?

2. What regulatory programs are activated?

  • Look at Gene Ontology enrichment of target genes
  • Identify common pathways controlled by top regulators
  • Consider whether regulators work together or independently

3. What is the direction of regulation?

  • For GSEA results: positive score indicates activation, negative indicates repression
  • Check the TF’s own expression change (logFC in RegEnrich results)
  • Consider post-translational modifications that may affect activity

Validating Master Regulator Predictions

Computational predictions should be validated experimentally:

Experimental Validation Approaches:

  1. Perturbation experiments: Knock down or overexpress the predicted master regulator
  2. ChIP-seq validation: Confirm direct binding to target genes
  3. Reporter assays: Test if the TF activates predicted target promoters
  4. Rescue experiments: Show that modulating the TF reverses the phenotype

Computational Validation:

  1. Cross-dataset validation: Check if the same regulators appear in similar conditions
  2. Literature mining: Systematic review of published studies
  3. Integration with other data: Combine with ATAC-seq, ChIP-seq, or proteomics

Best Practices for Master Regulator Analysis

Ensuring Robust Results

Quality Control Checkpoints:

  1. Input data quality: Ensure adequate sequencing depth and biological replicates (minimum 3 per group, preferably 6)
  2. Filtering: Remove lowly expressed genes that add noise
  3. Differential expression: Verify reasonable number of DE genes (typically hundreds to thousands)
  4. Regulon size: Most robust results come from regulons with 15-500 target genes
  5. Multiple testing correction: Always use FDR-adjusted p-values

Methodological Considerations:

  1. Use multiple methods: Compare results from RegEnrich and both RTN approaches
  2. Check consistency: Top regulators should show up across different methods
  3. Consider background: Account for tissue-specific TF expression and activity
  4. Validate computationally: Use independent datasets when available

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-interpreting weak signals

  • Problem: Calling transcription factors significant without strong statistical support
  • Solution: Use stringent FDR thresholds (< 0.05) and require high effect sizes

Pitfall 2: Ignoring tissue context

  • Problem: Identifying TFs that aren’t expressed in your system
  • Solution: Filter for TFs with adequate expression in your samples before analysis

Pitfall 3: Confusing correlation with causation

  • Problem: Assuming all enriched TFs are causally driving the changes
  • Solution: Validate key findings experimentally; consider upstream regulators vs downstream responders

Pitfall 4: Insufficient sample size

  • Problem: Low statistical power leading to unreliable results
  • Solution: Use at least 3 biological replicates per condition, preferably 5-6

Pitfall 5: Not considering TF activity vs expression

  • Problem: Missing master regulators whose activity changes without expression changes
  • Solution: Remember that TFs can be regulated post-translationally; GSEA-based approaches are more sensitive to activity changes

Conclusion

Master regulator analysis transforms RNA-seq data from lists of differentially expressed genes into mechanistic hypotheses about the transcription factors driving those changes. By integrating gene expression with regulatory network information, methods like RegEnrich and RTN help identify the key molecular drivers of biological processes.

Master regulator analysis represents a powerful approach to move beyond descriptive gene lists toward mechanistic understanding of gene regulation. As you apply these methods to your own research, remember that the most impactful discoveries come from combining computational predictions with biological insight and experimental validation.

References

  1. Castro MA, et al. “Regulators of genetic risk of breast cancer identified by integrative network analysis.” Nature Genetics (2016). doi: 10.1038/ng.3458
  2. Lefebvre C, et al. “A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers.” Molecular Systems Biology (2010). doi: 10.1038/msb.2010.31
  3. Alvarez MJ, et al. “Functional characterization of somatic mutations in cancer using network-based inference of protein activity.” Nature Genetics (2016). doi: 10.1038/ng.3593
  4. Fletcher MNC, et al. “Master regulators of FGFR2 signalling and breast cancer risk.” Nature Communications (2013). doi: 10.1038/ncomms3464
  5. Margolin AA, et al. “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context.” BMC Bioinformatics (2006). doi: 10.1186/1471-2105-7-S1-S7
  6. Law CW, et al. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology (2014). doi: 10.1186/gb-2014-15-2-r29
  7. Sivakumar S, de Santiago I, Chlon L, Markowetz F (2017) Master Regulators of Oncogenic KRAS Response in Pancreatic Cancer: An Integrative Network Biology Analysis. PLoS Med 14(1): e1002223. https://doi.org/10.1371/journal.pmed.1002223

This tutorial is part of the NGS101.com series on whole genome sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *