How to Perform Master Regulator Analysis on RNA-seq Data Using RegEnrich and RTN - A Complete Beginner's Guide

Discover the transcription factors controlling gene expression changes in your RNA-seq experiments

Table of Contents

Introduction: Understanding Master Regulator Analysis

In the intricate symphony of gene regulation, not all transcription factors play equal roles. Some act as “master regulators” – key transcription factors that orchestrate broad changes in gene expression programs, controlling entire networks of downstream genes. Identifying these master regulators is crucial for understanding the molecular mechanisms driving biological processes, from development to disease.

What is Master Regulator Analysis?

Master regulator analysis (MRA) is a computational approach designed to identify transcription factors (TFs) that are likely responsible for observed changes in gene expression between biological conditions. Rather than simply finding which genes are differentially expressed, MRA asks a more mechanistic question: “Which transcription factors are driving these expression changes?”

The fundamental principle behind MRA is that if a transcription factor is a key regulator of a biological process, many of its known target genes should show coordinated expression changes. By integrating:

Differential gene expression data from your RNA-seq experiment
Prior knowledge of transcription factor-target gene relationships
Statistical enrichment methods to identify over-represented regulatory relationships

MRA can pinpoint the transcription factors most likely orchestrating the observed transcriptional program.

Beginner’s Tip: Think of master regulator analysis as detective work. You observe which genes changed in your experiment (the evidence), then use known regulatory relationships (the database of suspects) to identify which transcription factors are most likely responsible (the culprit).

The Biological Mechanism Behind Master Regulator Analysis

To understand how MRA works, let’s examine the biological relationships it leverages:

Transcription Factor Binding and Gene Regulation:

Transcription factors bind to specific DNA sequences (regulatory elements) near target genes
Upon binding, TFs can activate or repress transcription of their target genes
A single TF can regulate dozens to hundreds of target genes
Target genes of the same TF often show coordinated expression changes when that TF’s activity changes

The Regulatory Network:
Master regulators sit at the top of regulatory hierarchies. When a master regulator becomes active or inactive:

Its direct target genes change expression
Some target genes may themselves be transcription factors, creating cascades
The combined effect produces coordinated changes across entire gene programs

Computational Inference:
MRA methods leverage databases of known or predicted TF-target relationships (derived from ChIP-seq, motif analysis, or literature curation). By testing whether a TF’s known targets are enriched among differentially expressed genes, these methods can infer which TFs are likely active regulators in your experimental condition.

Biological Insights from Master Regulator Analysis

Performing MRA on your RNA-seq data can reveal several types of biological insights:

Mechanistic Understanding: Beyond knowing which genes changed, you learn why they changed – which regulatory programs were activated
Therapeutic Targets: Master regulators often make better drug targets than downstream effectors, as they control multiple pathways
Regulatory Hierarchies: Identify primary drivers versus secondary responders in your biological system
Network Rewiring: Discover how regulatory networks are reorganized in disease or during development
Biomarker Discovery: Master regulators can serve as molecular signatures for disease subtypes or treatment response

Introducing RegEnrich and RTN: Two Complementary Approaches

This tutorial focuses on two powerful Bioconductor packages for master regulator analysis, each with distinct strengths:

RegEnrich (Regulatory Enrichment Analysis):

Approach: Integrates differential expression analysis with regulatory network enrichment
Strengths:
User-friendly workflow from count matrix to results
Built-in database of TF-target relationships
Multiple enrichment methods for robust results
Excellent visualization capabilities
Best for: Researchers wanting a streamlined, integrated workflow with minimal data preprocessing

RTN (Reconstruction of Transcriptional regulatory Networks):

Approach: Reconstructs regulatory networks from expression data using mutual information, then tests for master regulators
Strengths:
Infers regulatory relationships directly from your data
Can discover novel regulatory relationships not in databases
Multiple MRA methods (Fisher’s exact test and GSEA)
Sophisticated network analysis tools
Best for: Researchers interested in data-driven network reconstruction or working with less-studied organisms

Both methods are complementary: RegEnrich leverages curated knowledge, while RTN discovers relationships de novo from your data. Using both approaches can provide a more complete picture of regulatory mechanisms.

Setting Up Your Analysis Environment

Before beginning the analysis, we’ll create a dedicated computational environment with all necessary tools. This ensures reproducibility and prevents conflicts with other software.

Creating an R Environment for Master Regulator Analysis

Open your RStudio, install the required packages:

# Install Bioconductor if not already installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install core analysis packages
BiocManager::install(c(
    "RegEnrich",
    "RTN",
    "DESeq2",
    "limma",
    "edgeR",
    "org.Hs.eg.db",
    "RedeR",
    "igraph",
    "clusterProfiler"
))

# Install additional utilities
install.packages(c(
    "pheatmap",
    "ggplot2",
    "dplyr",
    "tidyr",
    "data.table",
    "VennDiagram"
))

Downloading and Preparing the Example Dataset

For this tutorial, we’ll use RNA-seq data from GEO dataset GSE53239, which examines gene expression differences in a controlled experimental setting. This dataset provides an good example for demonstrating master regulator analysis.

Understanding the GSE53239 Dataset

The dataset consists of:

Treatment samples: HGB_11 through HGB_16 (6 samples)
Control samples: HGB_17 through HGB_22 (6 samples)
Gene expression data: Raw count data from RNA-seq with gene symbols

Downloading the Expression Matrix

# Create project directory structure
mkdir -p ~/MRA_Analysis
cd ~/MRA_Analysis

# Download the gene expression matrix
wget "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE53239&format=file&file=GSE53239%5FHGB11%2D22%5FgeneLevel%5FHTSeqCountData%2Etxt%2Egz" \
    -O GSE53239_expression_matrix.txt.gz

# Decompress the file
gunzip GSE53239_expression_matrix.txt.gz

Preparing Data for Master Regulator Analysis

Now we’ll load all required libraries and prepare the data for analysis.

Loading Required Libraries

# Load all required libraries
library(tidyr)
library(dplyr)
library(DESeq2)
library(data.table)
library(edgeR)
library(RegEnrich)
library(RTN)
library(VennDiagram)

# Set working directory
setwd("~/MRA_Analysis")

Creating Sample Information and Loading Data

# Create sample metadata
sample_info <- data.frame(
    sample_id = paste0("HGB_", 11:22),
    condition = c(rep("Treatment", 6), rep("Control", 6)),
    stringsAsFactors = FALSE
)

# Save the sample information
write.table(sample_info, 
            file = "sample_info.txt", 
            sep = "\t", 
            row.names = FALSE, 
            quote = FALSE)

# Read the raw count matrix
counts_raw <- read.table("GSE53239_expression_matrix.txt", 
                         header = TRUE)

# Remove duplicate gene symbols (keep first occurrence)
counts_raw <- counts_raw[!duplicated(counts_raw[[1]]), ]

# Set gene symbols as row names
rownames(counts_raw) <- counts_raw[[1]]

# Remove the gene symbol column
counts_raw <- counts_raw[, -1]

# Read sample information
sample_info <- read.table("sample_info.txt", 
                          header = TRUE,
                          stringsAsFactors = FALSE)

# Ensure sample order matches between count matrix and metadata
sample_info <- sample_info[match(colnames(counts_raw), sample_info$sample_id), ]

# Verify the match
all(colnames(counts_raw) == sample_info$sample_id)

Filtering Low-Expressed Genes

# Create DESeq2 dataset object
dds <- DESeqDataSetFromMatrix(
    countData = counts_raw,
    colData = sample_info,
    design = ~ condition
)

# Filter genes: keep genes expressed in at least 80% of samples
perc_keep <- 0.8
gene_keep <- rowSums(counts_raw > 0) >= ceiling(perc_keep * ncol(counts_raw))
dds <- dds[gene_keep, ]

# Also create filtered count table for other analyses
count_tbl_filtered <- counts_raw[gene_keep, ]

Filtering Strategy: We keep genes that are expressed (count > 0) in at least 80% of samples. This removes genes with very sparse expression that are unlikely to be reliable for downstream analysis while retaining genes with condition-specific expression patterns.

Normalizing Data for RTN Analysis

# Perform variance stabilizing transformation for RTN
vsd <- vst(dds, blind = FALSE)
normalized_counts <- assay(vsd)

Why VST? The variance stabilizing transformation from DESeq2 removes the dependence of variance on the mean, making the data suitable for correlation-based methods used by RTN.

Master Regulator Analysis with RegEnrich

RegEnrich provides an integrated workflow that combines limma-voom for differential expression with regulatory network enrichment analysis.

Understanding the RegEnrich Workflow

RegEnrich analysis proceeds through these steps:

Differential expression analysis using limma-voom
Network inference to identify TF-target relationships
Enrichment testing using GSEA
Integration and scoring of results

Preparing Data for RegEnrich

# Load human transcription factors from RegEnrich or a list of TFs specific to your dataset
data(TFs)
human_tfs <- intersect(TFs$TF_name, rownames(dge_v$E))

# Create DGEList object and normalize with TMM
dge <- DGEList(counts = count_tbl_filtered, samples = sample_info)
dge <- calcNormFactors(dge, method = "TMM")

# Apply voom transformation
dge_v <- voom(dge, plot = TRUE)

# Create design matrix
design_mtx <- model.matrix(~ 0 + dge_v$targets$condition)
colnames(design_mtx) <- gsub("dge_v\\$targets\\$condition", "", colnames(design_mtx))

# Create contrast matrix
ctrst_mtx <- makeContrasts(contrasts = "Treatment-Control", levels = design_mtx)

Why Voom? The voom transformation combines the speed of limma with appropriate modeling of RNA-seq count data by estimating mean-variance relationships and computing precision weights for each observation.

Running RegEnrich Analysis

# Initialize RegEnrich object
regen_obj <- RegenrichSet(
    expr = dge_v$E,                           # voom-transformed expression
    colData = dge_v$targets,                  # sample information
    reg = human_tfs,  # TFs present in data
    method = "limma",                         # DE method
    design = design_mtx,                      # design matrix
    contrast = ctrst_mtx,                     # contrast matrix
    networkConstruction = "COEN",             # network inference method
    enrichTest = "GSEA"                       # enrichment method
)

# Run the complete analysis pipeline
regen_obj <- regenrich_diffExpr(regen_obj) %>% 
    regenrich_network(RsquaredCut = 0.8) %>%  # Lower for smaller sample sizes
    regenrich_enrich() %>% 
    regenrich_rankScore()

# Extract results
res_regenrich <- na.omit(as.data.frame(results_score(regen_obj)))

# Save results
fwrite(res_regenrich, "res_mra_regenrich.csv")

Understanding RegEnrich Parameters

Key parameters to note:

networkConstruction = “COEN”: Uses correlation-based network inference from your expression data
enrichTest = “GSEA”: Uses Gene Set Enrichment Analysis for robust enrichment testing
RsquaredCut = 0.8: Correlation threshold for defining TF-target relationships. Use lower values (0.6-0.7) for smaller sample sizes

Interpreting RegEnrich Results

The results table contains these key columns:

reg: Transcription factor name
negLogPDEA: -log10(p-value) from differential expression of the TF itself
negLogPEnrich: -log10(p-value) from enrichment analysis of TF targets
logFC: Log2 fold change of the TF
score: Integrated score combining multiple evidence types

Master regulators are identified by:

High negLogPEnrich (>1.3, corresponding to p < 0.05): Significant enrichment of TF targets
High score: Strong integrated evidence
Biological relevance: Consider logFC direction and TF expression level

Master Regulator Analysis with RTN

RTN takes a data-driven approach by first reconstructing regulatory networks from your expression data, then applying two different methods to identify master regulators.

Understanding the RTN Workflow

RTN’s analysis follows these steps:

Construct a transcriptional network using mutual information
Perform permutation analysis to assess network significance
Bootstrap analysis for network stability
Apply DPI filter to remove indirect interactions
Test for master regulators using two methods:

Fisher’s Exact Test (FET): Tests for over-representation of differentially expressed genes among TF targets
GSEA: Tests for enrichment of TF targets in ranked gene lists

Network Reconstruction

# Create RTN object
rtni <- tni.constructor(
    expData = normalized_counts,
    regulatoryElements = human_tfs
)

# Compute mutual information between TFs and all genes
rtni <- tni.permutation(rtni, nPermutations = 100)

# Bootstrap analysis for network stability
rtni <- tni.bootstrap(rtni)

# Apply DPI filter to remove indirect interactions
rtni <- tni.dpi.filter(rtni)

# Check how many regulons were successfully constructed
regulon_summary <- tni.regulon.summary(rtni)

Computational Note: Network reconstruction can take 30-60 minutes. The nPermutations = 100 is used here for demonstration; use nPermutations > 1000 for real analysis to ensure robust statistical assessment.

Performing Differential Expression Analysis

# Set reference level for comparison
dds$condition <- relevel(dds$condition, ref = "Control")

# Run DESeq2 differential expression analysis
dds_de <- DESeq(dds)
res_deseq2 <- results(dds_de)

# Sort by adjusted p-value
res_deseq2 <- res_deseq2[order(res_deseq2$padj), ]
res_deseq2_df <- as.data.frame(res_deseq2)

# Create ranked gene list by log2 fold change
logfc_ranked <- res_deseq2_df$log2FoldChange
names(logfc_ranked) <- rownames(res_deseq2_df)
logfc_ranked <- sort(logfc_ranked, decreasing = TRUE)

# Get top differentially expressed genes
# Note: Adjust the number or set thresholds based on your data quality
sig_genes <- rownames(res_deseq2_df)[1:1000]

Note on DE Genes: In this example, we take the top 1000 genes ranked by p-value. In practice, you should use appropriate FDR and FC cutoffs (e.g., padj < 0.05) if the statistical power is sufficient. The number of significant genes will depend on your experimental design and biological effect size.

Method 1: Master Regulator Analysis with Fisher’s Exact Test

# Prepare RTN object for MRA
rtna <- tni2tna.preprocess(
    object = rtni,
    phenotype = logfc_ranked,
    hits = sig_genes
)

# Run MRA using Fisher's exact test
rtna_mra <- tna.mra(rtna)

# Get results
results_mra_fet <- tna.get(
    rtna_mra,
    what = "mra",
    ntop = -1
)

# Save results
fwrite(results_mra_fet, "results_rtn_mra_fet.csv")

Interpreting Fisher’s Exact Test Results

The Fisher’s exact test results contain:

Regulon: Transcription factor name
Regulon.Size: Total number of target genes in the regulon
Observed.Hits: Number of significant DE genes among targets
Expected.Hits: Expected number by chance
Pvalue: Statistical significance of over-representation
Adjusted.Pvalue: FDR-adjusted p-value

Key Interpretations:

Observed > Expected: TF targets are over-represented among DE genes
Low Adjusted.Pvalue: Strong statistical evidence for master regulator
Large Size: More reliable results typically come from larger regulons (15-500 targets)

Method 2: Master Regulator Analysis with GSEA

# Run MRA using GSEA approach
rtna_gsea <- tna.gsea2(
    rtna,
    pValueCutoff = 0.05,
    pAdjustMethod = "BH",
    minRegulonSize = 15
)

# Get results
results_mra_gsea <- tna.get(
    rtna_gsea,
    what = "gsea2",
    ntop = -1
)

# Save results
fwrite(results_mra_gsea$differential, "results_rtn_mra_gsea.csv")

Interpreting GSEA Results

The GSEA results (in the differential data frame) contain:

Regulon: Transcription factor name
Regulon.Size: Number of target genes
Observed.Score: Enrichment score indicating direction and magnitude
Pvalue: Statistical significance
Adjusted.Pvalue: FDR-adjusted p-value

FET vs GSEA: Fisher’s exact test focuses on over-representation (are DE genes enriched?), while GSEA considers both enrichment and direction of change. GSEA is generally more powerful for detecting coordinated regulatory changes and provides information about activation vs. repression.

Visualizing and Interpreting Master Regulator Results

Comparing results across multiple methods strengthens confidence in identified master regulators.

Comparing Results Between Methods

# Create list of master regulators from each method
mra_list <- list(
    RegEnrich = res_regenrich[res_regenrich$negLogPEnrich > 1.3, ]$reg,
    MRA_FET = results_mra_fet[results_mra_fet$Adjusted.Pvalue < 0.05, ]$Regulon,
    GSEA = results_mra_gsea$differential[results_mra_gsea$differential$Adjusted.Pvalue < 0.05, ]$Regulon
)

# Create Venn diagram
venn.diagram(
    x = mra_list,
    filename = 'Venn_Diagram_Regulators.png',
    output = TRUE,
    lwd = 2,
    lty = 'blank',
    fill = c("#440154ff", "#21908dff", "#fde725ff"),
    cex = 1.5,
    fontface = "bold",
    fontfamily = "sans",
    cat.cex = 1.8,
    cat.fontface = "bold",
    cat.default.pos = "outer",
    cat.pos = c(-27, 27, 135),
    cat.dist = c(0.055, 0.055, 0.085),
    cat.fontfamily = "sans",
    rotation = 1
)

Interpreting the Comparison

The Venn diagram reveals:

Common regulators (center overlap): High-confidence master regulators identified by all methods
Method-specific regulators: May reflect different sensitivities or assumptions

RegEnrich unique: May capture regulators based on curated databases
RTN MRA-FET unique: Focuses on over-representation
RTN GSEA unique: Sensitive to coordinated directional changes

Focus on regulators appearing in multiple methods for highest confidence, but don’t ignore method-specific findings – they may represent biologically relevant regulators with different characteristics.

Biological Interpretation and Next Steps

Interpreting Your Master Regulator Results

When examining your master regulator analysis results, consider these key questions:

1. Are the identified regulators biologically relevant?

Do they match known biology of your system?
Are they expressed in your tissue/cell type?
Have they been implicated in similar conditions in the literature?

2. What regulatory programs are activated?

Look at Gene Ontology enrichment of target genes
Identify common pathways controlled by top regulators
Consider whether regulators work together or independently

3. What is the direction of regulation?

For GSEA results: positive score indicates activation, negative indicates repression
Check the TF’s own expression change (logFC in RegEnrich results)
Consider post-translational modifications that may affect activity

Validating Master Regulator Predictions

Computational predictions should be validated experimentally:

Experimental Validation Approaches:

Perturbation experiments: Knock down or overexpress the predicted master regulator
ChIP-seq validation: Confirm direct binding to target genes
Reporter assays: Test if the TF activates predicted target promoters
Rescue experiments: Show that modulating the TF reverses the phenotype

Computational Validation:

Cross-dataset validation: Check if the same regulators appear in similar conditions
Literature mining: Systematic review of published studies
Integration with other data: Combine with ATAC-seq, ChIP-seq, or proteomics

Best Practices for Master Regulator Analysis

Ensuring Robust Results

Quality Control Checkpoints:

Input data quality: Ensure adequate sequencing depth and biological replicates (minimum 3 per group, preferably 6)
Filtering: Remove lowly expressed genes that add noise
Differential expression: Verify reasonable number of DE genes (typically hundreds to thousands)
Regulon size: Most robust results come from regulons with 15-500 target genes
Multiple testing correction: Always use FDR-adjusted p-values

Methodological Considerations:

Use multiple methods: Compare results from RegEnrich and both RTN approaches
Check consistency: Top regulators should show up across different methods
Consider background: Account for tissue-specific TF expression and activity
Validate computationally: Use independent datasets when available

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-interpreting weak signals

Problem: Calling transcription factors significant without strong statistical support
Solution: Use stringent FDR thresholds (< 0.05) and require high effect sizes

Pitfall 2: Ignoring tissue context

Problem: Identifying TFs that aren’t expressed in your system
Solution: Filter for TFs with adequate expression in your samples before analysis

Pitfall 3: Confusing correlation with causation

Problem: Assuming all enriched TFs are causally driving the changes
Solution: Validate key findings experimentally; consider upstream regulators vs downstream responders

Pitfall 4: Insufficient sample size

Problem: Low statistical power leading to unreliable results
Solution: Use at least 3 biological replicates per condition, preferably 5-6

Pitfall 5: Not considering TF activity vs expression

Problem: Missing master regulators whose activity changes without expression changes
Solution: Remember that TFs can be regulated post-translationally; GSEA-based approaches are more sensitive to activity changes

Conclusion

Master regulator analysis transforms RNA-seq data from lists of differentially expressed genes into mechanistic hypotheses about the transcription factors driving those changes. By integrating gene expression with regulatory network information, methods like RegEnrich and RTN help identify the key molecular drivers of biological processes.

Master regulator analysis represents a powerful approach to move beyond descriptive gene lists toward mechanistic understanding of gene regulation. As you apply these methods to your own research, remember that the most impactful discoveries come from combining computational predictions with biological insight and experimental validation.

References

Castro MA, et al. “Regulators of genetic risk of breast cancer identified by integrative network analysis.” Nature Genetics (2016). doi: 10.1038/ng.3458
Lefebvre C, et al. “A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers.” Molecular Systems Biology (2010). doi: 10.1038/msb.2010.31
Alvarez MJ, et al. “Functional characterization of somatic mutations in cancer using network-based inference of protein activity.” Nature Genetics (2016). doi: 10.1038/ng.3593
Fletcher MNC, et al. “Master regulators of FGFR2 signalling and breast cancer risk.” Nature Communications (2013). doi: 10.1038/ncomms3464
Margolin AA, et al. “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context.” BMC Bioinformatics (2006). doi: 10.1186/1471-2105-7-S1-S7
Law CW, et al. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology (2014). doi: 10.1186/gb-2014-15-2-r29
Sivakumar S, de Santiago I, Chlon L, Markowetz F (2017) Master Regulators of Oncogenic KRAS Response in Pancreatic Cancer: An Integrative Network Biology Analysis. PLoS Med 14(1): e1002223. https://doi.org/10.1371/journal.pmed.1002223

This tutorial is part of the NGS101.com series on whole genome sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.

How to Perform Master Regulator Analysis on RNA-seq Data Using RegEnrich and RTN – A Complete Beginner’s Guide