How To Analyze ChIP-seq Data For Absolute Beginners Part 3: Differential Binding Analysis and Motif Discovery

How To Analyze ChIP-seq Data For Absolute Beginners Part 3: Differential Binding Analysis and Motif Discovery

Introduction: Moving Beyond Basic Peak Calling

In our previous tutorials, we walked through the fundamental process of ChIP-seq analysis, from raw sequencing reads to identifying where a protein of interest binds to DNA. While finding these binding sites (peaks) is a crucial first step, it’s often just the beginning of extracting meaningful biological insights from ChIP-seq data.

This tutorial takes you to the next level by exploring two powerful analytical approaches that help answer deeper biological questions:

  • Differential binding analysis: Identifying how protein-DNA interactions change between conditions
  • Advanced motif analysis: Uncovering the specific DNA sequences that drive binding

Beginner’s Tip: Think of differential binding analysis like comparing two maps to see how traffic patterns change during different times of day. Instead of just knowing where roads exist, you learn how their usage changes under different conditions.

The Power of Differential Binding Analysis

Differential binding analysis transforms ChIP-seq from a descriptive technique into a powerful tool for understanding the dynamic nature of gene regulation. Rather than simply knowing where a protein binds, differential binding tells us how binding patterns change in response to:

  • Different cell types or tissues
  • Treatment with drugs or stimuli
  • Disease states versus healthy controls
  • Different developmental stages
  • Genetic mutations or knockdowns

For example, differential binding analysis can reveal how:

  • A transcription factor relocates to different genomic sites after hormone treatment
  • Histone modifications change across promoters during cell differentiation
  • DNA repair proteins are recruited to different regions after DNA damage
  • Chromatin remodelers redistribute across the genome in cancer cells
  • Tissue-specific transcription factors bind to different enhancers in different cell types

The statistical frameworks used in differential binding analysis are conceptually similar to those employed in RNA-seq differential expression analysis, but with important modifications for peak-based data.

Unlocking Insights Through Differential Motif Analysis

While ChIP-seq identifies where proteins bind to DNA, motif analysis reveals the specific DNA sequences they recognize. Taking this a step further, differential motif analysis allows us to understand how DNA sequence preferences vary between experimental conditions, providing mechanistic insights into regulatory changes.

When analyzing motifs across different conditions (such as treatment vs. control or disease vs. healthy), several key questions emerge:

  • How do primary motif enrichments differ between regions with increased versus decreased binding?
  • Are unique co-factor motifs enriched in condition-specific binding sites, suggesting context-dependent partnerships?
  • Do changes in motif composition correlate with changes in binding intensity or functional outcomes?
  • Are certain motif variants preferentially associated with specific conditions, suggesting altered binding preferences?
  • Can motif differences explain functional changes observed in gene expression or cellular phenotypes?

By comparing motif patterns between differentially bound regions, we can move beyond simply cataloging binding sites to understanding the sequence-based mechanisms that drive dynamic gene regulation across different biological contexts.

Popular Tools and Methods

Several specialized tools have been developed for these analyses:

For Differential Binding Analysis:

  • HOMER getDifferentialPeaks – Built into the HOMER suite you used previously
  • limma/edgeR/DESeq2 – RNA-seq tools adapted for ChIP-seq count data
  • DiffBind – An R package specifically designed for ChIP-seq differential binding analysis
  • MAnorm – Specifically designed to normalize and compare two ChIP-seq samples
  • ChIPComp – A tool that can handle complex experimental designs
  • Csaw – A BioConductor package for analyzing high-throughput data with sliding windows

For Advanced Motif Analysis:

  • HOMER findMotifsGenome.pl – The tool you’ve already used, with advanced options
  • MEME Suite – A comprehensive set of tools for motif discovery and analysis
  • RSAT – Regulatory Sequence Analysis Tools
  • JASPAR – A database of transcription factor binding profiles
  • CentriMo – For identifying centrally enriched motifs
  • DREME – For discovering short motifs in large datasets
  • DiffLogo – For visualizing differences between motifs

Prerequisites for This Tutorial

Before performing differential binding and advanced motif analysis, you need several prerequisites:

  1. Processed ChIP-seq data: You’ll need aligned reads (BAM files) and/or called peaks from multiple conditions as generated in Part 1
  2. Properly matched samples: Ensure your experimental and control samples are appropriately paired
  3. Normalized data: To make valid comparisons, your samples should be normalized for sequencing depth and other technical factors
  4. Quality-controlled peaks: Peaks should be filtered for quality and reproducibility
  5. Sample metadata: You’ll need a clear understanding of your experimental design and how samples relate to each other

We’ll build directly on the results from the previous tutorial, focusing on creating count matrices from multiple samples, normalizing for technical differences, defining consensus peak sets, and selecting appropriate statistical models based on your experimental design.

Preparing Data for Differential Binding Analysis

Differential binding analysis requires multiple samples representing different conditions. Unlike RNA-seq, where all samples contain the same set of genes, ChIP-seq typically identifies a unique peak set in each sample. Therefore, we must first create a common peak set containing the peaks that exist across samples.

STEP 1: Create a Merged Peak Set

To create a consensus peak set, we merge overlapping peaks from all samples using HOMER’s mergePeaks function:

# STEP 1: Create a merged peak set across samples
# ---------------------------------------------

# Activate the Conda environment we set up in the previous tutorial
conda activate ~/Env_Homer

# Create a directory for the merged peaks
mkdir -p ~/ChIPseq_USF2/Homer/Merged_Peaks

# Merge the USF2 ChIP-seq peaks from all 4 samples
# -d given: use the given peak boundaries for merging
echo "Merging peaks from all samples..."
mergePeaks -d given \
~/ChIPseq_USF2/Homer/Sample1/Peaks_USF2_Sample1.txt \
~/ChIPseq_USF2/Homer/Sample2/Peaks_USF2_Sample2.txt \
~/ChIPseq_USF2/Homer/Sample3/Peaks_USF2_Sample3.txt \
~/ChIPseq_USF2/Homer/Sample4/Peaks_USF2_Sample4.txt \
> ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks.txt

# Check how many peaks were identified in the merged set
echo "Number of peaks in the merged set:"
wc -l ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks.txt

What’s Happening Here: The mergePeaks command combines peaks from all samples into a unified set. If a peak from one sample overlaps with a peak from another sample, they are merged into a single peak in the output. The -d given option preserves the original peak boundaries.

STEP 2: Create a Count Matrix for All Peaks

Next, we need to quantify the binding intensity of each peak across all samples. HOMER’s annotatePeaks.pl command can accomplish this:

# STEP 2: Create a count matrix for all peaks
# ---------------------------------------------

# Annotate the merged peaks and quantify binding in each sample
# The -d flag specifies the tag directories from each sample
# -noadj prevents normalization for this step (we'll handle normalization in R)
echo "Creating a count matrix across all samples..."
annotatePeaks.pl ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks.txt hg38 -d \
~/ChIPseq_USF2/Homer/Sample1/ \
~/ChIPseq_USF2/Homer/Sample2/ \
~/ChIPseq_USF2/Homer/Sample3/ \
~/ChIPseq_USF2/Homer/Sample4/ \
-noadj \
> ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks_annotated.txt

# Examine the first few lines of the output to verify
echo "Previewing the count matrix:"
head -n 3 ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks_annotated.txt

What This Creates: The output file contains one row per peak, with columns for genomic coordinates, nearby genes, and read counts from each sample. The last four columns will contain the counts for each peak in each sample, which we’ll use for differential analysis.

STEP 3: Performing Differential Binding Analysis in R

With our count matrix prepared, we can now perform differential binding analysis. We’ll use the limma package in R, which offers flexible statistical modeling and visualization capabilities.

3.1: Setting Up the R Environment

First, let’s prepare our R environment and load the necessary packages:

# STEP 3.1: Set up the R environment
# ---------------------------------------------

# Install Bioconductor if not already installed
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# Install required packages if not already installed
required_packages <- c("data.table", "edgeR", "limma", "ggplot2", 
                      "ggrepel", "ggfortify", "sva")
for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE)) {
    BiocManager::install(pkg)
    library(pkg, character.only = TRUE)
  }
}

# Set working directory - adjust this to your project directory
setwd("~/ChIPseq_USF2/")

# Print confirmation message
cat("R environment successfully set up for differential binding analysis!\n")

3.2: Loading and Formatting the Peak Matrix

Now let’s load our count data and prepare it for analysis:

# STEP 3.2: Load and format the peak matrix
# ---------------------------------------------

# Load required packages
library(data.table)
library(edgeR)
library(limma)

# Load the annotated peak file
# Adjust the path to match your file location
peak_usf2 <- fread("Homer/Merged_Peaks/USF2_Merged_Peaks_annotated.txt", data.table = FALSE)

# Display the dimensions of the data
cat("Dimensions of the peak data:", dim(peak_usf2), "\n")

# Display the column names to understand the structure
cat("Column names in the peak data:\n")
print(colnames(peak_usf2))

# Extract peak IDs and set as row names
rownames(peak_usf2) <- peak_usf2[[1]]

# Create peak annotation table with peak IDs and associated genes
peak_annot <- data.frame(PeakID = peak_usf2[[1]], Genes = peak_usf2$`Gene Name`)
rownames(peak_annot) <- peak_usf2[[1]]

# Extract and format the count matrix
# Note: Column indices may need adjustment based on your actual data structure
# The count columns are typically at the end of the file
peak_usf2_mx <- peak_usf2[, 20:23]  # Assuming count columns are 20-23
colnames(peak_usf2_mx) <- c("Sample1", "Sample2", "Sample3", "Sample4")

# Create a sample metadata table
# Adjust based on your experimental design
meta_peak_usf2_mx <- data.frame(
  SampleID = colnames(peak_usf2_mx), 
  Treatment = c("Ctrl", "Ctrl", "KD", "KD"),  # Example: Control vs Knockdown
  stringsAsFactors = FALSE
)
rownames(meta_peak_usf2_mx) <- meta_peak_usf2_mx$SampleID

# Save the formatted data for future use
save(peak_usf2_mx, meta_peak_usf2_mx, peak_annot, file = "peak_data_formatted.RData")

cat("Peak data loaded and formatted successfully!\n")

What’s Happening Here: We’re loading the annotated peak file, extracting the count data for each peak across samples, and creating a metadata table that defines our experimental groups (Control vs. Knockdown in this example).

STEP 4: Creating a DGEList and Normalizing the Data

Next, we’ll filter low-abundance peaks and normalize the data:

# STEP 4: Filter peaks and normalize data
# ---------------------------------------------

# Load required packages
library(data.table)
library(edgeR)
library(limma)

# Load the formatted data if not already in memory
if (!exists("peak_usf2_mx")) {
  load("peak_data_formatted.RData")
}

# Filter peaks with consistently low counts
# Keep peaks that have non-zero counts in at least 80% of samples
perc_keep <- 0.8
peak_keep <- rowSums(peak_usf2_mx > 0) >= ceiling(perc_keep * ncol(peak_usf2_mx))
peak_usf2_mx_filtered <- peak_usf2_mx[peak_keep, ]

# Report the filtering results
cat("Original number of peaks:", nrow(peak_usf2_mx), "\n")
cat("Number of peaks after filtering:", nrow(peak_usf2_mx_filtered), "\n")

# Update peak annotation
peak_annot_filtered <- peak_annot[rownames(peak_usf2_mx_filtered), ]

# Create a DGEList object
dge_peak <- DGEList(
  counts = peak_usf2_mx_filtered, 
  samples = meta_peak_usf2_mx, 
  genes = peak_annot_filtered
)

# Normalize for library size using TMM method
dge_peak <- calcNormFactors(dge_peak, method = "TMM")

# Convert to log-CPM values using voom transformation
# This also accounts for mean-variance relationship
pdf("voom_mean_variance_plot.pdf")
dge_peak_v <- voom(dge_peak, plot = TRUE)
dev.off()

# Save the normalized data for differential analysis
saveRDS(dge_peak_v, "dge_peak_voom_normalized.rds")

cat("Data filtering and normalization complete!\n")

What This Does:

  • Filters out peaks with low counts across samples
  • Creates a DGEList object, the standard format for differential analysis in limma/edgeR
  • Normalizes for library size using the TMM (Trimmed Mean of M-values) method
  • Applies voom transformation to convert counts to log-CPM values and model the mean-variance relationship
  • Generates a diagnostic plot showing how variance changes with mean expression

STEP 5: Creating a PCA Plot to Visualize Sample Relationships

Before running differential analysis, it’s important to visualize how your samples relate to each other:

# STEP 5: Create a PCA plot to visualize sample relationships
# ---------------------------------------------

# Load required packages
library(ggplot2)
library(ggrepel)
library(ggfortify)

# Load the normalized data if not already in memory
if (!exists("dge_peak_v")) {
  dge_peak_v <- readRDS("dge_peak_voom_normalized.rds")
}

# Set visualization parameters
shape_column <- "Treatment"
color_column <- "Treatment"
label <- TRUE
label_size <- 4
plot_save_name <- "PCA_ChIPseq_Samples.pdf"

# Extract sample metadata and normalized expression values
meta_table <- dge_peak_v$targets
count_table_t <- as.data.frame(t(dge_peak_v$E))

# Perform PCA
pca_result <- prcomp(count_table_t, scale. = TRUE)

# Calculate variance explained by each PC
var_explained <- round(100 * pca_result$sdev^2 / sum(pca_result$sdev^2), 1)

# Create PCA plot
pca_plot <- autoplot(pca_result, label, shape = shape_column, data = meta_table, colour = color_column) +
  geom_text_repel(aes(label = rownames(meta_table)), size = label_size) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        panel.grid.minor.y=element_blank(),
        panel.grid.major.y=element_blank())

# Display the plot
print(pca_plot)

# Save the plot
ggsave(plot_save_name, plot = pca_plot, device = "pdf", units = "cm", width = 20, height = 16)

cat("PCA plot created and saved as", plot_save_name, "\n")

Why This Matters:

  • PCA reduces the dimensionality of complex data and allows you to visualize relationships between samples
  • Samples should ideally cluster by experimental condition (Treatment vs. Control)
  • Outlier samples or unexpected clustering patterns may indicate technical issues or batch effects

STEP 6: Performing Differential Binding Analysis

Now we can perform the differential binding analysis:

# STEP 6: Perform differential binding analysis
# ---------------------------------------------

# Load required packages
library(limma)
library(data.table)

# Load the normalized data if not already in memory
if (!exists("dge_peak_v")) {
  dge_peak_v <- readRDS("dge_peak_voom_normalized.rds")
}

# Define the comparison
comparison <- "KD-Ctrl"  # Knockdown vs Control

# Create design matrix
# The "~ 0 +" formula creates a design with a coefficient for each group
design <- model.matrix(~ 0 + dge_peak_v$targets[["Treatment"]])
colnames(design) <- gsub(".*]]", "", colnames(design))

# Display the design matrix
cat("Design matrix:\n")
print(design)

# Create contrast matrix for our comparison of interest
contrast_matrix <- makeContrasts(contrasts = comparison, levels = design)

cat("Contrast matrix:\n")
print(contrast_matrix)

# Fit the linear model
fit <- lmFit(dge_peak_v, design)

# Apply the contrasts
fit_2 <- contrasts.fit(fit, contrast_matrix)

# Compute moderated t-statistics
fit_2 <- eBayes(fit_2)

# Get the results table
# n=Inf returns all peaks
# We're not filtering by p-value or log-fold-change at this step
depeak_tbl <- topTable(fit_2, coef = colnames(contrast_matrix), 
                      n = Inf, p.value = 1, lfc = 0, sort.by = "p")

# Add additional information
depeak_tbl$Gene <- dge_peak_v$genes$Genes[match(rownames(depeak_tbl), rownames(dge_peak_v$genes))]

# Add a significant column (for easy filtering)
depeak_tbl$significant <- ifelse(depeak_tbl$adj.P.Val < 0.05, "Yes", "No")

# Write the full results table
fwrite(depeak_tbl, "DE_Peak_Results_Full.tsv", sep = "\t", row.names = TRUE)

cat("Differential binding analysis complete!\n")

What’s Happening Here:

  • We create a design matrix that defines our experimental groups
  • We create a contrast matrix that specifies the comparison we want to make (KD vs. Control)
  • We fit a linear model to our data using limma
  • We calculate moderated t-statistics using eBayes, which improves statistical power
  • We extract the results and summarize how many peaks show differential binding

STEP 7: Creating a Volcano Plot for Visualization

Let’s create a volcano plot to better visualize our differential binding results:

# STEP 7: Create a volcano plot of differential binding results
# ---------------------------------------------

# Load required packages
library(ggplot2)
library(data.table)

# Load differential binding results if not already in memory
if (!exists("depeak_tbl")) {
  depeak_tbl <- fread("DE_Peak_Results_Full.tsv", data.table = FALSE)
  rownames(depeak_tbl) <- depeak_tbl$V1
  depeak_tbl$V1 <- NULL
}

# Set thresholds for highlighting
padj_cutoff <- 0.05
lfc_cutoff <- 1

# Add significance category
depeak_tbl$diffexpressed <- "Not Significant"
depeak_tbl$diffexpressed[depeak_tbl$logFC > lfc_cutoff & depeak_tbl$adj.P.Val < padj_cutoff] <- "Up-regulated"
depeak_tbl$diffexpressed[depeak_tbl$logFC < -lfc_cutoff & depeak_tbl$adj.P.Val < padj_cutoff] <- "Down-regulated"

# Add label column
depeak_tbl$label <- NA
top_genes_up <- rownames(depeak_tbl[depeak_tbl$diffexpressed == "Up-regulated", ])[1:10]
top_genes_down <- rownames(depeak_tbl[depeak_tbl$diffexpressed == "Down-regulated", ])[1:10]
depeak_tbl$label[rownames(depeak_tbl) %in% c(top_genes_up, top_genes_down)] <- depeak_tbl$Gene[rownames(depeak_tbl) %in% c(top_genes_up, top_genes_down)]

# Make sure ggrepel is loaded
if (!require("ggrepel")) {
  install.packages("ggrepel")
  library(ggrepel)
}

# Create volcano plot with ggrepel to prevent label overlap
volcano_plot <- ggplot(depeak_tbl, aes(x = logFC, y = -log10(adj.P.Val), color = diffexpressed)) +
  geom_point(size = 1, alpha = 0.7) +
  scale_color_manual(values = c("Not Significant" = "grey", 
                               "Up-regulated" = "red", 
                               "Down-regulated" = "blue")) +
  geom_text_repel(aes(label = label), size = 3, 
                  max.overlaps = 15,
                  box.padding = 0.5,
                  segment.color = "grey50",
                  min.segment.length = 0.1,
                  na.rm = TRUE) +
  geom_hline(yintercept = -log10(padj_cutoff), linetype = "dashed") +
  geom_vline(xintercept = c(-lfc_cutoff, lfc_cutoff), linetype = "dashed") +
  labs(title = "Differential Binding Analysis: Knockdown vs Control",
       subtitle = paste0("Red: Up-regulated (", sum(depeak_tbl$diffexpressed == "Up-regulated"), 
                       "), Blue: Down-regulated (", sum(depeak_tbl$diffexpressed == "Down-regulated"), ")"),
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Regulation") +
  theme_bw() +
  theme(panel.grid.minor = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 12),
        legend.position = "right")

# Display the volcano plot
print(volcano_plot)

# Save the volcano plot
ggsave("Volcano_Plot_DiffBind.pdf", volcano_plot, width = 10, height = 8)

cat("Volcano plot created and saved as Volcano_Plot_DiffBind.pdf\n")

Why This Visualization Matters:

  • The volcano plot offers a global view of binding changes, highlighting the most significant peaks
  • Points in the upper right represent significantly increased binding in the knockdown condition
  • Points in the upper left represent significantly decreased binding in the knockdown condition
  • The plot helps you identify the most interesting binding changes for further investigation

STEP 8: Creating BED Files for Downstream Analyses

Now let’s create BED files for the significantly differential peaks for use in subsequent analyses:

# STEP 8: Create BED files for downstream analyses
# ---------------------------------------------

# Load required packages
library(data.table)

# Load the results if not already in memory
if (!exists("depeak_tbl")) {
  depeak_tbl <- fread("DE_Peak_Results_Full.tsv", data.table = FALSE)
  rownames(depeak_tbl) <- depeak_tbl$V1
  depeak_tbl$V1 <- NULL
}

# Load the original peak file to get coordinates
# Adjust path as necessary
peak_usf2 <- fread("Homer/Merged_Peaks/USF2_Merged_Peaks_annotated.txt", data.table = FALSE)
rownames(peak_usf2) <- peak_usf2[[1]]

# Extract up-regulated peaks (logFC > 1, FDR < 0.05)
up_peak_ids <- rownames(depeak_tbl[depeak_tbl$logFC > 1 & depeak_tbl$adj.P.Val < 0.05, ])
depeak_tbl_up <- peak_usf2[up_peak_ids, 2:4]  # Columns 2-4 contain chr, start, end

# Extract down-regulated peaks (logFC < -1, FDR < 0.05)
down_peak_ids <- rownames(depeak_tbl[depeak_tbl$logFC < -1 & depeak_tbl$adj.P.Val < 0.05, ])
depeak_tbl_dn <- peak_usf2[down_peak_ids, 2:4]  # Columns 2-4 contain chr, start, end

# Write BED files
fwrite(depeak_tbl_up, "diffbind_up_peaks.bed", sep = "\t", col.names = FALSE)
fwrite(depeak_tbl_dn, "diffbind_down_peaks.bed", sep = "\t", col.names = FALSE)

What’s Happening Here:

  • We’re creating BED format files containing the genomic coordinates of significantly up-regulated and down-regulated peaks
  • These files will be used for motif analysis and can also be loaded into genome browsers

STEP 9: Performing Functional Enrichment Analysis with GREAT

The Genomic Regions Enrichment of Annotations Tool (GREAT) analyzes the functional significance of cis-regulatory regions identified by genomic location (like ChIP-seq peaks):

# STEP 9: Perform GREAT analysis for differential peaks
# ---------------------------------------------

# Load/install required packages
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("rGREAT", quietly = TRUE))
  BiocManager::install("rGREAT")

library(rGREAT)
library(data.table)

# Load the BED files for up and down regulated peaks if not already in memory
if (!exists("depeak_tbl_up")) {
  up_peaks <- fread("diffbind_up_peaks.bed", header = FALSE)
  down_peaks <- fread("diffbind_down_peaks.bed", header = FALSE)
} else {
  up_peaks <- depeak_tbl_up
  down_peaks <- depeak_tbl_dn
}

# Run GREAT analysis
# up-regulated peaks
great_job_up <- submitGreatJob(up_peaks)
great_enrich_up <- getEnrichmentTables(great_job_up, download_by = 'tsv')
great_enrich_up_df <- rbindlist(great_enrich_up)
great_enrich_up_df$PeakSet <- rep("Up", nrow(great_enrich_up_df))

# down-regulated peaks
great_job_dn <- submitGreatJob(down_peaks)
great_enrich_dn <- getEnrichmentTables(great_job_dn, download_by = 'tsv')
great_enrich_dn_df <- rbindlist(great_enrich_dn)
great_enrich_dn_df$PeakSet <- rep("Down", nrow(great_enrich_dn_df))

# Combine the results
great_enrich_de_peak_df <- rbind(great_enrich_up_df, great_enrich_dn_df)
fwrite(great_enrich_de_peak_df, "great_enrich_de_peak_df.tsv", sep = "\t")

cat("GREAT analysis complete! Results saved to great_enrich_de_peak_df.tsv\n")

What GREAT Analysis Provides:

  • Assesses the functional significance of regulatory regions by comparing them to known genomic annotations
  • Accounts for the uneven distribution of genes across the genome
  • Connects binding sites to potential biological processes, molecular functions, and developmental pathways
  • The output includes enriched terms, p-values, and gene associations

STEP 10: Performing Motif Analysis on Differentially Bound Regions

Now that we’ve identified differentially bound regions, we can analyze the DNA sequence motifs within these regions to understand the specifics of binding. For this step, we’ll need to return to the command line to use HOMER’s motif discovery tools:

# STEP 10: Motif analysis for differentially bound regions
# ---------------------------------------------

# Create directories for motif results
mkdir -p ~/ChIPseq_USF2/Motifs/Up_Peaks
mkdir -p ~/ChIPseq_USF2/Motifs/Down_Peaks

# Perform motif analysis for up-regulated peaks
echo "Performing motif analysis for up-regulated peaks..."
findMotifsGenome.pl ~/ChIPseq_USF2/diffbind_up_peaks.bed hg38 \
  ~/ChIPseq_USF2/Motifs/Up_Peaks/ \
  -size 200 \      # Search 200bp around peak center
  -len 8,10,12 \   # Look for motifs of lengths 8, 10, and 12 bp
  -S 10 \          # Number of motifs to find
  -mask \          # Mask repeat regions
  -p 4             # Use 4 CPU cores for parallel processing

# Perform motif analysis for down-regulated peaks
echo "Performing motif analysis for down-regulated peaks..."
findMotifsGenome.pl ~/ChIPseq_USF2/diffbind_down_peaks.bed hg38 \
  ~/ChIPseq_USF2/Motifs/Down_Peaks/ \
  -size 200 \
  -len 8,10,12 \
  -S 10 \
  -mask \
  -p 4

echo "Motif analysis complete!"

What Motif Analysis Reveals:

  • The specific DNA sequences recognized by the protein in different conditions
  • Potential differences in binding preferences between up and down-regulated sites
  • Enrichment of co-factor binding sites that might explain differential binding
  • The output includes both de novo discovered motifs and matches to known motif databases

STEP 11: Comparing Motifs Between Up and Down Regulated Regions

To directly compare the motif enrichment between up and down-regulated regions:

# STEP 11: Compare motifs between up and down regulated regions
# ---------------------------------------------

# Create a directory for the comparative analysis
mkdir -p ~/ChIPseq_USF2/Motifs/Comparison

# Use HOMER's compareMotifs.pl to compare motifs from up and down regions
echo "Comparing motifs between up and down regulated regions..."
compareMotifs.pl \
  ~/ChIPseq_USF2/Motifs/Up_Peaks/homerMotifs.all.motifs \
  ~/ChIPseq_USF2/Motifs/Down_Peaks/homerMotifs.all.motifs \
  -outdir ~/ChIPseq_USF2/Motifs/Comparison \
  -reduceThresh .75 \
  -matchThresh .6 \
  -pvalue 1e-10 \
  -info 1.5

echo "Motif comparison complete!"

What This Comparison Shows:

  • Similarities and differences in motif enrichment between regions with increased vs. decreased binding
  • Potential insight into why binding changes occur in different genomic contexts
  • Motifs that are uniquely enriched in one condition but not the other

STEP 12: Performing Differential Binding Analysis with HOMER’s Built-in Functions

While R offers flexible and powerful analysis options, HOMER also provides built-in functions for differential binding analysis. Here’s how to use them:

# STEP 12: Alternative approach - Differential binding with HOMER
# ---------------------------------------------

# Create a directory for HOMER differential binding results
mkdir -p ~/ChIPseq_USF2/Homer_DiffBind

# 11.1: Differential analysis without replicates
echo "Performing differential binding analysis (non-replicate) using HOMER..."
getDifferentialPeaks.pl \
  ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks.txt \
  ~/ChIPseq_USF2/Homer/Sample3/ \        # Treatment sample
  ~/ChIPseq_USF2/Homer/Sample1/ \        # Control sample
  -F 2 \                                  # Fold change cutoff
  -P 0.01 \                               # P-value cutoff
  > ~/ChIPseq_USF2/Homer_DiffBind/diff_peaks_USF2_single_KD-Ctrl.txt

# 11.2: Differential analysis with replicates
echo "Performing differential binding analysis (with replicates) using HOMER..."
getDifferentialPeaksReplicates.pl \
  -p ~/ChIPseq_USF2/Homer/Merged_Peaks/USF2_Merged_Peaks.txt \
  -t ~/ChIPseq_USF2/Homer/Sample3/ ~/ChIPseq_USF2/Homer/Sample4/ \  # Treatment replicates
  -b ~/ChIPseq_USF2/Homer/Sample1/ ~/ChIPseq_USF2/Homer/Sample2/ \  # Control replicates
  -genome hg38 \
  > ~/ChIPseq_USF2/Homer_DiffBind/diff_peaks_USF2_rep_KD-Ctrl.txt

echo "HOMER differential binding analysis complete!"

Pros and Cons of HOMER’s Built-in Functions:

Pros:

  • Simpler workflow with fewer steps
  • Integrated with HOMER’s peak calling and annotation
  • Less coding required

Cons:

  • Less flexible statistical modeling
  • Fewer visualization options
  • Limited ability to handle complex experimental designs

The R-based approach is generally recommended for more complex analyses, while HOMER’s built-in functions provide a quick alternative for simpler comparisons.

STEP 13: Integrating Differential Binding with Gene Expression Data

A powerful approach to understanding the functional impact of differential binding is to integrate it with gene expression data. If you have matching RNA-seq data, here’s how to correlate binding changes with expression changes:

# STEP 13: Integrate differential binding with gene expression data
# ---------------------------------------------

# Load required packages
library(data.table)
library(ggplot2)

# Load differential binding results if not already in memory
if (!exists("depeak_tbl")) {
  depeak_tbl <- fread("DE_Peak_Results_Full.tsv", data.table = FALSE)
  rownames(depeak_tbl) <- depeak_tbl$V1
  depeak_tbl$V1 <- NULL
}

# Load differential expression results (assuming you have matching RNA-seq data)
# Change the path to match your actual RNA-seq results
rnaseq_path <- "~/RNAseq_USF2/DE_Gene_Results.tsv"

if (file.exists(rnaseq_path)) {
  degene_tbl <- fread(rnaseq_path, data.table = FALSE)

  # Create a mapping from peaks to genes
  peak_to_gene <- data.frame(
    peak = rownames(depeak_tbl),
    gene = depeak_tbl$Gene,
    peak_logFC = depeak_tbl$logFC,
    peak_pval = depeak_tbl$adj.P.Val
  )

  # Remove peaks without gene annotation
  peak_to_gene <- peak_to_gene[!is.na(peak_to_gene$gene) & peak_to_gene$gene != "", ]

  # For genes with multiple peaks, select the most significant one
  peak_to_gene <- peak_to_gene[order(peak_to_gene$peak_pval), ]
  peak_to_gene <- peak_to_gene[!duplicated(peak_to_gene$gene), ]

  # Match with gene expression data
  integrated_data <- merge(
    peak_to_gene,
    degene_tbl,
    by.x = "gene",
    by.y = "gene_name",
    suffixes = c("_peak", "_gene")
  )

  # Calculate correlation
  correlation <- cor.test(integrated_data$peak_logFC, integrated_data$log2FoldChange)

  # Create scatter plot
  p <- ggplot(integrated_data, aes(x = peak_logFC, y = log2FoldChange)) +
    geom_point(aes(color = ifelse(peak_pval < 0.05 & padj < 0.05, "Both Significant", 
                                 ifelse(peak_pval < 0.05, "Peak Only", 
                                       ifelse(padj < 0.05, "Gene Only", "Not Significant")))),
               alpha = 0.7) +
    geom_smooth(method = "lm", se = TRUE, color = "black") +
    labs(title = "Correlation between Differential Binding and Gene Expression",
         subtitle = paste("Pearson r =", round(correlation$estimate, 3), 
                         ", p-value =", format(correlation$p.value, scientific = TRUE)),
         x = "ChIP-seq Log2 Fold Change",
         y = "RNA-seq Log2 Fold Change",
         color = "Significance") +
    theme_bw() +
    scale_color_manual(values = c("Both Significant" = "red", 
                                  "Peak Only" = "blue", 
                                  "Gene Only" = "green",
                                  "Not Significant" = "grey"))

  # Display the plot
  print(p)

  # Save the plot
  ggsave("Correlation_ChIPseq_RNAseq.pdf", p, width = 10, height = 8)

  # Write the integrated data
  fwrite(integrated_data, "Integrated_ChIPseq_RNAseq.tsv", sep = "\t", row.names = FALSE)

Why Integration is Valuable:

  • Reveals which binding changes lead to expression changes
  • Helps identify direct vs. indirect regulation
  • Provides functional context for binding changes
  • Strengthens the biological interpretation of your findings

Conclusion and Best Practices

Congratulations! You’ve successfully performed differential binding analysis and advanced motif discovery on your ChIP-seq data. These analyses have transformed your data from simple binding locations to insights about how protein-DNA interactions change between conditions and the specific sequence preferences that drive these changes.

Summary of Key Findings

From this analysis, you should now have:

  1. Differentially bound regions between your conditions, with statistical significance
  2. Functional annotations of these regions via HOMER analysis
  3. Enriched DNA motifs in up- and down-regulated regions
  4. Visualization files for exploring binding changes at specific loci
  5. Integration with gene expression (if available) to link binding to functional outcomes

Best Practices for Differential ChIP-seq Analysis

To ensure high-quality results from your differential binding analysis, keep these best practices in mind:

Experimental Design and Replication

  • Include biological replicates: At least 2-3 replicates per condition are recommended
  • Control technical factors: Process all samples together to minimize batch effects
  • Validate key findings: Consider independent experimental validation of important differential binding sites

Data Processing

  • Use consistent parameters: Apply the same quality control and processing steps to all samples
  • Normalize appropriately: Account for differences in sequencing depth and IP efficiency
  • Filter low-quality peaks: Remove peaks with low counts or poor reproducibility

Statistical Analysis

  • Choose appropriate statistical models: Consider the nature of your data and experimental design
  • Correct for multiple testing: Use adjusted p-values to control false discovery rate
  • Set reasonable thresholds: Balance stringency with biological meaningfulness

Interpretation and Integration

  • Connect to genes: Annotate peaks with nearby genes and regulatory elements
  • Integrate with expression data: Correlate binding changes with expression changes
  • Consider biological context: Interpret results in light of the cellular processes and pathways relevant to your system

Common Pitfalls and How to Avoid Them

Interpreting Causality

Pitfall: Assuming that all binding changes directly cause expression changes.

Solution:

  • Remember that correlation doesn’t imply causation
  • Look for motif evidence to support direct regulation
  • Consider the timing of binding vs. expression changes
  • Use perturbation experiments to test causal relationships

Overgeneralizing from One Factor

Pitfall: Attributing all regulatory changes to the factor you’re studying.

Solution:

  • Consider co-factors and the broader regulatory context
  • Look for cooperative binding patterns in motif analysis
  • Be cautious about claiming master regulator status for your factor

Neglecting Biological Context

Pitfall: Focusing solely on statistical significance without biological interpretation.

Solution:

  • Connect differential binding to relevant biological pathways
  • Consider cell type-specific regulatory features
  • Relate your findings to the broader literature in your field

Troubleshooting Common Issues

Few or No Differential Peaks

Problem: Your analysis identifies very few differential peaks despite clear experimental differences.

Solutions:

  • Try less stringent statistical thresholds (e.g., FDR < 0.1 instead of 0.05)
  • Check for batch effects that might be masking true differences
  • Consider different normalization methods
  • Examine global binding patterns with PCA to confirm sample separation

Poor Correlation with Expression

Problem: Binding changes don’t correlate well with expression changes.

Solutions:

  • Remember that binding doesn’t always lead to immediate expression changes
  • Consider indirect effects and regulatory networks
  • Look at different time points if available
  • Focus on subsets of genes with more direct regulatory relationships

Conflicting Motif Results

Problem: Motif analysis gives different results between tools or parameter settings.

Solutions:

  • Try multiple motif discovery tools (HOMER, MEME, etc.)
  • Vary parameters (motif length, background models)
  • Focus on the most consistently identified motifs
  • Consider biological knowledge about known binding preferences

Further Resources

For those interested in deepening their understanding of differential ChIP-seq analysis:

References

  • Bailey T, et al. (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Computational Biology, 9(11), e1003326.
  • Ross-Innes CS, et al. (2012). Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature, 481(7381), 389-393.
  • Stark R, Brown G (2011). DiffBind: differential binding analysis of ChIP-seq peak data.
  • McLean CY, et al. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology, 28(5), 495-501.
  • Heinz S, et al. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell, 38(4), 576-589.

This tutorial is part of the NGS101.com beginner’s guide to next-generation sequencing analysis. If you have questions or suggestions, please leave a comment below.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *