How To Analyze ATAC-seq Data For Absolute Beginners Part 2: Differential Binding Analysis Using DiffBind

Introduction: The Power of Comparative ATAC-seq Analysis

ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) has revolutionized chromatin accessibility mapping by requiring minimal input material and offering streamlined workflows. While identifying accessible regions in a single condition provides valuable insights, the most compelling biological discoveries often come from comparing accessibility between different experimental conditions – such as disease versus healthy states, drug-treated versus control samples, or different cell types and developmental stages.

This tutorial builds upon our previous ATAC-seq data analysis guide and focuses on Differential Binding Analysis (DBA) or Differential Accessibility Analysis (DAA) – the computational approach that identifies statistically significant differences in chromatin accessibility between conditions.

Beginner’s Tip: Think of differential analysis as finding the “doors” to the genome that open or close in response to a specific treatment or disease state. These changing accessibility patterns often reveal the key regulatory mechanisms driving biological differences between conditions.

What is Differential Binding Analysis for ATAC-seq?

Differential Binding Analysis identifies genomic regions showing statistically significant differences in signal intensity (read counts) between experimental conditions. For ATAC-seq specifically, these differences represent changes in chromatin accessibility that may indicate altered gene regulation.

The process typically involves five key steps:

Peak Identification: Detecting accessible chromatin regions (peaks) across all samples
Read Counting: Quantifying the number of sequencing reads within each peak for each sample
Normalization: Adjusting read counts to account for technical biases and differences in sequencing depth
Statistical Testing: Applying statistical models to identify significant differences in accessibility
Interpretation: Connecting differential accessibility to biological function through annotation and downstream analyses

Why Choose DiffBind for ATAC-seq Analysis?

DiffBind is a specialized Bioconductor package initially developed for ChIP-seq differential analysis but has become equally valuable for ATAC-seq data. As a purpose-built tool for chromatin-based experiments, it offers several key advantages:

Key Features of DiffBind

Unified Workflow: Integrates multiple analytical steps (consensus peak creation, counting, normalization, statistical testing) in a coherent framework
Statistical Flexibility: Supports both DESeq2 and edgeR as underlying statistical engines
Visualization Suite: Provides publication-quality plots specifically designed for chromatin accessibility data
Occupancy Analysis: Enables examination of peak overlaps between samples and conditions
Comprehensive Documentation: Offers detailed vignettes and examples specifically for ATAC-seq analysis

DiffBind vs. Direct Use of DESeq2/edgeR/limma

In my ChIP-seq tutorial on differential analysis, I demonstrated direct use of limma for differential analysis. While that approach remains valid, DiffBind offers significant advantages:

Advantages of DiffBind:

Specialized for Chromatin Data: Unlike general-purpose RNA-seq tools, DiffBind is designed specifically for peak-based analyses with features tailored to chromatin experiments.
Integrated Pipeline: DiffBind streamlines the entire workflow that would otherwise require custom scripting, handling everything from consensus peak creation to final visualization.
Statistical Flexibility: Users can easily switch between DESeq2 and edgeR statistical frameworks without changing their overall workflow.
Built-in Visualization: The package includes specialized plotting functions for exploring chromatin data, including PCA plots, heatmaps, and MA plots optimized for peak-based data.
Control Sample Integration: DiffBind properly incorporates control samples (input DNA) into the analysis, which can be complicated to implement manually.

Limitations to Consider:

Black-Box Nature: Some preprocessing steps are abstracted away, potentially limiting transparency for users who want full control over every analytical decision.
Peak-Based Approach: DiffBind relies on discrete pre-defined peaks, which may miss subtle changes in broader regions not called as peaks in individual samples.
Memory Requirements: Processing large datasets with DiffBind can be memory-intensive, particularly with many samples or broad peaks.
Learning Curve: DiffBind introduces its own terminology and data structures that users must understand, in addition to the underlying statistical concepts.

DESeq2 vs. edgeR Within DiffBind: Making the Right Choice

DiffBind allows you to use either DESeq2 or edgeR as the statistical engine for your differential analysis. While both are valid options, they have different characteristics:

DESeq2 generally provides more conservative estimates with better control of false positives, making it the preferred choice for most standard analyses. It’s the default method in recent DiffBind versions.
edgeR may identify more differential sites in some datasets but potentially introduces more false positives. It’s often preferred when detection power is prioritized over stringency.
Normalization approaches differ between the packages: DESeq2 uses a median of ratios method (also called RLE – Relative Log Expression), while edgeR uses TMM (Trimmed Mean of M-values). Both aim to handle compositional biases but may perform differently depending on dataset characteristics.

Best Practice: For high-confidence results, consider running both methods and focusing on the intersection of significant peaks identified by both approaches. This conservative strategy minimizes method-specific biases.

Special Considerations for ATAC-seq Differential Analysis

When analyzing ATAC-seq data with DiffBind, several factors deserve special attention:

Global Accessibility Changes: Unlike many ChIP-seq experiments, ATAC-seq studies may show global shifts in accessibility between conditions (e.g., globally increased accessibility in cancer cells). Standard normalization methods that assume most regions remain unchanged may produce misleading results in such cases.
Peak Calling Strategy: The choice of how to define consensus peaks (using all samples or condition-specific sets) can significantly impact differential results, especially when conditions show dramatic differences in accessibility profiles.
Library Size Definition: Whether to normalize based on total sequencing depth or only reads in peaks can affect the analysis outcome, particularly when conditions show large differences in overall accessibility.
Fragment Size Distribution: ATAC-seq produces a characteristic fragment size distribution reflecting nucleosome positioning. Substantial differences in this distribution between samples may indicate technical issues requiring attention before differential analysis.
Blacklisted Regions: Filtering out known problematic genomic regions is especially important for ATAC-seq, as the open nature of the assay can capture various artifacts.

Now, let’s dive into a complete DiffBind workflow for analyzing ATAC-seq data, highlighting best practices and potential pitfalls at each step.

Setting Up Your Analysis Environment

Required Software and Packages

Before beginning the analysis, you’ll need to install several R packages through Bioconductor. These tools provide the foundation for our differential analysis workflow:

# Install BiocManager if not already installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install required packages
BiocManager::install("DiffBind")         # Main package for differential analysis
BiocManager::install("ChIPseeker")       # For genomic annotation of peaks
BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")  # Mouse genome annotations
BiocManager::install("org.Mm.eg.db")     # Mouse gene annotations

# Install additional utility package
install.packages("data.table")           # For efficient data handling

# Load all required libraries
library(data.table)
library(DiffBind)
library(ChIPseeker)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)

Note: If you’re working with human data rather than mouse, replace the mouse-specific annotation packages with their human equivalents: TxDb.Hsapiens.UCSC.hg38.knownGene and org.Hs.eg.db.

Preparing Your Sample Sheet

DiffBind requires a sample sheet that specifies the location of your peak files and BAM files, along with metadata about your experimental design. This structured approach ensures all samples are properly organized before analysis begins.

Here we’ll prepare a sample sheet for a simple experimental design comparing tumor samples to normal samples:

# Define file paths on your computer
# These should point to your processed ATAC-seq data
peak_files <- c(
  "~/ATAC_Dataset/MACS2/Tumor1/Tumor1_peaks.narrowPeak", 
  "~/ATAC_Dataset/MACS2/Tumor2/Tumor2_peaks.narrowPeak", 
  "~/ATAC_Dataset/MACS2/Normal1/Normal1_peaks.narrowPeak", 
  "~/ATAC_Dataset/MACS2/Normal2/Normal2_peaks.narrowPeak"
)

bam_files <- c(
  "~/ATAC_Dataset/Alignment/Tumor1/Tumor1_final.bam", 
  "~/ATAC_Dataset/Alignment/Tumor2/Tumor2_final.bam", 
  "~/ATAC_Dataset/Alignment/Normal1/Normal1_final.bam", 
  "~/ATAC_Dataset/Alignment/Normal2/Normal2_final.bam"
)

# Define sample metadata
sample_names <- c("Tumor1", "Tumor2", "Normal1", "Normal2")
treatment <- c("Tumor", "Tumor", "Normal", "Normal")

# Create the sample sheet data frame
sample_sheet <- data.frame(
  SampleID = sample_names,       # Unique identifier for each sample
  Condition = treatment,         # Experimental condition/group
  bamReads = bam_files,          # Path to aligned reads (BAM)
  Peaks = peak_files,            # Path to called peaks
  PeakCaller = "narrow"          # Type of peak files (narrow or broad)
)

# Optional: Save the sample sheet for future reference
write.csv(sample_sheet, "atac_seq_sample_sheet.csv", row.names = FALSE)

Best Practice: Always include at least three biological replicates per condition for robust statistical analysis. Two replicates is the absolute minimum, and single replicates cannot be used for statistical testing. The example above uses two replicates per condition for simplicity, but for real-world studies, aim for three or more.

Creating and Processing the DiffBind Object

Loading Data and Creating a DBA Object

The first step is to create a DBA (DiffBind Analysis) object that will contain all your sample information and analysis results:

# Create the DBA object from the sample sheet
atac_db <- dba(sampleSheet = sample_sheet)

# Examine the DBA object
dba.show(atac_db)

The dba.show() function displays a summary of your DBA object, including the number of samples, their conditions, and the number of peaks in each sample. This initial check helps verify that your data was loaded correctly.

Counting Reads in Peaks

Next, we need to count the number of reads that fall within peaks for each sample. This step also creates a consensus peakset across all samples:

# Count reads overlapping peaks
atac_db <- dba.count(atac_db,
                     summits = 250,       # Center peaks on summits, extend +/- 250bp
                     minOverlap = 2,      # Require at least 2 samples to share a peak
                     bUseSummarizeOverlaps = TRUE)  # Use GenomicRanges::summarizeOverlaps()

# View updated DBA object with count information
dba.show(atac_db)

# Extract the raw count matrix (optional)
counts <- dba.peakset(atac_db, bRetrieve = TRUE)
head(counts)

Parameter Explanation:

summits = 250: This centers each peak on its summit (highest point of signal) and extends 250bp in each direction. This standardization improves comparability between samples.
minOverlap = 2: A peak must be present in at least 2 samples to be included in the consensus peakset. This filters out sample-specific noise.
bUseSummarizeOverlaps = TRUE: Uses the summarizeOverlaps function from GenomicRanges for counting, which provides greater accuracy but requires more memory.

Troubleshooting Tip: If you encounter memory issues with large datasets, set bUseSummarizeOverlaps = FALSE to use a more memory-efficient counting method, or increase your R memory limit with memory.limit(size=XX) where XX is in GB.

Exploratory Data Analysis

Before proceeding to differential analysis, it’s valuable to explore your data to check for quality issues or batch effects:

# Generate correlation heatmap
dba.plotHeatmap(atac_db)

# Create PCA plot to visualize sample relationships
dba.plotPCA(atac_db, DBA_CONDITION, label=DBA_ID)

These plots provide crucial quality control information:

Correlation Heatmap: Shows the similarity between samples. Replicates should cluster together, and you should see separation between conditions.
PCA Plot: Visualizes the major sources of variation in your data. Ideally, PC1 should separate your experimental conditions, while replicates cluster together.

If you observe unexpected patterns (e.g., replicates not clustering together or no separation between conditions), it may indicate technical issues that should be addressed before proceeding.

Performing Differential Binding Analysis

Defining Contrasts

Now we’ll define which conditions to compare in our differential analysis:

# Define the contrast by condition
atac_db <- dba.contrast(atac_db, 
                        categories = DBA_CONDITION,  # Use the Condition column from sample sheet
                        minMembers = 2)              # Minimum samples per group

# Display the contrast information
dba.show(atac_db, bContrasts = TRUE)

For more complex experimental designs with multiple factors, you can define several contrasts to examine different comparisons of interest.

Running the Differential Analysis

With our contrasts defined, we can now perform the statistical analysis to identify differentially accessible regions:

# Run differential analysis (default: DESeq2)
atac_db <- dba.analyze(atac_db)

# To use edgeR instead, uncomment this line:
# atac_db <- dba.analyze(atac_db, method=DBA_EDGER)

# Display analysis results summary
dba.show(atac_db, bContrasts = TRUE)

The summary will show the number of significantly differential sites at different thresholds. By default, DiffBind uses DESeq2 for the statistical analysis, but you can switch to edgeR by specifying method=DBA_EDGER.

Retrieving Differential Peaks

Now let’s extract the significant differential peaks for further analysis:

# Default: get significant results (FDR < 0.05)
diff_peaks <- dba.report(atac_db, th = 0.05)

# For more stringent filtering (FDR < 0.01)
# diff_peaks <- dba.report(atac_db, th = 0.01)

# To include log fold change cutoff (e.g., |log2FC| > 1)
# diff_peaks <- dba.report(atac_db, th = 0.05, fold = 1)

# View the number of significant peaks
length(diff_peaks)

# View the first few peaks
head(diff_peaks)

# Convert to data frame and write to CSV
diff_df <- as.data.frame(diff_peaks)
fwrite(diff_df, "differential_peaks.csv")

The resulting diff_peaks object is a GRanges object containing the differentially accessible regions, along with statistics such as p-values, FDR values, and fold changes.

Interpretation Tip: Positive fold changes indicate increased accessibility in the first condition (numerator) compared to the second condition (denominator). The contrast direction is important to keep in mind when interpreting your results.

Annotating and Visualizing Results

Genomic Annotation of Differential Peaks

Connecting differential peaks to genomic features provides biological context for your results:

# Annotate the differential peaks with nearby genes
# Use TxDb.Hsapiens.UCSC.hg38.knownGene and org.Hs.eg.db for human samples
diff_peaks_anno <- annotatePeak(diff_peaks,
                          TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,  # Mouse genome
                          annoDb = "org.Mm.eg.db",                     # Mouse gene annotations
                          tssRegion = c(-3000, 3000),                  # Define promoter region
                          verbose = FALSE)

# View annotation summary
diff_peaks_anno

# View detailed annotation table
annotated_df <- as.data.frame(diff_peaks_anno)
head(annotated_df)

# Save annotated results
fwrite(annotated_df, "differential_peaks_annotated.csv")

This annotation process connects each differential peak to nearby genes and classifies them according to genomic features (promoters, introns, exons, etc.).

Visualization of Results

Let’s create publication-quality visualizations to explore and present our findings:

# Create output directory for plots
dir.create("plots", showWarnings = FALSE)

# 1. Sample correlation heatmap
png("plots/sample_correlation_heatmap.png", width = 800, height = 800, res = 120)
dba.plotHeatmap(atac_db)
dev.off()

# 2. PCA plot of samples
png("plots/sample_pca.png", width = 800, height = 800, res = 120)
dba.plotPCA(atac_db, DBA_CONDITION, label=DBA_ID)
dev.off()

# 3. MA plot of differential analysis results
png("plots/ma_plot.png", width = 800, height = 800, res = 120)
dba.plotMA(atac_db)
dev.off()

# 4. Volcano plot of differential results
png("plots/volcano_plot.png", width = 800, height = 800, res = 120)
dba.plotVolcano(atac_db)
dev.off()

# 5. Heatmap of differential peaks across samples
png("plots/diff_peaks_heatmap.png", width = 800, height = 1000, res = 120)
dba.plotHeatmap(atac_db, contrast=1, correlations=FALSE)
dev.off()

# 6. Visualize the annotation distribution as pie chart
png("plots/annotation_pie.png", width = 800, height = 800, res = 120)
plotAnnoPie(diff_peaks_anno)
dev.off()

# 7. Visualize the distance of peaks to nearest TSS
png("plots/dist_to_tss.png", width = 800, height = 600, res = 120)
plotDistToTSS(diff_peaks_anno)
dev.off()

These visualizations provide multiple perspectives on your data:

MA Plot: Shows the relationship between mean signal intensity (M) and log fold change (A) for each peak, highlighting significant changes
Volcano Plot: Displays statistical significance versus fold change, ideal for identifying the most biologically relevant changes
Differential Peak Heatmap: Shows the accessibility pattern of significant peaks across all samples
Annotation Pie Chart: Reveals the distribution of differential peaks across genomic features
TSS Distance Plot: Shows how differential peaks are distributed relative to transcription start sites

Advanced Analysis and Interpretation

Gene Ontology Enrichment Analysis

To gain functional insights from differential accessibility regions, Gene Ontology (GO) enrichment analysis offers a powerful approach for identifying biological pathways, processes, and functions associated with your results. There are two primary strategies for conducting this analysis:

Gene-centric approach: Analyze genes proximal to differential peaks, which we’ll demonstrate below
Region-centric approach: Directly test genomic regions for functional enrichment, as covered in my previous ChIP-seq differential binding tutorial

The gene-centric method connects differential peaks to nearby genes and tests whether these genes are enriched in specific functional categories. This approach leverages extensive gene annotation resources while making assumptions about regulatory relationships based on genomic proximity.

# Extract genes associated with differential peaks 
# (focusing on those within 5kb of a TSS)
diff_genes <- unique(annotated_df$geneId[annotated_df$distanceToTSS <= 5000])

# If the number of genes is large, you may want to focus on the top differentially accessible regions
if(length(diff_genes) > 500) {
  # Focus on genes near the most significant peaks
  top_df <- annotated_df[order(annotated_df$FDR),]
  diff_genes <- unique(top_df$geneId[top_df$distanceToTSS <= 5000][1:500])
}

# Convert Entrez IDs to gene symbols
gene_symbols <- mapIds(org.Mm.eg.db, 
                      keys = diff_genes, 
                      column = "SYMBOL", 
                      keytype = "ENTREZID", 
                      multiVals = "first")

# Write gene list to file
writeLines(as.character(na.omit(gene_symbols)), "differential_peak_associated_genes.txt")

# Note: You can use this gene list with online tools like DAVID, Enrichr, or g:Profiler
# Or continue with R packages like clusterProfiler for enrichment analysis

Next Steps: Take the gene list to online enrichment tools or continue with R packages like clusterProfiler to identify enriched biological pathways, transcription factor networks, or disease associations.

Visualizing Specific Regions in a Genome Browser

For comprehensive instructions on visualizing these tracks in genome browsers like UCSC, IGV, or WashU Epigenome Browser, including advanced customization options and comparative visualizations, please refer to my detailed ChIP-seq visualization tutorial. The visualization techniques described there apply directly to ATAC-seq data, allowing you to create informative displays of differential accessibility in genomic context.

Best Practice: When visualizing differential regions, include tracks for all experimental conditions to directly compare accessibility patterns, and consider adding relevant gene annotation and DNA motif tracks to connect accessibility changes with potential regulatory mechanisms.

Integration with Transcription Factor Motif Analysis

Differential accessibility often indicates changes in transcription factor binding. Identifying enriched DNA sequence motifs within differentially accessible regions can reveal the transcription factors likely driving the observed chromatin changes. Please refer to my detailed ChIP-seq motif detection tutorial.

Best Practice: Perform motif analysis separately on regions with increased versus decreased accessibility, as different transcription factors may be responsible for opening versus closing chromatin in your biological system.

Conclusion and Next Steps

Differential accessibility analysis with DiffBind provides a powerful framework for identifying regulatory regions that change between experimental conditions. By connecting these accessibility changes to genes and pathways, you can gain mechanistic insights into the biological processes involved in your system of interest.

The differential binding analysis using DiffBind extends beyond ATAC-seq, providing a unified analytical framework for all chromatin accessibility experiments. Importantly, the analytical approaches covered in this tutorial—differential binding analysis, genomic annotation—and those detailed in my previous tutorials—peak visualization, motif discovery, peak calling, irreproducible discovery rate (IDR) assessment—work seamlessly across ChIP-seq, ATAC-seq, Cut&Run, and CUT&Tag methodologies. This cross-compatibility enables researchers to apply consistent analytical principles across various epigenomic techniques while focusing on the biological questions rather than having to master entirely different computational workflows for each assay type.

The methodology outlined in this tutorial can be extended to various experimental designs and integrated with other genomic data types for more comprehensive analyses. For example:

Integration with RNA-seq: Connect accessibility changes to differential gene expression
Transcription Factor Analysis: Identify key regulators by motif enrichment in differential regions
Time Course Analysis: Study dynamic accessibility changes across developmental or treatment time points
Multi-Factor Designs: Examine interactions between multiple experimental variables

Remember that ATAC-seq differential analysis is one component of a broader regulatory genomics toolkit. The most compelling biological stories often emerge from integrating multiple data types to build a comprehensive picture of gene regulation mechanisms.

Troubleshooting Common Issues

Low Number of Differential Peaks

Problem: Your analysis identified very few or no differential peaks.

Possible Solutions:

Check Sample Correlation: If samples don’t cluster by condition in the PCA or heatmap, technical variation may be overwhelming biological signal
Adjust Statistical Thresholds: Try a less stringent FDR threshold (e.g., 0.1 instead of 0.05)
Examine Your Consensus Approach: If using minOverlap > 1, you might miss condition-specific peaks
Consider Alternative Normalization: Try different normalization methods if global changes are expected
Increase Replicate Number: More replicates provide better statistical power

Memory Issues

Problem: R crashes due to insufficient memory.

Solutions:

Use Efficient Parameters: Set bUseSummarizeOverlaps = FALSE in dba.count()
Increase R Memory: Run memory.limit(size=32) to increase available memory (Windows)
Use a Filtered Peakset: Pre-filter peaks to reduce the consensus peak set size
Process on High-Memory Server: Consider moving analysis to a computational server

Unexpected Sample Clustering

Problem: Samples don’t cluster by experimental condition in PCA or heatmap.