How To Analyze ATAC-seq Data For Absolute Beginners Part 4: ATAC-seq and RNA-seq Integration

Introduction: The Power of Multi-omics Integration

Modern genomics research has evolved beyond studying isolated molecular mechanisms. To truly understand complex biological processes and disease mechanisms, researchers need to examine multiple layers of genomic information simultaneously. This tutorial introduces beginners to the powerful approach of integrating ATAC-seq and RNA-seq data using R, enabling a more comprehensive view of gene regulation.

What is ATAC-seq and RNA-seq Integration?

ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) provides information about which regions of the genome are accessible to transcription factors and other regulatory proteins. RNA-seq (RNA sequencing) measures gene expression levels, telling us which genes are actively being transcribed. Integrating these two data types allows us to connect chromatin accessibility with gene expression, creating a more complete picture of the gene regulatory landscape.

The integration process typically involves:

Analyzing each dataset independently (peak calling for ATAC-seq, differential expression analysis for RNA-seq)
Associating ATAC-seq peaks with their potential target genes
Correlating changes in chromatin accessibility with changes in gene expression
Visualizing and interpreting these relationships

Beginner’s Tip: Think of ATAC-seq and RNA-seq integration like studying both a city’s road systems (ATAC-seq) and traffic patterns (RNA-seq). The roads must be open for traffic to flow, just as chromatin must be accessible for genes to be expressed.

Why Integrate ATAC-seq and RNA-seq Data?

Single-omics approaches, while valuable, provide only a partial view of cellular processes:

ATAC-seq alone shows where chromatin is accessible but doesn’t tell us if the accessible regions are actually being used for transcription
RNA-seq alone reveals which genes are expressed but doesn’t explain the underlying regulatory mechanisms controlling that expression

By integrating these datasets, we can:

Identify functional regulatory elements: Not all accessible chromatin regions are actively involved in regulating transcription. Integration helps identify which accessible regions actually impact gene expression.
Understand transcriptional regulation mechanisms: When a gene’s expression changes, we can look for associated changes in chromatin accessibility to understand how that change might be regulated.
Discover candidate causative elements in disease: In studies comparing healthy and diseased states, integration can help pinpoint chromatin changes that may drive disease-associated gene expression alterations.
Reduce false positives: Integration provides multiple lines of evidence, increasing confidence in findings and reducing spurious associations.

The Workflow for ATAC-seq and RNA-seq Integration

Our workflow will follow these key steps:

Setting up the analysis environment: Installing necessary R packages
Data preparation: Loading and processing ATAC-seq peak files and RNA-seq differential expression results
Peak annotation: Associating ATAC-seq peaks with nearby genes
Data integration: Correlating chromatin accessibility with gene expression
Visualization: Creating informative plots to explore the integrated data
Interpretation: Drawing biological insights from the integrated analysis

This tutorial builds on my previous tutorials and assumes you have already performed the initial analyses:

ATAC-seq data processing and peak calling (using MACS2 or HOMER)
RNA-seq data processing and differential expression analysis (using limma, DESeq2 or similar tools)

We’ll focus specifically on the integration steps, using example datasets to illustrate the process.

Setting Up Your Analysis Environment

Before diving into the integration analysis, we need to set up our R environment with the necessary packages. We’ll use several Bioconductor packages specifically designed for genomic data analysis.

#-----------------------------------------------
# STEP 0: Install required R packages
#-----------------------------------------------

# If you haven't installed Bioconductor yet
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install required packages
BiocManager::install(c(
  "ChIPseeker",      # For annotating peaks
  "TxDb.Hsapiens.UCSC.hg38.knownGene",  # Human genome annotations
  "org.Hs.eg.db",    # Gene ID mapping for human
  "clusterProfiler", # For functional enrichment analysis
  "GenomicRanges",   # For genomic interval operations
  "rtracklayer",     # For importing/exporting genomic tracks
  "EnrichedHeatmap", # For specialized genomic heatmaps
  "ComplexHeatmap",  # For complex heatmap visualizations
  "DESeq2"           # For normalization
))

# Install CRAN packages
install.packages(c(
  "data.table",      # For fast and efficient data handling
  "ggplot2",         # For plotting
  "pheatmap",        # For heatmaps
  "RColorBrewer",    # For color palettes
  "ggrepel",         # For non-overlapping text labels
  "ggpubr"           # For combining plots
))

# Load the required packages
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
library(clusterProfiler)
library(ggplot2)
library(pheatmap)
library(GenomicRanges)
library(rtracklayer)
library(EnrichedHeatmap)
library(ComplexHeatmap)
library(RColorBrewer)
library(data.table)
library(ggrepel)  # For non-overlapping text labels

# Set up the TxDb object for annotations
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

Package Explanation:

ChIPseeker – Originally designed for ChIP-seq but works excellently for ATAC-seq peak annotation

TxDb.Hsapiens.UCSC.hg38.knownGene – Contains gene models for the human genome (hg38)

GenomicRanges – Provides data structures for representing genomic intervals

EnrichedHeatmap – Creates specialized heatmaps for genomic data visualization

data.table – Provides fast data manipulation capabilities, essential for large genomic datasets

Data Preparation

For this tutorial, we’ll work with example datasets that mimic real ATAC-seq and RNA-seq data. In your actual analysis, you would import your own processed data files.

#-----------------------------------------------
# STEP 1: Load and prepare data for integration
#-----------------------------------------------

#=============================================
# 1.1: Load RNA-seq differential expression results
#=============================================
rna_data <- fread("RNA_seq_differential_expression.tsv")

# Filter for significant genes (|log2FC| > 1 and padj < 0.05)
sig_genes <- rna_data[abs(log2FoldChange) > 1 & padj < 0.05]

#=============================================
# 1.2: Load ATAC-seq peaks and convert to GRanges object
#=============================================
# Load differential accessibility information
atac_diff <- fread("ATAC_seq_differential_peaks.tsv")

# Filter for significant differentially accessible regions
sig_atac <- atac_diff[abs(log2FoldChange) > 1 & padj < 0.05]

# Create a GRanges object with differential accessibility information
sig_atac_gr <- makeGRangesFromDataFrame(sig_atac, 
                                      keep.extra.columns = TRUE,
                                      seqnames.field = "chr",
                                      start.field = "start",
                                      end.field = "end")

Practical Tips:

When loading your actual data, ensure column names match your file format

Setting appropriate thresholds for significance filtering depends on your experimental design and biological question

For human data, check that chromosome naming is consistent (e.g., “chr1” vs “1”)

Peak Annotation

Next, we’ll annotate the ATAC-seq peaks to associate them with nearby genes, which is a crucial step for integration.

#-----------------------------------------------
# STEP 2: Annotate ATAC-seq peaks with genomic features
#-----------------------------------------------

#=============================================
# 2.1: Annotate specifically the differential peaks
#=============================================
# Annotate differentially accessible peaks
diff_peak_anno <- annotatePeak(sig_atac_gr, 
                              TxDb = txdb,
                              tssRegion = c(-3000, 3000),
                              annoDb = "org.Hs.eg.db")

# Get detailed annotation for differentially accessible peaks as data.table
diff_peak_anno_df <- as.data.table(as.data.frame(diff_peak_anno))

# Add gene symbols using data.table syntax
diff_peak_anno_df[, gene_symbol := mapIds(org.Hs.eg.db, 
                                         keys = geneId, 
                                         column = "SYMBOL", 
                                         keytype = "ENTREZID",
                                         multiVals = "first")]

#=============================================
# 2.2: Analyze peak distribution across genomic features
#=============================================
# Plot genomic feature distribution
pdf("ATAC_diffpeak_genomic_distribution.pdf", width = 10, height = 8)
plotAnnoPie(diff_peak_anno)
dev.off()

# Plot distribution of peaks relative to TSS
pdf("ATAC_diffpeak_TSS_distribution.pdf", width = 10, height = 8)
plotDistToTSS(diff_peak_anno)
dev.off()

Parameter Explanation:

tssRegion = c(-3000, 3000) defines the promoter region as 3kb upstream and downstream of the transcription start site (TSS)

annoDb = "org.Hs.eg.db" specifies that we’re working with human genes

The annotation process assigns each peak to various genomic features (promoters, exons, introns, etc.)

For mouse data, use TxDb.Mmusculus.UCSC.mm10.knownGene and org.Mm.eg.db instead

Data Integration

Now comes the key step of integrating the ATAC-seq and RNA-seq datasets to identify relationships between chromatin accessibility and gene expression.

#-----------------------------------------------
# STEP 3: Integrate ATAC-seq and RNA-seq data
#-----------------------------------------------

#=============================================
# 3.1: Prepare data for integration
#=============================================
# Create a data.table for differentially accessible regions with gene info
atac_with_genes <- diff_peak_anno_df[, .(seqnames, start, end, strand, 
                                       annotation, gene_symbol, geneId, distanceToTSS)]

# Add log2FC and padj from the original ATAC-seq data
# Match by genomic coordinates
atac_with_genes[, peak_key := paste(seqnames, start, end, sep = "_")]
sig_atac[, peak_key := paste(chr, start, end, sep = "_")]

atac_with_genes <- merge(atac_with_genes, sig_atac[, .(peak_key, log2FoldChange, padj)], by = "peak_key")
setnames(atac_with_genes, old = c("log2FoldChange", "padj"), new = c("atac_log2FC", "atac_padj"))

# Create a data.table for differentially expressed genes
rna_sig <- sig_genes[, .(gene_id, gene_name, log2FoldChange, padj)]
setnames(rna_sig, old = c("log2FoldChange", "padj"), new = c("rna_log2FC", "rna_padj"))

#=============================================
# 3.2: Merge datasets
#=============================================
# Ensure gene_symbol is character to avoid merge issues
atac_with_genes[, gene_symbol := as.character(gene_symbol)]
rna_sig[, gene_name := as.character(gene_name)]

# Merge the datasets based on gene symbols
integrated_data <- merge(atac_with_genes, rna_sig, by.x = "gene_symbol", by.y = "gene_name", allow.cartesian = TRUE)

# Print integration summary
cat("Integration summary:\n")
cat("Number of peaks associated with genes that have both differential expression and accessibility:", 
    nrow(integrated_data), "\n")

#=============================================
# 3.3: Calculate correlation between accessibility and expression
#=============================================
# Calculate Pearson correlation
correlation <- cor.test(integrated_data$atac_log2FC, integrated_data$rna_log2FC)

cat("Correlation between chromatin accessibility and gene expression:\n")
cat("Pearson correlation coefficient:", correlation$estimate, "\n")
cat("p-value:", correlation$p.value, "\n")

#=============================================
# 3.4: Categorize integrated results
#=============================================
# Categorize the results based on concordance
integrated_data[, concordance := fifelse(
  atac_log2FC > 0 & rna_log2FC > 0, "Concordant up",
  fifelse(atac_log2FC < 0 & rna_log2FC < 0, "Concordant down",
    fifelse(atac_log2FC > 0 & rna_log2FC < 0, "Discordant (open/down)",
      fifelse(atac_log2FC < 0 & rna_log2FC > 0, "Discordant (closed/up)", 
        "Other"))))]

# Summarize the categories
concordance_summary <- integrated_data[, .N, by = concordance]
cat("\nConcordance summary:\n")
print(concordance_summary)

Integration Approaches Explained:

Direct overlap connects each peak with its nearest gene annotation

Distance-based integration finds all genes within a certain distance (e.g., 100kb) of each peak

Concordance analysis examines whether chromatin accessibility and gene expression change in the same direction

The choice of distance threshold (100kb here) can be adjusted based on your biological context

Visualization

Visualizing the integrated data helps us understand the relationships between chromatin accessibility and gene expression more intuitively.

#-----------------------------------------------
# STEP 4: Visualize the integrated results
#-----------------------------------------------

#=============================================
# 4.1: Calculate integrated scores for better ranking
#=============================================
# Calculate various scores to evaluate the integration
integrated_data[, `:=` (
  # Standard product of fold changes
  combined_score = atac_log2FC * rna_log2FC,

  # Z-score normalized product (gives equal weight to both data types)
  z_combined_score = scale(atac_log2FC)[,1] * scale(rna_log2FC)[,1],

  # Statistical significance combined score (-log10 of p-value product)
  sig_score = -log10(atac_padj) - log10(rna_padj),

  # Weighted score that considers both magnitude and significance
  weighted_score = (atac_log2FC * rna_log2FC) * (-log10(atac_padj * rna_padj)),

  # Distance-weighted score (higher weight for closer peaks)
  distance_weight = exp(-pmin(abs(distanceToTSS), 100000)/10000),
  distance_weighted_score = (atac_log2FC * rna_log2FC) * exp(-pmin(abs(distanceToTSS), 100000)/10000)
)]

#=============================================
# 4.2: Create a scatterplot of ATAC-seq vs RNA-seq changes
#=============================================
# Scatterplot showing the relationship between accessibility and expression changes
scatter_plot <- ggplot(integrated_data, 
                     aes(x = atac_log2FC, 
                         y = rna_log2FC, 
                         color = concordance,
                         size = -log10(atac_padj * rna_padj))) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("Concordant up" = "#E63946", 
                               "Concordant down" = "#1D3557",
                               "Discordant (open/down)" = "#F4A261", 
                               "Discordant (closed/up)" = "#7B2CBF",
                               "Other" = "gray70")) +
  scale_size_continuous(range = c(1, 8), name = "Combined Significance") +
  labs(x = "ATAC-seq log2 Fold Change", 
       y = "RNA-seq log2 Fold Change",
       title = "Integration of Chromatin Accessibility and Gene Expression Changes",
       subtitle = paste("Pearson correlation:", round(correlation$estimate, 3), 
                       "(p-value:", format(correlation$p.value, digits = 3), ")")) +
  theme_minimal() +
  theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  # Add text labels for top genes
  geom_text_repel(data = head(integrated_data[order(-abs(weighted_score))], 15),
                 aes(label = gene_symbol),
                 size = 3.5,
                 max.overlaps = 20)

# Save the plot
ggsave("ATAC_RNA_integration_scatter.pdf", scatter_plot, width = 10, height = 8)

#=============================================
# 4.3: Create a volcano plot of integration results
#=============================================
# Add a column for integration significance (combined p-value)
integrated_data[, combined_pvalue := atac_padj * rna_padj]
integrated_data[, neg_log10_pval := -log10(combined_pvalue)]

# Create a volcano plot
volcano_plot <- ggplot(integrated_data, 
                     aes(x = combined_score, 
                         y = neg_log10_pval, 
                         color = concordance)) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(values = c("Concordant up" = "#E63946", 
                               "Concordant down" = "#1D3557",
                               "Discordant (open/down)" = "#F4A261", 
                               "Discordant (closed/up)" = "#7B2CBF",
                               "Other" = "gray70")) +
  theme_minimal() +
  labs(x = "Combined Score (ATAC-seq × RNA-seq log2FC)", 
       y = "-log10(Combined p-value)",
       title = "Volcano Plot of Integrated ATAC-seq and RNA-seq Results") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  # Add text labels for top genes
  geom_text_repel(data = head(integrated_data[order(-abs(weighted_score))], 20),
                 aes(label = gene_symbol),
                 size = 3.5,
                 max.overlaps = 20)

# Save the plot
ggsave("Integration_volcano_plot.pdf", volcano_plot, width = 10, height = 8)

#=============================================
# 4.4: Create a faceted histogram of peak-gene distances by concordance
#=============================================
# Calculate distance in kb for the integrated data
integrated_data[, distance_kb := abs(distanceToTSS) / 1000]  # Convert to kb for easier reading

# Create histograms of distances for different concordance types
distance_hist <- ggplot(integrated_data, aes(x = distance_kb, fill = concordance)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  facet_wrap(~ concordance, scales = "free_y") +
  scale_fill_manual(values = c("Concordant up" = "#E63946", 
                              "Concordant down" = "#1D3557",
                              "Discordant (open/down)" = "#F4A261", 
                              "Discordant (closed/up)" = "#7B2CBF",
                              "Other" = "gray70")) +
  theme_minimal() +
  labs(x = "Distance from ATAC-seq Peak to Gene (kb)", 
       y = "Count",
       title = "Distribution of Peak-Gene Distances by Concordance Category") +
  theme(legend.position = "none")

# Save the plot
ggsave("Distance_histograms_by_concordance.pdf", distance_hist, width = 12, height = 8)

Visualization Insights:

The scatterplot reveals the overall relationship between chromatin accessibility and gene expression

The volcano plot highlights genes with statistically significant concordant changes

The distance histogram shows whether certain types of regulatory relationships are more common at specific genomic distances

Results Interpretation and Biological Insights

Understanding how to interpret the results of ATAC-seq and RNA-seq integration is crucial for extracting meaningful biological insights.

Key Patterns to Look For

When analyzing the integration of ATAC-seq and RNA-seq data, consider these key patterns:

Concordant changes (accessibility and expression changing in the same direction) often represent the most direct and interpretable regulatory relationships:
- Concordant increases (open/up): Typically represent activation of enhancers or promoters leading to increased gene expression
- Concordant decreases (closed/down): May indicate repression of regulatory elements leading to decreased gene expression

Discordant changes are more complex and may indicate:
- Open chromatin but decreased expression: Possibly due to binding of repressors at newly accessible sites, or compensatory mechanisms
- Closed chromatin but increased expression: May result from loss of repressor binding, or regulation through distal elements not captured in the analysis

Distance relationships are important to consider:
- Regulatory elements closer to transcription start sites (especially promoters) typically show stronger correlation with gene expression
- Distal elements (enhancers) may show weaker correlation but still play critical functional roles

Genomic context provides valuable interpretative clues:
- Promoter-associated changes may directly affect transcription initiation
- Intronic or distal intergenic changes might represent enhancers or other regulatory elements
- UTR-associated changes could affect mRNA stability or translation efficiency

Common Pitfalls and Troubleshooting

When performing ATAC-seq and RNA-seq integration, be aware of these common challenges:

Gene assignment issues:
- Problem: Peaks may be incorrectly assigned to genes based solely on proximity
- Solution: Consider using chromatin conformation data (Hi-C, ChIA-PET) when available to confirm interactions

Temporal disconnects:
- Problem: Chromatin accessibility changes may precede expression changes
- Solution: Consider time-course experiments when possible

Cell type heterogeneity:
- Problem: Bulk sequencing may mask cell type-specific relationships
- Solution: Consider single-cell approaches if cellular heterogeneity is a concern

Threshold sensitivity:
- Problem: Results may be sensitive to significance thresholds
- Solution: Perform sensitivity analyses with different thresholds

Best Practices for ATAC-seq and RNA-seq Integration

To ensure robust and reproducible analyses, follow these best practices:

Quality control at every step:
- Use established QC metrics for both ATAC-seq and RNA-seq data
- Check for batch effects and correct them if necessary

Biological replication:
- Always include biological replicates for both assays
- Use consistent sampling and processing for matched samples

Data normalization:
- Apply appropriate normalization methods for each data type
- Consider batch correction if samples were processed separately

Multiple testing correction:
- Apply rigorous multiple testing correction to avoid false positives
- Consider the number of tests performed in the integrated analysis

Validation strategies:
- Consider alternative approaches to validate key findings:
  - qPCR for gene expression changes
  - ChIP-qPCR for specific regulatory regions
  - CRISPR validation for functional testing

Conclusion and Next Steps

Integrating ATAC-seq and RNA-seq data provides a powerful approach to understand the relationship between chromatin accessibility and gene expression. This multi-omics integration enables researchers to:

Identify functional regulatory elements by linking changes in chromatin accessibility to gene expression outcomes
Uncover regulatory mechanisms driving changes in gene expression in response to experimental conditions
Prioritize candidate regulatory regions for further functional validation through methods like CRISPR editing or reporter assays
Generate hypotheses about transcription factor activity by examining motif enrichment in peaks associated with expression changes

Advanced Analyses to Consider

After completing the basic integration workflow, consider these advanced analyses:

Transcription factor motif analysis:
- Identify enriched transcription factor binding motifs in concordant peaks
- Correlate with expression of the corresponding transcription factors

Pathway and network analysis:
- Perform pathway enrichment on genes in different concordance categories
- Build gene regulatory networks based on integrated data

Integration with additional data types:
- Incorporate histone modification data (ChIP-seq)
- Add DNA methylation data for a more complete epigenetic picture
- Include transcription factor binding data (ChIP-seq)

Functional validation:
- Design CRISPR experiments to validate key regulatory elements
- Use reporter assays to confirm enhancer/promoter activity
- Apply chromosome conformation capture methods to validate physical interactions

By leveraging the complementary information provided by ATAC-seq and RNA-seq, researchers can gain deeper insights into the mechanisms of gene regulation in their biological system of interest, potentially uncovering new therapeutic targets or biomarkers for disease.

References

Heshusius, S., Grech, L., Gillemans, N. et al. Epigenomic analysis of KLF1 haploinsufficiency in primary human erythroblasts. Sci Rep 12, 336 (2022). https://doi.org/10.1038/s41598-021-04126-6

This tutorial is part of the NGS101.com beginner’s guide to next-generation sequencing analysis. If you have questions or suggestions, please leave a comment below.

NGS Learning Hub

How To Analyze ATAC-seq Data For Absolute Beginners Part 4: ATAC-seq and RNA-seq Integration

Introduction: The Power of Multi-omics Integration

What is ATAC-seq and RNA-seq Integration?

Why Integrate ATAC-seq and RNA-seq Data?

The Workflow for ATAC-seq and RNA-seq Integration

Setting Up Your Analysis Environment

Data Preparation

Peak Annotation

Data Integration

Visualization

Results Interpretation and Biological Insights

Key Patterns to Look For

Common Pitfalls and Troubleshooting

Best Practices for ATAC-seq and RNA-seq Integration

Conclusion and Next Steps

Advanced Analyses to Consider

References

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags

How To Analyze ATAC-seq Data For Absolute Beginners Part 4: ATAC-seq and RNA-seq Integration

Introduction: The Power of Multi-omics Integration

What is ATAC-seq and RNA-seq Integration?

Why Integrate ATAC-seq and RNA-seq Data?

The Workflow for ATAC-seq and RNA-seq Integration

Setting Up Your Analysis Environment

Data Preparation

Peak Annotation

Data Integration

Visualization

Results Interpretation and Biological Insights

Key Patterns to Look For

Common Pitfalls and Troubleshooting

Best Practices for ATAC-seq and RNA-seq Integration

Conclusion and Next Steps

Advanced Analyses to Consider

References

Share this:

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags