Introduction: Understanding ATAC-seq
At the heart of molecular biology lies a fundamental question: how do cells regulate which genes are expressed and when? One powerful technique to explore this question is Assay for Transposase-Accessible Chromatin sequencing, commonly known as ATAC-seq. This tutorial introduces beginners to the fascinating world of ATAC-seq data analysis using HOMER and MACS2, versatile software suites that have become mainstays in genomic research.
What is ATAC-seq?
ATAC-seq represents a revolutionary method for mapping chromatin accessibility across the genome. The technique allows researchers to identify open regions of chromatin where DNA is not tightly bound to nucleosomes and is therefore accessible to regulatory proteins such as transcription factors. These accessible regions often correspond to active regulatory elements like promoters, enhancers, and other functional regions crucial for gene expression.
The experimental process begins with living cells, where a hyperactive Tn5 transposase enzyme is used to simultaneously cut open chromatin and insert sequencing adapters into these regions. The enzyme preferentially targets accessible DNA, effectively serving as a probe for open chromatin. The resulting DNA fragments are then purified, amplified by PCR, and sequenced. When mapped back to a reference genome, these sequences create a genome-wide profile of chromatin accessibility.
Beginner’s Tip: Think of ATAC-seq as taking a snapshot of all the “open doors” in the genome where regulatory proteins can enter and influence gene expression. While ChIP-seq tracks specific proteins at work, ATAC-seq reveals all potentially active regulatory regions regardless of which specific proteins might be binding there.
ATAC-seq vs. ChIP-seq: Key Differences and Complementary Roles
While both ATAC-seq and ChIP-seq (Chromatin Immunoprecipitation sequencing) are powerful techniques for studying gene regulation, they answer different questions and have distinct advantages:
Fundamental Differences:
- Target of Investigation:
- ATAC-seq: Directly measures chromatin accessibility, revealing all potentially active regulatory regions
- ChIP-seq: Targets specific protein-DNA interactions, showing exactly where a particular protein binds
- Experimental Process:
- ATAC-seq: Uses Tn5 transposase to preferentially insert adapters into open chromatin; requires no antibodies
- ChIP-seq: Uses antibodies to immunoprecipitate specific proteins along with their bound DNA fragments
- Required Input:
- ATAC-seq: Can work with as few as 500-50,000 cells (even single cells with specialized protocols)
- ChIP-seq: Typically requires millions of cells, especially for factors with few binding sites
Advantages of ATAC-seq:
- Simplicity: Streamlined “two-step” protocol (transposition and PCR) compared to ChIP-seq’s multi-day procedure
- Low Input Requirements: Ideal for rare cell populations or limited clinical samples
- Unbiased Discovery: Not limited to studying known proteins or requiring specific antibodies
- Global Regulatory Landscape: Captures all accessible regions in a single experiment
- Higher Signal-to-Noise Ratio: Often provides cleaner, more distinct peaks
Advantages of ChIP-seq:
- Protein Specificity: Directly identifies where specific proteins bind
- Functional Insight: Connects specific proteins to their genomic targets
- Validated Approach: Longer history of use with established analysis pipelines
- Histone Modifications: Uniquely able to map specific histone marks across the genome
When to Use Each Technique:
- Choose ATAC-seq when:
- Exploring the overall regulatory landscape without a specific protein in mind
- Working with limited biological material (rare cell types, clinical samples)
- Identifying all potential regulatory elements
- Comparing global accessibility changes between conditions
- Needing a faster, more straightforward protocol
- Choose ChIP-seq when:
- Studying a specific protein’s binding sites or histone modification
- Investigating how a particular factor responds to treatments or conditions
- Exploring direct protein-DNA interactions
- Cell material is abundant
- Antibody quality is high for your protein of interest
The two techniques are highly complementary. Many researchers use ATAC-seq to identify accessible regions genome-wide and then follow up with ChIP-seq to determine which specific factors bind within those regions.
Differences in Data Analysis Between ATAC-seq and ChIP-seq
While ATAC-seq and ChIP-seq data analysis share similar workflows, there are important distinctions to consider:
- Control Samples:
- ChIP-seq: Requires input control or IgG control samples
- ATAC-seq: Typically analyzed without control samples, using genomic background instead
- Fragment Size Distribution:
- ATAC-seq: Shows a characteristic nucleosome-free pattern (<100bp) and nucleosome-bound fragments (multiples of ~200bp)
- ChIP-seq: Fragment sizes depend on sonication conditions, typically 200-500bp
- Peak Calling Considerations:
- ATAC-seq: Peaks represent open chromatin regions; callers need to account for transposase insertion bias
- ChIP-seq: Peaks represent protein binding sites; specific to the protein being studied
- Advanced Analysis:
- ATAC-seq: Enables nucleosome positioning and transcription factor footprinting analysis
- ChIP-seq: Focuses more on motif discovery for the specific factor being studied
- Peak Shapes:
- ATAC-seq: Generally produces sharp, well-defined peaks at nucleosome-free regions
- ChIP-seq: Peak shapes vary depending on the factor (sharp for transcription factors, broad for some histone marks)
Despite these differences, many analysis tools developed for ChIP-seq have been adapted for ATAC-seq, with some adjustments to parameters and preprocessing steps.
Why ATAC-seq Matters in Biological Research
The applications of ATAC-seq extend to numerous biological fields:
- Gene Regulation: By identifying accessible chromatin regions, researchers can uncover the regulatory networks controlling gene expression during development, in different tissues, or in response to stimuli.
- Epigenetics: ATAC-seq reveals the epigenetic landscape that influences gene accessibility, complementing histone modification and DNA methylation studies.
- Disease Research: Changes in chromatin accessibility can indicate dysregulated gene expression in conditions like cancer, autoimmune diseases, and neurological disorders.
For example, cancer researchers use ATAC-seq to understand how chromatin accessibility is altered in tumor cells. Developmental biologists apply it to track how chromatin accessibility changes during cell differentiation. Immunologists employ ATAC-seq to reveal how immune cells respond to infections through changes in chromatin organization. The insights gained from these studies advance our fundamental understanding of biology and inform the development of new therapeutic strategies.
The ATAC-seq Data Analysis Workflow
For newcomers to computational biology, analyzing ATAC-seq data can be broken down into a logical sequence of steps:
- Quality Assessment: Ensuring the sequencing reads are reliable before proceeding
- Adapter Trimming: Removing Tn5 transposase adapter sequences
- Alignment: Mapping millions of short DNA sequences to their correct locations in the reference genome
- Post-alignment Processing: Removing duplicates, filtering for high-quality alignments, and adjusting for Tn5 transposition
- Peak Calling: Identifying regions of significant enrichment that represent accessible chromatin
- Annotation: Connecting accessible regions to nearby genomic features such as genes, enhancers, or promoters
- Visualization: Transforming complex datasets into intuitive graphical representations
- Motif Discovery: Identifying common DNA sequence patterns within accessible regions, revealing potential transcription factor binding sites
Throughout this analytical journey, computational tools serve as the researcher’s compass and map, guiding the way from raw data to biological meaning. In this tutorial, we’ll specifically focus on using HOMER for preprocessing and annotation, and MACS2 for peak calling, as these tools offer a good balance of accessibility and analytical power for beginners.
Setting Up Your Analysis Environment
Now that we understand what ATAC-seq is and how it differs from ChIP-seq, we’ll proceed with setting up our computational environment for ATAC-seq data analysis. Tools used for ChIP-seq analysis also work for ATAC-seq data. Here we use the same Conda environment and the tools installed within the environment from our previous ChIP-seq data analysis tutorial. You can keep using BWA for alignment as we did in the ChIP-seq data analysis tutorial. Here I demonstrate how to use Bowtie2, another popular aligner, for reads alignment.
Required Software Installation
#-----------------------------------------------
# STEP 0: Setup conda environment for ATAC-seq analysis
#-----------------------------------------------
# Activate the HOMER environment created for ChIP-seq data analysis
conda activate ~/Env_Homer
# Install Bowtie2 for alignment
conda install -c bioconda bowtie2
# Install MACS2 for peak calling
conda install -c bioconda macs2
# Check installed versions
bowtie2 --version
macs2 --version
# Ensure we have all necessary packages
conda install -y \
sambamba \ # Faster BAM file manipulation than samtools
deeptools \ # For visualization and quality control
multiqc \ # For aggregating QC reports
bioconductor-diffbind # For differential binding analysis
Troubleshooting Tip: If you encounter errors during installation, make sure you have sufficient disk space and permissions to install software. For permission errors, you might need to use
sudo
or contact your system administrator.
Reference File Preparation
The reference files used for ATAC-seq data analysis are the same as those for ChIP-seq data. Please refer to my previous ChIP-seq data analysis tutorial for reference file preparation. The prebuilt Bowtie2 index can be downloaded from refgenie:
#-----------------------------------------------
# STEP 1: Download reference genome index files
#-----------------------------------------------
# Create a dedicated directory for Bowtie2 Index files
mkdir -p ~/Genome_Index/Bowtie2_hg38
cd ~/Genome_Index/Bowtie2_hg38
# Define the base URL for all index files
BASE_URL="http://awspds.refgenie.databio.org/refgenomes.databio.org/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index__default"
FILE_PREFIX="2230c535660fb4774114bfa966a62f823fdb6d21acf138d4"
# Download all Bowtie2 index files
echo "Downloading Bowtie2 index files..."
wget ${BASE_URL}/${FILE_PREFIX}.1.bt2
wget ${BASE_URL}/${FILE_PREFIX}.2.bt2
wget ${BASE_URL}/${FILE_PREFIX}.3.bt2
wget ${BASE_URL}/${FILE_PREFIX}.4.bt2
wget ${BASE_URL}/${FILE_PREFIX}.rev.1.bt2
wget ${BASE_URL}/${FILE_PREFIX}.rev.2.bt2
echo "All reference genome index files downloaded successfully!"
# Download ENCODE blacklisted regions for later filtering
mkdir -p ~/Genome_Index/ENCODE_BlackList
cd ~/Genome_Index/ENCODE_BlackList
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
gunzip hg38-blacklist.v2.bed.gz
Best Practice: Store reference files in a centralized location on your system to avoid duplicating large files for different projects.
Download Example Data
For this tutorial, we’ll use a paired-end ATAC-seq dataset from mouse embryonic stem cells. Let’s create a directory structure and download some sample data:
#-----------------------------------------------
# STEP 2: Download example ATAC-seq dataset
#-----------------------------------------------
# Create a directory structure for the project
mkdir -p ~/ATAC_Dataset/{raw,trimmed,Alignment,Homer,MACS2}
# For demonstration, we will assume you have:
# ~/ATAC_Dataset/raw/Sample1_R1_001.fastq.gz (Read 1)
# ~/ATAC_Dataset/raw/Sample1_R2_001.fastq.gz (Read 2)
Note: For real analyses, you would typically use your own data or download data from repositories like GEO or SRA using tools like
prefetch
andfasterq-dump
from the SRA toolkit, as shown in our ChIP-seq tutorial.
The ATAC-seq Analysis Pipeline
Now that we have our environment set up, let’s walk through the analysis pipeline step by step. Unlike ChIP-seq, ATAC-seq data typically don’t require input control samples for peak calling, though they can be included if available.
Step 1: Quality Control and Adapter Trimming
First, we need to assess data quality and trim adapter sequences. ATAC-seq libraries often contain Tn5 transposase adapter sequences, which must be removed:
#-----------------------------------------------
# STEP 3: Quality control and adapter trimming
#-----------------------------------------------
# Create directories for the trimmed outputs
mkdir -p ~/ATAC_Dataset/trimmed/Sample1
# Trim the adapters
echo "Trimming adapters from paired-end ATAC-seq data..."
trim_galore \
--fastqc \ # Run FastQC after trimming for QC report
--paired \ # Data is paired-end
--cores 12 \ # Use 12 CPU cores for faster processing
--quality 20 \ # Trim low-quality bases (Q<20)
--stringency 3 \ # Require at least 3bp overlap with adapter
--length 20 \ # Discard reads shorter than 20bp after trimming
--output_dir ~/ATAC_Dataset/trimmed/Sample1 \
~/ATAC_Dataset/raw/Sample1_R1_001.fastq.gz \
~/ATAC_Dataset/raw/Sample1_R2_001.fastq.gz
echo "Quality control and adapter trimming complete!"
What This Does:
- FastQC generates quality reports before and after trimming
- Trim Galore removes adapter sequences and low-quality bases
- The paired-end parameter ensures reads remain properly paired after trimming
- Output files will be named with ‘_val_1.fq.gz’ and ‘_val_2.fq.gz’ suffixes
Step 2: Align Reads to the Reference Genome
Next, we map our trimmed reads to the reference genome using Bowtie2. For ATAC-seq, we use parameters optimized for the characteristics of Tn5 transposition:
#-----------------------------------------------
# STEP 4: Align reads to the reference genome
#-----------------------------------------------
# Create directory for alignment files
mkdir -p ~/ATAC_Dataset/Alignment/Sample1
# Set variables for readability
INDEX="~/Genome_Index/Bowtie2_hg38/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4"
THREADS=12
echo "Aligning ATAC-seq reads to reference genome..."
bowtie2 \
-q \ # Input is in FASTQ format
-x ${INDEX} \ # Bowtie2 index prefix
-p ${THREADS} \ # Number of threads
-1 ~/ATAC_Dataset/trimmed/Sample1/Sample1_R1_001_val_1.fq.gz \ # Read 1
-2 ~/ATAC_Dataset/trimmed/Sample1/Sample1_R2_001_val_2.fq.gz \ # Read 2
--sensitive \ # Alignment preset (balanced speed and sensitivity)
--no-mixed \ # Only report paired alignments
--no-discordant \ # Only report concordant alignments
--no-unal \ # Suppress unaligned reads
-S ~/ATAC_Dataset/Alignment/Sample1/Sample1.sam # Output SAM file
echo "Alignment complete! SAM file created."
ATAC-seq Alignment Tip:
- For ATAC-seq, we use
--no-mixed
and--no-discordant
to ensure only properly paired reads are reported- This helps avoid artificial peaks caused by misaligned read pairs
- The
--sensitive
preset offers a good balance of speed and sensitivity for most applications
Step 3: Process and Quality Control Alignments
After alignment, we need to convert, sort, and filter our alignment files to prepare them for peak calling. This step is crucial for ATAC-seq as it helps remove technical artifacts:
#-----------------------------------------------
# STEP 5: Process and filter alignment files
#-----------------------------------------------
# Set variables for readability
THREADS=12
SAMPLE_PREFIX="Sample1"
ALIGN_DIR="~/ATAC_Dataset/Alignment/Sample1"
#=============================================
# 5.1: SAM to BAM conversion
#=============================================
echo "Converting SAM to BAM format (compressed binary format)..."
samtools view \
-h \ # Include header in output
-S \ # Input is SAM format
-b \ # Output BAM format
-@ ${THREADS} \ # Use multiple threads
-o ${ALIGN_DIR}/${SAMPLE_PREFIX}.bam \ # Output file
${ALIGN_DIR}/${SAMPLE_PREFIX}.sam # Input file
# Remove large SAM file to save space
rm ${ALIGN_DIR}/${SAMPLE_PREFIX}.sam
#=============================================
# 5.2: Sort BAM files by genomic coordinates
#=============================================
echo "Sorting BAM files by chromosome position..."
# Using sambamba for faster sorting
sambamba sort \
-t ${THREADS} \ # Use multiple threads
-o ${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted.bam \ # Output sorted BAM
${ALIGN_DIR}/${SAMPLE_PREFIX}.bam # Input BAM
# Remove unsorted BAM file to save space
rm ${ALIGN_DIR}/${SAMPLE_PREFIX}.bam
#=============================================
# 5.3: Mark and remove PCR duplicates
#=============================================
echo "Marking and removing PCR duplicates..."
# ATAC-seq libraries can have high duplication rates due to PCR amplification
# Use Picard to mark and remove duplicates
picard MarkDuplicates \
I=${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted.bam \ # Input BAM
O=${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_dedup.bam \ # Output BAM
M=${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_dedup_metrics.txt \ # Metrics file
REMOVE_DUPLICATES=true \ # Remove duplicates
VALIDATION_STRINGENCY=LENIENT # Less strict validation
# Create index for the deduplicated BAM file
samtools index ${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_dedup.bam
#=============================================
# 5.4: Filter for high-quality alignments
#=============================================
echo "Filtering for high-quality alignments..."
# Filter out:
# 1. Multi-mapped reads (XS tag exists)
# 2. Unmapped reads
# 3. Duplicates (already removed, but to be safe)
# 4. Low mapping quality reads
sambamba view \
-h \ # Include header
-f bam \ # Output in BAM format
-F "[XS] == null and not unmapped and not duplicate and mapping_quality >= 30" \ # Filter expression
-t ${THREADS} \ # Number of threads
${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_dedup.bam \ # Input BAM
-o ${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_filtered.bam # Output filtered BAM
#=============================================
# 5.5: Filter out blacklisted regions
#=============================================
echo "Removing blacklisted regions from alignments..."
# Blacklisted regions are known problematic regions
# that tend to show artifactual signal in many assays
bedtools intersect \
-v \ # Only keep reads that DO NOT overlap blacklist
-abam ${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted_filtered.bam \ # Input BAM
-b ~/Genome_Index/ENCODE_BlackList/hg38-blacklist.v2.bed \ # Blacklist BED
> ${ALIGN_DIR}/${SAMPLE_PREFIX}_filtered_blacklist_removed.bam # Output filtered BAM
#=============================================
# 5.6: Final sorting of filtered BAM
#=============================================
echo "Final sorting of filtered BAM file..."
sambamba sort \
-t ${THREADS} \ # Use multiple threads
-o ${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam \ # Output final BAM
${ALIGN_DIR}/${SAMPLE_PREFIX}_filtered_blacklist_removed.bam # Input BAM
# Index the final BAM file
samtools index ${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam
#=============================================
# 5.7: Generate alignment statistics
#=============================================
echo "Generating alignment statistics..."
# Count total reads in original BAM
TOTAL=$(samtools view -c ${ALIGN_DIR}/${SAMPLE_PREFIX}_sorted.bam)
# Count reads after all filtering
FINAL=$(samtools view -c ${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam)
# Calculate percentage retained
PERCENT=$(echo "scale=2; 100*${FINAL}/${TOTAL}" | bc)
echo "Original read count: ${TOTAL}"
echo "Final read count: ${FINAL}"
echo "Percentage retained: ${PERCENT}%"
echo "Alignment processing and filtering complete!"
Why These Filtering Steps Matter for ATAC-seq:
- Removing duplicates is crucial as PCR amplification can artificially inflate signal
- Multi-mapped reads (with XS tag) can create false positives
- Blacklisted regions often show artifactual enrichment across many assays
- High mapping quality ensures reliable alignment locations
Step 4: Peak Calling and Annotation using HOMER
Now let’s identify the accessible chromatin regions using HOMER, which offers good integration with downstream annotation and motif analysis:
#-----------------------------------------------
# STEP 6: Peak calling and annotation with HOMER
#-----------------------------------------------
# Create directory for HOMER analysis
mkdir -p ~/ATAC_Dataset/Homer/Sample1
#=============================================
# 6.1: Create HOMER tag directory
#=============================================
echo "Creating HOMER tag directory..."
# Variables for readability
SAMPLE_PREFIX="Sample1"
ALIGN_DIR="~/ATAC_Dataset/Alignment/Sample1"
HOMER_DIR="~/ATAC_Dataset/Homer"
# Create tag directory for ATAC-seq sample
# Tag directories contain processed alignment data optimized for HOMER analysis
echo "Creating tag directory for ATAC-seq sample..."
makeTagDirectory \
${HOMER_DIR}/${SAMPLE_PREFIX} \ # Output directory
${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam \ # Input BAM
-genome hg38 \ # Reference genome
-checkGC \ # Calculate GC content distribution
-tbp 1 \ # Tags per bp (normalization factor)
-format sam # Input format (BAM is a binary SAM)
#=============================================
# 6.2: Call peaks to identify accessible regions
#=============================================
echo "Identifying accessible chromatin regions (peaks)..."
# Find peaks for ATAC-seq
findPeaks \
${HOMER_DIR}/${SAMPLE_PREFIX} \ # Tag directory
-style dnase \ # For ATAC-seq/DNase-style peaks
-o ${HOMER_DIR}/${SAMPLE_PREFIX}/peaks_${SAMPLE_PREFIX}_fdr1e-05.txt \ # Output file
-fdr 1e-05 \ # False discovery rate threshold (strict)
-region \ # Report concise peak boundaries
-minDist 150 \ # Minimum distance between peaks
-size 150 # Size of region for ATAC-seq peaks
# Peak style explanation:
# -style dnase: optimized for ATAC-seq/DNase-seq data (sharp, open chromatin peaks)
#=============================================
# 6.3: Annotate peaks with genomic features
#=============================================
echo "Annotating peaks with genomic features..."
# Annotate the peaks with nearby genes and genomic features
annotatePeaks.pl \
${HOMER_DIR}/${SAMPLE_PREFIX}/peaks_${SAMPLE_PREFIX}_fdr1e-05.txt \ # Input peak file
hg38 \ # Reference genome
-d ${HOMER_DIR}/${SAMPLE_PREFIX} \ # Tag directory for visualization
-go ${HOMER_DIR}/${SAMPLE_PREFIX}/go \ # Output directory for GO analysis
-genomeOntology ${HOMER_DIR}/${SAMPLE_PREFIX}/genomeOntology \ # Genome feature enrichment
> ${HOMER_DIR}/${SAMPLE_PREFIX}/peaks_${SAMPLE_PREFIX}_fdr1e-05_annotated.txt # Output file
echo "HOMER peak calling and annotation complete!"
Parameter Explanation:
-style dnase
tells HOMER to look for sharp peaks typical of open chromatin regions-fdr 1e-05
sets a stringent false discovery rate threshold of 0.001%-region
focuses on the precise peak boundaries rather than fixed-width windows-minDist 150
prevents calling multiple peaks within 150bp of each other- The annotation step connects each peak to its nearest gene and identifies whether it falls in a promoter, enhancer, or other genomic feature
- See my earlier ChIP-seq tutorial for a detailed explanation of HOMER outputs.
Step 5: Peak Calling and Annotation using MACS2
MACS2 is another popular peak caller widely used for ATAC-seq analysis. The peak calling and annotation process is very similar to the one demonstrated in my previous ChIP-seq tutorial, requiring only a few adjustments specific to ATAC-seq data:
#-----------------------------------------------
# STEP 7: Peak calling with MACS2
#-----------------------------------------------
# Create directory for MACS2 analysis
mkdir -p ~/ATAC_Dataset/MACS2/Sample1
#=============================================
# 7.1: Call peaks using MACS2
#=============================================
echo "Calling peaks with MACS2..."
# Variables for readability
SAMPLE_PREFIX="Sample1"
ALIGN_DIR="~/ATAC_Dataset/Alignment/Sample1"
MACS2_DIR="~/ATAC_Dataset/MACS2"
# Call peaks with MACS2
# ATAC-seq specific parameters: --nomodel --shift -100 --extsize 200
macs2 callpeak \
-t ${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam \ # Input BAM file
-f BAMPE \ # Format: paired-end BAM
-g hs \ # Genome size: human (hs)
-n ${SAMPLE_PREFIX} \ # Output name prefix
--outdir ${MACS2_DIR}/${SAMPLE_PREFIX} \ # Output directory
-q 0.01 \ # q-value (FDR) cutoff
--nomodel \ # Skip model building
--keep-dup all \ # Keep all duplicates (we already removed them)
--call-summits # Call peak summits
echo "MACS2 peak calling complete!"
# MACS2 parameter explanation:
# -f BAMPE: Format is paired-end BAM, uses actual fragment size
# --nomodel: Skip the model building step, important for ATAC-seq
# -q 0.01: FDR threshold of 1%
# --call-summits: Find the precise summit positions within peaks
MACS2 vs. HOMER for ATAC-seq:
- MACS2 is often preferred for ATAC-seq due to its handling of paired-end data and shift/extension parameters
- HOMER offers more integrated downstream analysis (annotation, motif discovery)
- Both tools are widely accepted in the field, and comparing results can increase confidence
- For details on MACS2 output, refer to my previous ChIP-seq tutorial.
Step 6: Peak Visualization
The visualization of ATAC-seq peaks follows the same principles and techniques outlined in my ChIP-seq data visualization tutorial.
#-----------------------------------------------
# STEP 8: Creating browser tracks and visualizations
#-----------------------------------------------
#=============================================
# 8.1: Create bigWig tracks for visualization using deepTools
#=============================================
echo "Creating bigWig tracks for genome browser visualization..."
# Get chromosome sizes for bigWig conversion
fetchChromSizes hg38 > ~/ATAC_Dataset/hg38.chrom.sizes
# Variables for readability
SAMPLE_PREFIX="Sample1"
ALIGN_DIR="~/ATAC_Dataset/Alignment/Sample1"
VISUAL_DIR="~/ATAC_Dataset/Visualization"
mkdir -p ${VISUAL_DIR}
# Method 1: Make bigWig files using HOMER
makeUCSCfile ~/ATAC_Dataset/Homer/Sample1 \
-o auto \
-bigWig ChromSizes_hg38.txt \
-style dnase
# Method 2: Make bigWig files using deepTools
bamCoverage \
--bam ${ALIGN_DIR}/${SAMPLE_PREFIX}_final.bam \ # Input BAM
--outFileName ${VISUAL_DIR}/${SAMPLE_PREFIX}.bw \ # Output bigWig
--outFileFormat bigwig \
--binSize 10 \ # 10bp resolution
--normalizeUsing RPKM \ # Normalize by reads per kilobase per million
--extendReads \ # Extend reads to fragment size
--ignoreDuplicates \ # Skip duplicate reads
--numberOfProcessors 10 # Use 10 threads
#=============================================
# 8.2: Create a BED file of peaks for sharing
#=============================================
echo "Creating BED files of peaks for sharing..."
# Convert peak files to standard BED format
pos2bed.pl ${HOMER_DIR}/${SAMPLE_PREFIX}/peaks_${SAMPLE_PREFIX}_fdr1e-05.txt \
> ${VISUAL_DIR}/${SAMPLE_PREFIX}_homer_peaks.bed
# Copy MACS2 narrowPeak file (already in BED-like format)
cp ${MACS2_DIR}/${SAMPLE_PREFIX}/${SAMPLE_PREFIX}_peaks.narrowPeak \
${VISUAL_DIR}/${SAMPLE_PREFIX}_macs2_peaks.bed
echo "Visualization files created successfully!"
Step 7: Motif Analysis and Differential Binding Analysis
The methodology for motif analysis in ATAC-seq data mirrors that of ChIP-seq analysis, as detailed in my previous tutorial.
Conclusion and Next Steps
Congratulations! You’ve successfully completed a basic ATAC-seq analysis pipeline, from raw sequencing data to annotated accessible chromatin regions. This analysis has identified regions of the genome where chromatin is accessible, potentially revealing active regulatory elements such as promoters and enhancers.
Summary of What We’ve Accomplished
- Quality Control and Preprocessing: We ensured high-quality data by trimming adapters and removing low-quality reads
- Alignment: We mapped reads to the reference genome using Bowtie2 with parameters optimized for ATAC-seq
- Filtering: We filtered alignments to remove duplicates, multi-mapped reads, and blacklisted regions
- Peak Calling: We identified accessible chromatin regions using both HOMER and MACS2
- Annotation: We annotated peaks with genomic features such as promoters, enhancers, and nearby genes
- Visualization: We created tracks for genome browsers to visualize and share our results
- Motif Analysis: We discovered DNA sequence motifs enriched in accessible regions, revealing potential transcription factor binding sites
Best Practices for ATAC-seq Analysis
To ensure high-quality results from your ATAC-seq analysis, keep these best practices in mind:
Quality Control at Every Step
- Before Analysis: Check sequencing quality with FastQC
- After Alignment: Verify mapping rates (>70% is ideal for mammalian genomes)
- After Peak Calling: Assess peak characteristics and reproducibility between replicates
Technical Considerations
- Fragment Size Distribution: ATAC-seq data should show characteristic nucleosome-free (~50-100bp) and nucleosome-bound (~200bp, ~400bp) peaks
- Library Complexity: Ensure sufficient complexity (low duplication rates) to capture the true signal
- Mitochondrial Contamination: Mitochondrial DNA is highly accessible and can dominate ATAC-seq libraries; filtering these reads is often necessary
- Tn5 Insertion Bias: Tn5 has slight sequence preferences that can be corrected in advanced analyses
Data Interpretation
- Focus on High-confidence Regions: Use stringent thresholds for peak calling
- Consider Peak Location: Promoter-proximal peaks often indicate active genes, while distal peaks suggest enhancers
- Integrate with Expression Data: Correlate accessibility with gene expression to identify functional elements
- Validate Key Findings: Consider experimental validation of important regulatory regions
Troubleshooting Common Issues
Low Peak Counts
Problem: You detected very few peaks compared to expectations.
Solutions:
- Check library quality and complexity
- Decrease the stringency of your peak calling parameters
- Ensure you have sufficient sequencing depth (>50 million reads recommended)
- Verify cell viability in the original sample (dead cells show global accessibility changes)
High Mitochondrial Content
Problem: A large percentage of reads map to mitochondrial DNA.
Solutions:
- Filter out mitochondrial reads before analysis
- Optimize cell lysis conditions in future experiments to reduce mitochondrial DNA release
- Consider deeper sequencing to compensate for lost reads
Inconsistent Fragment Size Distribution
Problem: Missing the characteristic nucleosome-free and mono-nucleosome peaks in fragment size distribution.
Solutions:
- Check Tn5 transposase activity and concentration
- Ensure proper cell permeabilization without over-digestion
- Optimize size selection during library preparation
Predominant Signal from Promoters
Problem: Most accessible regions are at promoters, with few distal elements.
Solutions:
- Increase sequencing depth to detect weaker distal signals
- Optimize cell lysis and tagmentation conditions
- Consider cell type specificity (some cell types have fewer active enhancers)
Further Resources
For those interested in deepening their understanding of ATAC-seq analysis:
- ENCODE ATAC-seq Guidelines: Best practices from the ENCODE consortium
- Galaxy ATAC-seq Tutorials: GUI-based alternatives to command-line analysis
- Schep et al. (2017) – chromVAR: Methods for analyzing variation in chromatin accessibility
- Yan et al. (2020) – TOBIAS: Advanced transcription factor footprinting analysis
References
- Buenrostro, J.D., Giresi, P.G., Zaba, L.C. et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods 10, 1213–1218 (2013).
- Corces, M.R., Trevino, A.E., Hamilton, E.G. et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nature Methods 14, 959–962 (2017).
- Grandi, F.C., Modi, H., Kampman, L. et al. Chromatin accessibility profiling by ATAC-seq. Nat Protoc 17, 1518–1552 (2022). https://doi.org/10.1038/s41596-022-00692-9
This tutorial is part of the NGS101.com beginner’s guide to next-generation sequencing analysis. If you have questions or suggestions, please leave a comment below.
Leave a Reply