Master the essential file formats in next-generation sequencing analysis
Introduction: Understanding the NGS Data Ecosystem
Next-generation sequencing (NGS) has revolutionized biological research by enabling us to read DNA, RNA, and epigenetic modifications at an unprecedented scale. However, with this power comes complexity – NGS workflows generate dozens of different file formats, each serving specific purposes in the analysis pipeline. Understanding these formats is crucial for any researcher working with genomic data.
What Makes NGS Data Formats Unique?
NGS data formats have evolved to address several key challenges:
- Scale: NGS experiments generate massive datasets, often containing millions to billions of sequencing reads
- Compression: Raw sequencing data can occupy terabytes of storage, requiring efficient compression methods
- Indexing: Random access to specific genomic regions requires sophisticated indexing schemes
- Standardization: Interoperability between different analysis tools demands standardized formats
- Metadata: Complex experimental designs require rich annotation and sample information
The NGS Analysis Journey: From Molecules to Insights
The path from biological sample to scientific insight involves multiple data transformations, each producing specific file types:
- Sequencing Instruments generate raw electrical signals and base calls
- Quality Control produces filtered and trimmed sequence reads
- Alignment maps reads to reference genomes, creating coordinate-sorted data
- Quantification summarizes read counts into expression matrices
- Variant Calling identifies genetic differences from reference sequences
- Annotation connects genomic features to biological knowledge
Each step requires specialized file formats optimized for different computational tasks, storage requirements, and analysis workflows.
Key Properties of NGS Data Formats
Understanding the characteristics of each format helps in choosing the right tools and approaches:
Size Considerations:
- Raw FASTQ files can range from gigabytes to terabytes
- Compressed alignment files (BAM) are typically 30-50% smaller than their uncompressed equivalents
- Index files, while small, are essential for efficient random access
Format Types:
- Text-based formats (FASTQ, SAM, VCF) are human-readable but larger
- Binary formats (BAM, BCF) offer better compression and faster processing
- Indexed formats enable rapid access to specific genomic regions
Critical Handling Considerations:
- Always verify file integrity using checksums after transfers
- Maintain consistent coordinate systems (0-based vs 1-based indexing)
- Preserve metadata and sample information throughout the analysis pipeline
- Use appropriate compression levels balancing file size and access speed
Raw Sequencing Data: The Foundation of NGS Analysis
Raw sequencing data represents the direct output from sequencing instruments before any computational processing. Understanding these formats is essential for quality assessment and troubleshooting.
Platform-Specific Raw Data Formats
Different sequencing technologies produce distinct raw data formats, each reflecting their underlying detection mechanisms:
Illumina Sequencing:
- BCL files: Binary base call files containing raw intensities and quality scores
- FASTQ files: Text-based format with sequences and per-base quality scores
- InterOp files: Binary files containing run metrics and quality statistics
Oxford Nanopore:
- FAST5 files: HDF5-based format storing raw electrical current measurements
- POD5 files: Newer, more efficient format replacing FAST5
- FASTQ files: Basecalled sequences with quality scores
Pacific Biosciences (PacBio):
- H5 files: HDF5 format for older RSII systems
- BAM files: PacBio’s primary format for Sequel systems
- FASTA/FASTQ: Extracted consensus sequences
Comparative Analysis of Raw Data Formats
Platform | Primary Format | File Size | Read Length | Error Profile | Use Cases |
---|---|---|---|---|---|
Illumina | FASTQ | 1-50 GB | 50-300bp | Low substitution | Genome sequencing, RNA-seq, ChIP-seq |
Nanopore | FAST5/POD5 | 10-500 GB | 1kb-2Mb | Indels, homopolymer | Long-read assembly, structural variants |
PacBio | BAM/FASTQ | 5-200 GB | 1kb-100kb | Random errors | High-quality assembly, isoform analysis |
Sequence Data Formats: The Building Blocks
Sequence data formats store the fundamental genetic information extracted from NGS experiments. These formats serve as input for most downstream analyses.
FASTQ: The Universal Sequence Format
FASTQ format dominates NGS workflows due to its simplicity and comprehensive information content.
Structure and Components:
@M00967:43:000000000-A3JHG:1:1101:18327:1699 1:N:0:1
CCTACGGGNGGCWGCAG
+
A1>1>11#-1>11<-<11
Detailed Breakdown:
- Header: Contains instrument ID, run ID, lane, tile, x/y coordinates
- Sequence: Raw nucleotide sequence (A, T, G, C, N for ambiguous)
- Quality: Phred scores indicating base call confidence
Common FASTQ Variants:
- Paired-end: Two FASTQ files (R1 and R2) with matching read IDs
- Compressed: .fastq.gz files for storage efficiency
- Multiplexed: Multiple samples in single file with barcode identification
Best Practices for FASTQ Handling:
# Check FASTQ file integrity
seqkit stats sample.fastq.gz
# Count total reads
echo $(cat sample.fastq | wc -l)/4 | bc
# Extract first 1000 reads
head -n 4000 sample.fastq > sample_subset.fastq
FASTA: Simple Sequence Storage
FASTA format provides a streamlined approach for storing sequences without quality information.
Basic Structure:
>sequence_identifier optional_description
ATGCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGA
>another_sequence
GCGATCGATCGATCGATCGATCGATCGATCGAT
When to Use FASTA:
- Reference genome sequences
- Protein sequences
- Assembled contigs or scaffolds
- Consensus sequences from multiple alignment
- Primer and probe sequences
FASTA vs FASTQ Decision Matrix:
Use Case | Format Choice | Reasoning |
---|---|---|
Raw sequencing reads | FASTQ | Need quality scores for filtering |
Reference genomes | FASTA | No quality information needed |
Assembly output | FASTA | Consensus sequences |
Database searches | FASTA | Standard for BLAST databases |
Alignment Data Formats: Mapping Reads to Genomes
Once sequencing reads are generated, they must be aligned to reference genomes. Alignment formats store this crucial mapping information with varying levels of compression and accessibility.
SAM: The Human-Readable Alignment Standard
The Sequence Alignment/Map (SAM) format provides a comprehensive, text-based representation of alignments.
SAM File Structure:
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@PG ID:bwa PN:bwa VN:0.7.17-r1188
@RG ID:sample1 SM:patient_001 PL:ILLUMINA
M00967:43:000000000-A3JHG:1:1101:18327:1699 99 chr1 1000 60 150M = 1150 300 CCTACGGGNGGCWGCAG... A1>1>11#-1>11<-<11... AS:i:145 XS:i:20
Header Section (@-lines):
- @HD: File format version and sort order
- @SQ: Reference sequence information
- @RG: Read group information (sample, library, platform)
- @PG: Program information used for alignment
Alignment Records (11 mandatory fields):
- QNAME: Read identifier
- FLAG: Bitwise flag indicating alignment properties
- RNAME: Reference sequence name (chromosome)
- POS: 1-based leftmost alignment position
- MAPQ: Mapping quality score
- CIGAR: Concise alignment representation
- RNEXT: Reference name of mate/next read
- PNEXT: Position of mate/next read
- TLEN: Template length
- SEQ: Read sequence
- QUAL: ASCII-encoded read quality
CIGAR String Interpretation:
- M: Match/mismatch
- I: Insertion in read
- D: Deletion in read
- S: Soft clipping
- H: Hard clipping
- N: Skipped region (splicing)
Example: 50M2I25M
= 50 matches, 2 insertions, 25 matches
BAM: Compressed Binary Alignments
BAM format provides the same information as SAM but in a compressed, binary format optimized for computational efficiency.
Key Advantages:
- File Size: 60-80% smaller than equivalent SAM files
- Processing Speed: Faster parsing and processing
- Random Access: Efficient retrieval of specific genomic regions
- Compression: Built-in bgzip compression
BAM Usage Examples:
# Convert SAM to BAM
samtools view -bS alignment.sam > alignment.bam
# Sort BAM file by coordinates
samtools sort -o alignment_sorted.bam alignment.bam
# Index BAM for random access
samtools index alignment_sorted.bam
# Extract reads from specific region
samtools view alignment_sorted.bam chr1:1000000-2000000
BAI: BAM Index Files
BAI files enable rapid random access to specific genomic regions within BAM files.
Index Structure:
- Linear Index: Coarse-grained genomic bins
- Hierarchical Index: Fine-grained access within bins
- Metadata: Reference sequence information and statistics
Critical Considerations:
- BAI files must be regenerated after any BAM file modification
- Index files should be stored alongside BAM files
- Coordinate-sorted BAM files are required for indexing
CRAM: Ultra-Compressed Alignments
CRAM format offers superior compression by using reference-based compression algorithms.
Compression Benefits:
- Size Reduction: 30-60% smaller than BAM files
- Lossless: Maintains all alignment information
- Reference-Based: Stores only differences from reference genome
CRAM Usage Scenarios:
- Long-term data archiving
- Large-scale population genomics projects
- Cloud storage optimization
- Bandwidth-limited data transfers
CRAM Workflow Example:
# Convert BAM to CRAM
samtools view -C -T reference.fa alignment.bam > alignment.cram
# Index CRAM file
samtools index alignment.cram
# Convert CRAM back to BAM
samtools view -b -T reference.fa alignment.cram > alignment_restored.bam
Quantification and Expression Data Formats
Gene expression analysis requires specialized formats to store quantified measurements across samples and conditions. These formats balance human readability with computational efficiency.
Count Matrices: The Foundation of Expression Analysis
Count matrices represent the core data structure for RNA-seq and single-cell analyses.
Tab-Separated Values (TSV) Format:
Gene_ID Sample_1 Sample_2 Sample_3 Sample_4
ENSG00000000003 743 891 1205 567
ENSG00000000005 0 2 1 0
ENSG00000000419 1891 2103 2456 1678
ENSG00000000457 567 634 723 445
ENSG00000000460 89 123 156 67
Comma-Separated Values (CSV) Format:
Gene_ID,Sample_1,Sample_2,Sample_3,Sample_4
ENSG00000000003,743,891,1205,567
ENSG00000000005,0,2,1,0
ENSG00000000419,1891,2103,2456,1678
Best Practices for Count Matrices:
- Use gene IDs (Ensembl, RefSeq) rather than gene names for consistency
- Include metadata files describing samples and experimental conditions
- Validate that row and column totals match expected values
- Store raw counts separately from normalized values
Normalized Expression Tables
Normalization addresses technical biases and enables meaningful comparisons between samples.
TPM (Transcripts Per Million) Table:
Gene_ID Gene_Length Sample_1_TPM Sample_2_TPM Sample_3_TPM
ENSG00000000003 2100 354.1 424.5 573.8
ENSG00000000005 1500 0.0 1.3 0.7
ENSG00000000419 3200 591.6 657.2 768.1
FPKM/RPKM Comparison:
- TPM: Transcript Per Million – sum to 1 million per sample
- FPKM: Fragments Per Kilobase Million – for paired-end RNA-seq
- RPKM: Reads Per Kilobase Million – for single-end RNA-seq
When to Use Each Format:
- TPM: Cross-sample comparisons and meta-analyses
- FPKM/RPKM: Within-sample gene length normalization
- Raw Counts: Differential expression analysis with DESeq2/edgeR
Single-Cell Specific Formats
Single-cell RNA-seq generates sparse, high-dimensional datasets requiring specialized storage formats.
Matrix Market (MTX) Format:
%%MatrixMarket matrix coordinate integer general
32738 5000 8934756
1 1 4
1 3 1
2 1 2
2 2 8
Format Structure:
- Line 1: Header with format information
- Line 2: Matrix dimensions (genes, cells, non-zero entries)
- Subsequent lines: Row, column, value triplets
Associated Files:
- features.tsv: Gene IDs and symbols
- barcodes.tsv: Cell barcode sequences
HDF5 and AnnData Formats
Modern single-cell analysis increasingly relies on hierarchical data formats.
HDF5 (.h5) Structure:
/
├── matrix/
│ ├── data
│ ├── indices
│ └── indptr
├── features/
│ ├── id
│ ├── name
│ └── feature_type
└── barcodes
AnnData (.h5ad) Components:
- X: Primary data matrix (genes × cells)
- obs: Cell metadata (cell type, cluster, etc.)
- var: Gene metadata (gene symbols, biotype)
- obsm: Multi-dimensional cell annotations
- varm: Multi-dimensional gene annotations
- uns: Unstructured metadata
Usage Example:
import scanpy as sc
import pandas as pd
# Load AnnData object
adata = sc.read_h5ad('single_cell_data.h5ad')
# Access expression matrix
expression = adata.X
# Access metadata
cell_metadata = adata.obs
gene_metadata = adata.var
Loom Format: Comprehensive Single-Cell Storage
Loom format provides a self-contained solution for single-cell genomics data.
Loom File Structure:
/
├── matrix (main data matrix)
├── row_attrs/
│ ├── Gene (gene symbols)
│ └── Accession (gene IDs)
├── col_attrs/
│ ├── CellID (cell barcodes)
│ └── CellType (annotations)
└── row_graphs/ (gene-gene relationships)
Advantages:
- Cross-platform compatibility
- Efficient sparse matrix storage
- Built-in metadata management
- Support for graphs and hierarchical relationships
R Data Formats
R-based analysis workflows commonly use native R storage formats.
RDS Format (.rds):
# Save single R object
expression_matrix <- read.csv("counts.csv", row.names=1)
saveRDS(expression_matrix, "expression_data.rds")
# Load RDS object
loaded_data <- readRDS("expression_data.rds")
RData Format (.rda/.RData):
# Save multiple R objects
sample_metadata <- read.csv("metadata.csv")
gene_annotations <- read.csv("genes.csv")
save(expression_matrix, sample_metadata, gene_annotations,
file="complete_dataset.RData")
# Load all objects
load("complete_dataset.RData")
Variant Data Formats: Capturing Genetic Diversity
Variant calling identifies genetic differences between sequenced samples and reference genomes. These formats must efficiently store diverse types of genetic variation while maintaining compatibility with analysis tools.
VCF: The Variant Call Format Standard
VCF format serves as the gold standard for storing genetic variants, from single nucleotide polymorphisms to complex structural variations.
VCF File Structure:
##fileformat=VCFv4.2
##reference=hg38
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr1 1000 rs123456 A G 99.0 PASS DP=50 GT:AD:DP 0/1:25,25:50 1/1:0,48:48
chr1 2000 . T C 45.2 LowQual DP=15 GT:AD:DP 0/0:15,0:15 0/1:8,7:15
chr2 3000 . GTC G 87.5 PASS DP=32 GT:AD:DP 0/1:16,16:32 0/1:18,14:32
Header Section (##-lines):
- fileformat: VCF version specification
- reference: Reference genome used
- contig: Chromosome/contig information
- INFO: Variant-level annotation descriptions
- FORMAT: Sample-level field descriptions
Variant Records (8 mandatory + sample columns):
- CHROM: Chromosome identifier
- POS: 1-based position
- ID: Variant identifier (e.g., dbSNP ID)
- REF: Reference allele
- ALT: Alternative allele(s)
- QUAL: Quality score
- FILTER: Filter status
- INFO: Variant annotations
- FORMAT: Sample data format
10+ Sample columns: Genotype and related data
Genotype Encoding:
- 0/0: Homozygous reference
- 0/1: Heterozygous
- 1/1: Homozygous alternative
- ./.: Missing genotype
Complex Variant Examples:
# Single nucleotide variant (SNV)
chr1 1000 . A G 99.0 PASS . GT 0/1
# Insertion
chr1 2000 . T TAGA 87.5 PASS . GT 0/1
# Deletion
chr1 3000 . ATCG A 92.3 PASS . GT 1/1
# Multi-allelic site
chr1 4000 . G A,T 78.4 PASS . GT 1/2
BCF: Binary Variant Call Format
BCF provides a compressed, binary representation of VCF data optimized for computational processing.
Key Advantages:
- Performance: 5-10x faster parsing than VCF
- Size: 50-70% smaller file sizes
- Indexing: Efficient random access with tabix
- Precision: Maintains full numerical precision
BCF Workflow:
# Convert VCF to BCF
bcftools view -Ob variants.vcf > variants.bcf
# Index BCF file
bcftools index variants.bcf
# Query specific region
bcftools view variants.bcf chr1:1000000-2000000
# Convert back to VCF
bcftools view variants.bcf > variants_restored.vcf
MAF: Mutation Annotation Format
MAF format specializes in storing somatic mutations with rich clinical and functional annotations.
MAF File Example:
Hugo_Symbol Variant_Classification Tumor_Sample_Barcode HGVSp HGVSc Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
TP53 Missense_Mutation TCGA-AA-A00A-01 p.R175H c.524G>A chr17 7578406 7578406 G A
KRAS Missense_Mutation TCGA-AA-A00A-01 p.G12D c.35G>A chr12 25245350 25245350 G A
PIK3CA Missense_Mutation TCGA-BB-B00B-01 p.E545K c.1633G>A chr3 178936091 178936091 G A
Critical MAF Fields:
- Hugo_Symbol: Gene symbol
- Variant_Classification: Functional impact (Missense, Nonsense, etc.)
- Tumor_Sample_Barcode: Sample identifier
- HGVSp/HGVSc: Protein and coding sequence notation
- Reference_Allele/Tumor_Seq_Allele2: Variant alleles
MAF Use Cases:
- Cancer genomics analysis
- Mutation burden calculations
- Pathway enrichment analysis
- Clinical data integration
- Survival analysis correlation
BEDPE: Paired-End Breakpoint Format
BEDPE format stores structural variants and breakpoint information from paired-end sequencing.
BEDPE Structure:
chr1 1000 2000 chr1 5000 6000 variant_1 100 + - translocation
chr2 3000 3500 chr3 7000 7500 variant_2 150 + + deletion
chrX 8000 8200 chrY 9000 9200 variant_3 80 - + inversion
BEDPE Fields:
- chrom1, start1, end1: First breakpoint
- chrom2, start2, end2: Second breakpoint
- name: Variant identifier
- score: Confidence score
- strand1, strand2: Breakpoint orientations
- type: Structural variant type
Structural Variant Types:
- Deletion: Loss of genomic sequence
- Duplication: Copy number increase
- Inversion: Sequence orientation reversal
- Translocation: Inter-chromosomal rearrangement
- Insertion: Novel sequence addition
Copy Number Variation Formats
CNVkit generates specialized formats for copy number analysis.
CNR (Copy Number Ratio) Format:
chromosome start end gene log2 depth weight
chr1 1000000 1001000 GENE1 -0.15 125.4 0.95
chr1 1001000 1002000 GENE1 0.23 138.2 0.98
chr1 1002000 1003000 GENE2 1.45 156.7 0.92
CNS (Copy Number Segment) Format:
chromosome start end gene log2 cn depth p_ttest probes weight
chr1 1000000 1500000 GENE1,GENE2 0.12 2 142.3 0.001 500 0.94
chr1 1500000 2000000 GENE3 1.58 4 165.8 0.000 250 0.96
Interpretation:
- log2 ratio: Copy number relative to diploid (log2(cn/2))
- cn: Absolute copy number
- p_ttest: Statistical significance of segment
Annotation and Feature Data Formats
Genomic annotations connect sequence data to biological knowledge, providing essential context for interpreting analysis results. These formats must efficiently represent diverse genomic features while maintaining compatibility with analysis tools.
GTF/GFF: Gene Transfer Format
GTF and GFF formats store comprehensive gene structure and functional annotations.
GTF Format Example:
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; exon_id "ENSE00002234944";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; exon_id "ENSE00003582793";
GTF Field Descriptions:
- seqname: Chromosome/scaffold identifier
- source: Annotation source (ENSEMBL, RefSeq, etc.)
- feature: Feature type (gene, transcript, exon, CDS)
- start/end: 1-based genomic coordinates
- score: Confidence score (optional)
- strand: + (forward) or – (reverse)
- frame: Reading frame for CDS features
- attributes: Semicolon-separated key-value pairs
Common Feature Types:
- gene: Complete gene locus
- transcript: Individual transcript isoform
- exon: Transcribed regions
- CDS: Protein-coding sequences
- UTR: Untranslated regions
- start_codon/stop_codon: Translation boundaries
GFF3 Enhanced Features:
chr1 RefSeq gene 11874 14409 . + . ID=gene1;Name=DDX11L1;Dbxref=GeneID:100287102
chr1 RefSeq mRNA 11874 14409 . + . ID=rna1;Parent=gene1;Name=NR_046018.2
chr1 RefSeq exon 11874 12227 . + . ID=exon1;Parent=rna1
chr1 RefSeq exon 12613 12721 . + . ID=exon2;Parent=rna1
GFF3 vs GTF Comparison:
- GFF3: Hierarchical relationships with ID/Parent structure
- GTF: Flat structure with shared gene_id/transcript_id
- GFF3: More flexible attribute system
- GTF: Simpler parsing for RNA-seq workflows
BED: Browser Extensible Data Format
BED format provides a simple, flexible way to represent genomic intervals and annotations.
BED Format Variants:
BED3 (Minimal):
chr1 1000 2000
chr1 5000 6000
chr2 3000 4000
BED6 (Standard):
chr1 1000 2000 feature1 100 +
chr1 5000 6000 feature2 200 -
chr2 3000 4000 feature3 150 +
BED12 (Full):
chr1 1000 5000 gene1 1000 + 1200 4800 255,0,0 2 800,600 0,3400
BED Field Descriptions:
- chrom: Chromosome name
- chromStart: 0-based start position
- chromEnd: 1-based end position
- name: Feature identifier
- score: Display score (0-1000)
- strand: Orientation
- thickStart/thickEnd: Coding region boundaries
- itemRgb: RGB color values
- blockCount: Number of sub-features
- blockSizes: Comma-separated block sizes
- blockStarts: Relative block start positions
BED Use Cases:
- ChIP-seq peak regions
- Gene promoter definitions
- Regulatory element annotations
- Copy number variation segments
- Structural variant breakpoints
Peak Calling Formats
ChIP-seq and ATAC-seq analyses generate specialized peak formats.
narrowPeak Format:
chr1 1000 2000 peak1 100 . 5.2 10.1 8.3 500
chr1 5000 6000 peak2 200 . 7.8 15.2 12.1 400
chr2 3000 4000 peak3 150 . 6.1 12.5 9.8 300
narrowPeak Fields:
1-3. Standard BED3 fields
- name: Peak identifier
- score: Integer score (0-1000)
- strand: Orientation (usually ‘.’)
- signalValue: Signal enrichment
- pValue: -log10(p-value)
- qValue: -log10(q-value)
- peak: Relative peak summit position
broadPeak Format:
chr1 1000 5000 region1 100 . 5.2 10.1 8.3
chr1 10000 15000 region2 200 . 7.8 15.2 12.1
broadPeak Differences:
- No peak summit column (column 10)
- Represents broader enrichment regions
- Suitable for histone modifications (H3K27me3, H3K36me3)
BigBED: Indexed Binary BED
BigBED format provides efficient storage and random access for large BED datasets.
BigBED Advantages:
- Compression: 70-90% size reduction
- Indexing: Rapid region-based queries
- Scalability: Handles millions of features efficiently
- Browser Integration: Direct UCSC Genome Browser loading
BigBED Creation:
# Sort BED file by chromosome and position
sort -k1,1 -k2,2n input.bed > sorted.bed
# Get chromosome sizes
fetchChromSizes hg38 > hg38.chrom.sizes
# Convert BED to BigBED
bedToBigBed sorted.bed hg38.chrom.sizes output.bb
# Query specific region from BigBED
bigBedToBed output.bb -chrom=chr1 -start=1000000 -end=2000000 stdout
WIG and BigWig: Continuous Signal Data
Wiggle (WIG) and BigWig formats store continuous numerical data across genomic coordinates, essential for visualizing signal tracks.
WIG Format Types:
Variable Step WIG:
track type=wiggle_0 name="Sample1_Coverage" description="Coverage track"
variableStep chrom=chr1
1001 5.2
1002 5.8
1003 6.1
1005 4.9
1010 7.3
Fixed Step WIG:
track type=wiggle_0 name="Sample1_Coverage"
fixedStep chrom=chr1 start=1001 step=1
5.2
5.8
6.1
0.0
4.9
BedGraph Format (WIG alternative):
track type=bedGraph name="Sample1_Coverage"
chr1 1000 1001 5.2
chr1 1001 1002 5.8
chr1 1002 1003 6.1
chr1 1004 1005 4.9
chr1 1009 1010 7.3
Format Comparison:
- Variable Step: Sparse data with irregular intervals
- Fixed Step: Dense data with regular intervals
- BedGraph: Most flexible, handles any interval structure
BigWig Advantages:
- Performance: 100x faster random access than WIG
- Compression: Significant size reduction
- Multi-resolution: Automatic data summarization at different zoom levels
- Streaming: Efficient data transfer over networks
BigWig Creation and Usage:
# Convert bedGraph to BigWig
bedGraphToBigWig coverage.bedGraph hg38.chrom.sizes coverage.bw
# Extract signal from specific region
bigWigToBedGraph coverage.bw -chrom=chr1 -start=1000000 -end=2000000 stdout
# Calculate summary statistics
bigWigSummary coverage.bw chr1 1000000 2000000 100
Common BigWig Applications:
- ChIP-seq signal tracks
- RNA-seq coverage visualization
- ATAC-seq accessibility profiles
- Hi-C interaction frequencies
- Methylation percentage tracks
Specialized NGS Data Formats
Advanced NGS applications have developed specialized formats to handle unique data types and optimize storage for specific use cases.
Compressed and Indexed Formats
VCF.gz + TBI: Tabix-Indexed Variants
Tabix indexing enables efficient random access to compressed VCF files.
# Compress and index VCF
bgzip variants.vcf
tabix -p vcf variants.vcf.gz
# Query specific region
tabix variants.vcf.gz chr1:1000000-2000000
# Multiple region query
tabix variants.vcf.gz chr1:100000-200000 chr2:300000-400000
Index File Structure:
- Linear index: Coarse-grained genomic bins (16kb default)
- Hierarchical index: Fine-grained access within bins
- Metadata: Sequence names, file offsets, and statistics
Tabix Advantages:
- Works with any coordinate-sorted, tab-delimited format
- Minimal memory footprint for large files
- Supports multiple simultaneous queries
- Network streaming compatibility
Sequencing Archive Formats
SRA: Sequence Read Archive
SRA format serves as the primary archive format for public sequencing data repositories.
SRA File Structure:
sample.sra
├── Metadata (experiment design, sample info)
├── Read data (sequences and qualities)
├── Alignment data (optional)
└── Analysis data (optional)
SRA Toolkit Usage:
# Download SRA file
prefetch SRR1234567
# Convert to FASTQ
fasterq-dump SRR1234567
# Split paired-end reads
fasterq-dump --split-files SRR1234567
# Dump specific reads
sam-dump --aligned-region chr1:1000000-2000000 SRR1234567
SRA Advantages:
- Comprehensive metadata storage
- Efficient compression algorithms
- Quality score optimization
- International standard for data sharing
Long-Read Specific Formats
FAST5: Nanopore Raw Signal Data
FAST5 format stores raw electrical current measurements from Oxford Nanopore sequencing.
FAST5 HDF5 Structure:
/
├── UniqueGlobalKey/
│ ├── channel_id/
│ ├── context_tags/
│ ├── tracking_id/
│ └── sampling_rate
├── Raw/
│ └── Reads/
│ └── Read_[number]/
│ ├── Signal (raw current values)
│ └── Signal_metadata
└── Analyses/
├── Basecall_1D_[version]/
│ ├── BaseCalled_template/
│ │ ├── Fastq
│ │ └── Events
│ └── Summary/
└── EventDetection_[version]/
FAST5 Data Components:
- Raw Signal: 4000 Hz current measurements
- Event Data: Segmented signal regions
- Basecalls: Sequence calls with quality scores
- Metadata: Pore information, chemistry, temperature
FAST5 Analysis Tools:
# Extract FASTQ from FAST5
ont_fast5_api_multi_to_single --input_path multi.fast5 --save_path single_reads/ --recursive
# Basecall with Guppy
guppy_basecaller --input_path fast5_dir/ --save_path output_dir/ --config dna_r9.4.1_450bps_hac.cfg
# Extract signal data
h5dump -d /read_12345/Raw/Signal sample.fast5
M5/PBI: PacBio Data Formats
PacBio generates specialized formats for long-read sequencing data.
PacBio BAM Structure:
- Standard BAM alignment records
- Extended tags for PacBio-specific information
- Pulse-level data (optional)
PBI Index Fields:
# PacBio BAM Index (.pbi)
- Reference ID and position
- Read quality scores
- Subread information
- Barcode data (if multiplexed)
- Kinetic information
PacBio Analysis Example:
# Extract subreads from PacBio BAM
bamtools filter -in pacbio.bam -out subreads.bam -tag "qs:>750"
# Generate consensus sequences
pbccs pacbio.bam --min-passes 3 --min-rq 0.99 consensus.bam
# Polish assembly with long reads
pbmm2 align reference.fa pacbio.bam aligned.bam
variantCaller --algorithm=arrow aligned.bam -r reference.fa -o polished.fa
Configuration and Metadata Formats
JSON: JavaScript Object Notation
JSON format provides flexible metadata storage for NGS workflows.
Sample Metadata JSON:
{
"experiment_id": "EXP001",
"samples": [
{
"sample_id": "SAMPLE_001",
"condition": "control",
"replicate": 1,
"library_prep": "TruSeq",
"sequencing_depth": 30000000,
"quality_metrics": {
"mean_quality": 35.2,
"percent_duplicates": 12.3,
"mapping_rate": 94.5
}
},
{
"sample_id": "SAMPLE_002",
"condition": "treatment",
"replicate": 1,
"library_prep": "TruSeq",
"sequencing_depth": 28500000,
"quality_metrics": {
"mean_quality": 34.8,
"percent_duplicates": 15.1,
"mapping_rate": 93.2
}
}
],
"analysis_parameters": {
"aligner": "bwa-mem",
"peak_caller": "macs2",
"fdr_threshold": 0.05
}
}
YAML: Human-Readable Configuration
YAML provides an alternative to JSON with improved readability.
Pipeline Configuration YAML:
# NGS Analysis Pipeline Configuration
pipeline:
name: "ChIP-seq Analysis"
version: "1.2.0"
reference:
genome: "hg38"
index_path: "/data/genomes/hg38/bwa_index"
annotation: "/data/annotations/gencode.v38.gtf"
quality_control:
adapter_trimming: true
quality_threshold: 20
minimum_length: 30
alignment:
tool: "bwa"
parameters:
- "-M"
- "-t 16"
peak_calling:
tool: "macs2"
parameters:
fdr: 0.05
fold_change: 2.0
samples:
- name: "ChIP_sample1"
files:
- "sample1_R1.fastq.gz"
- "sample1_R2.fastq.gz"
condition: "treatment"
- name: "Input_control1"
files:
- "input1_R1.fastq.gz"
- "input1_R2.fastq.gz"
condition: "control"
Configuration Format Benefits:
- Reproducibility: Document analysis parameters
- Automation: Drive pipeline execution
- Version Control: Track parameter changes
- Collaboration: Share analysis protocols
Best Practices for NGS Data Management
File Organization and Naming Conventions
Hierarchical Directory Structure:
project_root/
├── raw_data/
│ ├── sample_001_R1.fastq.gz
│ ├── sample_001_R2.fastq.gz
│ └── checksums.md5
├── processed/
│ ├── trimmed/
│ ├── aligned/
│ └── quantified/
├── analysis/
│ ├── differential_expression/
│ ├── pathway_analysis/
│ └── figures/
├── metadata/
│ ├── sample_sheet.csv
│ └── experimental_design.yaml
└── scripts/
├── preprocessing.sh
└── analysis.R
Naming Convention Examples:
# Good naming practices
ChIPseq_USF2_HepG2_rep1_treat_001.fastq.gz
RNAseq_WT_brain_12h_rep2_control_R1.fastq.gz
WGS_patient_001_tumor_primary_001.bam
# Include key information:
- Assay type (ChIPseq, RNAseq, WGS)
- Target/condition (USF2, WT, patient_001)
- Sample type (HepG2, brain, tumor)
- Time point (12h)
- Replicate (rep1, rep2)
- Condition (treat, control)
- Read pair (R1, R2)
Data Integrity and Quality Control
Checksum Verification:
# Generate checksums during data transfer
md5sum *.fastq.gz > checksums.md5
# Verify file integrity
md5sum -c checksums.md5
# For large files, use faster alternatives
sha256sum large_file.bam > large_file.sha256
File Format Validation:
# Validate FASTQ format
seqkit stats -T sample.fastq.gz
# Check BAM file integrity
samtools quickcheck aligned.bam
# Validate VCF format
bcftools view -h variants.vcf | head -20
Storage and Compression Strategies
Compression Guidelines:
- FASTQ files: Always compress with gzip (.gz)
- BAM files: Use built-in compression (already compressed)
- VCF files: Compress with bgzip for tabix compatibility
- Text files: Use gzip for significant space savings
Archive Strategy:
# Long-term storage with maximum compression
tar -czf project_archive.tar.gz project_directory/
# Create separate archives for different data types
tar -czf raw_data.tar.gz raw_data/
tar -czf analysis_results.tar.gz analysis/
Data Backup and Version Control
Backup Strategy:
- Primary storage: Active analysis workspace
- Secondary backup: Network storage or cloud
- Archive storage: Long-term compressed storage
- Metadata backup: Critical sample information
Version Control for Analysis:
# Initialize git repository
git init project_analysis
cd project_analysis
# Track analysis scripts and metadata
git add scripts/ metadata/ README.md
git commit -m "Initial analysis setup"
# Create branches for different analyses
git checkout -b differential_expression
git checkout -b pathway_analysis
Common Pitfalls and Troubleshooting
Format Compatibility Issues
Coordinate System Mismatches:
- 0-based vs 1-based: BED (0-based) vs VCF/GTF (1-based)
- Half-open intervals: BED uses [start, end) intervals
- Always verify: Use tools like
bedtools
for coordinate conversions
# Convert BED to 1-based coordinates
awk '{print $1, $2+1, $3}' OFS='\t' input.bed > output_1based.bed
# Convert VCF positions to BED format
bcftools query -f '%CHROM\t%POS0\t%END\n' variants.vcf > positions.bed
Character Encoding Problems:
# Check file encoding
file -i sample.txt
# Convert encoding if necessary
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
# Remove invisible characters
tr -d '\r' < windows_file.txt > unix_file.txt
Performance Optimization
Index Management:
# Always index coordinate-sorted files
samtools index sorted.bam
tabix -p vcf compressed.vcf.gz
samtools faidx reference.fa
# Verify index compatibility
samtools idxstats sorted.bam
Memory and Storage Optimization:
# Use streaming for large files
samtools view large.bam chr1:1000000-2000000 | process_reads.py
# Parallel processing with GNU parallel
ls *.fastq.gz | parallel -j 8 'process_sample.sh {}'
# Monitor resource usage
htop # Interactive process monitor
iostat -x 1 # I/O statistics
Data Corruption and Recovery
Common Corruption Signs:
- Unexpected file sizes (too small or truncated)
- Parsing errors from standard tools
- Missing headers or incomplete records
- Checksum verification failures
Recovery Strategies:
# Attempt to recover truncated files
samtools view -h corrupted.bam | samtools view -bS - > recovered.bam
# Extract partial data from corrupted files
head -n 4000000 corrupted.fastq > partial_recovery.fastq
# Use repair tools for specific formats
repair.sh in=corrupted.fastq out=repaired.fastq
Format-Specific Troubleshooting
FASTQ Issues:
# Check for malformed FASTQ records
awk 'NR%4==1{if($0!~/^@/)print "Line "NR": "$0}' sample.fastq
# Validate quality score encoding
seqkit stats -T sample.fastq | grep -E "Q20|Q30"
# Fix truncated FASTQ files
seqkit head -n $(expr $(cat sample.fastq | wc -l) / 4 \* 4) sample.fastq > fixed.fastq
BAM/SAM Problems:
# Check BAM header consistency
samtools view -H sample.bam | grep "@RG"
# Validate sort order
samtools view sample.bam | head -1000 | cut -f3,4 | sort -k1,1 -k2,2n
# Fix header issues
samtools reheader new_header.sam sample.bam > fixed.bam
VCF Validation:
# Comprehensive VCF validation
bcftools view variants.vcf | vcf-validator
# Check for sorting issues
bcftools view variants.vcf | bcftools query -f '%CHROM\t%POS\n' | sort -k1,1 -k2,2n -c
# Fix VCF formatting
bcftools norm -f reference.fa -O z variants.vcf > normalized.vcf.gz
Future Trends in NGS Data Formats
Emerging Technologies and Formats
Real-Time Sequencing Data:
- Streaming formats: Handle continuous data flow from real-time sequencers
- Adaptive compression: Dynamic compression based on data characteristics
- Event-driven processing: Process data as it’s generated
Multi-Modal Data Integration:
- Multi-omics formats: Combine genomics, transcriptomics, and epigenomics
- Spatial data formats: Integrate location information with expression data
- Time-series formats: Handle temporal genomics experiments
Cloud-Native Formats:
- Object storage optimization: Formats designed for cloud storage systems
- Serverless compatibility: Support for function-as-a-service architectures
- API-first design: RESTful interfaces for data access
Standards Evolution
International Coordination:
- GA4GH standards: Global Alliance for Genomics and Health format specifications
- FAIR principles: Findable, Accessible, Interoperable, Reusable data
- Metadata standardization: Consistent sample and experimental annotations
Performance Improvements:
- Columnar storage: Apache Parquet and similar formats for analytics
- GPU acceleration: Formats optimized for parallel processing
- Quantum-ready: Preparation for quantum computing applications
Conclusion: Mastering NGS Data Formats
Understanding NGS data formats is fundamental to successful genomics analysis. Each format has evolved to address specific challenges in storing, processing, and analyzing biological data. From the simple elegance of FASTA to the sophisticated compression of CRAM, these formats enable researchers to extract meaningful insights from vast genomic datasets.
Key Takeaways
Format Selection Strategy:
- Choose formats based on your analysis requirements and computational resources
- Consider long-term storage and accessibility needs
- Balance compression efficiency with processing speed
- Ensure compatibility with your analysis tools and collaborators
Quality Assurance:
- Implement robust data validation procedures
- Maintain comprehensive metadata throughout your analysis pipeline
- Use checksums and version control to ensure data integrity
- Document format conversions and processing steps
Future-Proofing:
- Stay informed about emerging format standards
- Design analysis pipelines with format flexibility
- Participate in community discussions about best practices
- Consider cloud compatibility in format selection
Practical Next Steps
- Audit Your Current Data: Review existing datasets and identify format optimization opportunities
- Standardize Workflows: Implement consistent naming conventions and directory structures
- Automate Validation: Create scripts to verify data integrity and format compliance
- Collaborate Effectively: Establish format standards within your research group
- Stay Updated: Follow developments in format standards and analysis tools
The landscape of NGS data formats continues to evolve with advancing sequencing technologies and computational capabilities. By mastering these fundamental formats and understanding their appropriate applications, researchers can build robust, efficient analysis pipelines that scale from individual experiments to large-scale genomics consortia.
Whether you’re analyzing your first ChIP-seq dataset or managing petabyte-scale population genomics data, a solid understanding of NGS data formats provides the foundation for reproducible, high-quality genomics research. The investment in learning these formats pays dividends in analysis efficiency, data management, and collaborative success.
This tutorial is part of the NGS101.com comprehensive guide to next-generation sequencing analysis. For more tutorials on specific analysis workflows and advanced techniques, explore our complete tutorial collection.
Leave a Reply