The Complete Guide to NGS Data Types and Formats: From Raw Reads to Analysis-Ready Files

The Complete Guide to NGS Data Types and Formats: From Raw Reads to Analysis-Ready Files

Master the essential file formats in next-generation sequencing analysis

Introduction: Understanding the NGS Data Ecosystem

Next-generation sequencing (NGS) has revolutionized biological research by enabling us to read DNA, RNA, and epigenetic modifications at an unprecedented scale. However, with this power comes complexity – NGS workflows generate dozens of different file formats, each serving specific purposes in the analysis pipeline. Understanding these formats is crucial for any researcher working with genomic data.

What Makes NGS Data Formats Unique?

NGS data formats have evolved to address several key challenges:

  • Scale: NGS experiments generate massive datasets, often containing millions to billions of sequencing reads
  • Compression: Raw sequencing data can occupy terabytes of storage, requiring efficient compression methods
  • Indexing: Random access to specific genomic regions requires sophisticated indexing schemes
  • Standardization: Interoperability between different analysis tools demands standardized formats
  • Metadata: Complex experimental designs require rich annotation and sample information

The NGS Analysis Journey: From Molecules to Insights

The path from biological sample to scientific insight involves multiple data transformations, each producing specific file types:

  1. Sequencing Instruments generate raw electrical signals and base calls
  2. Quality Control produces filtered and trimmed sequence reads
  3. Alignment maps reads to reference genomes, creating coordinate-sorted data
  4. Quantification summarizes read counts into expression matrices
  5. Variant Calling identifies genetic differences from reference sequences
  6. Annotation connects genomic features to biological knowledge

Each step requires specialized file formats optimized for different computational tasks, storage requirements, and analysis workflows.

Key Properties of NGS Data Formats

Understanding the characteristics of each format helps in choosing the right tools and approaches:

Size Considerations:

  • Raw FASTQ files can range from gigabytes to terabytes
  • Compressed alignment files (BAM) are typically 30-50% smaller than their uncompressed equivalents
  • Index files, while small, are essential for efficient random access

Format Types:

  • Text-based formats (FASTQ, SAM, VCF) are human-readable but larger
  • Binary formats (BAM, BCF) offer better compression and faster processing
  • Indexed formats enable rapid access to specific genomic regions

Critical Handling Considerations:

  • Always verify file integrity using checksums after transfers
  • Maintain consistent coordinate systems (0-based vs 1-based indexing)
  • Preserve metadata and sample information throughout the analysis pipeline
  • Use appropriate compression levels balancing file size and access speed

Raw Sequencing Data: The Foundation of NGS Analysis

Raw sequencing data represents the direct output from sequencing instruments before any computational processing. Understanding these formats is essential for quality assessment and troubleshooting.

Platform-Specific Raw Data Formats

Different sequencing technologies produce distinct raw data formats, each reflecting their underlying detection mechanisms:

Illumina Sequencing:

  • BCL files: Binary base call files containing raw intensities and quality scores
  • FASTQ files: Text-based format with sequences and per-base quality scores
  • InterOp files: Binary files containing run metrics and quality statistics

Oxford Nanopore:

  • FAST5 files: HDF5-based format storing raw electrical current measurements
  • POD5 files: Newer, more efficient format replacing FAST5
  • FASTQ files: Basecalled sequences with quality scores

Pacific Biosciences (PacBio):

  • H5 files: HDF5 format for older RSII systems
  • BAM files: PacBio’s primary format for Sequel systems
  • FASTA/FASTQ: Extracted consensus sequences

Comparative Analysis of Raw Data Formats

PlatformPrimary FormatFile SizeRead LengthError ProfileUse Cases
IlluminaFASTQ1-50 GB50-300bpLow substitutionGenome sequencing, RNA-seq, ChIP-seq
NanoporeFAST5/POD510-500 GB1kb-2MbIndels, homopolymerLong-read assembly, structural variants
PacBioBAM/FASTQ5-200 GB1kb-100kbRandom errorsHigh-quality assembly, isoform analysis

Sequence Data Formats: The Building Blocks

Sequence data formats store the fundamental genetic information extracted from NGS experiments. These formats serve as input for most downstream analyses.

FASTQ: The Universal Sequence Format

FASTQ format dominates NGS workflows due to its simplicity and comprehensive information content.

Structure and Components:

@M00967:43:000000000-A3JHG:1:1101:18327:1699 1:N:0:1
CCTACGGGNGGCWGCAG
+
A1>1>11#-1>11<-<11

Detailed Breakdown:

  • Header: Contains instrument ID, run ID, lane, tile, x/y coordinates
  • Sequence: Raw nucleotide sequence (A, T, G, C, N for ambiguous)
  • Quality: Phred scores indicating base call confidence

Common FASTQ Variants:

  • Paired-end: Two FASTQ files (R1 and R2) with matching read IDs
  • Compressed: .fastq.gz files for storage efficiency
  • Multiplexed: Multiple samples in single file with barcode identification

Best Practices for FASTQ Handling:

# Check FASTQ file integrity
seqkit stats sample.fastq.gz

# Count total reads
echo $(cat sample.fastq | wc -l)/4 | bc

# Extract first 1000 reads
head -n 4000 sample.fastq > sample_subset.fastq

FASTA: Simple Sequence Storage

FASTA format provides a streamlined approach for storing sequences without quality information.

Basic Structure:

>sequence_identifier optional_description
ATGCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGA
>another_sequence
GCGATCGATCGATCGATCGATCGATCGATCGAT

When to Use FASTA:

  • Reference genome sequences
  • Protein sequences
  • Assembled contigs or scaffolds
  • Consensus sequences from multiple alignment
  • Primer and probe sequences

FASTA vs FASTQ Decision Matrix:

Use CaseFormat ChoiceReasoning
Raw sequencing readsFASTQNeed quality scores for filtering
Reference genomesFASTANo quality information needed
Assembly outputFASTAConsensus sequences
Database searchesFASTAStandard for BLAST databases

Alignment Data Formats: Mapping Reads to Genomes

Once sequencing reads are generated, they must be aligned to reference genomes. Alignment formats store this crucial mapping information with varying levels of compression and accessibility.

SAM: The Human-Readable Alignment Standard

The Sequence Alignment/Map (SAM) format provides a comprehensive, text-based representation of alignments.

SAM File Structure:

@HD    VN:1.6  SO:coordinate
@SQ    SN:chr1 LN:248956422
@PG    ID:bwa  PN:bwa  VN:0.7.17-r1188
@RG    ID:sample1  SM:patient_001  PL:ILLUMINA
M00967:43:000000000-A3JHG:1:1101:18327:1699    99  chr1    1000    60  150M    =   1150    300 CCTACGGGNGGCWGCAG...    A1>1>11#-1>11<-<11...   AS:i:145    XS:i:20

Header Section (@-lines):

  • @HD: File format version and sort order
  • @SQ: Reference sequence information
  • @RG: Read group information (sample, library, platform)
  • @PG: Program information used for alignment

Alignment Records (11 mandatory fields):

  1. QNAME: Read identifier
  2. FLAG: Bitwise flag indicating alignment properties
  3. RNAME: Reference sequence name (chromosome)
  4. POS: 1-based leftmost alignment position
  5. MAPQ: Mapping quality score
  6. CIGAR: Concise alignment representation
  7. RNEXT: Reference name of mate/next read
  8. PNEXT: Position of mate/next read
  9. TLEN: Template length
  10. SEQ: Read sequence
  11. QUAL: ASCII-encoded read quality

CIGAR String Interpretation:

  • M: Match/mismatch
  • I: Insertion in read
  • D: Deletion in read
  • S: Soft clipping
  • H: Hard clipping
  • N: Skipped region (splicing)

Example: 50M2I25M = 50 matches, 2 insertions, 25 matches

BAM: Compressed Binary Alignments

BAM format provides the same information as SAM but in a compressed, binary format optimized for computational efficiency.

Key Advantages:

  • File Size: 60-80% smaller than equivalent SAM files
  • Processing Speed: Faster parsing and processing
  • Random Access: Efficient retrieval of specific genomic regions
  • Compression: Built-in bgzip compression

BAM Usage Examples:

# Convert SAM to BAM
samtools view -bS alignment.sam > alignment.bam

# Sort BAM file by coordinates
samtools sort -o alignment_sorted.bam alignment.bam

# Index BAM for random access
samtools index alignment_sorted.bam

# Extract reads from specific region
samtools view alignment_sorted.bam chr1:1000000-2000000

BAI: BAM Index Files

BAI files enable rapid random access to specific genomic regions within BAM files.

Index Structure:

  • Linear Index: Coarse-grained genomic bins
  • Hierarchical Index: Fine-grained access within bins
  • Metadata: Reference sequence information and statistics

Critical Considerations:

  • BAI files must be regenerated after any BAM file modification
  • Index files should be stored alongside BAM files
  • Coordinate-sorted BAM files are required for indexing

CRAM: Ultra-Compressed Alignments

CRAM format offers superior compression by using reference-based compression algorithms.

Compression Benefits:

  • Size Reduction: 30-60% smaller than BAM files
  • Lossless: Maintains all alignment information
  • Reference-Based: Stores only differences from reference genome

CRAM Usage Scenarios:

  • Long-term data archiving
  • Large-scale population genomics projects
  • Cloud storage optimization
  • Bandwidth-limited data transfers

CRAM Workflow Example:

# Convert BAM to CRAM
samtools view -C -T reference.fa alignment.bam > alignment.cram

# Index CRAM file
samtools index alignment.cram

# Convert CRAM back to BAM
samtools view -b -T reference.fa alignment.cram > alignment_restored.bam

Quantification and Expression Data Formats

Gene expression analysis requires specialized formats to store quantified measurements across samples and conditions. These formats balance human readability with computational efficiency.

Count Matrices: The Foundation of Expression Analysis

Count matrices represent the core data structure for RNA-seq and single-cell analyses.

Tab-Separated Values (TSV) Format:

Gene_ID    Sample_1    Sample_2    Sample_3    Sample_4
ENSG00000000003    743 891 1205    567
ENSG00000000005    0   2   1   0
ENSG00000000419    1891    2103    2456    1678
ENSG00000000457    567 634 723 445
ENSG00000000460    89  123 156 67

Comma-Separated Values (CSV) Format:

Gene_ID,Sample_1,Sample_2,Sample_3,Sample_4
ENSG00000000003,743,891,1205,567
ENSG00000000005,0,2,1,0
ENSG00000000419,1891,2103,2456,1678

Best Practices for Count Matrices:

  • Use gene IDs (Ensembl, RefSeq) rather than gene names for consistency
  • Include metadata files describing samples and experimental conditions
  • Validate that row and column totals match expected values
  • Store raw counts separately from normalized values

Normalized Expression Tables

Normalization addresses technical biases and enables meaningful comparisons between samples.

TPM (Transcripts Per Million) Table:

Gene_ID    Gene_Length Sample_1_TPM    Sample_2_TPM    Sample_3_TPM
ENSG00000000003    2100    354.1   424.5   573.8
ENSG00000000005    1500    0.0 1.3 0.7
ENSG00000000419    3200    591.6   657.2   768.1

FPKM/RPKM Comparison:

  • TPM: Transcript Per Million – sum to 1 million per sample
  • FPKM: Fragments Per Kilobase Million – for paired-end RNA-seq
  • RPKM: Reads Per Kilobase Million – for single-end RNA-seq

When to Use Each Format:

  • TPM: Cross-sample comparisons and meta-analyses
  • FPKM/RPKM: Within-sample gene length normalization
  • Raw Counts: Differential expression analysis with DESeq2/edgeR

Single-Cell Specific Formats

Single-cell RNA-seq generates sparse, high-dimensional datasets requiring specialized storage formats.

Matrix Market (MTX) Format:

%%MatrixMarket matrix coordinate integer general
32738 5000 8934756
1 1 4
1 3 1
2 1 2
2 2 8

Format Structure:

  • Line 1: Header with format information
  • Line 2: Matrix dimensions (genes, cells, non-zero entries)
  • Subsequent lines: Row, column, value triplets

Associated Files:

  • features.tsv: Gene IDs and symbols
  • barcodes.tsv: Cell barcode sequences

HDF5 and AnnData Formats

Modern single-cell analysis increasingly relies on hierarchical data formats.

HDF5 (.h5) Structure:

/
├── matrix/
│   ├── data
│   ├── indices
│   └── indptr
├── features/
│   ├── id
│   ├── name
│   └── feature_type
└── barcodes

AnnData (.h5ad) Components:

  • X: Primary data matrix (genes × cells)
  • obs: Cell metadata (cell type, cluster, etc.)
  • var: Gene metadata (gene symbols, biotype)
  • obsm: Multi-dimensional cell annotations
  • varm: Multi-dimensional gene annotations
  • uns: Unstructured metadata

Usage Example:

import scanpy as sc
import pandas as pd

# Load AnnData object
adata = sc.read_h5ad('single_cell_data.h5ad')

# Access expression matrix
expression = adata.X

# Access metadata
cell_metadata = adata.obs
gene_metadata = adata.var

Loom Format: Comprehensive Single-Cell Storage

Loom format provides a self-contained solution for single-cell genomics data.

Loom File Structure:

/
├── matrix (main data matrix)
├── row_attrs/
│   ├── Gene (gene symbols)
│   └── Accession (gene IDs)
├── col_attrs/
│   ├── CellID (cell barcodes)
│   └── CellType (annotations)
└── row_graphs/ (gene-gene relationships)

Advantages:

  • Cross-platform compatibility
  • Efficient sparse matrix storage
  • Built-in metadata management
  • Support for graphs and hierarchical relationships

R Data Formats

R-based analysis workflows commonly use native R storage formats.

RDS Format (.rds):

# Save single R object
expression_matrix <- read.csv("counts.csv", row.names=1)
saveRDS(expression_matrix, "expression_data.rds")

# Load RDS object
loaded_data <- readRDS("expression_data.rds")

RData Format (.rda/.RData):

# Save multiple R objects
sample_metadata <- read.csv("metadata.csv")
gene_annotations <- read.csv("genes.csv")
save(expression_matrix, sample_metadata, gene_annotations, 
     file="complete_dataset.RData")

# Load all objects
load("complete_dataset.RData")

Variant Data Formats: Capturing Genetic Diversity

Variant calling identifies genetic differences between sequenced samples and reference genomes. These formats must efficiently store diverse types of genetic variation while maintaining compatibility with analysis tools.

VCF: The Variant Call Format Standard

VCF format serves as the gold standard for storing genetic variants, from single nucleotide polymorphisms to complex structural variations.

VCF File Structure:

##fileformat=VCFv4.2
##reference=hg38
##contig=&lt;ID=chr1,length=248956422>
##INFO=&lt;ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=&lt;ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=&lt;ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1 Sample2
chr1    1000    rs123456    A   G   99.0    PASS    DP=50   GT:AD:DP    0/1:25,25:50    1/1:0,48:48
chr1    2000    .   T   C   45.2    LowQual DP=15   GT:AD:DP    0/0:15,0:15 0/1:8,7:15
chr2    3000    .   GTC G   87.5    PASS    DP=32   GT:AD:DP    0/1:16,16:32    0/1:18,14:32

Header Section (##-lines):

  • fileformat: VCF version specification
  • reference: Reference genome used
  • contig: Chromosome/contig information
  • INFO: Variant-level annotation descriptions
  • FORMAT: Sample-level field descriptions

Variant Records (8 mandatory + sample columns):

  1. CHROM: Chromosome identifier
  2. POS: 1-based position
  3. ID: Variant identifier (e.g., dbSNP ID)
  4. REF: Reference allele
  5. ALT: Alternative allele(s)
  6. QUAL: Quality score
  7. FILTER: Filter status
  8. INFO: Variant annotations
  9. FORMAT: Sample data format
    10+ Sample columns: Genotype and related data

Genotype Encoding:

  • 0/0: Homozygous reference
  • 0/1: Heterozygous
  • 1/1: Homozygous alternative
  • ./​.: Missing genotype

Complex Variant Examples:

# Single nucleotide variant (SNV)
chr1    1000    .   A   G   99.0    PASS    .   GT  0/1

# Insertion
chr1    2000    .   T   TAGA    87.5    PASS    .   GT  0/1

# Deletion
chr1    3000    .   ATCG    A   92.3    PASS    .   GT  1/1

# Multi-allelic site
chr1    4000    .   G   A,T 78.4    PASS    .   GT  1/2

BCF: Binary Variant Call Format

BCF provides a compressed, binary representation of VCF data optimized for computational processing.

Key Advantages:

  • Performance: 5-10x faster parsing than VCF
  • Size: 50-70% smaller file sizes
  • Indexing: Efficient random access with tabix
  • Precision: Maintains full numerical precision

BCF Workflow:

# Convert VCF to BCF
bcftools view -Ob variants.vcf > variants.bcf

# Index BCF file
bcftools index variants.bcf

# Query specific region
bcftools view variants.bcf chr1:1000000-2000000

# Convert back to VCF
bcftools view variants.bcf > variants_restored.vcf

MAF: Mutation Annotation Format

MAF format specializes in storing somatic mutations with rich clinical and functional annotations.

MAF File Example:

Hugo_Symbol    Variant_Classification  Tumor_Sample_Barcode    HGVSp   HGVSc   Chromosome  Start_Position  End_Position    Reference_Allele    Tumor_Seq_Allele2
TP53    Missense_Mutation   TCGA-AA-A00A-01 p.R175H c.524G>A    chr17   7578406 7578406 G   A
KRAS    Missense_Mutation   TCGA-AA-A00A-01 p.G12D  c.35G>A chr12   25245350    25245350    G   A
PIK3CA    Missense_Mutation   TCGA-BB-B00B-01 p.E545K c.1633G>A   chr3    178936091   178936091   G   A

Critical MAF Fields:

  • Hugo_Symbol: Gene symbol
  • Variant_Classification: Functional impact (Missense, Nonsense, etc.)
  • Tumor_Sample_Barcode: Sample identifier
  • HGVSp/HGVSc: Protein and coding sequence notation
  • Reference_Allele/Tumor_Seq_Allele2: Variant alleles

MAF Use Cases:

  • Cancer genomics analysis
  • Mutation burden calculations
  • Pathway enrichment analysis
  • Clinical data integration
  • Survival analysis correlation

BEDPE: Paired-End Breakpoint Format

BEDPE format stores structural variants and breakpoint information from paired-end sequencing.

BEDPE Structure:

chr1    1000    2000    chr1    5000    6000    variant_1   100 +   -   translocation
chr2    3000    3500    chr3    7000    7500    variant_2   150 +   +   deletion
chrX    8000    8200    chrY    9000    9200    variant_3   80  -   +   inversion

BEDPE Fields:

  • chrom1, start1, end1: First breakpoint
  • chrom2, start2, end2: Second breakpoint
  • name: Variant identifier
  • score: Confidence score
  • strand1, strand2: Breakpoint orientations
  • type: Structural variant type

Structural Variant Types:

  • Deletion: Loss of genomic sequence
  • Duplication: Copy number increase
  • Inversion: Sequence orientation reversal
  • Translocation: Inter-chromosomal rearrangement
  • Insertion: Novel sequence addition

Copy Number Variation Formats

CNVkit generates specialized formats for copy number analysis.

CNR (Copy Number Ratio) Format:

chromosome    start   end gene    log2    depth   weight
chr1    1000000 1001000 GENE1   -0.15   125.4   0.95
chr1    1001000 1002000 GENE1   0.23    138.2   0.98
chr1    1002000 1003000 GENE2   1.45    156.7   0.92

CNS (Copy Number Segment) Format:

chromosome    start   end gene    log2    cn  depth   p_ttest probes  weight
chr1    1000000 1500000 GENE1,GENE2 0.12    2   142.3   0.001   500 0.94
chr1    1500000 2000000 GENE3   1.58    4   165.8   0.000   250 0.96

Interpretation:

  • log2 ratio: Copy number relative to diploid (log2(cn/2))
  • cn: Absolute copy number
  • p_ttest: Statistical significance of segment

Annotation and Feature Data Formats

Genomic annotations connect sequence data to biological knowledge, providing essential context for interpreting analysis results. These formats must efficiently represent diverse genomic features while maintaining compatibility with analysis tools.

GTF/GFF: Gene Transfer Format

GTF and GFF formats store comprehensive gene structure and functional annotations.

GTF Format Example:

chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; exon_id "ENSE00002234944";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; exon_id "ENSE00003582793";

GTF Field Descriptions:

  1. seqname: Chromosome/scaffold identifier
  2. source: Annotation source (ENSEMBL, RefSeq, etc.)
  3. feature: Feature type (gene, transcript, exon, CDS)
  4. start/end: 1-based genomic coordinates
  5. score: Confidence score (optional)
  6. strand: + (forward) or – (reverse)
  7. frame: Reading frame for CDS features
  8. attributes: Semicolon-separated key-value pairs

Common Feature Types:

  • gene: Complete gene locus
  • transcript: Individual transcript isoform
  • exon: Transcribed regions
  • CDS: Protein-coding sequences
  • UTR: Untranslated regions
  • start_codon/stop_codon: Translation boundaries

GFF3 Enhanced Features:

chr1    RefSeq  gene    11874   14409   .   +   .   ID=gene1;Name=DDX11L1;Dbxref=GeneID:100287102
chr1    RefSeq  mRNA    11874   14409   .   +   .   ID=rna1;Parent=gene1;Name=NR_046018.2
chr1    RefSeq  exon    11874   12227   .   +   .   ID=exon1;Parent=rna1
chr1    RefSeq  exon    12613   12721   .   +   .   ID=exon2;Parent=rna1

GFF3 vs GTF Comparison:

  • GFF3: Hierarchical relationships with ID/Parent structure
  • GTF: Flat structure with shared gene_id/transcript_id
  • GFF3: More flexible attribute system
  • GTF: Simpler parsing for RNA-seq workflows

BED: Browser Extensible Data Format

BED format provides a simple, flexible way to represent genomic intervals and annotations.

BED Format Variants:

BED3 (Minimal):

chr1    1000    2000
chr1    5000    6000
chr2    3000    4000

BED6 (Standard):

chr1    1000    2000    feature1    100 +
chr1    5000    6000    feature2    200 -
chr2    3000    4000    feature3    150 +

BED12 (Full):

chr1    1000    5000    gene1   1000    +   1200    4800    255,0,0 2   800,600 0,3400

BED Field Descriptions:

  1. chrom: Chromosome name
  2. chromStart: 0-based start position
  3. chromEnd: 1-based end position
  4. name: Feature identifier
  5. score: Display score (0-1000)
  6. strand: Orientation
  7. thickStart/thickEnd: Coding region boundaries
  8. itemRgb: RGB color values
  9. blockCount: Number of sub-features
  10. blockSizes: Comma-separated block sizes
  11. blockStarts: Relative block start positions

BED Use Cases:

  • ChIP-seq peak regions
  • Gene promoter definitions
  • Regulatory element annotations
  • Copy number variation segments
  • Structural variant breakpoints

Peak Calling Formats

ChIP-seq and ATAC-seq analyses generate specialized peak formats.

narrowPeak Format:

chr1    1000    2000    peak1   100 .   5.2 10.1    8.3 500
chr1    5000    6000    peak2   200 .   7.8 15.2    12.1    400
chr2    3000    4000    peak3   150 .   6.1 12.5    9.8 300

narrowPeak Fields:
1-3. Standard BED3 fields

  1. name: Peak identifier
  2. score: Integer score (0-1000)
  3. strand: Orientation (usually ‘.’)
  4. signalValue: Signal enrichment
  5. pValue: -log10(p-value)
  6. qValue: -log10(q-value)
  7. peak: Relative peak summit position

broadPeak Format:

chr1    1000    5000    region1 100 .   5.2 10.1    8.3
chr1    10000   15000   region2 200 .   7.8 15.2    12.1

broadPeak Differences:

  • No peak summit column (column 10)
  • Represents broader enrichment regions
  • Suitable for histone modifications (H3K27me3, H3K36me3)

BigBED: Indexed Binary BED

BigBED format provides efficient storage and random access for large BED datasets.

BigBED Advantages:

  • Compression: 70-90% size reduction
  • Indexing: Rapid region-based queries
  • Scalability: Handles millions of features efficiently
  • Browser Integration: Direct UCSC Genome Browser loading

BigBED Creation:

# Sort BED file by chromosome and position
sort -k1,1 -k2,2n input.bed > sorted.bed

# Get chromosome sizes
fetchChromSizes hg38 > hg38.chrom.sizes

# Convert BED to BigBED
bedToBigBed sorted.bed hg38.chrom.sizes output.bb

# Query specific region from BigBED
bigBedToBed output.bb -chrom=chr1 -start=1000000 -end=2000000 stdout

WIG and BigWig: Continuous Signal Data

Wiggle (WIG) and BigWig formats store continuous numerical data across genomic coordinates, essential for visualizing signal tracks.

WIG Format Types:

Variable Step WIG:

track type=wiggle_0 name="Sample1_Coverage" description="Coverage track"
variableStep chrom=chr1
1001    5.2
1002    5.8
1003    6.1
1005    4.9
1010    7.3

Fixed Step WIG:

track type=wiggle_0 name="Sample1_Coverage"
fixedStep chrom=chr1 start=1001 step=1
5.2
5.8
6.1
0.0
4.9

BedGraph Format (WIG alternative):

track type=bedGraph name="Sample1_Coverage"
chr1    1000    1001    5.2
chr1    1001    1002    5.8
chr1    1002    1003    6.1
chr1    1004    1005    4.9
chr1    1009    1010    7.3

Format Comparison:

  • Variable Step: Sparse data with irregular intervals
  • Fixed Step: Dense data with regular intervals
  • BedGraph: Most flexible, handles any interval structure

BigWig Advantages:

  • Performance: 100x faster random access than WIG
  • Compression: Significant size reduction
  • Multi-resolution: Automatic data summarization at different zoom levels
  • Streaming: Efficient data transfer over networks

BigWig Creation and Usage:

# Convert bedGraph to BigWig
bedGraphToBigWig coverage.bedGraph hg38.chrom.sizes coverage.bw

# Extract signal from specific region  
bigWigToBedGraph coverage.bw -chrom=chr1 -start=1000000 -end=2000000 stdout

# Calculate summary statistics
bigWigSummary coverage.bw chr1 1000000 2000000 100

Common BigWig Applications:

  • ChIP-seq signal tracks
  • RNA-seq coverage visualization
  • ATAC-seq accessibility profiles
  • Hi-C interaction frequencies
  • Methylation percentage tracks

Specialized NGS Data Formats

Advanced NGS applications have developed specialized formats to handle unique data types and optimize storage for specific use cases.

Compressed and Indexed Formats

VCF.gz + TBI: Tabix-Indexed Variants

Tabix indexing enables efficient random access to compressed VCF files.

# Compress and index VCF
bgzip variants.vcf
tabix -p vcf variants.vcf.gz

# Query specific region
tabix variants.vcf.gz chr1:1000000-2000000

# Multiple region query
tabix variants.vcf.gz chr1:100000-200000 chr2:300000-400000

Index File Structure:

  • Linear index: Coarse-grained genomic bins (16kb default)
  • Hierarchical index: Fine-grained access within bins
  • Metadata: Sequence names, file offsets, and statistics

Tabix Advantages:

  • Works with any coordinate-sorted, tab-delimited format
  • Minimal memory footprint for large files
  • Supports multiple simultaneous queries
  • Network streaming compatibility

Sequencing Archive Formats

SRA: Sequence Read Archive

SRA format serves as the primary archive format for public sequencing data repositories.

SRA File Structure:

sample.sra
├── Metadata (experiment design, sample info)
├── Read data (sequences and qualities)
├── Alignment data (optional)
└── Analysis data (optional)

SRA Toolkit Usage:

# Download SRA file
prefetch SRR1234567

# Convert to FASTQ
fasterq-dump SRR1234567

# Split paired-end reads
fasterq-dump --split-files SRR1234567

# Dump specific reads
sam-dump --aligned-region chr1:1000000-2000000 SRR1234567

SRA Advantages:

  • Comprehensive metadata storage
  • Efficient compression algorithms
  • Quality score optimization
  • International standard for data sharing

Long-Read Specific Formats

FAST5: Nanopore Raw Signal Data

FAST5 format stores raw electrical current measurements from Oxford Nanopore sequencing.

FAST5 HDF5 Structure:

/
├── UniqueGlobalKey/
│   ├── channel_id/
│   ├── context_tags/
│   ├── tracking_id/
│   └── sampling_rate
├── Raw/
│   └── Reads/
│       └── Read_[number]/
│           ├── Signal (raw current values)
│           └── Signal_metadata
└── Analyses/
    ├── Basecall_1D_[version]/
    │   ├── BaseCalled_template/
    │   │   ├── Fastq
    │   │   └── Events
    │   └── Summary/
    └── EventDetection_[version]/

FAST5 Data Components:

  • Raw Signal: 4000 Hz current measurements
  • Event Data: Segmented signal regions
  • Basecalls: Sequence calls with quality scores
  • Metadata: Pore information, chemistry, temperature

FAST5 Analysis Tools:

# Extract FASTQ from FAST5
ont_fast5_api_multi_to_single --input_path multi.fast5 --save_path single_reads/ --recursive

# Basecall with Guppy
guppy_basecaller --input_path fast5_dir/ --save_path output_dir/ --config dna_r9.4.1_450bps_hac.cfg

# Extract signal data
h5dump -d /read_12345/Raw/Signal sample.fast5

M5/PBI: PacBio Data Formats

PacBio generates specialized formats for long-read sequencing data.

PacBio BAM Structure:

  • Standard BAM alignment records
  • Extended tags for PacBio-specific information
  • Pulse-level data (optional)

PBI Index Fields:

# PacBio BAM Index (.pbi)
- Reference ID and position
- Read quality scores  
- Subread information
- Barcode data (if multiplexed)
- Kinetic information

PacBio Analysis Example:

# Extract subreads from PacBio BAM
bamtools filter -in pacbio.bam -out subreads.bam -tag "qs:>750"

# Generate consensus sequences
pbccs pacbio.bam --min-passes 3 --min-rq 0.99 consensus.bam

# Polish assembly with long reads
pbmm2 align reference.fa pacbio.bam aligned.bam
variantCaller --algorithm=arrow aligned.bam -r reference.fa -o polished.fa

Configuration and Metadata Formats

JSON: JavaScript Object Notation

JSON format provides flexible metadata storage for NGS workflows.

Sample Metadata JSON:

{
  "experiment_id": "EXP001",
  "samples": [
    {
      "sample_id": "SAMPLE_001",
      "condition": "control",
      "replicate": 1,
      "library_prep": "TruSeq",
      "sequencing_depth": 30000000,
      "quality_metrics": {
        "mean_quality": 35.2,
        "percent_duplicates": 12.3,
        "mapping_rate": 94.5
      }
    },
    {
      "sample_id": "SAMPLE_002", 
      "condition": "treatment",
      "replicate": 1,
      "library_prep": "TruSeq",
      "sequencing_depth": 28500000,
      "quality_metrics": {
        "mean_quality": 34.8,
        "percent_duplicates": 15.1,
        "mapping_rate": 93.2
      }
    }
  ],
  "analysis_parameters": {
    "aligner": "bwa-mem",
    "peak_caller": "macs2",
    "fdr_threshold": 0.05
  }
}

YAML: Human-Readable Configuration

YAML provides an alternative to JSON with improved readability.

Pipeline Configuration YAML:

# NGS Analysis Pipeline Configuration
pipeline:
  name: "ChIP-seq Analysis"
  version: "1.2.0"

reference:
  genome: "hg38"
  index_path: "/data/genomes/hg38/bwa_index"
  annotation: "/data/annotations/gencode.v38.gtf"

quality_control:
  adapter_trimming: true
  quality_threshold: 20
  minimum_length: 30

alignment:
  tool: "bwa"
  parameters:
    - "-M"
    - "-t 16"

peak_calling:
  tool: "macs2"
  parameters:
    fdr: 0.05
    fold_change: 2.0

samples:
  - name: "ChIP_sample1"
    files: 
      - "sample1_R1.fastq.gz"
      - "sample1_R2.fastq.gz"
    condition: "treatment"

  - name: "Input_control1"
    files:
      - "input1_R1.fastq.gz" 
      - "input1_R2.fastq.gz"
    condition: "control"

Configuration Format Benefits:

  • Reproducibility: Document analysis parameters
  • Automation: Drive pipeline execution
  • Version Control: Track parameter changes
  • Collaboration: Share analysis protocols

Best Practices for NGS Data Management

File Organization and Naming Conventions

Hierarchical Directory Structure:

project_root/
├── raw_data/
│   ├── sample_001_R1.fastq.gz
│   ├── sample_001_R2.fastq.gz
│   └── checksums.md5
├── processed/
│   ├── trimmed/
│   ├── aligned/
│   └── quantified/
├── analysis/
│   ├── differential_expression/
│   ├── pathway_analysis/
│   └── figures/
├── metadata/
│   ├── sample_sheet.csv
│   └── experimental_design.yaml
└── scripts/
    ├── preprocessing.sh
    └── analysis.R

Naming Convention Examples:

# Good naming practices
ChIPseq_USF2_HepG2_rep1_treat_001.fastq.gz
RNAseq_WT_brain_12h_rep2_control_R1.fastq.gz
WGS_patient_001_tumor_primary_001.bam

# Include key information:
- Assay type (ChIPseq, RNAseq, WGS)
- Target/condition (USF2, WT, patient_001)
- Sample type (HepG2, brain, tumor)
- Time point (12h)
- Replicate (rep1, rep2)
- Condition (treat, control)
- Read pair (R1, R2)

Data Integrity and Quality Control

Checksum Verification:

# Generate checksums during data transfer
md5sum *.fastq.gz > checksums.md5

# Verify file integrity
md5sum -c checksums.md5

# For large files, use faster alternatives
sha256sum large_file.bam > large_file.sha256

File Format Validation:

# Validate FASTQ format
seqkit stats -T sample.fastq.gz

# Check BAM file integrity
samtools quickcheck aligned.bam

# Validate VCF format
bcftools view -h variants.vcf | head -20

Storage and Compression Strategies

Compression Guidelines:

  • FASTQ files: Always compress with gzip (.gz)
  • BAM files: Use built-in compression (already compressed)
  • VCF files: Compress with bgzip for tabix compatibility
  • Text files: Use gzip for significant space savings

Archive Strategy:

# Long-term storage with maximum compression
tar -czf project_archive.tar.gz project_directory/

# Create separate archives for different data types
tar -czf raw_data.tar.gz raw_data/
tar -czf analysis_results.tar.gz analysis/

Data Backup and Version Control

Backup Strategy:

  • Primary storage: Active analysis workspace
  • Secondary backup: Network storage or cloud
  • Archive storage: Long-term compressed storage
  • Metadata backup: Critical sample information

Version Control for Analysis:

# Initialize git repository
git init project_analysis
cd project_analysis

# Track analysis scripts and metadata
git add scripts/ metadata/ README.md
git commit -m "Initial analysis setup"

# Create branches for different analyses
git checkout -b differential_expression
git checkout -b pathway_analysis

Common Pitfalls and Troubleshooting

Format Compatibility Issues

Coordinate System Mismatches:

  • 0-based vs 1-based: BED (0-based) vs VCF/GTF (1-based)
  • Half-open intervals: BED uses [start, end) intervals
  • Always verify: Use tools like bedtools for coordinate conversions
# Convert BED to 1-based coordinates
awk '{print $1, $2+1, $3}' OFS='\t' input.bed > output_1based.bed

# Convert VCF positions to BED format
bcftools query -f '%CHROM\t%POS0\t%END\n' variants.vcf > positions.bed

Character Encoding Problems:

# Check file encoding
file -i sample.txt

# Convert encoding if necessary
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

# Remove invisible characters
tr -d '\r' < windows_file.txt > unix_file.txt

Performance Optimization

Index Management:

# Always index coordinate-sorted files
samtools index sorted.bam
tabix -p vcf compressed.vcf.gz
samtools faidx reference.fa

# Verify index compatibility
samtools idxstats sorted.bam

Memory and Storage Optimization:

# Use streaming for large files
samtools view large.bam chr1:1000000-2000000 | process_reads.py

# Parallel processing with GNU parallel
ls *.fastq.gz | parallel -j 8 'process_sample.sh {}'

# Monitor resource usage
htop  # Interactive process monitor
iostat -x 1  # I/O statistics

Data Corruption and Recovery

Common Corruption Signs:

  • Unexpected file sizes (too small or truncated)
  • Parsing errors from standard tools
  • Missing headers or incomplete records
  • Checksum verification failures

Recovery Strategies:

# Attempt to recover truncated files
samtools view -h corrupted.bam | samtools view -bS - > recovered.bam

# Extract partial data from corrupted files
head -n 4000000 corrupted.fastq > partial_recovery.fastq

# Use repair tools for specific formats
repair.sh in=corrupted.fastq out=repaired.fastq

Format-Specific Troubleshooting

FASTQ Issues:

# Check for malformed FASTQ records
awk 'NR%4==1{if($0!~/^@/)print "Line "NR": "$0}' sample.fastq

# Validate quality score encoding
seqkit stats -T sample.fastq | grep -E "Q20|Q30"

# Fix truncated FASTQ files
seqkit head -n $(expr $(cat sample.fastq | wc -l) / 4 \* 4) sample.fastq > fixed.fastq

BAM/SAM Problems:

# Check BAM header consistency
samtools view -H sample.bam | grep "@RG"

# Validate sort order
samtools view sample.bam | head -1000 | cut -f3,4 | sort -k1,1 -k2,2n

# Fix header issues
samtools reheader new_header.sam sample.bam > fixed.bam

VCF Validation:

# Comprehensive VCF validation
bcftools view variants.vcf | vcf-validator

# Check for sorting issues
bcftools view variants.vcf | bcftools query -f '%CHROM\t%POS\n' | sort -k1,1 -k2,2n -c

# Fix VCF formatting
bcftools norm -f reference.fa -O z variants.vcf > normalized.vcf.gz

Future Trends in NGS Data Formats

Emerging Technologies and Formats

Real-Time Sequencing Data:

  • Streaming formats: Handle continuous data flow from real-time sequencers
  • Adaptive compression: Dynamic compression based on data characteristics
  • Event-driven processing: Process data as it’s generated

Multi-Modal Data Integration:

  • Multi-omics formats: Combine genomics, transcriptomics, and epigenomics
  • Spatial data formats: Integrate location information with expression data
  • Time-series formats: Handle temporal genomics experiments

Cloud-Native Formats:

  • Object storage optimization: Formats designed for cloud storage systems
  • Serverless compatibility: Support for function-as-a-service architectures
  • API-first design: RESTful interfaces for data access

Standards Evolution

International Coordination:

  • GA4GH standards: Global Alliance for Genomics and Health format specifications
  • FAIR principles: Findable, Accessible, Interoperable, Reusable data
  • Metadata standardization: Consistent sample and experimental annotations

Performance Improvements:

  • Columnar storage: Apache Parquet and similar formats for analytics
  • GPU acceleration: Formats optimized for parallel processing
  • Quantum-ready: Preparation for quantum computing applications

Conclusion: Mastering NGS Data Formats

Understanding NGS data formats is fundamental to successful genomics analysis. Each format has evolved to address specific challenges in storing, processing, and analyzing biological data. From the simple elegance of FASTA to the sophisticated compression of CRAM, these formats enable researchers to extract meaningful insights from vast genomic datasets.

Key Takeaways

Format Selection Strategy:

  • Choose formats based on your analysis requirements and computational resources
  • Consider long-term storage and accessibility needs
  • Balance compression efficiency with processing speed
  • Ensure compatibility with your analysis tools and collaborators

Quality Assurance:

  • Implement robust data validation procedures
  • Maintain comprehensive metadata throughout your analysis pipeline
  • Use checksums and version control to ensure data integrity
  • Document format conversions and processing steps

Future-Proofing:

  • Stay informed about emerging format standards
  • Design analysis pipelines with format flexibility
  • Participate in community discussions about best practices
  • Consider cloud compatibility in format selection

Practical Next Steps

  1. Audit Your Current Data: Review existing datasets and identify format optimization opportunities
  2. Standardize Workflows: Implement consistent naming conventions and directory structures
  3. Automate Validation: Create scripts to verify data integrity and format compliance
  4. Collaborate Effectively: Establish format standards within your research group
  5. Stay Updated: Follow developments in format standards and analysis tools

The landscape of NGS data formats continues to evolve with advancing sequencing technologies and computational capabilities. By mastering these fundamental formats and understanding their appropriate applications, researchers can build robust, efficient analysis pipelines that scale from individual experiments to large-scale genomics consortia.

Whether you’re analyzing your first ChIP-seq dataset or managing petabyte-scale population genomics data, a solid understanding of NGS data formats provides the foundation for reproducible, high-quality genomics research. The investment in learning these formats pays dividends in analysis efficiency, data management, and collaborative success.


This tutorial is part of the NGS101.com comprehensive guide to next-generation sequencing analysis. For more tutorials on specific analysis workflows and advanced techniques, explore our complete tutorial collection.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *