The Complete Guide to NGS Data Types and Formats: From Raw Reads to Analysis-Ready Files

Master the essential file formats in next-generation sequencing analysis

Introduction: Understanding the NGS Data Ecosystem

Next-generation sequencing (NGS) has revolutionized biological research by enabling us to read DNA, RNA, and epigenetic modifications at an unprecedented scale. However, with this power comes complexity – NGS workflows generate dozens of different file formats, each serving specific purposes in the analysis pipeline. Understanding these formats is crucial for any researcher working with genomic data.

What Makes NGS Data Formats Unique?

NGS data formats have evolved to address several key challenges:

Scale: NGS experiments generate massive datasets, often containing millions to billions of sequencing reads
Compression: Raw sequencing data can occupy terabytes of storage, requiring efficient compression methods
Indexing: Random access to specific genomic regions requires sophisticated indexing schemes
Standardization: Interoperability between different analysis tools demands standardized formats
Metadata: Complex experimental designs require rich annotation and sample information

The NGS Analysis Journey: From Molecules to Insights

The path from biological sample to scientific insight involves multiple data transformations, each producing specific file types:

Sequencing Instruments generate raw electrical signals and base calls
Quality Control produces filtered and trimmed sequence reads
Alignment maps reads to reference genomes, creating coordinate-sorted data
Quantification summarizes read counts into expression matrices
Variant Calling identifies genetic differences from reference sequences
Annotation connects genomic features to biological knowledge

Each step requires specialized file formats optimized for different computational tasks, storage requirements, and analysis workflows.

Key Properties of NGS Data Formats

Understanding the characteristics of each format helps in choosing the right tools and approaches:

Size Considerations:

Raw FASTQ files can range from gigabytes to terabytes
Compressed alignment files (BAM) are typically 30-50% smaller than their uncompressed equivalents
Index files, while small, are essential for efficient random access

Format Types:

Text-based formats (FASTQ, SAM, VCF) are human-readable but larger
Binary formats (BAM, BCF) offer better compression and faster processing
Indexed formats enable rapid access to specific genomic regions

Critical Handling Considerations:

Always verify file integrity using checksums after transfers
Maintain consistent coordinate systems (0-based vs 1-based indexing)
Preserve metadata and sample information throughout the analysis pipeline
Use appropriate compression levels balancing file size and access speed

Raw Sequencing Data: The Foundation of NGS Analysis

Raw sequencing data represents the direct output from sequencing instruments before any computational processing. Understanding these formats is essential for quality assessment and troubleshooting.

Platform-Specific Raw Data Formats

Different sequencing technologies produce distinct raw data formats, each reflecting their underlying detection mechanisms:

Illumina Sequencing:

BCL files: Binary base call files containing raw intensities and quality scores
FASTQ files: Text-based format with sequences and per-base quality scores
InterOp files: Binary files containing run metrics and quality statistics

Oxford Nanopore:

FAST5 files: HDF5-based format storing raw electrical current measurements
POD5 files: Newer, more efficient format replacing FAST5
FASTQ files: Basecalled sequences with quality scores

Pacific Biosciences (PacBio):

H5 files: HDF5 format for older RSII systems
BAM files: PacBio’s primary format for Sequel systems
FASTA/FASTQ: Extracted consensus sequences

Comparative Analysis of Raw Data Formats

Platform	Primary Format	File Size	Read Length	Error Profile	Use Cases
Illumina	FASTQ	1-50 GB	50-300bp	Low substitution	Genome sequencing, RNA-seq, ChIP-seq
Nanopore	FAST5/POD5	10-500 GB	1kb-2Mb	Indels, homopolymer	Long-read assembly, structural variants
PacBio	BAM/FASTQ	5-200 GB	1kb-100kb	Random errors	High-quality assembly, isoform analysis

Sequence Data Formats: The Building Blocks

Sequence data formats store the fundamental genetic information extracted from NGS experiments. These formats serve as input for most downstream analyses.

FASTQ: The Universal Sequence Format

FASTQ format dominates NGS workflows due to its simplicity and comprehensive information content.

Structure and Components:

@M00967:43:000000000-A3JHG:1:1101:18327:1699 1:N:0:1
CCTACGGGNGGCWGCAG
+
A1>1>11#-1>11&lt;-&lt;11

Detailed Breakdown:

Header: Contains instrument ID, run ID, lane, tile, x/y coordinates
Sequence: Raw nucleotide sequence (A, T, G, C, N for ambiguous)
Quality: Phred scores indicating base call confidence

Common FASTQ Variants:

Paired-end: Two FASTQ files (R1 and R2) with matching read IDs
Compressed: .fastq.gz files for storage efficiency
Multiplexed: Multiple samples in single file with barcode identification

Best Practices for FASTQ Handling:

# Check FASTQ file integrity
seqkit stats sample.fastq.gz

# Count total reads
echo $(cat sample.fastq | wc -l)/4 | bc

# Extract first 1000 reads
head -n 4000 sample.fastq > sample_subset.fastq

FASTA: Simple Sequence Storage

FASTA format provides a streamlined approach for storing sequences without quality information.

Basic Structure:

>sequence_identifier optional_description
ATGCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGA
>another_sequence
GCGATCGATCGATCGATCGATCGATCGATCGAT

When to Use FASTA:

Reference genome sequences
Protein sequences
Assembled contigs or scaffolds
Consensus sequences from multiple alignment
Primer and probe sequences

FASTA vs FASTQ Decision Matrix:

Use Case	Format Choice	Reasoning
Raw sequencing reads	FASTQ	Need quality scores for filtering
Reference genomes	FASTA	No quality information needed
Assembly output	FASTA	Consensus sequences
Database searches	FASTA	Standard for BLAST databases

Alignment Data Formats: Mapping Reads to Genomes

Once sequencing reads are generated, they must be aligned to reference genomes. Alignment formats store this crucial mapping information with varying levels of compression and accessibility.

SAM: The Human-Readable Alignment Standard

The Sequence Alignment/Map (SAM) format provides a comprehensive, text-based representation of alignments.

SAM File Structure:

@HD    VN:1.6  SO:coordinate
@SQ    SN:chr1 LN:248956422
@PG    ID:bwa  PN:bwa  VN:0.7.17-r1188
@RG    ID:sample1  SM:patient_001  PL:ILLUMINA
M00967:43:000000000-A3JHG:1:1101:18327:1699    99  chr1    1000    60  150M    =   1150    300 CCTACGGGNGGCWGCAG...    A1>1>11#-1>11&lt;-&lt;11...   AS:i:145    XS:i:20

Header Section (@-lines):

@HD: File format version and sort order
@SQ: Reference sequence information
@RG: Read group information (sample, library, platform)
@PG: Program information used for alignment

Alignment Records (11 mandatory fields):

QNAME: Read identifier
FLAG: Bitwise flag indicating alignment properties
RNAME: Reference sequence name (chromosome)
POS: 1-based leftmost alignment position
MAPQ: Mapping quality score
CIGAR: Concise alignment representation
RNEXT: Reference name of mate/next read
PNEXT: Position of mate/next read
TLEN: Template length
SEQ: Read sequence
QUAL: ASCII-encoded read quality

CIGAR String Interpretation:

M: Match/mismatch
I: Insertion in read
D: Deletion in read
S: Soft clipping
H: Hard clipping
N: Skipped region (splicing)

Example: 50M2I25M = 50 matches, 2 insertions, 25 matches

BAM: Compressed Binary Alignments

BAM format provides the same information as SAM but in a compressed, binary format optimized for computational efficiency.

Key Advantages:

File Size: 60-80% smaller than equivalent SAM files
Processing Speed: Faster parsing and processing
Random Access: Efficient retrieval of specific genomic regions
Compression: Built-in bgzip compression

BAM Usage Examples:

# Convert SAM to BAM
samtools view -bS alignment.sam > alignment.bam

# Sort BAM file by coordinates
samtools sort -o alignment_sorted.bam alignment.bam

# Index BAM for random access
samtools index alignment_sorted.bam

# Extract reads from specific region
samtools view alignment_sorted.bam chr1:1000000-2000000

BAI: BAM Index Files

BAI files enable rapid random access to specific genomic regions within BAM files.

Index Structure:

Linear Index: Coarse-grained genomic bins
Hierarchical Index: Fine-grained access within bins
Metadata: Reference sequence information and statistics

Critical Considerations:

BAI files must be regenerated after any BAM file modification
Index files should be stored alongside BAM files
Coordinate-sorted BAM files are required for indexing

CRAM: Ultra-Compressed Alignments

CRAM format offers superior compression by using reference-based compression algorithms.

Compression Benefits:

Size Reduction: 30-60% smaller than BAM files
Lossless: Maintains all alignment information
Reference-Based: Stores only differences from reference genome

CRAM Usage Scenarios:

Long-term data archiving
Large-scale population genomics projects
Cloud storage optimization
Bandwidth-limited data transfers

CRAM Workflow Example:

# Convert BAM to CRAM
samtools view -C -T reference.fa alignment.bam > alignment.cram

# Index CRAM file
samtools index alignment.cram

# Convert CRAM back to BAM
samtools view -b -T reference.fa alignment.cram > alignment_restored.bam

Quantification and Expression Data Formats

Gene expression analysis requires specialized formats to store quantified measurements across samples and conditions. These formats balance human readability with computational efficiency.

Count Matrices: The Foundation of Expression Analysis

Count matrices represent the core data structure for RNA-seq and single-cell analyses.

Tab-Separated Values (TSV) Format:

Gene_ID    Sample_1    Sample_2    Sample_3    Sample_4
ENSG00000000003    743 891 1205    567
ENSG00000000005    0   2   1   0
ENSG00000000419    1891    2103    2456    1678
ENSG00000000457    567 634 723 445
ENSG00000000460    89  123 156 67

Comma-Separated Values (CSV) Format:

Gene_ID,Sample_1,Sample_2,Sample_3,Sample_4
ENSG00000000003,743,891,1205,567
ENSG00000000005,0,2,1,0
ENSG00000000419,1891,2103,2456,1678

Best Practices for Count Matrices:

Use gene IDs (Ensembl, RefSeq) rather than gene names for consistency
Include metadata files describing samples and experimental conditions
Validate that row and column totals match expected values
Store raw counts separately from normalized values

Normalized Expression Tables

Normalization addresses technical biases and enables meaningful comparisons between samples.

TPM (Transcripts Per Million) Table:

Gene_ID    Gene_Length Sample_1_TPM    Sample_2_TPM    Sample_3_TPM
ENSG00000000003    2100    354.1   424.5   573.8
ENSG00000000005    1500    0.0 1.3 0.7
ENSG00000000419    3200    591.6   657.2   768.1

FPKM/RPKM Comparison:

TPM: Transcript Per Million – sum to 1 million per sample
FPKM: Fragments Per Kilobase Million – for paired-end RNA-seq
RPKM: Reads Per Kilobase Million – for single-end RNA-seq

When to Use Each Format:

TPM: Cross-sample comparisons and meta-analyses
FPKM/RPKM: Within-sample gene length normalization
Raw Counts: Differential expression analysis with DESeq2/edgeR

Single-Cell Specific Formats

Single-cell RNA-seq generates sparse, high-dimensional datasets requiring specialized storage formats.

Matrix Market (MTX) Format:

%%MatrixMarket matrix coordinate integer general
32738 5000 8934756
1 1 4
1 3 1
2 1 2
2 2 8

Format Structure:

Line 1: Header with format information
Line 2: Matrix dimensions (genes, cells, non-zero entries)
Subsequent lines: Row, column, value triplets

Associated Files:

features.tsv: Gene IDs and symbols
barcodes.tsv: Cell barcode sequences

HDF5 and AnnData Formats

Modern single-cell analysis increasingly relies on hierarchical data formats.

HDF5 (.h5) Structure:

/
├── matrix/
│   ├── data
│   ├── indices
│   └── indptr
├── features/
│   ├── id
│   ├── name
│   └── feature_type
└── barcodes

AnnData (.h5ad) Components:

X: Primary data matrix (genes × cells)
obs: Cell metadata (cell type, cluster, etc.)
var: Gene metadata (gene symbols, biotype)
obsm: Multi-dimensional cell annotations
varm: Multi-dimensional gene annotations
uns: Unstructured metadata

Usage Example:

import scanpy as sc
import pandas as pd

# Load AnnData object
adata = sc.read_h5ad('single_cell_data.h5ad')

# Access expression matrix
expression = adata.X

# Access metadata
cell_metadata = adata.obs
gene_metadata = adata.var

Loom Format: Comprehensive Single-Cell Storage

Loom format provides a self-contained solution for single-cell genomics data.

Loom File Structure:

/
├── matrix (main data matrix)
├── row_attrs/
│   ├── Gene (gene symbols)
│   └── Accession (gene IDs)
├── col_attrs/
│   ├── CellID (cell barcodes)
│   └── CellType (annotations)
└── row_graphs/ (gene-gene relationships)

Advantages:

Cross-platform compatibility
Efficient sparse matrix storage
Built-in metadata management
Support for graphs and hierarchical relationships

R Data Formats

R-based analysis workflows commonly use native R storage formats.

RDS Format (.rds):

# Save single R object
expression_matrix <- read.csv("counts.csv", row.names=1)
saveRDS(expression_matrix, "expression_data.rds")

# Load RDS object
loaded_data <- readRDS("expression_data.rds")

RData Format (.rda/.RData):

# Save multiple R objects
sample_metadata <- read.csv("metadata.csv")
gene_annotations <- read.csv("genes.csv")
save(expression_matrix, sample_metadata, gene_annotations, 
     file="complete_dataset.RData")

# Load all objects
load("complete_dataset.RData")

Variant Data Formats: Capturing Genetic Diversity

Variant calling identifies genetic differences between sequenced samples and reference genomes. These formats must efficiently store diverse types of genetic variation while maintaining compatibility with analysis tools.

VCF: The Variant Call Format Standard

VCF format serves as the gold standard for storing genetic variants, from single nucleotide polymorphisms to complex structural variations.

VCF File Structure:

##fileformat=VCFv4.2
##reference=hg38
##contig=&lt;ID=chr1,length=248956422>
##INFO=&lt;ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=&lt;ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=&lt;ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1 Sample2
chr1    1000    rs123456    A   G   99.0    PASS    DP=50   GT:AD:DP    0/1:25,25:50    1/1:0,48:48
chr1    2000    .   T   C   45.2    LowQual DP=15   GT:AD:DP    0/0:15,0:15 0/1:8,7:15
chr2    3000    .   GTC G   87.5    PASS    DP=32   GT:AD:DP    0/1:16,16:32    0/1:18,14:32

Header Section (##-lines):

fileformat: VCF version specification
reference: Reference genome used
contig: Chromosome/contig information
INFO: Variant-level annotation descriptions
FORMAT: Sample-level field descriptions

Variant Records (8 mandatory + sample columns):

CHROM: Chromosome identifier
POS: 1-based position
ID: Variant identifier (e.g., dbSNP ID)
REF: Reference allele
ALT: Alternative allele(s)
QUAL: Quality score
FILTER: Filter status
INFO: Variant annotations
FORMAT: Sample data format
10+ Sample columns: Genotype and related data

Genotype Encoding:

0/0: Homozygous reference
0/1: Heterozygous
1/1: Homozygous alternative
./.: Missing genotype

Complex Variant Examples:

# Single nucleotide variant (SNV)
chr1    1000    .   A   G   99.0    PASS    .   GT  0/1

# Insertion
chr1    2000    .   T   TAGA    87.5    PASS    .   GT  0/1

# Deletion
chr1    3000    .   ATCG    A   92.3    PASS    .   GT  1/1

# Multi-allelic site
chr1    4000    .   G   A,T 78.4    PASS    .   GT  1/2

BCF: Binary Variant Call Format

BCF provides a compressed, binary representation of VCF data optimized for computational processing.

Key Advantages:

Performance: 5-10x faster parsing than VCF
Size: 50-70% smaller file sizes
Indexing: Efficient random access with tabix
Precision: Maintains full numerical precision

BCF Workflow:

# Convert VCF to BCF
bcftools view -Ob variants.vcf > variants.bcf

# Index BCF file
bcftools index variants.bcf

# Query specific region
bcftools view variants.bcf chr1:1000000-2000000

# Convert back to VCF
bcftools view variants.bcf > variants_restored.vcf

MAF: Mutation Annotation Format

MAF format specializes in storing somatic mutations with rich clinical and functional annotations.

MAF File Example:

Hugo_Symbol    Variant_Classification  Tumor_Sample_Barcode    HGVSp   HGVSc   Chromosome  Start_Position  End_Position    Reference_Allele    Tumor_Seq_Allele2
TP53    Missense_Mutation   TCGA-AA-A00A-01 p.R175H c.524G>A    chr17   7578406 7578406 G   A
KRAS    Missense_Mutation   TCGA-AA-A00A-01 p.G12D  c.35G>A chr12   25245350    25245350    G   A
PIK3CA    Missense_Mutation   TCGA-BB-B00B-01 p.E545K c.1633G>A   chr3    178936091   178936091   G   A

Critical MAF Fields:

Hugo_Symbol: Gene symbol
Variant_Classification: Functional impact (Missense, Nonsense, etc.)
Tumor_Sample_Barcode: Sample identifier
HGVSp/HGVSc: Protein and coding sequence notation
Reference_Allele/Tumor_Seq_Allele2: Variant alleles

MAF Use Cases:

Cancer genomics analysis
Mutation burden calculations
Pathway enrichment analysis
Clinical data integration
Survival analysis correlation

BEDPE: Paired-End Breakpoint Format

BEDPE format stores structural variants and breakpoint information from paired-end sequencing.

BEDPE Structure:

chr1    1000    2000    chr1    5000    6000    variant_1   100 +   -   translocation
chr2    3000    3500    chr3    7000    7500    variant_2   150 +   +   deletion
chrX    8000    8200    chrY    9000    9200    variant_3   80  -   +   inversion

BEDPE Fields:

chrom1, start1, end1: First breakpoint
chrom2, start2, end2: Second breakpoint
name: Variant identifier
score: Confidence score
strand1, strand2: Breakpoint orientations
type: Structural variant type

Structural Variant Types:

Deletion: Loss of genomic sequence
Duplication: Copy number increase
Inversion: Sequence orientation reversal
Translocation: Inter-chromosomal rearrangement
Insertion: Novel sequence addition

Copy Number Variation Formats

CNVkit generates specialized formats for copy number analysis.

CNR (Copy Number Ratio) Format:

chromosome    start   end gene    log2    depth   weight
chr1    1000000 1001000 GENE1   -0.15   125.4   0.95
chr1    1001000 1002000 GENE1   0.23    138.2   0.98
chr1    1002000 1003000 GENE2   1.45    156.7   0.92

CNS (Copy Number Segment) Format:

chromosome    start   end gene    log2    cn  depth   p_ttest probes  weight
chr1    1000000 1500000 GENE1,GENE2 0.12    2   142.3   0.001   500 0.94
chr1    1500000 2000000 GENE3   1.58    4   165.8   0.000   250 0.96

Interpretation:

log2 ratio: Copy number relative to diploid (log2(cn/2))
cn: Absolute copy number
p_ttest: Statistical significance of segment

Annotation and Feature Data Formats

Genomic annotations connect sequence data to biological knowledge, providing essential context for interpreting analysis results. These formats must efficiently represent diverse genomic features while maintaining compatibility with analysis tools.

GTF/GFF: Gene Transfer Format

GTF and GFF formats store comprehensive gene structure and functional annotations.

GTF Format Example:

chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; exon_id "ENSE00002234944";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; exon_id "ENSE00003582793";

GTF Field Descriptions:

seqname: Chromosome/scaffold identifier
source: Annotation source (ENSEMBL, RefSeq, etc.)
feature: Feature type (gene, transcript, exon, CDS)
start/end: 1-based genomic coordinates
score: Confidence score (optional)
strand: + (forward) or – (reverse)
frame: Reading frame for CDS features
attributes: Semicolon-separated key-value pairs

Common Feature Types:

gene: Complete gene locus
transcript: Individual transcript isoform
exon: Transcribed regions
CDS: Protein-coding sequences
UTR: Untranslated regions
start_codon/stop_codon: Translation boundaries

GFF3 Enhanced Features:

chr1    RefSeq  gene    11874   14409   .   +   .   ID=gene1;Name=DDX11L1;Dbxref=GeneID:100287102
chr1    RefSeq  mRNA    11874   14409   .   +   .   ID=rna1;Parent=gene1;Name=NR_046018.2
chr1    RefSeq  exon    11874   12227   .   +   .   ID=exon1;Parent=rna1
chr1    RefSeq  exon    12613   12721   .   +   .   ID=exon2;Parent=rna1

GFF3 vs GTF Comparison:

GFF3: Hierarchical relationships with ID/Parent structure
GTF: Flat structure with shared gene_id/transcript_id
GFF3: More flexible attribute system
GTF: Simpler parsing for RNA-seq workflows

BED: Browser Extensible Data Format

BED format provides a simple, flexible way to represent genomic intervals and annotations.

BED Format Variants:

BED3 (Minimal):

chr1    1000    2000
chr1    5000    6000
chr2    3000    4000

BED6 (Standard):

chr1    1000    2000    feature1    100 +
chr1    5000    6000    feature2    200 -
chr2    3000    4000    feature3    150 +

BED12 (Full):

chr1    1000    5000    gene1   1000    +   1200    4800    255,0,0 2   800,600 0,3400

BED Field Descriptions:

chrom: Chromosome name
chromStart: 0-based start position
chromEnd: 1-based end position
name: Feature identifier
score: Display score (0-1000)
strand: Orientation
thickStart/thickEnd: Coding region boundaries
itemRgb: RGB color values
blockCount: Number of sub-features
blockSizes: Comma-separated block sizes
blockStarts: Relative block start positions

BED Use Cases:

ChIP-seq peak regions
Gene promoter definitions
Regulatory element annotations
Copy number variation segments
Structural variant breakpoints

Peak Calling Formats

ChIP-seq and ATAC-seq analyses generate specialized peak formats.

narrowPeak Format:

chr1    1000    2000    peak1   100 .   5.2 10.1    8.3 500
chr1    5000    6000    peak2   200 .   7.8 15.2    12.1    400
chr2    3000    4000    peak3   150 .   6.1 12.5    9.8 300

narrowPeak Fields:
1-3. Standard BED3 fields

name: Peak identifier
score: Integer score (0-1000)
strand: Orientation (usually ‘.’)
signalValue: Signal enrichment
pValue: -log10(p-value)
qValue: -log10(q-value)
peak: Relative peak summit position

broadPeak Format:

chr1    1000    5000    region1 100 .   5.2 10.1    8.3
chr1    10000   15000   region2 200 .   7.8 15.2    12.1

broadPeak Differences:

No peak summit column (column 10)
Represents broader enrichment regions
Suitable for histone modifications (H3K27me3, H3K36me3)

BigBED: Indexed Binary BED

BigBED format provides efficient storage and random access for large BED datasets.

BigBED Advantages:

Compression: 70-90% size reduction
Indexing: Rapid region-based queries
Scalability: Handles millions of features efficiently
Browser Integration: Direct UCSC Genome Browser loading

BigBED Creation:

# Sort BED file by chromosome and position
sort -k1,1 -k2,2n input.bed > sorted.bed

# Get chromosome sizes
fetchChromSizes hg38 > hg38.chrom.sizes

# Convert BED to BigBED
bedToBigBed sorted.bed hg38.chrom.sizes output.bb

# Query specific region from BigBED
bigBedToBed output.bb -chrom=chr1 -start=1000000 -end=2000000 stdout

WIG and BigWig: Continuous Signal Data

Wiggle (WIG) and BigWig formats store continuous numerical data across genomic coordinates, essential for visualizing signal tracks.

WIG Format Types:

Variable Step WIG:

track type=wiggle_0 name="Sample1_Coverage" description="Coverage track"
variableStep chrom=chr1
1001    5.2
1002    5.8
1003    6.1
1005    4.9
1010    7.3

Fixed Step WIG:

track type=wiggle_0 name="Sample1_Coverage"
fixedStep chrom=chr1 start=1001 step=1
5.2
5.8
6.1
0.0
4.9

BedGraph Format (WIG alternative):

track type=bedGraph name="Sample1_Coverage"
chr1    1000    1001    5.2
chr1    1001    1002    5.8
chr1    1002    1003    6.1
chr1    1004    1005    4.9
chr1    1009    1010    7.3

Format Comparison:

Variable Step: Sparse data with irregular intervals
Fixed Step: Dense data with regular intervals
BedGraph: Most flexible, handles any interval structure

BigWig Advantages:

Performance: 100x faster random access than WIG
Compression: Significant size reduction
Multi-resolution: Automatic data summarization at different zoom levels
Streaming: Efficient data transfer over networks

BigWig Creation and Usage:

# Convert bedGraph to BigWig
bedGraphToBigWig coverage.bedGraph hg38.chrom.sizes coverage.bw

# Extract signal from specific region  
bigWigToBedGraph coverage.bw -chrom=chr1 -start=1000000 -end=2000000 stdout

# Calculate summary statistics
bigWigSummary coverage.bw chr1 1000000 2000000 100

Common BigWig Applications:

ChIP-seq signal tracks
RNA-seq coverage visualization
ATAC-seq accessibility profiles
Hi-C interaction frequencies
Methylation percentage tracks

Specialized NGS Data Formats

Advanced NGS applications have developed specialized formats to handle unique data types and optimize storage for specific use cases.

Compressed and Indexed Formats

VCF.gz + TBI: Tabix-Indexed Variants

Tabix indexing enables efficient random access to compressed VCF files.

# Compress and index VCF
bgzip variants.vcf
tabix -p vcf variants.vcf.gz

# Query specific region
tabix variants.vcf.gz chr1:1000000-2000000

# Multiple region query
tabix variants.vcf.gz chr1:100000-200000 chr2:300000-400000

Index File Structure:

Linear index: Coarse-grained genomic bins (16kb default)
Hierarchical index: Fine-grained access within bins
Metadata: Sequence names, file offsets, and statistics

Tabix Advantages:

Works with any coordinate-sorted, tab-delimited format
Minimal memory footprint for large files
Supports multiple simultaneous queries
Network streaming compatibility

Sequencing Archive Formats

SRA: Sequence Read Archive

SRA format serves as the primary archive format for public sequencing data repositories.

SRA File Structure:

sample.sra
├── Metadata (experiment design, sample info)
├── Read data (sequences and qualities)
├── Alignment data (optional)
└── Analysis data (optional)

SRA Toolkit Usage:

# Download SRA file
prefetch SRR1234567

# Convert to FASTQ
fasterq-dump SRR1234567

# Split paired-end reads
fasterq-dump --split-files SRR1234567

# Dump specific reads
sam-dump --aligned-region chr1:1000000-2000000 SRR1234567

SRA Advantages:

Comprehensive metadata storage
Efficient compression algorithms
Quality score optimization
International standard for data sharing

Long-Read Specific Formats

FAST5: Nanopore Raw Signal Data

FAST5 format stores raw electrical current measurements from Oxford Nanopore sequencing.

FAST5 HDF5 Structure:

/
├── UniqueGlobalKey/
│   ├── channel_id/
│   ├── context_tags/
│   ├── tracking_id/
│   └── sampling_rate
├── Raw/
│   └── Reads/
│       └── Read_[number]/
│           ├── Signal (raw current values)
│           └── Signal_metadata
└── Analyses/
    ├── Basecall_1D_[version]/
    │   ├── BaseCalled_template/
    │   │   ├── Fastq
    │   │   └── Events
    │   └── Summary/
    └── EventDetection_[version]/

FAST5 Data Components:

Raw Signal: 4000 Hz current measurements
Event Data: Segmented signal regions
Basecalls: Sequence calls with quality scores
Metadata: Pore information, chemistry, temperature

FAST5 Analysis Tools:

# Extract FASTQ from FAST5
ont_fast5_api_multi_to_single --input_path multi.fast5 --save_path single_reads/ --recursive

# Basecall with Guppy
guppy_basecaller --input_path fast5_dir/ --save_path output_dir/ --config dna_r9.4.1_450bps_hac.cfg

# Extract signal data
h5dump -d /read_12345/Raw/Signal sample.fast5

M5/PBI: PacBio Data Formats

PacBio generates specialized formats for long-read sequencing data.

PacBio BAM Structure:

Standard BAM alignment records
Extended tags for PacBio-specific information
Pulse-level data (optional)

PBI Index Fields:

# PacBio BAM Index (.pbi)
- Reference ID and position
- Read quality scores  
- Subread information
- Barcode data (if multiplexed)
- Kinetic information

PacBio Analysis Example:

# Extract subreads from PacBio BAM
bamtools filter -in pacbio.bam -out subreads.bam -tag "qs:>750"

# Generate consensus sequences
pbccs pacbio.bam --min-passes 3 --min-rq 0.99 consensus.bam

# Polish assembly with long reads
pbmm2 align reference.fa pacbio.bam aligned.bam
variantCaller --algorithm=arrow aligned.bam -r reference.fa -o polished.fa

Configuration and Metadata Formats

JSON: JavaScript Object Notation

JSON format provides flexible metadata storage for NGS workflows.

Sample Metadata JSON:

{
  "experiment_id": "EXP001",
  "samples": [
    {
      "sample_id": "SAMPLE_001",
      "condition": "control",
      "replicate": 1,
      "library_prep": "TruSeq",
      "sequencing_depth": 30000000,
      "quality_metrics": {
        "mean_quality": 35.2,
        "percent_duplicates": 12.3,
        "mapping_rate": 94.5
      }
    },
    {
      "sample_id": "SAMPLE_002", 
      "condition": "treatment",
      "replicate": 1,
      "library_prep": "TruSeq",
      "sequencing_depth": 28500000,
      "quality_metrics": {
        "mean_quality": 34.8,
        "percent_duplicates": 15.1,
        "mapping_rate": 93.2
      }
    }
  ],
  "analysis_parameters": {
    "aligner": "bwa-mem",
    "peak_caller": "macs2",
    "fdr_threshold": 0.05
  }
}

YAML: Human-Readable Configuration

YAML provides an alternative to JSON with improved readability.

Pipeline Configuration YAML:

# NGS Analysis Pipeline Configuration
pipeline:
  name: "ChIP-seq Analysis"
  version: "1.2.0"

reference:
  genome: "hg38"
  index_path: "/data/genomes/hg38/bwa_index"
  annotation: "/data/annotations/gencode.v38.gtf"

quality_control:
  adapter_trimming: true
  quality_threshold: 20
  minimum_length: 30

alignment:
  tool: "bwa"
  parameters:
    - "-M"
    - "-t 16"

peak_calling:
  tool: "macs2"
  parameters:
    fdr: 0.05
    fold_change: 2.0

samples:
  - name: "ChIP_sample1"
    files: 
      - "sample1_R1.fastq.gz"
      - "sample1_R2.fastq.gz"
    condition: "treatment"

  - name: "Input_control1"
    files:
      - "input1_R1.fastq.gz" 
      - "input1_R2.fastq.gz"
    condition: "control"

Configuration Format Benefits:

Reproducibility: Document analysis parameters
Automation: Drive pipeline execution
Version Control: Track parameter changes
Collaboration: Share analysis protocols

Best Practices for NGS Data Management

File Organization and Naming Conventions

Hierarchical Directory Structure:

project_root/
├── raw_data/
│   ├── sample_001_R1.fastq.gz
│   ├── sample_001_R2.fastq.gz
│   └── checksums.md5
├── processed/
│   ├── trimmed/
│   ├── aligned/
│   └── quantified/
├── analysis/
│   ├── differential_expression/
│   ├── pathway_analysis/
│   └── figures/
├── metadata/
│   ├── sample_sheet.csv
│   └── experimental_design.yaml
└── scripts/
    ├── preprocessing.sh
    └── analysis.R

Naming Convention Examples:

# Good naming practices
ChIPseq_USF2_HepG2_rep1_treat_001.fastq.gz
RNAseq_WT_brain_12h_rep2_control_R1.fastq.gz
WGS_patient_001_tumor_primary_001.bam

# Include key information:
- Assay type (ChIPseq, RNAseq, WGS)
- Target/condition (USF2, WT, patient_001)
- Sample type (HepG2, brain, tumor)
- Time point (12h)
- Replicate (rep1, rep2)
- Condition (treat, control)
- Read pair (R1, R2)

Data Integrity and Quality Control

Checksum Verification:

# Generate checksums during data transfer
md5sum *.fastq.gz > checksums.md5

# Verify file integrity
md5sum -c checksums.md5

# For large files, use faster alternatives
sha256sum large_file.bam > large_file.sha256

File Format Validation:

# Validate FASTQ format
seqkit stats -T sample.fastq.gz

# Check BAM file integrity
samtools quickcheck aligned.bam

# Validate VCF format
bcftools view -h variants.vcf | head -20

Storage and Compression Strategies

Compression Guidelines:

FASTQ files: Always compress with gzip (.gz)
BAM files: Use built-in compression (already compressed)
VCF files: Compress with bgzip for tabix compatibility
Text files: Use gzip for significant space savings

Archive Strategy:

# Long-term storage with maximum compression
tar -czf project_archive.tar.gz project_directory/

# Create separate archives for different data types
tar -czf raw_data.tar.gz raw_data/
tar -czf analysis_results.tar.gz analysis/

Data Backup and Version Control

Backup Strategy:

Primary storage: Active analysis workspace
Secondary backup: Network storage or cloud
Archive storage: Long-term compressed storage
Metadata backup: Critical sample information

Version Control for Analysis:

# Initialize git repository
git init project_analysis
cd project_analysis

# Track analysis scripts and metadata
git add scripts/ metadata/ README.md
git commit -m "Initial analysis setup"

# Create branches for different analyses
git checkout -b differential_expression
git checkout -b pathway_analysis

Common Pitfalls and Troubleshooting

Format Compatibility Issues

Coordinate System Mismatches:

0-based vs 1-based: BED (0-based) vs VCF/GTF (1-based)
Half-open intervals: BED uses [start, end) intervals
Always verify: Use tools like bedtools for coordinate conversions

# Convert BED to 1-based coordinates
awk '{print $1, $2+1, $3}' OFS='\t' input.bed > output_1based.bed

# Convert VCF positions to BED format
bcftools query -f '%CHROM\t%POS0\t%END\n' variants.vcf > positions.bed

Character Encoding Problems:

# Check file encoding
file -i sample.txt

# Convert encoding if necessary
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

# Remove invisible characters
tr -d '\r' < windows_file.txt > unix_file.txt

Performance Optimization

Index Management:

# Always index coordinate-sorted files
samtools index sorted.bam
tabix -p vcf compressed.vcf.gz
samtools faidx reference.fa

# Verify index compatibility
samtools idxstats sorted.bam

Memory and Storage Optimization:

# Use streaming for large files
samtools view large.bam chr1:1000000-2000000 | process_reads.py

# Parallel processing with GNU parallel
ls *.fastq.gz | parallel -j 8 'process_sample.sh {}'

# Monitor resource usage
htop  # Interactive process monitor
iostat -x 1  # I/O statistics

Data Corruption and Recovery

Common Corruption Signs:

Unexpected file sizes (too small or truncated)
Parsing errors from standard tools
Missing headers or incomplete records
Checksum verification failures

Recovery Strategies:

# Attempt to recover truncated files
samtools view -h corrupted.bam | samtools view -bS - > recovered.bam

# Extract partial data from corrupted files
head -n 4000000 corrupted.fastq > partial_recovery.fastq

# Use repair tools for specific formats
repair.sh in=corrupted.fastq out=repaired.fastq

Format-Specific Troubleshooting

FASTQ Issues:

# Check for malformed FASTQ records
awk 'NR%4==1{if($0!~/^@/)print "Line "NR": "$0}' sample.fastq

# Validate quality score encoding
seqkit stats -T sample.fastq | grep -E "Q20|Q30"

# Fix truncated FASTQ files
seqkit head -n $(expr $(cat sample.fastq | wc -l) / 4 \* 4) sample.fastq > fixed.fastq

BAM/SAM Problems:

# Check BAM header consistency
samtools view -H sample.bam | grep "@RG"

# Validate sort order
samtools view sample.bam | head -1000 | cut -f3,4 | sort -k1,1 -k2,2n

# Fix header issues
samtools reheader new_header.sam sample.bam > fixed.bam

VCF Validation:

# Comprehensive VCF validation
bcftools view variants.vcf | vcf-validator

# Check for sorting issues
bcftools view variants.vcf | bcftools query -f '%CHROM\t%POS\n' | sort -k1,1 -k2,2n -c

# Fix VCF formatting
bcftools norm -f reference.fa -O z variants.vcf > normalized.vcf.gz

Future Trends in NGS Data Formats

Emerging Technologies and Formats

Real-Time Sequencing Data:

Streaming formats: Handle continuous data flow from real-time sequencers
Adaptive compression: Dynamic compression based on data characteristics
Event-driven processing: Process data as it’s generated

Multi-Modal Data Integration:

Multi-omics formats: Combine genomics, transcriptomics, and epigenomics
Spatial data formats: Integrate location information with expression data
Time-series formats: Handle temporal genomics experiments

Cloud-Native Formats:

Object storage optimization: Formats designed for cloud storage systems
Serverless compatibility: Support for function-as-a-service architectures
API-first design: RESTful interfaces for data access

Standards Evolution

International Coordination:

GA4GH standards: Global Alliance for Genomics and Health format specifications
FAIR principles: Findable, Accessible, Interoperable, Reusable data
Metadata standardization: Consistent sample and experimental annotations

Performance Improvements:

Columnar storage: Apache Parquet and similar formats for analytics
GPU acceleration: Formats optimized for parallel processing
Quantum-ready: Preparation for quantum computing applications

Conclusion: Mastering NGS Data Formats

Understanding NGS data formats is fundamental to successful genomics analysis. Each format has evolved to address specific challenges in storing, processing, and analyzing biological data. From the simple elegance of FASTA to the sophisticated compression of CRAM, these formats enable researchers to extract meaningful insights from vast genomic datasets.

Key Takeaways

Format Selection Strategy:

Choose formats based on your analysis requirements and computational resources
Consider long-term storage and accessibility needs
Balance compression efficiency with processing speed
Ensure compatibility with your analysis tools and collaborators

Quality Assurance:

Implement robust data validation procedures
Maintain comprehensive metadata throughout your analysis pipeline
Use checksums and version control to ensure data integrity
Document format conversions and processing steps

Future-Proofing:

Stay informed about emerging format standards
Design analysis pipelines with format flexibility
Participate in community discussions about best practices
Consider cloud compatibility in format selection

Practical Next Steps

Audit Your Current Data: Review existing datasets and identify format optimization opportunities
Standardize Workflows: Implement consistent naming conventions and directory structures
Automate Validation: Create scripts to verify data integrity and format compliance
Collaborate Effectively: Establish format standards within your research group
Stay Updated: Follow developments in format standards and analysis tools

The landscape of NGS data formats continues to evolve with advancing sequencing technologies and computational capabilities. By mastering these fundamental formats and understanding their appropriate applications, researchers can build robust, efficient analysis pipelines that scale from individual experiments to large-scale genomics consortia.

Whether you’re analyzing your first ChIP-seq dataset or managing petabyte-scale population genomics data, a solid understanding of NGS data formats provides the foundation for reproducible, high-quality genomics research. The investment in learning these formats pays dividends in analysis efficiency, data management, and collaborative success.

This tutorial is part of the NGS101.com comprehensive guide to next-generation sequencing analysis. For more tutorials on specific analysis workflows and advanced techniques, explore our complete tutorial collection.