How To Analyze Whole Genome Sequencing Data For Absolute Beginners Part 1: From Raw Reads to High-Quality Variants Using GATK

February 9, 2026

Lei

Table of Contents

Understanding Whole Genome Sequencing and Its Applications

What is Whole Genome Sequencing (WGS)?

Whole Genome Sequencing represents one of the most comprehensive approaches to studying genetic variation across an entire organism’s DNA. Unlike targeted sequencing approaches that focus on specific regions of interest, WGS captures virtually every nucleotide in the genome, providing an unbiased view of genetic variation including single nucleotide variants (SNVs), insertions and deletions (indels), structural variants, and copy number variations.

The technology behind WGS has evolved dramatically since the completion of the Human Genome Project. Modern sequencing platforms can now generate billions of short DNA sequences (reads) in a single run, covering the human genome dozens of times over. This redundant coverage, known as sequencing depth, allows for accurate detection of genetic variants even in challenging genomic regions.

Beginner’s Tip: Think of WGS as taking millions of overlapping photographs of a landscape from different angles. Each photograph (sequencing read) shows a small portion, but when combined, they create a detailed map of the entire terrain (genome).

Clinical and Research Applications of WGS

WGS has revolutionized both research and clinical practice across multiple domains:

Medical Diagnostics and Precision Medicine:

Rare disease diagnosis, particularly when standard genetic tests fail to identify causative mutations
Pharmacogenomics applications to predict drug responses and optimize medication selection
Cancer genomics to identify somatic mutations driving tumor growth and resistance to therapy
Prenatal and neonatal screening for genetic disorders

Population and Evolutionary Genetics:

Large-scale population studies to understand human genetic diversity and migration patterns
Identification of genetic factors contributing to complex diseases through genome-wide association studies (GWAS)
Agricultural genomics for crop improvement and livestock breeding programs

Infectious Disease Research:

Pathogen surveillance and outbreak investigation through viral and bacterial genome sequencing
Understanding antimicrobial resistance mechanisms in clinical isolates
Tracking disease transmission patterns during epidemics

Biological Insights from WGS Analysis

WGS data provides unprecedented insights into genome organization and function:

Variant Discovery: Identification of all types of genetic variants, from single base changes to large structural rearrangements
Functional Impact Assessment: Prediction of how variants affect protein function and gene regulation
Population Stratification: Understanding genetic ancestry and population structure
Evolutionary Signatures: Detection of regions under positive or negative selection pressure
Genome-wide Patterns: Analysis of mutation rates, recombination hotspots, and chromatin organization

WGS Analysis Pipeline Options

Several computational pipelines have been developed for WGS analysis, each with different strengths. The most commonly used tools include GATK (Genome Analysis Toolkit), DeepVariant, Strelka2, and FreeBayes. Each pipeline has been optimized for different use cases – some prioritize speed for clinical applications, others focus on maximum accuracy, and some offer extensive customization for research applications.

GATK, developed by the Broad Institute, has emerged as the industry standard for germline variant calling and is the most widely adopted pipeline in both clinical and research settings. It offers comprehensive best practices workflows that have been extensively validated across thousands of genomes, sophisticated algorithms for quality control and filtering, and excellent documentation with strong community support.

Why GATK for This Tutorial

We’ve chosen GATK for this tutorial because it provides the ideal learning foundation for newcomers to WGS analysis. Its well-documented workflows, extensive educational resources, and systematic approach to quality control make it perfect for understanding the fundamental principles of variant calling. The skills and concepts you learn from GATK will transfer readily to other pipelines as your expertise grows and your project requirements evolve.

GATK Workflow Overview

The GATK workflow for WGS analysis follows a systematic approach:

Data Preparation: Quality control and preprocessing of raw sequencing reads
Alignment: Mapping reads to a reference genome using BWA-MEM
Post-alignment Processing: Marking duplicates and recalibrating base quality scores
Variant Calling: Identifying genetic variants using HaplotypeCaller
Variant Filtering: Applying quality filters to retain high-confidence variants
Annotation: Adding functional information to variants for interpretation

Each step includes rigorous quality control measures and follows established best practices developed through analysis of thousands of genomes.

Setting Up Your WGS Analysis Environment

Before beginning the analysis, we need to establish a computational environment with all necessary tools and dependencies.

Creating a Dedicated Conda Environment

# Create a new conda environment for WGS analysis
# This isolates our tools from other projects and prevents conflicts
conda create -n wgs_analysis python=3.9

# Activate the environment - all subsequent installations will go here
conda activate wgs_analysis

# Add bioinformatics channels to conda
# bioconda: specialized channel for bioinformatics software
# conda-forge: community-maintained packages with high quality standards
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict  # Prevents package conflicts

# Install GATK and essential tools for the entire workflow
conda install -y \
    gatk4 \          # Main variant calling toolkit
    bwa \            # Read alignment tool
    samtools \       # SAM/BAM file manipulation
    picard \         # Java-based tools for BAM processing
    trim-galore \    # Adapter trimming and quality control
    fastqc \         # Sequencing quality assessment
    bcftools \       # VCF file manipulation and statistics
    snpeff \         # Variant annotation and effect prediction
    bedtools \       # Genomic interval operations
    vcftools \       # VCF filtering and format conversion
    tabix  \         # VCF indexing and querying
    tree             # Show directory structure

# Verify GATK installation and check version
gatk --version

This creates an isolated environment for WGS analysis with all required tools. The bioconda and conda-forge channels provide specialized bioinformatics software with high quality standards. Setting channel priority to strict prevents package conflicts.

Reference Genome and Resource Files

The reference genome serves as the “map” against which we align our sequencing reads. We’ll use the human reference genome GRCh38/hg38, which is the current standard:

# Create organized directory structure for the project
mkdir -p ~/wgs_analysis/{reference,data,results}
cd ~/wgs_analysis/reference

# Download human reference genome (GRCh38/hg38 - the current standard)
# This is the "map" we'll align our reads to
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict

# Create BWA index - this allows BWA to quickly find where reads should align
# This step takes ~2-3 hours but only needs to be done once
bwa index Homo_sapiens_assembly38.fasta

The BWA index step takes 2-3 hours but only needs to be done once. This creates specialized data structures that allow BWA to quickly find where reads should align.

Next, we’ll download the GATK resource bundle – high-quality variant datasets used for base quality score recalibration and variant filtering:

# Download GATK resource bundle - these are "truth sets" of known variants
# Used for base quality score recalibration and variant filtering

# HapMap: High-quality SNPs used for training variant filters
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi

# 1000 Genomes Omni: Another high-quality variant set for training
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi

# 1000 Genomes high-confidence SNPs: Large collection of validated variants
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi

# Mills and 1000G indels: High-quality insertions and deletions
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

# dbSNP: Database of known variants - helps identify novel vs. known variants
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi

These files contain different types of high-quality variants:

HapMap: Highest quality SNPs used for training variant filters
1000 Genomes Omni: Another high-quality variant set for training
1000 Genomes high-confidence SNPs: Large collection of validated variants
Mills and 1000G indels: High-quality insertions and deletions
dbSNP: Database of known variants for annotation

Important Note: The GATK resource bundle moved from the old genomics-public-data bucket to the new gcp-public-data–broad-references bucket. Most files now come pre-indexed with .tbi files.

Complete WGS Analysis Pipeline

Let’s walk through the complete validated pipeline. Each step includes comprehensive quality control and follows GATK best practices. For this tutorial, we’ll analyze a tumor sample called “tumor1” with paired-end FASTQ files.

Step 1: Quality Control and Preprocessing

Quality assessment helps identify potential issues with sequencing data before proceeding with alignment. We’ll examine base quality scores, GC content distribution, adapter contamination, and sequence length distribution:

# Set up variables and create directory structure
REFERENCE="~/wgs_analysis/reference/Homo_sapiens_assembly38.fasta"
SAMPLE="tumor1"
THREADS=20

# Create organized directory structure for results
mkdir -p ~/wgs_analysis/results/{qc,trimmed,aligned,recal,var}/${SAMPLE}

# Run FastQC to assess the quality of raw sequencing data
# This generates HTML reports showing:
# - Base quality scores across read positions
# - GC content distribution
# - Adapter contamination levels
# - Overrepresented sequences
fastqc \
    ~/wgs_analysis/data/${SAMPLE}_R1.fastq.gz \
    ~/wgs_analysis/data/${SAMPLE}_R2.fastq.gz \
    -o ~/wgs_analysis/results/qc/${SAMPLE}

FastQC generates HTML reports showing base quality scores across read positions, GC content distribution, adapter contamination levels, and overrepresented sequences. Check these reports to ensure your data meets quality standards before proceeding.

Step 2: Adapter Trimming and Quality Filtering

Modern sequencing often includes adapter sequences that must be removed before alignment. We’ll also filter out low-quality bases and very short reads:

# Trim adapters and low-quality bases using Trim Galore
# This removes:
# - Illumina sequencing adapters that can interfere with alignment
# - Low-quality bases (quality score < 20) from read ends
# - Very short reads (< 50bp) that may align poorly
trim_galore \
    --paired \           # Indicates we have paired-end reads (R1 and R2)
    --quality 20 \       # Remove bases with quality score < 20
    --length 50 \        # Discard reads shorter than 50bp after trimming
    --fastqc \           # Run FastQC on trimmed reads for comparison
    --output_dir ~/wgs_analysis/results/trimmed/${SAMPLE} \
    ~/wgs_analysis/data/${SAMPLE}_R1.fastq.gz \
    ~/wgs_analysis/data/${SAMPLE}_R2.fastq.gz

This step removes Illumina sequencing adapters, trims bases with quality scores below 20, discards reads shorter than 50bp after trimming, and generates FastQC reports on the cleaned data. The output files will be named with _val_1.fq.gz and _val_2.fq.gz suffixes.

Step 3: Read Alignment with BWA-MEM

BWA-MEM maps our cleaned sequencing reads to the reference genome. The algorithm is specifically designed for 70bp-1Mbp Illumina reads and handles chimeric reads effectively:

# Align reads to reference genome using BWA-MEM algorithm
# BWA-MEM is specifically designed for 70bp-1Mbp Illumina reads
bwa mem \
    -t ${THREADS} \     # Use multiple CPU threads for faster processing
    -M \                # Mark shorter split hits as secondary (required for Picard compatibility)
    -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib" \  # Read group info (required for GATK)
    ${REFERENCE} \      # Reference genome file
    ~/wgs_analysis/results/trimmed/${SAMPLE}/${SAMPLE}_R1_val_1.fq.gz \  # Trimmed forward reads
    ~/wgs_analysis/results/trimmed/${SAMPLE}/${SAMPLE}_R2_val_2.fq.gz \  # Trimmed reverse reads
    | samtools sort -@ ${THREADS} -o ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam  # Sort reads by position

# Create index for the BAM file - this allows random access to specific regions
samtools index ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam

The read group information (@RG) is essential for GATK processing. It includes the read group identifier (ID), sample name (SM), sequencing platform (PL), and library identifier (LB). The output is piped directly to samtools for coordinate sorting, and we create an index for random access to specific genomic regions.

Step 4: Mark PCR and Optical Duplicates

PCR amplification during library preparation and optical duplicates from sequencing can bias variant calling by making variants appear more frequent than they actually are:

# Mark PCR and optical duplicates using GATK MarkDuplicates
# Duplicates arise from:
# - PCR amplification during library preparation
# - Optical duplicates from sequencing (reads from same DNA cluster)
# These can bias variant calling by making variants appear more frequent
gatk MarkDuplicates \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam \
    -O ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
    -M ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_duplicate_metrics.txt \  # File containing duplication statistics
    --CREATE_INDEX true  # Automatically create BAM index

The metrics file reports the total number of reads, number of duplicate reads, and duplication rate. For good WGS libraries, the duplication rate should be less than 30%. Higher duplication rates may indicate issues with library preparation or low input DNA amounts.

Step 5: Base Quality Score Recalibration (BQSR)

Sequencing machines sometimes systematically over- or under-estimate the probability that a base call is correct. BQSR corrects these systematic errors using patterns observed at known variant sites:

# Step 1: Generate recalibration data table
# This analyzes patterns of base quality score errors using known variant sites
gatk BaseRecalibrator \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
    -R ${REFERENCE} \
    --known-sites ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \        # Known SNPs from dbSNP
    --known-sites ~/wgs_analysis/reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \  # Known indels
    -O ~/wgs_analysis/results/recal/${SAMPLE}/${SAMPLE}_recal_data.table

# Step 2: Apply base quality score recalibration
# This updates the quality scores in your BAM file based on the patterns found above
gatk ApplyBQSR \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
    -R ${REFERENCE} \
    --bqsr-recal-file ~/wgs_analysis/results/recal/${SAMPLE}/${SAMPLE}_recal_data.table \
    -O ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam

The first step analyzes patterns of base quality score errors using known SNPs from dbSNP and known indels from the Mills dataset. The second step applies these corrections to update quality scores in the BAM file. This improves variant calling accuracy, particularly for low-frequency variant detection.

Step 6: Alignment Quality Assessment

Before proceeding to variant calling, we should assess the quality of our alignments to ensure they meet expected standards:

# Collect alignment summary metrics
# This provides statistics about how well your reads aligned to the reference
gatk CollectAlignmentSummaryMetrics \
    -R ${REFERENCE} \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
    -O ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_alignment_summary.txt

# Collect insert size metrics (for paired-end data)
# Insert size is the distance between paired reads - important for structural variant detection
gatk CollectInsertSizeMetrics \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
    -O ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_metrics.txt \
    -H ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_histogram.pdf

Key metrics to check include:

TOTAL_READS: Total number of reads processed
PCT_PF_READS_ALIGNED: Percentage of reads aligned (should be >95% for human WGS)
PF_MISMATCH_RATE: Rate of mismatches (should be <5%)
Insert size distribution: Should match library preparation (typically 300-500bp for most WGS libraries)

Step 7: Variant Calling with HaplotypeCaller

HaplotypeCaller is GATK’s primary variant calling algorithm. It uses local assembly and haplotype-based calling to achieve high accuracy, particularly in complex genomic regions:

# Call variants in GVCF (Genomic Variant Call Format) mode
# GVCF mode records information about ALL sites (variant and non-variant)
# This allows for accurate joint genotyping when analyzing multiple samples
gatk HaplotypeCaller \
    -R ${REFERENCE} \
    -I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}.g.vcf.gz \
    -ERC GVCF \          # Emit Reference Confidence mode - creates GVCF format
    --dbsnp ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz  # Annotate with known variants

GVCF (Genomic Variant Call Format) mode records information about all genomic positions, not just variant sites. This enables accurate joint genotyping when analyzing multiple samples and is essential for population-scale studies and family analyses.

For single-sample analysis, we need to convert the GVCF to a standard VCF format:

# Convert GVCF to regular VCF for single-sample analysis
# This step is REQUIRED before variant filtering
gatk GenotypeGVCFs \
    -R ${REFERENCE} \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}.g.vcf.gz \
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz

Step 8: Variant Quality Score Recalibration (VQSR)

VQSR uses machine learning to distinguish true variants from artifacts based on annotation profiles from high-quality training datasets. This approach is more sophisticated than simple hard filtering with arbitrary thresholds.

SNP Recalibration:

# Build SNP recalibration model using multiple high-quality training sets
# Each training set has different confidence levels and purposes
gatk VariantRecalibrator \
    -R ${REFERENCE} \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz \
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 \
        ~/wgs_analysis/reference/hapmap_3.3.hg38.vcf.gz \     # Highest quality training set
    --resource:omni,known=false,training=true,truth=false,prior=12.0 \
        ~/wgs_analysis/reference/1000G_omni2.5.hg38.vcf.gz \  # High quality, slightly lower confidence
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 \
        ~/wgs_analysis/reference/1000G_phase1.snps.high_confidence.hg38.vcf.gz \  # Large training set
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \
        ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \  # Known variants (not for training)
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \  # Variant annotations to use
    -mode SNP \             # Process SNPs only
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.recal \
    --tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.tranches

# Apply SNP recalibration filter
# This assigns PASS/FAIL based on the machine learning model
gatk ApplyVQSR \
    -R ${REFERENCE} \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz \
    --recal-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.recal \
    --tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.tranches \
    -mode SNP \
    --truth-sensitivity-filter-level 99.0 \  # Keep 99% of true variants (high sensitivity)
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz

The annotations used for SNP recalibration include:

QD: QualByDepth – variant quality normalized by depth
MQ: MappingQuality – how well reads map to reference
MQRankSum: Difference in mapping quality between reference and alternate alleles
ReadPosRankSum: Position of variants within reads
FS: FisherStrand – strand bias indicator
SOR: StrandOddsRatio – another strand bias measure

Indel Recalibration:

# Build indel recalibration model
# Indels are harder to call accurately, so we use fewer annotations
gatk VariantRecalibrator \
    -R ${REFERENCE} \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz \
    --resource:mills,known=false,training=true,truth=true,prior=12.0 \
        ~/wgs_analysis/reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \  # High-quality indels
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \
        ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
    -an QD -an ReadPosRankSum -an FS -an SOR \  # Fewer annotations for indels
    -mode INDEL \
    --max-gaussians 4 \     # Limit model complexity (indels have less training data)
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.recal \
    --tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.tranches

# Apply indel recalibration
# Using 95% sensitivity (slightly more stringent than SNPs)
gatk ApplyVQSR \
    -R ${REFERENCE} \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz \
    --recal-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.recal \
    --tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.tranches \
    -mode INDEL \
    --truth-sensitivity-filter-level 95.0 \
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz

Indel recalibration uses fewer annotations because indels are harder to call accurately and have less training data available. We limit model complexity with --max-gaussians 4 and use slightly more stringent filtering (95% sensitivity vs 99% for SNPs). The final result contains high-quality variants that have passed both SNP and indel filters with a much lower false positive rate than raw calls.

Understanding the VCF File Format:

#CHROM  POS    ID     REF  ALT  QUAL  FILTER  INFO    FORMAT   SAMPLE1
chr1    12345  rs123  A    G    60    PASS    DP=30   GT:DP    1/1:30

Columns:

Position: Chromosome 1, position 12345, variant ID rs123
Variant: A→G substitution
Quality Score: 60
Filter Status: PASS/FAIL
Overall coverage: 30 reads at this position
Format: Genotype:Depth
Sample genotype: Homozygous for the G allele (both copies are G)
Sample coverage: 30 reads supporting this call in this sample

Common genotype codes:

0/0: Homozygous reference (A/A in this example)
0/1: Heterozygous (A/G in this example)
1/1: Homozygous alternate (G/G in this example)
1/2: Heterozygous with two different alternate alleles
./.: Missing/unknown genotype

Separator meaning:

/: Unphased (don’t know which allele came from which parent)
|: Phased (know maternal vs paternal inheritance)

Step 9: Functional Annotation with SnpEff

Functional annotation predicts the biological effect of each variant on genes and proteins. This information is crucial for interpreting variant significance:

# Annotate variants with functional consequences
# This predicts the effect of each variant on genes and proteins
snpEff ann \
    -Xmx32g \    # Allocate 32GB memory for large genomes
    -stats ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotation_stats.html \  # Generate stats report
    GRCh38.105 \  # Latest Ensembl-based database version
    ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz \
    > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotated.vcf

SnpEff annotation includes gene names and descriptions, transcript IDs and biotypes, predicted effects (HIGH/MODERATE/LOW/MODIFIER impact), amino acid changes for protein-coding variants, and distances to nearest genes for intergenic variants. The statistics HTML file provides a comprehensive overview of variant effects and quality metrics.

Convert to More Readable Formats:

The annotated VCF file is large and complex. Convert it to a tab-delimited table for easier analysis:

# Convert VCF to a readable table format with key annotation fields
# This extracts the most important information into columns
gatk VariantsToTable \
    -V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotated.vcf \
    -F CHROM -F POS -F REF -F ALT -F QUAL -F ANN \  # Select specific fields to extract
    -O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variants_table.tsv

The resulting table contains these key columns:

CHROM: Chromosome where the variant is located
POS: Genomic position of the variant
REF: Reference allele (what the reference genome has at this position)
ALT: Alternate allele (the variant observed in your sample)
QUAL: Quality score of the variant call (higher = more confident)
ANN: SnpEff annotation containing effect predictions, gene names, and impact ratings

The ANN field contains detailed annotations separated by pipes (|), including effect type (e.g., missense_variant), impact level (HIGH/MODERATE/LOW/MODIFIER), gene name, and amino acid changes for protein-coding variants. This tabular format makes it much easier to filter, sort, and analyze variants in spreadsheet applications or downstream analysis tools.

Step 10: Variant Statistics and Quality Assessment

Finally, we’ll generate comprehensive statistics about our variant calls to assess the overall quality of the analysis:

# Generate comprehensive variant statistics using bcftools
# This provides detailed metrics about your variant calls
bcftools stats ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variant_stats.txt

# Count different types of variants
# SNPs (single nucleotide polymorphisms)
bcftools view -v snps ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz | bcftools query -f '.\n' | wc -l > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snp_count.txt

# Indels (insertions and deletions)
bcftools view -v indels ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz | bcftools query -f '.\n' | wc -l > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indel_count.txt

Expected numbers for human WGS at 30x coverage:

SNPs: 4-5 million per genome
Indels: 500,000-800,000 per genome
Ti/Tv ratio: 2.0-2.1 (transitions vs transversions)
Het/Hom ratio: 1.5-2.0 (heterozygous vs homozygous variants)

Quick Quality Control Summary:

After completing the analysis, it’s helpful to generate a comprehensive QC summary to assess the overall quality of your WGS data:

#!/bin/bash
# Quick QC Summary Script for WGS Analysis
# This script extracts key quality metrics from all analysis steps

SAMPLE=tumor1

echo "========================================="
echo "WGS Quality Control Summary for ${SAMPLE}"
echo "========================================="
echo

echo "=== ALIGNMENT QUALITY ==="
echo -n "Mapping Rate: "
grep -A1 "FIRST_OF_PAIR" ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_alignment_summary.txt | tail -1 | cut -f7
echo "  (Benchmark: >95%)"
echo

echo -n "Duplicate Rate: "
grep -A1 "LIBRARY" ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_duplicate_metrics.txt | tail -1 | cut -f9
echo "  (Benchmark: <30%)"
echo

echo -n "Mean Insert Size: "
grep -A1 "MEDIAN_INSERT_SIZE" ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_metrics.txt | tail -1 | cut -f1
echo "bp  (Benchmark: 300-500bp)"
echo

echo "=== VARIANT CALLING QUALITY ==="
echo -n "Total SNPs: "
cat ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snp_count.txt
echo "  (Benchmark: 4-5 million)"
echo

echo -n "Total Indels: "
cat ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indel_count.txt
echo "  (Benchmark: 0.5-0.8 million)"
echo

echo -n "Ti/Tv Ratio: "
# Extract Ti/Tv ratio from bcftools stats output
grep -v "^#" ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variant_stats.txt | grep "TSTV" | cut -f5
echo "  (Benchmark: 2.0-2.1)"
echo

echo "For detailed variant annotation statistics, open:"
echo "~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotation_stats.html"
echo "========================================="

This script provides a quick overview of all critical quality metrics. Run it after completing your analysis to ensure your WGS data meets quality standards. The SnpEff HTML report provides additional detailed statistics and visualizations for comprehensive quality assessment.

To help you interpret your own results, here are the QC metrics from our tutorial example:

Metric	Your Result	Benchmark	Status
Mapping Rate	99.99%	>95%	✅ Excellent
Duplicate Rate	10.19%	<30%	✅ Excellent
Insert Size	197bp	300-500bp	⚠️ Shorter than typical
Total SNPs	3.73M	4-5M	✅ Good
Total Indels	683K	500-800K	✅ Good
Ti/Tv Ratio	1.96	2.0-2.1	✅ Excellent

Step 11: Creating Visualization Files

Generate files suitable for genome browser visualization and downstream analysis:

# Create BED file for genome browser visualization
# BED format is widely supported by genome browsers (IGV, UCSC, etc.)
bcftools query -f '%CHROM\t%POS0\t%END\t%ID\n' ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variants.bed

# Generate coverage tracks (useful for validating variant calls)
# This shows sequencing depth across the genome
bedtools genomecov -ibam ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam -bg > ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_coverage.bedgraph

These files can be loaded into IGV (Integrative Genomics Viewer) to visualize variant calls in genomic context, check read alignments supporting each variant, and identify potential false positives or missed variants.

Quality Control Standards and Benchmarks

Pre-alignment QC Benchmarks

Mean base quality scores: >20 across all positions
Adapter contamination: <5%
Sequence length distribution: Should match expected insert size and read length

Post-alignment QC Benchmarks

Mapping rate: >95% for human WGS data
Duplicate rate: <30% for WGS libraries
Insert size distribution: Should be consistent with library preparation (typically 300-500bp)
Coverage uniformity: Should be relatively even across autosomes

Variant Calling QC Benchmarks

Ti/Tv ratio: 2.0-2.1 for whole genome data
Total variant count: 4-5 million SNPs + 0.5-0.8 million indels per human genome
Het/Hom ratio: 1.5-2.0 for most populations
dbSNP overlap: >95% of variants should be present in dbSNP for typical populations

Best Practices for WGS Analysis

Computational Resources and Performance

Hardware Requirements:

Minimum 64GB RAM for human WGS analysis (128GB+ recommended)
200GB+ temporary disk space per sample
Multi-core processors significantly reduce runtime
SSD storage improves I/O performance

Storage Considerations:

Raw FASTQ files: ~100GB per 30x coverage human genome
Intermediate BAM files: ~100GB per sample
Final compressed VCF files: <1GB per sample
Plan for 3-4x raw data size for intermediate files

Common Pitfalls and Solutions

Memory Issues:

Increase Java heap size for GATK: -Xmx32g or higher
Use scatter-gather parallelization for large chromosomes
Process samples individually if system resources are limited

Performance Optimization:

Utilize multi-threading options in all tools
Use local SSD storage for temporary files
Consider cloud computing for large-scale analyses
Implement parallel processing for multiple samples

Data Quality Issues:

Verify reference genome version matches your sequencing platform
Check for unusual insert size distributions or high duplication rates
Investigate potential sample contamination or batch effects
Validate problematic regions with IGV visualization

Integration with Other Data Types

Transcriptomic Data Integration:

Combine with RNA-seq to assess functional impact of variants
Use allele-specific expression analysis to validate variant calls
Correlate variants with splicing changes and gene expression levels

Epigenomic Context:

Integrate with ChIP-seq data to identify regulatory variants
Use chromatin accessibility data (ATAC-seq) to prioritize variants
Consider chromatin interaction data (Hi-C) for long-range effects

Troubleshooting Common Issues

GATK-Specific Problems

VQSR Training Failures:

Ensure sufficient variant density for machine learning training
Check resource file compatibility with your reference genome version
Consider using hard filtering for very small datasets (<1000 variants)
Verify all resource files have proper indices

Joint Calling Issues:

Verify GVCF files are properly formatted and indexed
Check chromosome naming consistency across all input files
Monitor GenomicsDB workspace size limits for large cohorts
Ensure sufficient disk space for temporary files

Memory and Performance Problems:

Monitor Java heap size usage and increase as needed
Use GATK’s built-in progress monitoring to identify bottlenecks
Consider splitting large chromosomes for parallel processing
Implement checkpointing for long-running analyses

Data Interpretation Challenges

Variant Filtration:

Balance sensitivity vs specificity based on your study goals
Validate filtering strategies with known positive and negative controls
Consider population-specific allele frequencies for rare variant analysis
Use orthogonal validation methods for critical variants

Clinical Interpretation:

Focus on clinically actionable genes and pathogenic variants
Use established variant classification guidelines (ACMG/AMP)
Consider returning incidental findings policies
Ensure appropriate genetic counseling resources are available

Advanced Analysis Considerations

Multi-Sample Joint Calling Workflow

For population studies or family analyses, joint calling provides more accurate variant calls by leveraging information across all samples:

# Create a directory for joint calling
mkdir -p ~/wgs_analysis/results/joint
cd ~/wgs_analysis/results/joint

# Create sample map for all samples in your cohort
echo -e "sample1\t/path/to/sample1.g.vcf.gz" > sample_map.txt
echo -e "sample2\t/path/to/sample2.g.vcf.gz" >> sample_map.txt
echo -e "sample3\t/path/to/sample3.g.vcf.gz" >> sample_map.txt

# Import GVCFs to GenomicsDB workspace
gatk GenomicsDBImport \
    --sample-name-map sample_map.txt \
    --genomicsdb-workspace-path genomicsdb \
    --intervals chr1 \ # Process one chromosome a time for large datasets
    --reader-threads 4

# Perform joint genotyping
gatk GenotypeGVCFs \
    -R ${REFERENCE} \
    -V gendb://genomicsdb \
    -O cohort_raw_variants.vcf.gz

# Once you have `cohort_raw_variants.vcf.gz` from joint calling, you follow the same steps 8-11 as single-sample analysis.

Joint calling improves accuracy for rare variants by borrowing statistical strength across samples and ensures consistent variant representation across the entire cohort.

Conclusion

This comprehensive tutorial has guided you through the complete GATK workflow for WGS analysis, from raw sequencing reads to high-quality, annotated variants. You’ve learned to:

Set up a robust computational environment for genomic analysis
Perform quality control and preprocessing of sequencing data
Execute accurate read alignment and post-processing steps
Implement sophisticated variant calling and filtering approaches
Apply functional annotation to understand variant consequences

The GATK pipeline represents a mature, well-validated approach that balances sensitivity and specificity in variant detection. While the computational requirements are substantial, the resulting high-quality variant calls provide a solid foundation for downstream analyses in both research and clinical contexts.

Key Takeaways

Quality is Paramount: Every step includes quality control measures to ensure reliable results. Never skip QC steps, as they often reveal issues that could compromise your entire analysis.

Resources Matter: Adequate computational resources (memory, storage, CPU) are essential for efficient WGS analysis. Plan your infrastructure carefully before beginning large-scale projects.

Validation is Essential: Use orthogonal methods to validate critical findings, especially for clinical applications. No computational pipeline is perfect, and validation provides confidence in your results.

Documentation Enables Reproducibility: Comprehensive documentation of parameters, software versions, and methodological choices enables reproducible science and facilitates troubleshooting.

Integration Adds Value: The greatest insights often come from integrating WGS data with other genomic data types (RNA-seq, ChIP-seq, clinical phenotypes) rather than analyzing variants in isolation.

Next Steps in Your Genomic Analysis Journey

With high-quality variants in hand, you’re now ready to explore:

Population genomics approaches for understanding genetic diversity and evolution
Pharmacogenomic analysis for personalized medicine applications
Functional genomics integration with transcriptomic and epigenomic data
Clinical variant interpretation using established guidelines and databases
Advanced statistical genetics methods for association studies and heritability analysis

Each of these advanced analyses builds upon the solid foundation you’ve established through this tutorial. The principles of quality control, systematic analysis, and careful validation apply across all areas of genomics research.

Remember that variant calling is just the beginning of genomic analysis. The true value of WGS data emerges when variants are interpreted in their biological context, integrated with other data types, and translated into actionable insights for human health or biological understanding.

Additional Resources

Essential Databases and Tools

GATK Resource Bundle: Comprehensive collection of reference datasets
gnomAD: Population allele frequencies for variant interpretation
ClinVar: Clinical variant significance database
OMIM: Gene-disease relationship database
IGV (Integrative Genomics Viewer): Essential for variant visualization

References

McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-303.
Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491-8.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754-60.
Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T., Carneiro, M. O., Auwera, G. A. V. d., … & Banks, E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples.. https://doi.org/10.1101/201178
Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983-987.
Chen X, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220-2.
Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80-92.
Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-9.
Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-2.

This tutorial is part of the NGS101.com comprehensive guide to next-generation sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.

Tags:

base quality recalibration, GATK, variant calling, variant filtration, VQSR, WGS analysis, whole genome sequencing

Comments

8 responses to “How To Analyze Whole Genome Sequencing Data For Absolute Beginners Part 1: From Raw Reads to High-Quality Variants Using GATK”

Devon

September 24, 2025

Dear Dr. Guo

Thank you for the tutorials on WGS analysis. It is really valuable.
I would like to know, when working with the whole genome of a human, and i want to look at variants across chr1 – chr22, including chrX and chrY, would it be beneficial to run the gatk HaplotypeCaller with the -L argument and then specify a interval.list? Would this speed up the variant calling when used in together with –native-pair-hmm-thread ?

Thank you for the tutorials once again.

Reply
1. Lei
  
  September 24, 2025
  
  The -L flag in HaplotypeCaller limits processing to specified genomic intervals, rather than the entire genome. If you divide the genome into fixed-size chunks (e.g., by chromosome) and run HaplotypeCaller in parallel on an HPC cluster, this approach can significantly reduce processing time. Each chunk will produce a separate VCF file, which you’ll need to merge afterward.
  
  However, if you combine all genomic regions into a single interval list and use the -L flag, HaplotypeCaller will process the list as a single unit, similar to the whole genome, offering no speed improvement.
  
  Reply
Charles

November 3, 2025

Hi, is it possible for individuals with windows laptops to try the tutorials?

Reply
1. Lei
  
  November 3, 2025
  
  Unfortunately, laptops lack the necessary power for whole-genome or whole-exome sequencing (WGS/WES) analysis. As noted in my tutorials, you’ll need access to a robust Linux system, such as an HPC cluster.
  If you’d like to gain hands-on experience with the full workflow, I’m launching a course that includes a pre-configured Linux environment. This will allow you to practice Linux command-line tools in a suitable setting. Let me know if you’re interested, or stay tuned for upcoming announcements!
  
  Reply
John

December 10, 2025

I was working with a targeted sequencing workflow, which was designed with hg19 as reference. Can you suggest any sources for the reference materials based on hg19 instead of GRCh38?

Reply
1. Lei
  
  December 10, 2025
  
  The tutorial utilizes the Broad Institute’s public reference dataset located at
  https://console.cloud.google.com/storage/browser/gcp-public-data–broad-references/
  (bucket: gcp-public-data–broad-references). This bucket hosts reference files for various model species, including all hg19 resources, which you can download directly from the directory.
  
  Reply
Allies

January 13, 2026

Hi there.

Just wanted to know where the `tumor1` sample is from? Was this publicly available data?

Reply
1. Lei
  
  January 15, 2026
  
  Hi Allies,
  
  Most real-world human whole-genome sequencing (WGS) data, especially clinical or sensitive samples, are not publicly available due to privacy regulations. In the tutorial, “Tumor1” is simply an example to illustrate the workflow steps. The best way forward is to test the steps directly on your own dataset.
  
  Reply

Search

[jetpack_subscription_form title=”Subscribe” subscribe_text=”Get updates directly to your inbox!” success_message=”Thanks for subscribing!”]