Understanding Whole Genome Sequencing and Its Applications
What is Whole Genome Sequencing (WGS)?
Whole Genome Sequencing represents one of the most comprehensive approaches to studying genetic variation across an entire organism’s DNA. Unlike targeted sequencing approaches that focus on specific regions of interest, WGS captures virtually every nucleotide in the genome, providing an unbiased view of genetic variation including single nucleotide variants (SNVs), insertions and deletions (indels), structural variants, and copy number variations.
The technology behind WGS has evolved dramatically since the completion of the Human Genome Project. Modern sequencing platforms can now generate billions of short DNA sequences (reads) in a single run, covering the human genome dozens of times over. This redundant coverage, known as sequencing depth, allows for accurate detection of genetic variants even in challenging genomic regions.
Beginner’s Tip: Think of WGS as taking millions of overlapping photographs of a landscape from different angles. Each photograph (sequencing read) shows a small portion, but when combined, they create a detailed map of the entire terrain (genome).
Clinical and Research Applications of WGS
WGS has revolutionized both research and clinical practice across multiple domains:
Medical Diagnostics and Precision Medicine:
- Rare disease diagnosis, particularly when standard genetic tests fail to identify causative mutations
- Pharmacogenomics applications to predict drug responses and optimize medication selection
- Cancer genomics to identify somatic mutations driving tumor growth and resistance to therapy
- Prenatal and neonatal screening for genetic disorders
Population and Evolutionary Genetics:
- Large-scale population studies to understand human genetic diversity and migration patterns
- Identification of genetic factors contributing to complex diseases through genome-wide association studies (GWAS)
- Agricultural genomics for crop improvement and livestock breeding programs
Infectious Disease Research:
- Pathogen surveillance and outbreak investigation through viral and bacterial genome sequencing
- Understanding antimicrobial resistance mechanisms in clinical isolates
- Tracking disease transmission patterns during epidemics
Biological Insights from WGS Analysis
WGS data provides unprecedented insights into genome organization and function:
- Variant Discovery: Identification of all types of genetic variants, from single base changes to large structural rearrangements
- Functional Impact Assessment: Prediction of how variants affect protein function and gene regulation
- Population Stratification: Understanding genetic ancestry and population structure
- Evolutionary Signatures: Detection of regions under positive or negative selection pressure
- Genome-wide Patterns: Analysis of mutation rates, recombination hotspots, and chromatin organization
WGS Analysis Pipeline Options
Several computational pipelines have been developed for WGS analysis, each with different strengths. The most commonly used tools include GATK (Genome Analysis Toolkit), DeepVariant, Strelka2, and FreeBayes. Each pipeline has been optimized for different use cases – some prioritize speed for clinical applications, others focus on maximum accuracy, and some offer extensive customization for research applications.
GATK, developed by the Broad Institute, has emerged as the industry standard for germline variant calling and is the most widely adopted pipeline in both clinical and research settings. It offers comprehensive best practices workflows that have been extensively validated across thousands of genomes, sophisticated algorithms for quality control and filtering, and excellent documentation with strong community support.
Why GATK for This Tutorial
We’ve chosen GATK for this tutorial because it provides the ideal learning foundation for newcomers to WGS analysis. Its well-documented workflows, extensive educational resources, and systematic approach to quality control make it perfect for understanding the fundamental principles of variant calling. The skills and concepts you learn from GATK will transfer readily to other pipelines as your expertise grows and your project requirements evolve.
GATK Workflow Overview
The GATK workflow for WGS analysis follows a systematic approach:
- Data Preparation: Quality control and preprocessing of raw sequencing reads
- Alignment: Mapping reads to a reference genome using BWA-MEM
- Post-alignment Processing: Marking duplicates and recalibrating base quality scores
- Variant Calling: Identifying genetic variants using HaplotypeCaller
- Variant Filtering: Applying quality filters to retain high-confidence variants
- Annotation: Adding functional information to variants for interpretation
Each step includes rigorous quality control measures and follows established best practices developed through analysis of thousands of genomes.
Setting Up Your WGS Analysis Environment
Before beginning the analysis, we need to establish a computational environment with all necessary tools and dependencies.
Creating a Dedicated Conda Environment
# Create a new conda environment for WGS analysis
# This isolates our tools from other projects and prevents conflicts
conda create -n wgs_analysis python=3.9
# Activate the environment - all subsequent installations will go here
conda activate wgs_analysis
# Add bioinformatics channels to conda
# bioconda: specialized channel for bioinformatics software
# conda-forge: community-maintained packages with high quality standards
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict # Prevents package conflicts
# Install GATK and essential tools for the entire workflow
conda install -y \
gatk4 \ # Main variant calling toolkit
bwa \ # Read alignment tool
samtools \ # SAM/BAM file manipulation
picard \ # Java-based tools for BAM processing
trim-galore \ # Adapter trimming and quality control
fastqc \ # Sequencing quality assessment
bcftools \ # VCF file manipulation and statistics
snpeff \ # Variant annotation and effect prediction
bedtools \ # Genomic interval operations
vcftools \ # VCF filtering and format conversion
tabix \ # VCF indexing and querying
tree # Show directory structure
# Verify GATK installation and check version
gatk --version
This creates an isolated environment for WGS analysis with all required tools. The bioconda and conda-forge channels provide specialized bioinformatics software with high quality standards. Setting channel priority to strict prevents package conflicts.
Reference Genome and Resource Files
The reference genome serves as the “map” against which we align our sequencing reads. We’ll use the human reference genome GRCh38/hg38, which is the current standard:
# Create organized directory structure for the project
mkdir -p ~/wgs_analysis/{reference,data,results}
cd ~/wgs_analysis/reference
# Download human reference genome (GRCh38/hg38 - the current standard)
# This is the "map" we'll align our reads to
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict
# Create BWA index - this allows BWA to quickly find where reads should align
# This step takes ~2-3 hours but only needs to be done once
bwa index Homo_sapiens_assembly38.fasta
The BWA index step takes 2-3 hours but only needs to be done once. This creates specialized data structures that allow BWA to quickly find where reads should align.
Next, we’ll download the GATK resource bundle – high-quality variant datasets used for base quality score recalibration and variant filtering:
# Download GATK resource bundle - these are "truth sets" of known variants
# Used for base quality score recalibration and variant filtering
# HapMap: High-quality SNPs used for training variant filters
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi
# 1000 Genomes Omni: Another high-quality variant set for training
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi
# 1000 Genomes high-confidence SNPs: Large collection of validated variants
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
# Mills and 1000G indels: High-quality insertions and deletions
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
# dbSNP: Database of known variants - helps identify novel vs. known variants
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi
These files contain different types of high-quality variants:
- HapMap: Highest quality SNPs used for training variant filters
- 1000 Genomes Omni: Another high-quality variant set for training
- 1000 Genomes high-confidence SNPs: Large collection of validated variants
- Mills and 1000G indels: High-quality insertions and deletions
- dbSNP: Database of known variants for annotation
Important Note: The GATK resource bundle moved from the old genomics-public-data bucket to the new gcp-public-data–broad-references bucket. Most files now come pre-indexed with .tbi files.
Complete WGS Analysis Pipeline
Let’s walk through the complete validated pipeline. Each step includes comprehensive quality control and follows GATK best practices. For this tutorial, we’ll analyze a tumor sample called “tumor1” with paired-end FASTQ files.
Step 1: Quality Control and Preprocessing
Quality assessment helps identify potential issues with sequencing data before proceeding with alignment. We’ll examine base quality scores, GC content distribution, adapter contamination, and sequence length distribution:
# Set up variables and create directory structure
REFERENCE="~/wgs_analysis/reference/Homo_sapiens_assembly38.fasta"
SAMPLE="tumor1"
THREADS=20
# Create organized directory structure for results
mkdir -p ~/wgs_analysis/results/{qc,trimmed,aligned,recal,var}/${SAMPLE}
# Run FastQC to assess the quality of raw sequencing data
# This generates HTML reports showing:
# - Base quality scores across read positions
# - GC content distribution
# - Adapter contamination levels
# - Overrepresented sequences
fastqc \
~/wgs_analysis/data/${SAMPLE}_R1.fastq.gz \
~/wgs_analysis/data/${SAMPLE}_R2.fastq.gz \
-o ~/wgs_analysis/results/qc/${SAMPLE}
FastQC generates HTML reports showing base quality scores across read positions, GC content distribution, adapter contamination levels, and overrepresented sequences. Check these reports to ensure your data meets quality standards before proceeding.
Step 2: Adapter Trimming and Quality Filtering
Modern sequencing often includes adapter sequences that must be removed before alignment. We’ll also filter out low-quality bases and very short reads:
# Trim adapters and low-quality bases using Trim Galore
# This removes:
# - Illumina sequencing adapters that can interfere with alignment
# - Low-quality bases (quality score < 20) from read ends
# - Very short reads (< 50bp) that may align poorly
trim_galore \
--paired \ # Indicates we have paired-end reads (R1 and R2)
--quality 20 \ # Remove bases with quality score < 20
--length 50 \ # Discard reads shorter than 50bp after trimming
--fastqc \ # Run FastQC on trimmed reads for comparison
--output_dir ~/wgs_analysis/results/trimmed/${SAMPLE} \
~/wgs_analysis/data/${SAMPLE}_R1.fastq.gz \
~/wgs_analysis/data/${SAMPLE}_R2.fastq.gz
This step removes Illumina sequencing adapters, trims bases with quality scores below 20, discards reads shorter than 50bp after trimming, and generates FastQC reports on the cleaned data. The output files will be named with _val_1.fq.gz and _val_2.fq.gz suffixes.
Step 3: Read Alignment with BWA-MEM
BWA-MEM maps our cleaned sequencing reads to the reference genome. The algorithm is specifically designed for 70bp-1Mbp Illumina reads and handles chimeric reads effectively:
# Align reads to reference genome using BWA-MEM algorithm
# BWA-MEM is specifically designed for 70bp-1Mbp Illumina reads
bwa mem \
-t ${THREADS} \ # Use multiple CPU threads for faster processing
-M \ # Mark shorter split hits as secondary (required for Picard compatibility)
-R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib" \ # Read group info (required for GATK)
${REFERENCE} \ # Reference genome file
~/wgs_analysis/results/trimmed/${SAMPLE}/${SAMPLE}_R1_val_1.fq.gz \ # Trimmed forward reads
~/wgs_analysis/results/trimmed/${SAMPLE}/${SAMPLE}_R2_val_2.fq.gz \ # Trimmed reverse reads
| samtools sort -@ ${THREADS} -o ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam # Sort reads by position
# Create index for the BAM file - this allows random access to specific regions
samtools index ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam
The read group information (@RG) is essential for GATK processing. It includes the read group identifier (ID), sample name (SM), sequencing platform (PL), and library identifier (LB). The output is piped directly to samtools for coordinate sorting, and we create an index for random access to specific genomic regions.
Step 4: Mark PCR and Optical Duplicates
PCR amplification during library preparation and optical duplicates from sequencing can bias variant calling by making variants appear more frequent than they actually are:
# Mark PCR and optical duplicates using GATK MarkDuplicates
# Duplicates arise from:
# - PCR amplification during library preparation
# - Optical duplicates from sequencing (reads from same DNA cluster)
# These can bias variant calling by making variants appear more frequent
gatk MarkDuplicates \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}.bam \
-O ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
-M ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_duplicate_metrics.txt \ # File containing duplication statistics
--CREATE_INDEX true # Automatically create BAM index
The metrics file reports the total number of reads, number of duplicate reads, and duplication rate. For good WGS libraries, the duplication rate should be less than 30%. Higher duplication rates may indicate issues with library preparation or low input DNA amounts.
Step 5: Base Quality Score Recalibration (BQSR)
Sequencing machines sometimes systematically over- or under-estimate the probability that a base call is correct. BQSR corrects these systematic errors using patterns observed at known variant sites:
# Step 1: Generate recalibration data table
# This analyzes patterns of base quality score errors using known variant sites
gatk BaseRecalibrator \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
-R ${REFERENCE} \
--known-sites ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ # Known SNPs from dbSNP
--known-sites ~/wgs_analysis/reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ # Known indels
-O ~/wgs_analysis/results/recal/${SAMPLE}/${SAMPLE}_recal_data.table
# Step 2: Apply base quality score recalibration
# This updates the quality scores in your BAM file based on the patterns found above
gatk ApplyBQSR \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_marked_duplicates.bam \
-R ${REFERENCE} \
--bqsr-recal-file ~/wgs_analysis/results/recal/${SAMPLE}/${SAMPLE}_recal_data.table \
-O ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam
The first step analyzes patterns of base quality score errors using known SNPs from dbSNP and known indels from the Mills dataset. The second step applies these corrections to update quality scores in the BAM file. This improves variant calling accuracy, particularly for low-frequency variant detection.
Step 6: Alignment Quality Assessment
Before proceeding to variant calling, we should assess the quality of our alignments to ensure they meet expected standards:
# Collect alignment summary metrics
# This provides statistics about how well your reads aligned to the reference
gatk CollectAlignmentSummaryMetrics \
-R ${REFERENCE} \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
-O ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_alignment_summary.txt
# Collect insert size metrics (for paired-end data)
# Insert size is the distance between paired reads - important for structural variant detection
gatk CollectInsertSizeMetrics \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
-O ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_metrics.txt \
-H ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_histogram.pdf
Key metrics to check include:
- TOTAL_READS: Total number of reads processed
- PCT_PF_READS_ALIGNED: Percentage of reads aligned (should be >95% for human WGS)
- PF_MISMATCH_RATE: Rate of mismatches (should be <5%)
- Insert size distribution: Should match library preparation (typically 300-500bp for most WGS libraries)
Step 7: Variant Calling with HaplotypeCaller
HaplotypeCaller is GATK’s primary variant calling algorithm. It uses local assembly and haplotype-based calling to achieve high accuracy, particularly in complex genomic regions:
# Call variants in GVCF (Genomic Variant Call Format) mode
# GVCF mode records information about ALL sites (variant and non-variant)
# This allows for accurate joint genotyping when analyzing multiple samples
gatk HaplotypeCaller \
-R ${REFERENCE} \
-I ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam \
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}.g.vcf.gz \
-ERC GVCF \ # Emit Reference Confidence mode - creates GVCF format
--dbsnp ~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz # Annotate with known variants
GVCF (Genomic Variant Call Format) mode records information about all genomic positions, not just variant sites. This enables accurate joint genotyping when analyzing multiple samples and is essential for population-scale studies and family analyses.
For single-sample analysis, we need to convert the GVCF to a standard VCF format:
# Convert GVCF to regular VCF for single-sample analysis
# This step is REQUIRED before variant filtering
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}.g.vcf.gz \
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz
Step 8: Variant Quality Score Recalibration (VQSR)
VQSR uses machine learning to distinguish true variants from artifacts based on annotation profiles from high-quality training datasets. This approach is more sophisticated than simple hard filtering with arbitrary thresholds.
SNP Recalibration:
# Build SNP recalibration model using multiple high-quality training sets
# Each training set has different confidence levels and purposes
gatk VariantRecalibrator \
-R ${REFERENCE} \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 \
~/wgs_analysis/reference/hapmap_3.3.hg38.vcf.gz \ # Highest quality training set
--resource:omni,known=false,training=true,truth=false,prior=12.0 \
~/wgs_analysis/reference/1000G_omni2.5.hg38.vcf.gz \ # High quality, slightly lower confidence
--resource:1000G,known=false,training=true,truth=false,prior=10.0 \
~/wgs_analysis/reference/1000G_phase1.snps.high_confidence.hg38.vcf.gz \ # Large training set
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \
~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ # Known variants (not for training)
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ # Variant annotations to use
-mode SNP \ # Process SNPs only
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.recal \
--tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.tranches
# Apply SNP recalibration filter
# This assigns PASS/FAIL based on the machine learning model
gatk ApplyVQSR \
-R ${REFERENCE} \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_raw_variants.vcf.gz \
--recal-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.recal \
--tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps.tranches \
-mode SNP \
--truth-sensitivity-filter-level 99.0 \ # Keep 99% of true variants (high sensitivity)
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz
The annotations used for SNP recalibration include:
- QD: QualByDepth – variant quality normalized by depth
- MQ: MappingQuality – how well reads map to reference
- MQRankSum: Difference in mapping quality between reference and alternate alleles
- ReadPosRankSum: Position of variants within reads
- FS: FisherStrand – strand bias indicator
- SOR: StrandOddsRatio – another strand bias measure
Indel Recalibration:
# Build indel recalibration model
# Indels are harder to call accurately, so we use fewer annotations
gatk VariantRecalibrator \
-R ${REFERENCE} \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz \
--resource:mills,known=false,training=true,truth=true,prior=12.0 \
~/wgs_analysis/reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ # High-quality indels
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \
~/wgs_analysis/reference/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-an QD -an ReadPosRankSum -an FS -an SOR \ # Fewer annotations for indels
-mode INDEL \
--max-gaussians 4 \ # Limit model complexity (indels have less training data)
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.recal \
--tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.tranches
# Apply indel recalibration
# Using 95% sensitivity (slightly more stringent than SNPs)
gatk ApplyVQSR \
-R ${REFERENCE} \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snps_recalibrated.vcf.gz \
--recal-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.recal \
--tranches-file ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indels.tranches \
-mode INDEL \
--truth-sensitivity-filter-level 95.0 \
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz
Indel recalibration uses fewer annotations because indels are harder to call accurately and have less training data available. We limit model complexity with --max-gaussians 4 and use slightly more stringent filtering (95% sensitivity vs 99% for SNPs). The final result contains high-quality variants that have passed both SNP and indel filters with a much lower false positive rate than raw calls.
Understanding the VCF File Format:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr1 12345 rs123 A G 60 PASS DP=30 GT:DP 1/1:30
Columns:
- Position: Chromosome 1, position 12345, variant ID rs123
- Variant: A→G substitution
- Quality Score: 60
- Filter Status: PASS/FAIL
- Overall coverage: 30 reads at this position
- Format: Genotype:Depth
- Sample genotype: Homozygous for the G allele (both copies are G)
- Sample coverage: 30 reads supporting this call in this sample
Common genotype codes:
0/0: Homozygous reference (A/A in this example)0/1: Heterozygous (A/G in this example)1/1: Homozygous alternate (G/G in this example)- 1/2: Heterozygous with two different alternate alleles
./.: Missing/unknown genotype
Separator meaning:
/: Unphased (don’t know which allele came from which parent)|: Phased (know maternal vs paternal inheritance)
Step 9: Functional Annotation with SnpEff
Functional annotation predicts the biological effect of each variant on genes and proteins. This information is crucial for interpreting variant significance:
# Annotate variants with functional consequences
# This predicts the effect of each variant on genes and proteins
snpEff ann \
-Xmx32g \ # Allocate 32GB memory for large genomes
-stats ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotation_stats.html \ # Generate stats report
GRCh38.105 \ # Latest Ensembl-based database version
~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz \
> ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotated.vcf
SnpEff annotation includes gene names and descriptions, transcript IDs and biotypes, predicted effects (HIGH/MODERATE/LOW/MODIFIER impact), amino acid changes for protein-coding variants, and distances to nearest genes for intergenic variants. The statistics HTML file provides a comprehensive overview of variant effects and quality metrics.
Convert to More Readable Formats:
The annotated VCF file is large and complex. Convert it to a tab-delimited table for easier analysis:
# Convert VCF to a readable table format with key annotation fields
# This extracts the most important information into columns
gatk VariantsToTable \
-V ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotated.vcf \
-F CHROM -F POS -F REF -F ALT -F QUAL -F ANN \ # Select specific fields to extract
-O ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variants_table.tsv

The resulting table contains these key columns:
- CHROM: Chromosome where the variant is located
- POS: Genomic position of the variant
- REF: Reference allele (what the reference genome has at this position)
- ALT: Alternate allele (the variant observed in your sample)
- QUAL: Quality score of the variant call (higher = more confident)
- ANN: SnpEff annotation containing effect predictions, gene names, and impact ratings
The ANN field contains detailed annotations separated by pipes (|), including effect type (e.g., missense_variant), impact level (HIGH/MODERATE/LOW/MODIFIER), gene name, and amino acid changes for protein-coding variants. This tabular format makes it much easier to filter, sort, and analyze variants in spreadsheet applications or downstream analysis tools.
Step 10: Variant Statistics and Quality Assessment
Finally, we’ll generate comprehensive statistics about our variant calls to assess the overall quality of the analysis:
# Generate comprehensive variant statistics using bcftools
# This provides detailed metrics about your variant calls
bcftools stats ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variant_stats.txt
# Count different types of variants
# SNPs (single nucleotide polymorphisms)
bcftools view -v snps ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz | bcftools query -f '.\n' | wc -l > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snp_count.txt
# Indels (insertions and deletions)
bcftools view -v indels ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz | bcftools query -f '.\n' | wc -l > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indel_count.txt
Expected numbers for human WGS at 30x coverage:
- SNPs: 4-5 million per genome
- Indels: 500,000-800,000 per genome
- Ti/Tv ratio: 2.0-2.1 (transitions vs transversions)
- Het/Hom ratio: 1.5-2.0 (heterozygous vs homozygous variants)
Quick Quality Control Summary:
After completing the analysis, it’s helpful to generate a comprehensive QC summary to assess the overall quality of your WGS data:
#!/bin/bash
# Quick QC Summary Script for WGS Analysis
# This script extracts key quality metrics from all analysis steps
SAMPLE=tumor1
echo "========================================="
echo "WGS Quality Control Summary for ${SAMPLE}"
echo "========================================="
echo
echo "=== ALIGNMENT QUALITY ==="
echo -n "Mapping Rate: "
grep -A1 "FIRST_OF_PAIR" ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_alignment_summary.txt | tail -1 | cut -f7
echo " (Benchmark: >95%)"
echo
echo -n "Duplicate Rate: "
grep -A1 "LIBRARY" ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_duplicate_metrics.txt | tail -1 | cut -f9
echo " (Benchmark: <30%)"
echo
echo -n "Mean Insert Size: "
grep -A1 "MEDIAN_INSERT_SIZE" ~/wgs_analysis/results/qc/${SAMPLE}/${SAMPLE}_insert_size_metrics.txt | tail -1 | cut -f1
echo "bp (Benchmark: 300-500bp)"
echo
echo "=== VARIANT CALLING QUALITY ==="
echo -n "Total SNPs: "
cat ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_snp_count.txt
echo " (Benchmark: 4-5 million)"
echo
echo -n "Total Indels: "
cat ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_indel_count.txt
echo " (Benchmark: 0.5-0.8 million)"
echo
echo -n "Ti/Tv Ratio: "
# Extract Ti/Tv ratio from bcftools stats output
grep -v "^#" ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variant_stats.txt | grep "TSTV" | cut -f5
echo " (Benchmark: 2.0-2.1)"
echo
echo "For detailed variant annotation statistics, open:"
echo "~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_annotation_stats.html"
echo "========================================="
This script provides a quick overview of all critical quality metrics. Run it after completing your analysis to ensure your WGS data meets quality standards. The SnpEff HTML report provides additional detailed statistics and visualizations for comprehensive quality assessment.
To help you interpret your own results, here are the QC metrics from our tutorial example:
| Metric | Your Result | Benchmark | Status |
|---|---|---|---|
| Mapping Rate | 99.99% | >95% | ✅ Excellent |
| Duplicate Rate | 10.19% | <30% | ✅ Excellent |
| Insert Size | 197bp | 300-500bp | ⚠️ Shorter than typical |
| Total SNPs | 3.73M | 4-5M | ✅ Good |
| Total Indels | 683K | 500-800K | ✅ Good |
| Ti/Tv Ratio | 1.96 | 2.0-2.1 | ✅ Excellent |
Step 11: Creating Visualization Files
Generate files suitable for genome browser visualization and downstream analysis:
# Create BED file for genome browser visualization
# BED format is widely supported by genome browsers (IGV, UCSC, etc.)
bcftools query -f '%CHROM\t%POS0\t%END\t%ID\n' ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_filtered.vcf.gz > ~/wgs_analysis/results/var/${SAMPLE}/${SAMPLE}_variants.bed
# Generate coverage tracks (useful for validating variant calls)
# This shows sequencing depth across the genome
bedtools genomecov -ibam ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_recalibrated.bam -bg > ~/wgs_analysis/results/aligned/${SAMPLE}/${SAMPLE}_coverage.bedgraph
These files can be loaded into IGV (Integrative Genomics Viewer) to visualize variant calls in genomic context, check read alignments supporting each variant, and identify potential false positives or missed variants.
Quality Control Standards and Benchmarks
Pre-alignment QC Benchmarks
- Mean base quality scores: >20 across all positions
- Adapter contamination: <5%
- Sequence length distribution: Should match expected insert size and read length
Post-alignment QC Benchmarks
- Mapping rate: >95% for human WGS data
- Duplicate rate: <30% for WGS libraries
- Insert size distribution: Should be consistent with library preparation (typically 300-500bp)
- Coverage uniformity: Should be relatively even across autosomes
Variant Calling QC Benchmarks
- Ti/Tv ratio: 2.0-2.1 for whole genome data
- Total variant count: 4-5 million SNPs + 0.5-0.8 million indels per human genome
- Het/Hom ratio: 1.5-2.0 for most populations
- dbSNP overlap: >95% of variants should be present in dbSNP for typical populations
Best Practices for WGS Analysis
Computational Resources and Performance
Hardware Requirements:
- Minimum 64GB RAM for human WGS analysis (128GB+ recommended)
- 200GB+ temporary disk space per sample
- Multi-core processors significantly reduce runtime
- SSD storage improves I/O performance
Storage Considerations:
- Raw FASTQ files: ~100GB per 30x coverage human genome
- Intermediate BAM files: ~100GB per sample
- Final compressed VCF files: <1GB per sample
- Plan for 3-4x raw data size for intermediate files
Common Pitfalls and Solutions
Memory Issues:
- Increase Java heap size for GATK:
-Xmx32gor higher - Use scatter-gather parallelization for large chromosomes
- Process samples individually if system resources are limited
Performance Optimization:
- Utilize multi-threading options in all tools
- Use local SSD storage for temporary files
- Consider cloud computing for large-scale analyses
- Implement parallel processing for multiple samples
Data Quality Issues:
- Verify reference genome version matches your sequencing platform
- Check for unusual insert size distributions or high duplication rates
- Investigate potential sample contamination or batch effects
- Validate problematic regions with IGV visualization
Integration with Other Data Types
Transcriptomic Data Integration:
- Combine with RNA-seq to assess functional impact of variants
- Use allele-specific expression analysis to validate variant calls
- Correlate variants with splicing changes and gene expression levels
Epigenomic Context:
- Integrate with ChIP-seq data to identify regulatory variants
- Use chromatin accessibility data (ATAC-seq) to prioritize variants
- Consider chromatin interaction data (Hi-C) for long-range effects
Troubleshooting Common Issues
GATK-Specific Problems
VQSR Training Failures:
- Ensure sufficient variant density for machine learning training
- Check resource file compatibility with your reference genome version
- Consider using hard filtering for very small datasets (<1000 variants)
- Verify all resource files have proper indices
Joint Calling Issues:
- Verify GVCF files are properly formatted and indexed
- Check chromosome naming consistency across all input files
- Monitor GenomicsDB workspace size limits for large cohorts
- Ensure sufficient disk space for temporary files
Memory and Performance Problems:
- Monitor Java heap size usage and increase as needed
- Use GATK’s built-in progress monitoring to identify bottlenecks
- Consider splitting large chromosomes for parallel processing
- Implement checkpointing for long-running analyses
Data Interpretation Challenges
Variant Filtration:
- Balance sensitivity vs specificity based on your study goals
- Validate filtering strategies with known positive and negative controls
- Consider population-specific allele frequencies for rare variant analysis
- Use orthogonal validation methods for critical variants
Clinical Interpretation:
- Focus on clinically actionable genes and pathogenic variants
- Use established variant classification guidelines (ACMG/AMP)
- Consider returning incidental findings policies
- Ensure appropriate genetic counseling resources are available
Advanced Analysis Considerations
Multi-Sample Joint Calling Workflow
For population studies or family analyses, joint calling provides more accurate variant calls by leveraging information across all samples:
# Create a directory for joint calling
mkdir -p ~/wgs_analysis/results/joint
cd ~/wgs_analysis/results/joint
# Create sample map for all samples in your cohort
echo -e "sample1\t/path/to/sample1.g.vcf.gz" > sample_map.txt
echo -e "sample2\t/path/to/sample2.g.vcf.gz" >> sample_map.txt
echo -e "sample3\t/path/to/sample3.g.vcf.gz" >> sample_map.txt
# Import GVCFs to GenomicsDB workspace
gatk GenomicsDBImport \
--sample-name-map sample_map.txt \
--genomicsdb-workspace-path genomicsdb \
--intervals chr1 \ # Process one chromosome a time for large datasets
--reader-threads 4
# Perform joint genotyping
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V gendb://genomicsdb \
-O cohort_raw_variants.vcf.gz
# Once you have `cohort_raw_variants.vcf.gz` from joint calling, you follow the same steps 8-11 as single-sample analysis.
Joint calling improves accuracy for rare variants by borrowing statistical strength across samples and ensures consistent variant representation across the entire cohort.
Conclusion
This comprehensive tutorial has guided you through the complete GATK workflow for WGS analysis, from raw sequencing reads to high-quality, annotated variants. You’ve learned to:
- Set up a robust computational environment for genomic analysis
- Perform quality control and preprocessing of sequencing data
- Execute accurate read alignment and post-processing steps
- Implement sophisticated variant calling and filtering approaches
- Apply functional annotation to understand variant consequences
The GATK pipeline represents a mature, well-validated approach that balances sensitivity and specificity in variant detection. While the computational requirements are substantial, the resulting high-quality variant calls provide a solid foundation for downstream analyses in both research and clinical contexts.
Key Takeaways
Quality is Paramount: Every step includes quality control measures to ensure reliable results. Never skip QC steps, as they often reveal issues that could compromise your entire analysis.
Resources Matter: Adequate computational resources (memory, storage, CPU) are essential for efficient WGS analysis. Plan your infrastructure carefully before beginning large-scale projects.
Validation is Essential: Use orthogonal methods to validate critical findings, especially for clinical applications. No computational pipeline is perfect, and validation provides confidence in your results.
Documentation Enables Reproducibility: Comprehensive documentation of parameters, software versions, and methodological choices enables reproducible science and facilitates troubleshooting.
Integration Adds Value: The greatest insights often come from integrating WGS data with other genomic data types (RNA-seq, ChIP-seq, clinical phenotypes) rather than analyzing variants in isolation.
Next Steps in Your Genomic Analysis Journey
With high-quality variants in hand, you’re now ready to explore:
- Population genomics approaches for understanding genetic diversity and evolution
- Pharmacogenomic analysis for personalized medicine applications
- Functional genomics integration with transcriptomic and epigenomic data
- Clinical variant interpretation using established guidelines and databases
- Advanced statistical genetics methods for association studies and heritability analysis
Each of these advanced analyses builds upon the solid foundation you’ve established through this tutorial. The principles of quality control, systematic analysis, and careful validation apply across all areas of genomics research.
Remember that variant calling is just the beginning of genomic analysis. The true value of WGS data emerges when variants are interpreted in their biological context, integrated with other data types, and translated into actionable insights for human health or biological understanding.
Additional Resources
Essential Databases and Tools
- GATK Resource Bundle: Comprehensive collection of reference datasets
- gnomAD: Population allele frequencies for variant interpretation
- ClinVar: Clinical variant significance database
- OMIM: Gene-disease relationship database
- IGV (Integrative Genomics Viewer): Essential for variant visualization
References
- McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-303.
- Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.
- DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491-8.
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754-60.
- Poplin, R., Ruano-Rubio, V., DePristo, M. A., Fennell, T., Carneiro, M. O., Auwera, G. A. V. d., … & Banks, E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples.. https://doi.org/10.1101/201178
- Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983-987.
- Chen X, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220-2.
- Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80-92.
- Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-9.
- Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008.
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-2.
This tutorial is part of the NGS101.com comprehensive guide to next-generation sequencing analysis. If this tutorial helped advance your research, please comment and share your experience to help other researchers! Subscribe to stay updated with our latest bioinformatics tutorials and resources.





Leave a Reply