Introduction: Understanding Tumor Copy Number Variants
Cancer is fundamentally a disease of genomic instability, where normal cells accumulate mutations that drive uncontrolled growth and metastasis. Among these mutations, somatic copy number alterations (SCNAs) – also known as tumor CNVs – play a pivotal role in cancer initiation, progression, and treatment resistance. This tutorial builds upon our Part 6: germline CNV analysis guide and Part 1 of our WGS analysis series to help you identify and interpret these critical cancer-driving alterations.
What Are Tumor Copy Number Variants?
Tumor copy number variants are somatic alterations where cancer cells gain or lose copies of chromosomal segments during tumorigenesis. Unlike germline CNVs that are present in all cells from birth, tumor CNVs are acquired during cancer development and are typically found only in malignant tissue.
These alterations can range from focal changes affecting single genes to massive chromosomal instability involving entire chromosome arms. The cancer genome often becomes a mosaic of cells with different copy number profiles, creating tumor heterogeneity that complicates both detection and treatment.
Key Characteristics of Tumor CNVs:
- Somatic Origin: Acquired during cancer development, not inherited
- Heterogeneity: Different cancer cells may have different copy number profiles
- Functional Impact: Often affect oncogenes and tumor suppressor genes
- Therapeutic Relevance: Can predict drug response and resistance mechanisms
- Prognostic Value: Copy number burden correlates with patient outcomes
Germline vs. Tumor CNVs: While germline CNVs affect constitutional DNA and are present in all cells, tumor CNVs are somatic alterations found only in cancer cells. Detection requires comparing tumor samples to matched normal controls to distinguish acquired alterations from inherited variants.
The Cancer Genomics Landscape and Copy Number Alterations
Copy number alterations are hallmarks of cancer genomes and contribute to tumorigenesis through multiple mechanisms:
Oncogene Amplification:
- ERBB2/HER2 amplification in breast cancer drives aggressive growth and guides targeted therapy with trastuzumab
- MYC amplification promotes cell proliferation and is found across many cancer types
- EGFR amplification in glioblastoma contributes to treatment resistance
Tumor Suppressor Loss:
- TP53 deletion removes critical cell cycle checkpoints
- CDKN2A/B deletions disable senescence pathways
- RB1 loss disrupts normal growth control
Chromosomal Instability:
- Aneuploidy (abnormal chromosome numbers) is nearly universal in solid tumors
- Chromothripsis creates localized genomic chaos through chromosome shattering
- Whole-genome doubling followed by loss creates complex copy number landscapes
Clinical Applications:
- Diagnosis: Copy number profiles help classify tumor subtypes (e.g., medulloepithelioma vs. ependymoma)
- Prognosis: Copy number burden often correlates with patient survival
- Treatment Selection: HER2 amplification guides trastuzumab therapy; homologous recombination deficiency predicts PARP inhibitor response
- Resistance Monitoring: Acquired amplifications can drive treatment resistance
Challenges in Tumor CNV Detection
Detecting copy number alterations in tumor samples presents unique challenges compared to germline analysis:
Biological Challenges:
- Tumor Purity: Normal cell contamination dilutes true copy number signals
- Tumor Heterogeneity: Subclonal populations may have different copy number profiles
- Ploidy Variation: Many tumors are not diploid, complicating copy number interpretation
- Stromal Contamination: Infiltrating immune cells and fibroblasts affect copy number estimates
Technical Challenges:
- Reference Selection: Requires matched normal samples or carefully selected controls
- Noise vs. Signal: Must distinguish true somatic alterations from technical artifacts
- Clonal Evolution: Copy number may change during disease progression or treatment
- Sample Quality: Degraded FFPE samples or low tumor content affect detection sensitivity
Analytical Challenges:
- Baseline Establishment: Determining the diploid baseline in aneuploid tumors
- Segmentation Sensitivity: Balancing detection of focal events with broad chromosomal changes
- Integration with Other Data: Combining copy number with mutation, expression, and methylation data
Tumor CNV Detection Workflow Overview
The workflow for tumor CNV detection differs from germline analysis in several key respects:
- Sample Preparation: Requires matched tumor-normal pairs or carefully selected normal references
- Purity Assessment: Estimating tumor content and ploidy
- Coverage Calculation: Computing read depth ratios between tumor and normal samples
- Normalization: Correcting for technical biases and systematic effects
- Segmentation: Identifying regions of constant copy number
- Calling: Determining copy number states relative to normal diploid baseline
Setting Up Your Analysis Environment
This tutorial builds directly on our previous WGS analysis tutorials, using the same computational environment and sample data.
Prerequisites
Before proceeding, ensure you have:
- Completed Part 1: From Raw Reads to High-Quality Variants Using GATK
- Reviewed our germline CNV analysis tutorial for CNVkit basics
- Aligned BAM files for tumor and normal samples (normal1, normal2, tumor1, tumor2)
- The conda environment from Part 1 (
WGS_env) with CNVkit installed - Sufficient computational resources (16+ GB RAM recommended for WGS)
Installing Additional Tools for Tumor Analysis
Let’s activate your existing environment and add tumor-specific analysis tools:
# Activate the conda environment from previous tutorials
conda activate ~/WGS_env
Sample Data Overview
For this tutorial, we’ll use the sample data from Part 1:
- normal1: First normal/control sample
- normal2: Second normal/control sample
- tumor1: First tumor sample
- tumor2: Second tumor sample
Important Note: While these samples might represent matched tumor-normal pairs in a real study, CNVkit’s approach uses all available normal samples to build a robust reference. This pooled normal reference approach is actually superior to individual pair-wise comparisons because:
- Increased statistical power: More normal samples reduce noise in the reference
- Better bias correction: Technical artifacts are better normalized across multiple samples
- Improved sensitivity: Small copy number changes are easier to detect against a stable reference
Methodology Clarification: Unlike some somatic variant callers that require strict tumor-normal pairing, CNVkit uses a pooled normal reference approach where all normal samples contribute to a single, robust baseline for comparison.
Download Additional Reference Files
Tumor CNV analysis benefits from additional annotation resources:
# Create directory for tumor-specific references (separate from germline CNV references)
mkdir -p ~/references/tumor_cnv
cd ~/references/tumor_cnv
# Download COSMIC Cancer Gene Census (requires registration, using backup)
# This contains genes with documented roles in cancer
wget -O cancer_gene_census.csv \
"https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v95/cancer_gene_census.csv"
# Download OncoKB cancer gene list (curated oncogenes and tumor suppressors)
wget -O oncokb_cancer_genes.txt \
"https://www.oncokb.org/api/v1/utils/cancerGeneList.txt"
# Download ClinVar pathogenic CNV annotations
wget -O clinvar_cnvs.vcf.gz \
"https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz"
gunzip clinvar_cnvs.vcf.gz
Tumor CNV Detection Pipeline
Now let’s proceed with the tumor-specific CNV analysis workflow, highlighting the differences from germline analysis.
Key Difference: Tumor CNV analysis inherently filters out germline variants by using a reference built exclusively from normal samples. When we compare tumor coverage to this normal reference, only somatic (tumor-acquired) alterations will show up as significant deviations.
Step 1: Preparing Your Example Dataset
For this tutorial, we’ll use the same example data from Part 1. If you’re working with different samples, the workflow remains identical – simply adjust the sample names.
# Create directory structure for tumor CNV analysis
PROJECT_DIR=~/WGS_Project
cd ${PROJECT_DIR}
# Create tumor-specific subdirectories within the existing CNV analysis structure
mkdir -p cnv_analysis/tumor/{coverage,reference,calls,plots,annotation,clinical}
TUMOR_CNV_DIR=${PROJECT_DIR}/cnv_analysis/tumor
# Set variables for the analysis
BAM_DIR=${PROJECT_DIR}/aligned_reads
NORMAL_SAMPLES=("normal1" "normal2")
TUMOR_SAMPLES=("tumor1" "tumor2")
ALL_SAMPLES=("normal1" "normal2" "tumor1" "tumor2")
REFERENCE_GENOME=~/references/hg38.fa
Step 2: Create Tumor-Optimized Reference
Unlike germline analysis, tumor CNV detection requires special consideration for reference creation:
#=============================================
# 2.1: Generate coverage profiles for all samples
#=============================================
# Generate coverage for normal samples (these will form our reference)
# The "targets.bed" and "antitargets.bed" files are from part 6
for sample in "${NORMAL_SAMPLES[@]}"; do
cnvkit.py coverage \
"${BAM_DIR}/${sample}_recalibrated.bam" \
${TUMOR_CNV_DIR}/reference/targets.bed \
-o ${TUMOR_CNV_DIR}/coverage/${sample}.targetcoverage.cnn
cnvkit.py coverage \
"${BAM_DIR}/${sample}_recalibrated.bam" \
${TUMOR_CNV_DIR}/reference/antitargets.bed \
-o ${TUMOR_CNV_DIR}/coverage/${sample}.antitargetcoverage.cnn
done
# Generate coverage for tumor samples (these will be compared to reference)
for sample in "${TUMOR_SAMPLES[@]}"; do
cnvkit.py coverage \
"${BAM_DIR}/${sample}_recalibrated.bam" \
${TUMOR_CNV_DIR}/reference/targets.bed \
-o ${TUMOR_CNV_DIR}/coverage/${sample}.targetcoverage.cnn
cnvkit.py coverage \
"${BAM_DIR}/${sample}_recalibrated.bam" \
${TUMOR_CNV_DIR}/reference/antitargets.bed \
-o ${TUMOR_CNV_DIR}/coverage/${sample}.antitargetcoverage.cnn
done
#=============================================
# 2.2: Build normal reference from control samples
#=============================================
cnvkit.py reference \
${TUMOR_CNV_DIR}/coverage/normal*.targetcoverage.cnn \
${TUMOR_CNV_DIR}/coverage/normal*.antitargetcoverage.cnn \
--fasta ${REFERENCE_GENOME} \
-o ${TUMOR_CNV_DIR}/reference/tumor_reference.cnn
Critical Difference from Germline Analysis: For tumor CNV detection, the reference MUST be built exclusively from normal samples. Including tumor samples in the reference would normalize out the very copy number alterations we’re trying to detect.
Step 3: Tumor-Normal Copy Number Analysis
Now we perform the core tumor CNV analysis, comparing each tumor sample to our normal reference:
#=============================================
# 3.1: Calculate copy number ratios for tumor samples
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# Calculate log2 ratios relative to normal reference
cnvkit.py fix \
${TUMOR_CNV_DIR}/coverage/${sample}.targetcoverage.cnn \
${TUMOR_CNV_DIR}/coverage/${sample}.antitargetcoverage.cnn \
${TUMOR_CNV_DIR}/reference/tumor_reference.cnn \
-o ${TUMOR_CNV_DIR}/calls/${sample}.cnr
done
#=============================================
# 3.2: Segment tumor copy number profiles
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# Segment the tumor sample using CBS (Circular Binary Segmentation)
# This algorithm is particularly well-suited for tumor samples with noise
cnvkit.py segment \
${TUMOR_CNV_DIR}/calls/${sample}.cnr \
-m cbs \
--smooth-cbs \
-o ${TUMOR_CNV_DIR}/calls/${sample}.cns
done
#=============================================
# 3.3: Call discrete copy number states
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# Call integer copy number states
# Tumor samples often have complex ploidy, so we use permissive thresholds
cnvkit.py call \
${TUMOR_CNV_DIR}/calls/${sample}.cns \
-o ${TUMOR_CNV_DIR}/calls/${sample}_called.cns \
--purity 0.7 \
--ploidy 2 \
--drop-low-coverage
done
#=============================================
# 3.4: Quality control and metrics
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# Calculate basic CNV statistics
cnvkit.py metrics \
${TUMOR_CNV_DIR}/calls/${sample}.cnr \
-s ${TUMOR_CNV_DIR}/calls/${sample}.cns \
> ${TUMOR_CNV_DIR}/calls/${sample}_metrics.txt
done

Step 4: Tumor-Specific Filtering and Annotation
Tumor CNV filtering focuses on identifying high-confidence somatic alterations. Note that germline variants are already filtered out through our use of the normal reference in Step 3:
#=============================================
# 4.1: Filter tumor CNVs for high-confidence calls
#=============================================
mkdir -p ${TUMOR_CNV_DIR}/filtered
for sample in "${TUMOR_SAMPLES[@]}"; do
# Apply tumor-specific filters
# More permissive than germline due to tumor heterogeneity and noise
# Note: --thresholds parameter must be quoted when using comma-separated values
cnvkit.py call \
${TUMOR_CNV_DIR}/calls/${sample}.cns \
-o ${TUMOR_CNV_DIR}/filtered/${sample}_filtered.cns \
--thresholds="-1.0,-0.5,0.5,1.0" \
--purity 0.7 \
--drop-low-coverage
# Filter by minimum size (tumors often have focal alterations)
# Keep alterations ≥100kb for broad analysis, ≥10kb for focal events
awk 'NR==1 || ($3-$2) >= 100000' \
${TUMOR_CNV_DIR}/filtered/${sample}_filtered.cns \
> ${TUMOR_CNV_DIR}/filtered/${sample}_broad.cns
awk 'NR==1 || ($3-$2) >= 10000' \
${TUMOR_CNV_DIR}/filtered/${sample}_filtered.cns \
> ${TUMOR_CNV_DIR}/filtered/${sample}_focal.cns
done
Step 5: Visualization and Results Interpretation
Tumor CNV visualization requires specialized plots to capture tumor-specific features:
#=============================================
# 5.1: Generate tumor CNV visualization plots
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# 1. Genome-wide copy number profile
cnvkit.py scatter \
${TUMOR_CNV_DIR}/calls/${sample}.cnr \
-s ${TUMOR_CNV_DIR}/calls/${sample}.cns \
-o ${TUMOR_CNV_DIR}/plots/${sample}_genome_wide.pdf \
--title "Tumor Copy Number Profile: $sample"
# 2. Individual chromosome plots for detailed review
for chr in {1..22} X Y; do
cnvkit.py scatter \
${TUMOR_CNV_DIR}/calls/${sample}.cnr \
-s ${TUMOR_CNV_DIR}/calls/${sample}.cns \
-c chr${chr} \
-o ${TUMOR_CNV_DIR}/plots/${sample}_chr${chr}.pdf \
--title "$sample Chromosome $chr"
done
# 3. Heatmap for copy number states
cnvkit.py heatmap \
${TUMOR_CNV_DIR}/calls/${sample}.cns \
-o ${TUMOR_CNV_DIR}/plots/${sample}_heatmap.pdf
# 4. Generate copy number distribution plot
# Create a diagram showing the copy number profile
cnvkit.py diagram \
${TUMOR_CNV_DIR}/calls/${sample}.cnr \
-s ${TUMOR_CNV_DIR}/calls/${sample}.cns \
-o ${TUMOR_CNV_DIR}/plots/${sample}_diagram.pdf
done

Step 6: Clinical Interpretation
For tumor CNVs, clinical interpretation focuses on actionable alterations and known pathogenic variants from clinical databases:
#=============================================
# 6.1: Process ClinVar CNV data for clinical annotation
#=============================================
mkdir -p ${TUMOR_CNV_DIR}/clinical
# Note: ClinVar contains mostly germline variants, but can help identify:
# 1. Tumor suppressor genes where germline + somatic hits cause cancer
# 2. Regions with known pathogenic copy number changes
# 3. Genes where CNVs have established clinical significance
# Extract CNV-related entries from ClinVar VCF
# Look for copy number variants using CLNVC (ClinVar Variant Class) field
grep -E "(CLNVC=Deletion|CLNVC=Duplication|CLNVC=copy_number_gain|CLNVC=copy_number_loss)" ~/references/tumor_cnv/clinvar_cnvs.vcf > \
${TUMOR_CNV_DIR}/clinical/clinvar_cnvs_only.vcf
# Convert ClinVar CNVs to BED format for overlap analysis
grep -v "^#" ${TUMOR_CNV_DIR}/clinical/clinvar_cnvs_only.vcf | \
awk 'BEGIN{OFS="\t"} {
# Extract basic coordinates and ensure chromosome has "chr" prefix
chr = $1
if (chr !~ /^chr/) {
chr = "chr" chr
}
start = $2 - 1 # Convert to 0-based for BED
# Determine end position based on variant type
end = start + 1 # default for SNVs
# For indels, calculate size from REF/ALT
ref_len = length($4)
alt_len = length($5)
if (ref_len > alt_len) {
# Deletion - use REF length
end = start + ref_len
} else if (alt_len > ref_len) {
# Insertion/duplication - for BED purposes, use minimal coordinates
end = start + 1
} else {
# Same length (substitution) or complex
end = start + ref_len
}
# Extract ClinVar-specific information
clnvc = "unknown"
if (match($8, /CLNVC=([^;]+)/, arr)) {
clnvc = arr[1]
}
# Extract clinical significance
clnsig = "unknown"
if (match($8, /CLNSIG=([^;]+)/, arr)) {
clnsig = arr[1]
}
# Extract gene info
gene = "unknown"
if (match($8, /GENEINFO=([^;:]+)/, arr)) {
gene = arr[1]
}
# Extract disease/phenotype information
clndn = "unknown"
if (match($8, /CLNDN=([^;]+)/, arr)) {
clndn = arr[1]
}
# Only include if it could be a CNV-like variant
if (clnvc ~ /[Dd]eletion|[Dd]uplication|copy_number/ || ref_len > 10 || alt_len > 10) {
# Print BED entry with ClinVar annotations
print chr, start, end, $3, clnvc, clnsig, gene, clndn
}
}' > ${TUMOR_CNV_DIR}/clinical/clinvar_cnvs.bed
#=============================================
# 6.2: Overlap tumor CNVs with ClinVar pathogenic variants
#=============================================
for sample in "${TUMOR_SAMPLES[@]}"; do
# Convert our CNV calls to BED format for overlap analysis
# Fix chromosome naming to match ClinVar (add "chr" prefix if missing)
tail -n +2 ${TUMOR_CNV_DIR}/filtered/${sample}_filtered.cns | \
awk 'BEGIN{OFS="\t"} {
# Ensure chromosome has "chr" prefix
chr = $1
if (chr !~ /^chr/) {
chr = "chr" chr
}
# Classify alteration type
if ($6 >= 0.5) alteration="gain"
else if ($6 <= -0.5) alteration="loss"
else alteration="neutral"
print chr, $2, $3, $4, alteration, $6
}' > ${TUMOR_CNV_DIR}/clinical/${sample}_cnvs_for_overlap.bed
# Find overlaps with ClinVar pathogenic/likely pathogenic CNVs
bedtools intersect \
-a ${TUMOR_CNV_DIR}/clinical/${sample}_cnvs_for_overlap.bed \
-b ${TUMOR_CNV_DIR}/clinical/clinvar_cnvs.bed \
-wa -wb \
-f 0.5 \
> ${TUMOR_CNV_DIR}/clinical/${sample}_clinvar_overlaps.bed
done

Advanced Analysis and Integration
Multi-Sample Analysis and Cohort Studies
For larger studies involving multiple tumor samples:
#=============================================
# Cohort-level analysis (when you have multiple samples)
#=============================================
# Create directory for cohort analysis
mkdir -p ${TUMOR_CNV_DIR}/cohort_analysis
# Combine all tumor samples for population-level analysis
cnvkit.py call \
${TUMOR_CNV_DIR}/calls/tumor*.cns \
-o ${TUMOR_CNV_DIR}/cohort_analysis/cohort_calls.cns \
--center-at 0
# Generate cohort heatmap
cnvkit.py heatmap \
${TUMOR_CNV_DIR}/calls/tumor*.cns \
-o ${TUMOR_CNV_DIR}/cohort_analysis/cohort_heatmap.pdf
# Identify recurrent alterations across the cohort
cnvkit.py genemetrics \
${TUMOR_CNV_DIR}/calls/tumor*.cns \
-t 0.2 \
-o ${TUMOR_CNV_DIR}/cohort_analysis/recurrent_cnvs.txt
echo "Cohort analysis complete"
Single-Command Tumor CNV Analysis
For experienced users or production workflows, CNVkit provides a streamlined batch command that performs the entire tumor CNV analysis workflow in a single step. This method automatically handles coverage calculation, reference building, copy number calculation, segmentation, and calling with optimized parameters.
The batch approach is ideal for:
- Production workflows requiring consistent, automated processing
- Large-scale studies processing many tumor-normal pairs
- Experienced users who understand the underlying methodology
- Standardized pipelines needing reproducible results
Complete Workflow Using CNVkit Batch
# Set up directory for batch analysis
mkdir -p ${TUMOR_CNV_DIR}/batch_analysis
cd ${TUMOR_CNV_DIR}/batch_analysis
# Define sample paths
NORMAL_BAMS=("${BAM_DIR}/normal1_recalibrated.bam" "${BAM_DIR}/normal2_recalibrated.bam")
TUMOR_BAMS=("${BAM_DIR}/tumor1_recalibrated.bam" "${BAM_DIR}/tumor2_recalibrated.bam")
# Single command to perform complete tumor CNV analysis
cnvkit.py batch \
${TUMOR_BAMS[@]} \
--normal ${NORMAL_BAMS[@]} \
--fasta ${REFERENCE_GENOME} \
--annotate ~/references/refFlat_hg38.txt \
--output-reference pooled_normal_reference.cnn \
--output-dir results/ \
--method wgs \
--segment-method cbs \
--drop-low-coverage \
--scatter \
--diagram
Tumor-Only Analysis Using CNVkit Batch
For tumor-only analysis, we can use CNVkit’s batch command without normal samples to create a flat reference:
# Set up directory for tumor-only batch analysis
mkdir -p ${TUMOR_CNV_DIR}/tumor_only_analysis
cd ${TUMOR_CNV_DIR}/tumor_only_analysis
# Define tumor sample paths
TUMOR_BAMS=("${BAM_DIR}/tumor1_recalibrated.bam" "${BAM_DIR}/tumor2_recalibrated.bam")
# Single command to perform complete tumor-only CNV analysis
cnvkit.py batch \
${TUMOR_BAMS[@]} \
--fasta ${REFERENCE_GENOME} \
--annotate ~/references/refFlat_hg38.txt \
--output-reference flat_reference.cnn \
--output-dir results/ \
--method wgs \
--segment-method cbs \
--drop-low-coverage \
--scatter \
--diagram
Key Considerations For Tumor-Only Analysis:
What You Can Detect:
- Large amplifications (oncogenes like HER2, EGFR, MYC)
- Deep deletions (tumor suppressors like TP53, CDKN2A)
- Broad chromosomal gains/losses
- Clinically actionable alterations
Limitations:
- Cannot distinguish somatic vs. germline – inherited CNVs will appear as “tumor” alterations
- Higher false positive rate – normal population variation may be called as alterations
- Reduced sensitivity – subtle copy number changes harder to detect
- Interpretation challenges – requires more careful filtering and validation
Best Practices for Tumor-Only:
- Focus on large alterations – Use higher thresholds (log2 > 1.0 for gains, < -1.0 for losses)
- Filter with population databases – Remove common germline CNVs using DGV, gnomAD-SV
- Prioritize cancer genes – Focus on known oncogenes and tumor suppressors
- Validate key findings – Use FISH, qPCR, or array-based methods for actionable alterations
- Consider clinical context – Integrate with pathology, immunohistochemistry, and clinical presentation
Tumor-only analysis is particularly useful for:
- Biomarker testing for targeted therapy selection
- Clinical trials enrollment based on copy number alterations
- Retrospective studies using archived FFPE samples
- Resource-limited settings where matched normals aren’t feasible
Best Practices for Tumor CNV Analysis
Tumor CNV analysis requires specialized considerations beyond those for germline variants:
Experimental Design for Tumor Studies
Experimental Design for Tumor Studies
Sample Collection:
- Normal Control Samples: Include high-quality normal tissue samples for reference building (blood, adjacent normal tissue, or other confirmed normal samples)
- Tumor Purity Assessment: Ensure tumor samples have >50% cancer cells for reliable detection
- Fresh vs. FFPE: Fresh-frozen samples provide higher quality data than FFPE, but both are analyzable
- Multiple Regions: Consider sampling multiple tumor regions to assess intratumor heterogeneity
When Tumor-Normal Pairing Matters:
- Individual patient analysis: For clinical reporting, knowing which normal belongs to which tumor patient is important for interpretation
- Germline variant filtering: Patient-specific germline variants can be filtered using matched normal
- Contamination assessment: Cross-sample contamination is easier to detect with known pairs
When Pooled Normal References Are Better (CNVkit approach):
- Copy number detection: Pooling multiple normals creates a more robust baseline
- Technical bias correction: Systematic artifacts are better normalized across multiple samples
- Statistical power: More normal samples improve the signal-to-noise ratio
Sequencing Considerations:
- Coverage Depth: Minimum 60-80x for tumor samples, 30-40x for normals
- Paired-End Sequencing: Recommended for breakpoint resolution and structural variant detection
- Insert Size: 350-500bp provides optimal coverage uniformity
- Quality Metrics: Monitor DNA integrity, library complexity, and sequencing quality
Computational Best Practices
Reference Building:
- Pure Normal Samples: Use only confirmed normal samples in reference construction
- Population Matching: Match tumor and normal samples by ancestry when possible
- Batch Correction: Process tumor-normal pairs together to minimize batch effects
- Reference Updates: Periodically update references as more normal samples become available
Quality Control Checkpoints:
- Purity Estimation: Validate tumor purity estimates with pathology review
- Sex Chromosome Analysis: Verify expected sex chromosome copy number
- Contamination Assessment: Check for cross-sample contamination
- Technical Artifacts: Filter regions prone to false positives (centromeres, segmental duplications)
Common Pitfalls in Tumor CNV Analysis
Technical Pitfalls:
- Contaminated References: Including tumor samples in normal reference panels
- Ignoring Tumor Purity: Not accounting for normal cell contamination in interpretation
- Batch Effects: Processing tumor and normal samples in different batches
- Over-segmentation: Using segmentation parameters that create too many small segments
Biological Pitfalls:
- Assuming Diploidy: Many tumors are aneuploid with complex ploidy states
- Ignoring Heterogeneity: Assuming all tumor cells have identical copy number profiles
- Temporal Changes: Not accounting for copy number evolution during treatment
- Germline Contamination: Misinterpreting inherited CNVs as somatic alterations
Clinical Pitfalls:
- Over-interpretation: Calling variants actionable without sufficient evidence
- Ignoring Context: Not considering tumor type-specific alteration patterns
- Single Time Point: Not monitoring copy number changes during treatment
- Poor Communication: Inadequate explanation of results to clinical teams
Troubleshooting Common Issues
Low Signal-to-Noise Ratio
Problem: CNV signals are weak or noisy, making reliable detection difficult.
Solutions:
- Increase sequencing depth (aim for 80-100x for problematic samples)
- Improve tumor purity through macro-dissection or laser capture microdissection
- Use larger bin sizes to reduce noise at the cost of resolution
- Consider alternative normalization strategies (GC correction, mappability filtering)
Ploidy and Purity Estimation Challenges
Problem: Difficulty determining tumor ploidy and purity affecting copy number interpretation.
Solutions:
- Use orthogonal methods (flow cytometry, FISH) to estimate ploidy
- Employ allele-specific copy number analysis (ASCAT, ABSOLUTE)
- Integrate with mutation data to estimate sample purity
- Consider using multiple purity/ploidy scenarios in interpretation
High Tumor Heterogeneity
Problem: Different tumor regions show different copy number profiles.
Solutions:
- Sample multiple tumor regions to capture heterogeneity
- Use single-cell copy number analysis for detailed characterization
- Focus on early, clonal alterations present across all regions
- Consider bulk tissue analysis as average of heterogeneous populations
Germline vs. Somatic Distinction
Problem: Difficulty distinguishing inherited from acquired copy number alterations.
Solutions:
- Always include matched normal tissue controls
- Use population databases (DGV, gnomAD-SV) to filter common variants
- Validate suspicious germline alterations in independent normal tissue
- Consider constitutional CNV testing for patients with multiple rare variants
References and Further Reading
- Talevich E, Shain AH, Botton T, Bastian BC. (2016). CNVkit: Genome-wide copy number detection and visualization from targeted DNA sequencing. PLOS Computational Biology, 12(4): e1004873.
- Zack TI, Schumacher SE, Carter SL, et al. (2013). Pan-cancer patterns of somatic copy number alteration. Nature Genetics, 45(10): 1134-1140.
- Cancer Genome Atlas Research Network. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10): 1113-1120.
- Riggs ER, Andersen EF, Cherry AM, et al. (2020). Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genetics in Medicine, 22(2): 245-257.
- Carter SL, Cibulskis K, Helman E, et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5): 413-421.
- Van Loo P, Nordgard SH, Lingjærde OC, et al. (2010). Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences, 107(39): 16910-16915.
- Venkatraman ES, Olshen AB. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics, 23(6): 657-663.
- Wolff AC, Hammond MEH, Allison KH, et al. (2018). Human epidermal growth factor receptor 2 testing in breast cancer: American Society of Clinical Oncology/College of American Pathologists clinical practice guideline focused update. Journal of Clinical Oncology, 36(20): 2105-2122.
- Lord CJ, Ashworth A. (2017). PARP inhibitors: Synthetic lethality in the clinic. Science, 355(6330): 1152-1158.
- Sondka Z, Bamford S, Cole CG, et al. (2018). The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nature Reviews Cancer, 18(11): 696-705.
- Lasolle, H., Elsensohn, MH., Wierinckx, A. et al. Chromosomal instability in the prediction of pituitary neuroendocrine tumors prognosis. acta neuropathol commun 8, 190 (2020). https://doi.org/10.1186/s40478-020-01067-5
This tutorial is part of the NGS101.com series on whole genome sequencing analysis. If this tutorial helped advance your cancer genomics research, please share your experience and subscribe for more comprehensive bioinformatics tutorials.





Leave a Reply