How To Download Gene Expression Data From TCGA and GTEx Using R: A Complete Guide for Cancer Genomics Research

How To Download Gene Expression Data From TCGA and GTEx Using R: A Complete Guide for Cancer Genomics Research

By

Lei

A comprehensive step-by-step tutorial for accessing and analyzing large-scale genomic datasets

Introduction: Unlocking the Power of Public Cancer Genomics Data

Cancer research has been revolutionized by the availability of large-scale, high-quality genomic datasets. Two of the most valuable resources for researchers worldwide are The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. This tutorial will guide you through the process of downloading and analyzing gene expression data from these databases using R, opening doors to powerful comparative analyses between cancer and normal tissues.

What Are TCGA and GTEx Databases?

The Cancer Genome Atlas (TCGA) represents one of the most ambitious cancer genomics projects ever undertaken. Launched by the National Cancer Institute and the National Human Genome Research Institute, TCGA has systematically characterized the molecular alterations across more than 30 different cancer types. The project has generated comprehensive genomic profiles from over 11,000 tumor samples, creating an unprecedented resource for understanding cancer biology.

The Genotype-Tissue Expression (GTEx) project complements TCGA by providing a comprehensive atlas of gene expression and regulation across diverse human tissues and individuals. GTEx has collected samples from 54 non-diseased tissue sites across nearly 1,000 individuals, establishing the largest resource to date for studying human gene expression variation across tissues and individuals.

Why These Databases Matter: Together, TCGA and GTEx provide researchers with the ability to compare cancer samples against appropriate normal tissue controls, enabling powerful differential expression analyses that can reveal cancer-specific molecular signatures.

Types of Data Hosted by These Databases

Both databases host multiple types of molecular data:

TCGA Data Types:

  • RNA Sequencing (RNA-seq): Gene expression quantification and transcript-level data
  • Whole Exome/Genome Sequencing: Mutation profiles and structural variations
  • DNA Methylation: Epigenetic modifications across the genome
  • Copy Number Variation: Chromosomal gains and losses
  • MicroRNA Sequencing: Small RNA expression profiles
  • Clinical Data: Patient demographics, treatment history, and outcomes

GTEx Data Types:

  • RNA Sequencing: Gene and transcript expression across tissues
  • Whole Genome Sequencing: Genetic variants and their associations with expression
  • Histology Images: Tissue morphology for quality assessment
  • Quantitative Trait Loci (eQTL) Data: Genetic variants affecting gene expression

Sample Types and Tissue Coverage

TCGA Sample Categories:

  • Primary Tumor: Fresh tumor tissue from the primary cancer site
  • Solid Tissue Normal: Adjacent normal tissue from the same patient
  • Blood Derived Normal: Normal blood samples for germline comparisons
  • Recurrent Tumor: Tumor samples collected after initial treatment
  • Metastatic: Secondary tumor sites in advanced disease cases

GTEx Tissue Types:
The GTEx project encompasses 54 tissue types across major organ systems:

  • Brain tissues: 13 different brain regions
  • Cardiovascular: Heart, arteries, and related tissues
  • Digestive system: Colon, stomach, liver, pancreas, and others
  • Endocrine: Thyroid, adrenal gland, pituitary
  • Reproductive: Ovary, testis, prostate, breast, uterus
  • Muscle and connective tissue: Skeletal muscle, skin, fat
  • Blood and immune: Whole blood, spleen, lymph nodes

Research Applications and Analysis Opportunities

The combination of TCGA and GTEx data enables several powerful research approaches:

Differential Expression Analysis:

  • Compare tumor vs. normal tissue expression patterns
  • Identify cancer-specific biomarkers and therapeutic targets
  • Understand tissue-specific expression differences

Pathway and Functional Analysis:

  • Discover disrupted biological pathways in cancer
  • Identify co-expressed gene modules and regulatory networks
  • Characterize immune infiltration patterns in tumors

Survival and Clinical Association Studies:

  • Correlate gene expression with patient outcomes
  • Identify prognostic signatures for risk stratification
  • Discover predictive biomarkers for treatment response

Multi-omics Integration:

  • Combine expression data with mutation, methylation, and copy number data
  • Understand the molecular mechanisms driving cancer development
  • Identify druggable targets and resistance mechanisms

Pan-cancer Analyses:

  • Compare molecular patterns across different cancer types
  • Identify common and unique features of various cancers
  • Discover universal cancer signatures and tissue-specific alterations

Setting Up Your Analysis Environment

Before diving into data download and analysis, we need to set up our R environment with the necessary packages. If you’re new to R environment setup, I recommend checking out my previous tutorial on RNA-seq analysis which covers the basics in detail.

Required Package Installation

We’ll install packages from both CRAN and Bioconductor. For beginners, here’s what each package does:

#-----------------------------------------------
# STEP 1: Install required packages for TCGA/GTEx analysis
#-----------------------------------------------

# Install BiocManager (manages bioinformatics packages)
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

# Separate packages by source for proper installation
cran_packages <- c(
    "data.table",          # Fast data manipulation
    "ggplot2"              # Making plots
)

bioconductor_packages <- c(
    # Data access
    "TCGAbiolinks",        # Downloads TCGA data
    "recount3",            # Downloads GTEx data  
    "SummarizedExperiment", # Handles genomic data format

    # Analysis packages
    "limma",               # Differential expression analysis
    "edgeR",               # RNA-seq analysis tools
    "sva",                 # Batch effect correction

    # Visualization
    "pheatmap",            # Heatmaps
    "EnhancedVolcano",     # Volcano plots

    # Gene annotation
    "org.Hs.eg.db"        # Human gene information
)

# Install CRAN packages first
install.packages(cran_packages)

# Install Bioconductor packages
BiocManager::install(bioconductor_packages, ask = FALSE, update = FALSE)

# Load the packages we'll use
library(TCGAbiolinks)
library(recount3)
library(SummarizedExperiment)
library(data.table)
library(limma)
library(edgeR)

Setup Working Directory

Let’s create a clean workspace for our analysis:

#-----------------------------------------------
# STEP 2: Create project directory structure
#-----------------------------------------------

# Create main project folder
project_dir <- "~/TCGA_GTEx_Analysis"
dir.create(project_dir, recursive = TRUE, showWarnings = FALSE)

# Create subfolders
dir.create(file.path(project_dir, "data"), showWarnings = FALSE)
dir.create(file.path(project_dir, "results"), showWarnings = FALSE)
dir.create(file.path(project_dir, "plots"), showWarnings = FALSE)

# Set working directory
setwd(project_dir)

Downloading Gene Expression Data from TCGA

The Cancer Genome Atlas provides a wealth of cancer genomics data. We’ll use the TCGAbiolinks package to access and download TCGA data systematically.

Exploring Available TCGA Datasets

Let’s see what cancer types are available in TCGA:

#-----------------------------------------------
# STEP 3: Explore available TCGA datasets
#-----------------------------------------------

# Get all TCGA cancer projects
all_projects <- getGDCprojects()

# Filter for TCGA projects (they start with "TCGA-")
tcga_projects <- all_projects[grepl("^TCGA-", all_projects$project_id), ]

# Show available cancer types
print(tcga_projects[, c("project_id", "name")])

# For this tutorial, we'll use Breast Cancer (TCGA-BRCA)

#    project_id                                                             name
# 5   TCGA-BRCA                                        Breast Invasive Carcinoma
# 6   TCGA-UCEC                             Uterine Corpus Endometrial Carcinoma
# 42  TCGA-LAML                                           Acute Myeloid Leukemia
# 44  TCGA-CHOL                                               Cholangiocarcinoma
# 50   TCGA-ACC                                         Adrenocortical Carcinoma
# 51  TCGA-DLBC                  Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
# 52  TCGA-CESC Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma
# 53  TCGA-ESCA                                             Esophageal Carcinoma
# 58  TCGA-KICH                                               Kidney Chromophobe
# 60  TCGA-HNSC                            Head and Neck Squamous Cell Carcinoma
# 61  TCGA-COAD                                             Colon Adenocarcinoma
# 62  TCGA-BLCA                                     Bladder Urothelial Carcinoma
# 64  TCGA-KIRP                            Kidney Renal Papillary Cell Carcinoma
# 65   TCGA-GBM                                          Glioblastoma Multiforme
# 66  TCGA-MESO                                                     Mesothelioma
# 67  TCGA-SARC                                                          Sarcoma
# 68  TCGA-PCPG                               Pheochromocytoma and Paraganglioma
# 69  TCGA-READ                                            Rectum Adenocarcinoma
# 70  TCGA-LUAD                                              Lung Adenocarcinoma
# 71  TCGA-LUSC                                     Lung Squamous Cell Carcinoma
# 72  TCGA-PAAD                                        Pancreatic Adenocarcinoma
# 73  TCGA-LIHC                                   Liver Hepatocellular Carcinoma
# 74  TCGA-KIRC                                Kidney Renal Clear Cell Carcinoma
# 75   TCGA-LGG                                         Brain Lower Grade Glioma
# 76  TCGA-SKCM                                          Skin Cutaneous Melanoma
# 77  TCGA-PRAD                                          Prostate Adenocarcinoma
# 78  TCGA-STAD                                           Stomach Adenocarcinoma
# 79    TCGA-OV                                Ovarian Serous Cystadenocarcinoma
# 80  TCGA-THYM                                                          Thymoma
# 81   TCGA-UCS                                           Uterine Carcinosarcoma
# 82   TCGA-UVM                                                   Uveal Melanoma
# 83  TCGA-TGCT                                      Testicular Germ Cell Tumors
# 85  TCGA-THCA                                                Thyroid Carcinoma

Understanding Sample Types

Before downloading, let’s see what sample types are available:

#-----------------------------------------------
# STEP 4: Check available samples for TCGA-BRCA
#-----------------------------------------------

# Query what data is available for breast cancer
brca_query <- GDCquery(
    project = "TCGA-BRCA",                    # Breast cancer project
    data.category = "Transcriptome Profiling", # RNA-seq data
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"           # Count data (not FPKM)
)

# Get sample information
sample_info <- getResults(brca_query)

# Get cancer subtypes
subtype_info <- TCGAquery_subtype(tumor = "BRCA")

table(subtype_info$BRCA_Subtype_PAM50)

 # Basal   Her2   LumA   LumB     NA Normal 
 #   192     82    562    209      2     40 

# Available subtype information for cancers
# "ACC", "BRCA", "BLCA", "CESC", "CHOL", "COAD", "ESCA", "GBM"
# "HNSC", "KICH", "KIRC", "KIRP", "LGG", "LIHC", "LUAD", "LUSC"
# "PAAD", "PCPG", "PRAD", "READ", "SKCM", "SARC", "STAD", "THCA"
# "UCEC", "UCS", "UVM"

# Check sample types available
sample_types <- table(sample_info$sample_type)
print(sample_types)

#          Metastatic    7
#       Primary Tumor 1111
# Solid Tissue Normal  113

Downloading TCGA Gene Expression Data

Now let’s download the actual data. We’ll focus on tumor and normal samples:

#-----------------------------------------------
# STEP 5: Download TCGA breast cancer data
#-----------------------------------------------

# We want both tumor and normal samples
target_samples <- c("Primary Tumor", "Solid Tissue Normal")

# Create focused query
brca_query_filtered <- GDCquery(
    project = "TCGA-BRCA",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts",
    sample.type = target_samples
)

# Check how many samples we'll download
filtered_samples <- getResults(brca_query_filtered)
print(table(filtered_samples$sample_type))

# Download the data (this takes several minutes)
GDCdownload(brca_query_filtered, directory = "data/TCGA_BRCA")

# Process the downloaded data into R format
tcga_data <- GDCprepare(
    query = brca_query_filtered,
    directory = "data/TCGA_BRCA"
)

cat(paste("Downloaded", ncol(tcga_data), "samples with", nrow(tcga_data), "genes\n"))

# Downloaded 1224 samples with 60660 genes

Examine and Clean TCGA Data

Let’s look at our data structure and clean it up:

#-----------------------------------------------
# STEP 6: Clean and examine TCGA data
#-----------------------------------------------

# Get the expression matrix (genes x samples)
tcga_expression <- assay(tcga_data, "unstranded")  # Raw counts
tcga_samples <- colData(tcga_data)                 # Sample information
tcga_genes <- rowData(tcga_data)                   # Gene information

# Check sample distribution
sample_dist <- table(tcga_samples$sample_type)
print(sample_dist)

#       Primary Tumor 1111
# Solid Tissue Normal  113

# Keep only protein-coding genes (reduces noise)
if ("gene_type" %in% colnames(tcga_genes)) {
    protein_coding <- tcga_genes$gene_type == "protein_coding"
    tcga_data_clean <- tcga_data[protein_coding, ]
} else {
    tcga_data_clean <- tcga_data
}

# Save processed data
saveRDS(tcga_data_clean, "data/tcga_brca_processed.rds")

Saving TCGA Data

Let’s save our processed data for future use:

#-----------------------------------------------
# STEP 7: Save TCGA data in multiple formats
#-----------------------------------------------

# Also save as text files for use in other programs
tcga_expression_clean <- assay(tcga_data_clean, "unstranded")
tcga_samples_clean <- as.data.frame(colData(tcga_data_clean))
tcga_genes_clean <- as.data.frame(rowData(tcga_data_clean))

# Save expression matrix with gene names
expression_table <- data.table(
    gene_id = rownames(tcga_expression_clean),
    gene_name = tcga_genes_clean$gene_name,
    tcga_expression_clean
)
fwrite(expression_table, "data/tcga_brca_expression.tsv", sep = "\t")

# Save sample information
tcga_samples_clean$sample_id <- rownames(tcga_samples_clean)

# Convert list columns to character (more concise approach)
tcga_samples_clean[, sapply(tcga_samples_clean, is.list)] <- lapply(tcga_samples_clean[, sapply(tcga_samples_clean, is.list)], as.character)

fwrite(tcga_samples_clean, "data/tcga_brca_samples.tsv", sep = "\t")

Downloading Gene Expression Data from GTEx

The GTEx project provides normal tissue gene expression data that serves as an excellent control for cancer studies. We’ll use the recount3 package to access GTEx data.

Exploring Available GTEx Datasets

GTEx provides normal tissue expression data. Let’s see what’s available:

#-----------------------------------------------
# STEP 8: Explore available GTEx datasets
#-----------------------------------------------

# Check what projects are available in recount3
available_data <- available_projects()

# Look for GTEx data
gtex_data <- available_data[available_data$file_source == "gtex", ]
print(gtex_data[, c("project", "organism", "file_source")])

#         project organism file_source
#  ADIPOSE_TISSUE    human        gtex
#          MUSCLE    human        gtex
#    BLOOD_VESSEL    human        gtex
#           HEART    human        gtex
#           OVARY    human        gtex
#          UTERUS    human        gtex
#          VAGINA    human        gtex
#          BREAST    human        gtex

Download GTEx Sample Information

Let’s first get the sample information to see what tissues are available:

#-----------------------------------------------
# STEP 9: Get GTEx sample information
#-----------------------------------------------

# Create GTEx data object (this is lightweight - just gets metadata)
gtex_rse <- create_rse_manual(
    project = "BREAST",
    project_home = "data_sources/gtex", 
    organism = "human",
    annotation = "gencode_v26",
    type = "gene"
)

# Get sample information
gtex_samples <- colData(gtex_rse)

# See what tissue types are available
table(gtex_samples$gtex.smtsd)

# Breast - Mammary Tissue   482

Download GTEx Gene Expression Data

Now let’s download the actual expression data:

#-----------------------------------------------
# STEP 10: Download GTEx expression data
#-----------------------------------------------

# Get the actual expression data (this triggers download)
gtex_expression <- assay(gtex_rse, "raw_counts")
gtex_samples_filtered <- colData(gtex_rse)
gtex_genes <- rowData(gtex_rse)

Process and Clean GTEx Data

Let’s clean the GTEx data to match our TCGA processing:

#-----------------------------------------------
# STEP 11: Process GTEx data
#-----------------------------------------------

# Filter for protein-coding genes (to match TCGA)
gtex_gene_df <- as.data.frame(gtex_genes)
if ("gene_type" %in% colnames(gtex_gene_df)) {
    protein_coding <- gtex_gene_df$gene_type == "protein_coding"
    gtex_clean <- gtex_rse[protein_coding, ]
    cat(paste("Kept", sum(protein_coding), "protein-coding genes\n"))
} else {
    gtex_clean <- gtex_rse
}

# Focus on breast tissue samples for comparison with TCGA
breast_mask <- gtex_clean$gtex.smtsd == "Breast - Mammary Tissue"
gtex_breast <- gtex_clean[, breast_mask]

# Save processed GTEx data
saveRDS(gtex_clean, "data/gtex_processed.rds")
saveRDS(gtex_breast, "data/gtex_breast.rds")

Save GTEx Data

Let’s save the processed GTEx data:

#-----------------------------------------------
# STEP 12: Save GTEx data
#-----------------------------------------------

# Get clean data for saving
gtex_expression_clean <- assay(gtex_clean, "raw_counts")  # Use raw_counts assay
gtex_samples_clean <- as.data.frame(colData(gtex_clean))
gtex_genes_clean <- as.data.frame(rowData(gtex_clean))

# Save expression matrix with gene information
gtex_table <- data.table(
    gene_id = rownames(gtex_expression_clean),
    gene_name = gtex_genes_clean$gene_name,
    gtex_expression_clean
)
fwrite(gtex_table, "data/gtex_breast_expression.tsv", sep = "\t")

# Save sample information (handle list columns)
gtex_samples_clean$sample_id <- rownames(gtex_samples_clean)
gtex_samples_clean[, sapply(gtex_samples_clean, is.list)] <- 
    lapply(gtex_samples_clean[, sapply(gtex_samples_clean, is.list)], as.character)
fwrite(gtex_samples_clean, "data/gtex_breast_samples.tsv", sep = "\t")

Performing Differential Expression Analysis Between Tumor and Normal Tissues

Now we’ll combine TCGA tumor data with GTEx normal tissue data to find genes that are different between cancer and normal breast tissue.

Prepare Data for Analysis

First, let’s load and prepare our data:

#-----------------------------------------------
# STEP 13: Prepare data for differential expression analysis
#-----------------------------------------------

# Load our processed data
tcga_data <- readRDS("data/tcga_brca_processed.rds")
gtex_breast <- readRDS("data/gtex_breast.rds")

# Get expression matrices
tcga_expression <- assay(tcga_data, "unstranded")  # TCGA counts
gtex_expression <- assay(gtex_breast, "raw_counts")    # GTEx counts

# Remove genes that are not expressed in both datasets
tcga_expression <- tcga_expression[rowSums(tcga_expression > 0) >= ncol(tcga_expression), ]

gtex_expression <- gtex_expression[rowSums(gtex_expression > 0) >= ncol(gtex_expression), ]

# Get sample information
tcga_samples <- colData(tcga_data)
gtex_samples <- colData(gtex_breast)

# Get gene information
tcga_gene <- rowData(tcga_data)[rownames(tcga_expression), ]
gtex_gene <- rowData(gtex_breast)[rownames(gtex_expression), ]

rownames(tcga_gene) <- gsub("\\.\\d+$", "", rownames(tcga_gene))
rownames(gtex_gene) <- gsub("\\.\\d+$", "", rownames(gtex_gene))

# Find common genes between datasets
common_genes <- intersect(rownames(tcga_gene), rownames(gtex_gene))

tcga_gene_common <- tcga_gene[common_genes, ]
gtex_gene_common <- gtex_gene[common_genes, ]

# Keep only common genes
rownames(tcga_expression) <- gsub("\\.\\d+$", "", rownames(tcga_expression))
rownames(gtex_expression) <- gsub("\\.\\d+$", "", rownames(gtex_expression))

tcga_expression <- tcga_expression[rownames(tcga_gene_common), ]
gtex_expression <- gtex_expression[rownames(gtex_gene_common), ]

# Focus on TCGA tumor samples only
tumor_samples <- tcga_samples$sample_type == "Primary Tumor"
tcga_tumor_expression <- tcga_expression[, tumor_samples]

Combine Datasets and Handle Batch Effects

This is the tricky part – combining data from different studies:

#-----------------------------------------------
# STEP 14: Combine datasets and address batch effects
#-----------------------------------------------

# Combine expression matrices
combined_expression <- cbind(tcga_tumor_expression, gtex_expression)

# Create combined sample information
# Use data.table for efficient data manipulation
tcga_info <- data.table(
    sample_id = colnames(tcga_tumor_expression),
    dataset = "TCGA",
    condition = "Tumor"
)

gtex_info <- data.table(
    sample_id = colnames(gtex_expression),
    dataset = "GTEx", 
    condition = "Normal"
)

combined_samples <- rbind(tcga_info, gtex_info)

print(table(combined_samples$dataset, combined_samples$condition))

#       Normal Tumor
#  GTEx    482     0
#  TCGA      0  1111

# Filter out very low expressed genes
# Keep genes with at least 10 counts in at least 10% of samples
min_samples <- round(0.1 * ncol(combined_expression))
keep_genes <- rowSums(combined_expression > 10) >= min_samples
filtered_expression <- combined_expression[keep_genes, ]

Important Note About Batch Effects: When combining TCGA and GTEx data, we face a major challenge called “confounding.” All tumor samples come from TCGA and all normal samples come from GTEx. This means we can’t tell if differences are due to cancer or just differences between the two studies. We’ll proceed with caution and focus on large effect sizes.

Perform Differential Expression Analysis

Now let’s find genes that are different between tumor and normal:

#-----------------------------------------------
# STEP 15: Differential expression analysis with limma
#-----------------------------------------------

# Load required packages
library(limma)
library(edgeR)

# Create DGEList object for normalization
dge <- DGEList(counts = filtered_expression)

# Add gene symbols to the DGEList object
dge$genes <- data.frame(
    gene_id = rownames(filtered_expression),
    gene_name = tcga_gene_common$gene_name,
    gene_type = tcga_gene_common$gene_type
)

# Add sample group information
dge$samples$condition <- factor(combined_samples$condition)
dge$samples$dataset <- factor(combined_samples$dataset)

# Calculate normalization factors (accounts for library size differences)
dge <- calcNormFactors(dge, method = "TMM")

# Create design matrix
# Include both condition (tumor vs normal) and dataset (TCGA vs GTEx)
design <- model.matrix(~ condition, data = combined_samples)
colnames(design) <- c("Intercept", "Tumor_vs_Normal")

# Apply voom transformation (converts counts to log2-cpm with weights)
v <- voom(dge, design, plot = FALSE)

# Fit linear model
fit <- lmFit(v, design)
fit <- eBayes(fit)  # Apply empirical Bayes smoothing

# Extract results for tumor vs normal comparison
results <- topTable(
    fit,
    coef = "Tumor_vs_Normal",  # This coefficient represents tumor vs normal
    number = Inf,              # Get all genes
    adjust.method = "BH"       # Multiple testing correction
)

fwrite(results, "DEGs_Tumor_vs_Normal.tsv", sep = "\t")

#         gene_id       gene_name      gene_type     logFC P.Value adj.P.Val
# ENSG00000254996 ANKHD1-EIF4EBP3 protein_coding -7.739631       0         0
# ENSG00000166337           TAF10 protein_coding -6.742676       0         0
# ENSG00000255508      AP002990.1 protein_coding -7.434083       0         0
# ENSG00000167995           BEST1 protein_coding -8.792405       0         0
# ENSG00000131503          ANKHD1 protein_coding -6.586050       0         0
# ENSG00000243725            TTC4 protein_coding -5.127006       0         0

Create Visualizations

Let’s make some plots to visualize our results:

#-----------------------------------------------
# STEP 16: Create visualization plots
#-----------------------------------------------

# Load plotting library
library(ggplot2)

# 1. Volcano plot
volcano_data <- data.table(
    gene_name = results$gene_name,
    logFC = results$logFC,
    neg_log10_pval = -log10(results$adj.P.Val),
    significant = results$adj.P.Val < 0.05 & abs(results$logFC) > 1
)

volcano_plot <- ggplot(volcano_data, aes(x = logFC, y = neg_log10_pval)) +
    geom_point(aes(color = significant), alpha = 0.6, size = 1) +
    scale_color_manual(values = c("FALSE" = "gray", "TRUE" = "red")) +
    geom_vline(xintercept = c(-1, 1), linetype = "dashed", alpha = 0.5) +
    geom_hline(yintercept = -log10(0.05), linetype = "dashed", alpha = 0.5) +
    labs(
        title = "Volcano Plot: Tumor vs Normal Breast Tissue",
        x = "Log2 Fold Change (Tumor vs Normal)",
        y = "-Log10(Adjusted P-value)",
        color = "Significant"
    ) +
    theme_minimal()

ggsave("plots/volcano_plot.png", volcano_plot, width = 8, height = 6, dpi = 300)

# 2. MA plot (shows relationship between expression level and fold change)
ma_plot <- ggplot(results, aes(x = AveExpr, y = logFC)) +
    geom_point(aes(color = adj.P.Val < 0.05), alpha = 0.6, size = 0.8) +
    scale_color_manual(values = c("FALSE" = "gray", "TRUE" = "red")) +
    geom_hline(yintercept = c(-1, 0, 1), linetype = c("dashed", "solid", "dashed")) +
    labs(
        title = "MA Plot: Average Expression vs Fold Change",
        x = "Average Expression (log2 CPM)",
        y = "Log2 Fold Change",
        color = "FDR < 0.05"
    ) +
    theme_minimal()

ggsave("plots/ma_plot.png", ma_plot, width = 8, height = 6, dpi = 300)

Understanding Batch Effects and Study Design Caveats

The Challenge of Cross-Study Analysis

When we combine TCGA and GTEx data, we face a serious challenge called confounding:

  • All tumor samples come from TCGA
  • All normal samples come from GTEx
  • We can’t separate true biological differences from technical differences between studies

This means some of our “cancer genes” might actually be “TCGA vs GTEx” genes!

Caveats and Limitations

#-----------------------------------------------
# STEP 17: Assess potential batch effects
#-----------------------------------------------

# Let's check some "housekeeping" genes that shouldn't change much
housekeeping_genes <- c("ACTB", "GAPDH", "RPL13A", "SDHA")

# Find these genes in our results
hk_results <- results[results$gene_name %in% housekeeping_genes, ]

cat("Housekeeping gene results (should have small changes):\n")
print(hk_results[, c("gene_name", "logFC", "adj.P.Val")])

# If housekeeping genes show large changes, this suggests batch effects
large_hk_changes <- abs(hk_results$logFC) > 0.5
if (any(large_hk_changes)) {
    cat("⚠️  Warning: Some housekeeping genes show large changes\n")
    cat("This suggests possible batch effects between TCGA and GTEx\n")
} else {
    cat("✓ Housekeeping genes look stable\n")
}

#                 gene_name      logFC     adj.P.Val
# ENSG00000073578      SDHA -4.0925807  0.000000e+00
# ENSG00000142541    RPL13A  1.3622276 1.155829e-152
# ENSG00000075624      ACTB  0.4329943  8.901307e-35
# ENSG00000111640     GAPDH  0.4409455  3.127211e-18

# ⚠️  Warning: Some housekeeping genes show large changes
# This suggests possible batch effects between TCGA and GTEx

Suggestions for Handling Batch Effects

1. Use Conservative Thresholds:

# Focus on genes with very large effect sizes and high significance
high_confidence <- results[
    abs(results$logFC) > 2 &           # Large effect size
    results$adj.P.Val < 0.001 &       # Very significant
    results$AveExpr > 5,              # Well-expressed genes
]

cat(paste("High-confidence DE genes:", nrow(high_confidence), "\n"))

# High-confidence DE genes: 334 

2. Within-Study Analysis (When Possible):

# Check if TCGA has normal samples for within-study comparison
tcga_normal_count <- sum(tcga_samples$sample_type == "Solid Tissue Normal")
cat(paste("TCGA normal samples available:", tcga_normal_count, "\n"))

if (tcga_normal_count > 10) {
    cat("Recommendation: Use TCGA tumor vs TCGA normal for more reliable results\n")
}

# TCGA normal samples available: 113 
# Recommendation: Use TCGA tumor vs TCGA normal for more reliable results

3. Literature Validation:

# Check if our top genes are known cancer genes
known_cancer_genes <- c("MYC", "TP53", "BRCA1", "ERBB2", "ESR1")
cancer_gene_hits <- high_confidence$gene_name %in% known_cancer_genes

cat("Known cancer genes in high-confidence results:\n")
print(high_confidence[cancer_gene_hits, c("gene_name", "logFC", "adj.P.Val")])

#                 gene_name    logFC     adj.P.Val
# ENSG00000141736     ERBB2 2.086971 6.695777e-110
# ENSG00000091831      ESR1 2.238715  1.308510e-42

Alternative Approach: Pre-Processed Unified TCGA and GTEx Data

Why Consider Unified Data?

While downloading raw TCGA and GTEx data provides maximum flexibility, direct comparison between these datasets poses significant challenges due to differences in sample processing, sequencing platforms, and analysis pipelines. The RNA-seq expression levels typically range from 4-10 (log2 of normalized_count) for TCGA and 0-4 (log2 of RPKM) for GTEx, making direct comparisons problematic.

Wang et al. (2018) developed a comprehensive pipeline that processes and unifies RNA-seq data from TCGA and GTEx studies, including uniform realignment, gene expression quantification, and batch effect removal. Their analysis demonstrated that uniform alignment and quantification alone is insufficient – explicit batch effect correction is essential to facilitate meaningful data comparison between cancer and normal samples.

The Unified Dataset

The researchers processed data from 10,366 samples, including 2,790 from GTEx and 7,576 from TCGA, resulting in 9,109 high-quality samples after quality control and degradation filtering. This unified dataset covers tissues studied by both projects and provides three different data formats:

Downloading the Unified Data

The processed data is available on Figshare in three formats:

Conclusion: Best Practices for TCGA and GTEx Analysis

This tutorial has shown you how to download and analyze gene expression data from TCGA and GTEx databases. Here are the key takeaways:

What We’ve Accomplished

  1. Downloaded real cancer genomics data from public databases
  2. Processed and cleaned the data for analysis
  3. Identified genes that are different between cancer and normal tissue
  4. Created visualizations to understand the results
  5. Addressed important limitations of cross-study analysis

Key Best Practices

Data Quality:

  • Always filter low-expressed genes
  • Use appropriate normalization methods
  • Check for batch effects

Statistical Analysis:

  • Use proper multiple testing correction
  • Focus on large effect sizes for cross-study comparisons
  • Validate results when possible

Interpretation:

  • Be cautious with cross-study comparisons
  • Focus on biological relevance
  • Consider known cancer biology

Extending This Analysis

You can extend this analysis in several ways:

# Example extensions:
# 1. Analyze multiple cancer types
# 2. Include clinical data for survival analysis  
# 3. Perform pathway enrichment analysis
# 4. Compare with published gene signatures

Final Recommendations

  1. Start with well-characterized datasets like breast cancer
  2. Validate findings with literature and other datasets
  3. Be transparent about limitations in your analysis
  4. Consider experimental validation for important findings

The power of public genomics databases like TCGA and GTEx lies in their ability to generate hypotheses for further testing. By following the principles in this tutorial, you’ll be well-equipped to explore cancer genomics data responsibly and effectively.

For more detailed information about RNA-seq analysis best practices, check out my previous tutorial on differential expression analysis.

Troubleshooting Common Issues

Download Problems

  • Slow downloads: Try different times of day when servers are less busy
  • Failed downloads: Check internet connection and try restarting R
  • Package errors: Update R and Bioconductor packages

Memory Issues

# If R runs out of memory:
# 1. Close other programs
# 2. Increase memory limit (Windows)
memory.limit(size = 8000)  # 8GB

# 3. Work with smaller subsets
# subset_data &lt;- data[1:1000, ]  # First 1000 genes only

Analysis Issues

  • No significant genes: Try less stringent p-value cutoffs
  • Too many significant genes: Check for batch effects
  • Weird results: Verify sample labels and data processing steps

Remember: Real data analysis often involves troubleshooting. Don’t get discouraged if things don’t work perfectly the first time!

References and Further Reading

  1. The Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 2013.
  2. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science, 2020.
  3. Colaprico, A., et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Research, 2016.
  4. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). https://doi.org/10.1038/nature24277
  5. Wang, Q., Armenia, J., Zhang, C. et al. Unifying cancer and normal RNA sequencing data from different sources. Sci Data 5, 180061 (2018). https://doi.org/10.1038/sdata.2018.61

This tutorial is part of the NGS101.com beginner’s guide to genomics data analysis. For questions or suggestions, leave a comment below.

Comments

8 responses to “How To Download Gene Expression Data From TCGA and GTEx Using R: A Complete Guide for Cancer Genomics Research”

  1. Davis Avatar
    Davis

    If you wanted to filter patients by breast cancer subtype how would you do that? For example with TNBC, how would you find the appropriate IHC meta data to filter for HER2-/ER-/PR-?

    1. Lei Avatar

      Hi Davis,

      Good questions! I’ve updated the tutorial. Please check out the “Understanding Sample Types” section, where I’ve added details on how to access cancer subtype information.

  2. Leo Avatar
    Leo

    Say I have a list of non-coding RNAs which I want to see if they are differentially expressed once I have combined the TCGA and GTEx normal how would I ensure to run the differential expression to factor in the batch impacts this may have?

    1. Lei Avatar

      Thank you for this important question about batch effects. You’ve identified a key challenge with combining TCGA and GTEx data.

      As mentioned in the tutorial, when your cancer samples come from TCGA and normal samples from GTEx, you have a complete confounding design—the dataset source is perfectly correlated with the biological condition. This means standard batch correction methods won’t fully separate technical from biological variation.

      That said, there are a few approaches to consider:

      1. Proceed with awareness: Tumor-normal differences are often strong enough to dominate batch effects, so you may still capture true biological signals. However, you should validate findings with other datasets or experimental evidence.

      2. Alternative controls: If possible, use matched normal samples from TCGA instead of GTEx to avoid the confounding issue entirely.

      3. Complementary validation: Cross-reference your results with independent studies or datasets that examined the same non-coding RNAs.

      The key is to interpret results cautiously and always ask: do these differences make biological sense? Independent validation becomes especially critical with this study design.

      1. Leo Avatar
        Leo

        Thank you for the response, what would you suggest would be other suitable databases to help validate this data e.g. KMN plotter and GEPIA? Or do you have any other suggestions?
        For datasets like Ovarian there is no matched normal however. Also, actually to prevent batch effects what was the reasoning for using recount3 instead of UCSC Xena Toil ? Additionally, why use limma and edgeR instead of DESEQ2?

        1. Lei Avatar

          Great follow-up questions! Let me address each one:

          Regarding Recount3 and batch effects:

          I should clarify that this tutorial doesn’t use Recount3. We download data directly from TCGA and GTEx portals.

          More importantly, batch effects cannot be prevented or fully normalized away when combining TCGA tumor samples with GTEx normal samples, regardless of which data source you use (Recount3, UCSC Xena, or direct downloads). This is because the batch (dataset source) is perfectly confounded with the biological condition (tumor vs normal)—all your tumor samples come from TCGA and all your normals from GTEx. No computational method can separate technical batch effects from true biological differences in this design. This fundamental limitation exists regardless of what method you use.

          For validation, I’d suggest:

          – GEPIA2/3: Specifically designed for TCGA-GTEx comparisons with their own normalization approaches
          – KM plotter: Correlates expression with patient survival outcomes
          – Independent GEO datasets: Studies with properly controlled tumor-normal designs

          Regarding matched normals:

          You’re correct that some cancer types like ovarian have very limited matched TCGA normals. The TCGA-GTEx comparison is often the only large-scale option available, but this makes validation with independent datasets even more critical.

          Regarding DESeq2 vs. limma/edgeR:

          This tutorial demonstrates limma-voom, but DESeq2 is equally valid and would give highly similar results. Use whichever tool you’re most comfortable with—the key statistical principles are the same, and none of them can solve the fundamental confounding issue when comparing TCGA to GTEx.

  3. Reza Avatar
    Reza

    hello many thanks for your nice tutorials. I’m learning coding, and I’m a cancer researcher. Thanks to you im getting more and more interested in R coding. I was wondering if you could show us how we can use these databases to see whether the survival rate changes in the two groups of patients, one with lower and another with higher expression of a gene.
    thanks

    1. Lei Avatar

      Hi Reza,

      Absolutely, I’ll include survival analysis in the clinical data analysis tutorial series.
      Stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *