How to Analyze RNAseq Data for Absolute Beginners Part 6: A Comprehensive Guide for Cancer Subtype Prediction

Meta Description: Learn how to predict cancer subtypes using RNA-seq data through practical implementations of PAM50, genefu, and GSVA methods. Perfect for bioinformaticians and computational biologists working with gene expression data.

Video Tutorial

Introduction

Cancer subtype prediction from RNA-seq data is crucial for personalized medicine and treatment optimization. This tutorial, part 6 in our RNA-seq analysis series, demonstrates how to implement various subtype prediction methods using R. We’ll use breast cancer as our model system, but the principles can be applied to other cancer types.

Note: This tutorial builds on concepts covered in Part 3: Differential Expression Analysis. If you’re new to RNA-seq analysis, we recommend starting with Part 1: Introduction to RNA-seq.

Understanding Molecular Subtypes

Breast Cancer Subtypes Overview

Breast cancer serves as an excellent model for molecular subtyping, featuring four well-characterized intrinsic subtypes. Each subtype has distinct characteristics affecting treatment decisions and prognosis:

Subtype	Key Characteristics	Treatment Response	Prognosis
Luminal A	ER+/PR+, HER2-, Low Ki-67	Excellent hormone therapy response	Best
Luminal B	ER+/PR+, HER2±, High Ki-67	Combined therapy needed	Intermediate
HER2-enriched	ER-/PR-, HER2+	Good trastuzumab response	Intermediate
Basal-like/Triple-negative	ER-/PR-, HER2-	Limited targeted options	Poor

Setup and Prerequisites

Required R Packages

First, let’s install all necessary packages. We’ll use BiocManager for Bioconductor packages and standard installation for CRAN packages.

# Install BiocManager if not already installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install Bioconductor packages
BiocManager::install(c(
    "GEOquery",
    "edgeR",
    "limma",
    "EnsDb.Hsapiens.v86",
    "genefu",
    "GSVA"
))

# Install CRAN packages
install.packages(c(
    "data.table",
    "xtable",
    "rmeta",
    "caret",
    "pheatmap",
    "devtools"
))

# Install GitHub packages
devtools::install_github("ccchang0111/PAM50")

Loading Libraries

# Load required libraries
library(GEOquery)
library(data.table)
library(edgeR)
library(limma)
library(xtable)
library(rmeta)
library(caret)
library(genefu)
library(AnnotationDbi)
library(EnsDb.Hsapiens.v86)
library(PAM50)
library(GSVA)
library(pheatmap)

Data Preparation

Processing Gene Expression Data

We’ll use the GSE209998 dataset as our example. This section shows how to process raw count data into a format suitable for subtype prediction.

# Fetch data from GEO
gse_meta <- getGEO("GSE209998")
gse_meta <- pData(gse_meta[[1]])
rownames(gse_meta) <- gse_meta$description.1

# Read count data
gse_count <- fread("GSE209998_AUR_129_raw_counts.txt", data.table = F)

# Verify sample matching
if(!identical(rownames(gse_meta), colnames(gse_count)[-1])) {
    stop("Sample names don't match between metadata and count data")
}

# Process count table
gse_count <- gse_count[!duplicated(gse_count[[1]]), ]
rownames(gse_count) <- gse_count[[1]]
gse_count <- gse_count[, -1]

# Filter low-expressed genes
perc_keep <- 0.8  # Keep genes expressed in 80% of samples
gene_keep <- rowSums(gse_count > 0) >= ceiling(perc_keep * ncol(gse_count))
count_tbl_low_rm <- gse_count[gene_keep, ]

# Create DGE list and normalize
dge_gse <- DGEList(counts=count_tbl_low_rm, samples = gse_meta)
dge_gse <- dge_gse[, dge_gse$samples$tissue.ch1 == "Breast"]
dge_gse <- calcNormFactors(dge_gse, method = "TMM")
dge_gse_v <- voom(dge_gse, plot=TRUE)

Preparing Gene Annotations

The prediction methods require ENTREZ IDs. Here’s how to map your genes to the correct identifiers:

# Get ENTREZ IDs
entrezid <- AnnotationDbi::select(
    EnsDb.Hsapiens.v86,
    keys = rownames(dge_gse_v),
    columns = c("ENTREZID", "SYMBOL"),
    keytype = "SYMBOL"
)

# Clean up annotations
entrezid <- na.omit(entrezid)
entrezid <- entrezid[!duplicated(entrezid[[1]]) & !duplicated(entrezid[[2]]), ]
rownames(entrezid) <- entrezid[[2]]

# Match genes
com_gene <- intersect(entrezid[[2]], rownames(dge_gse_v))
entrezid <- entrezid[com_gene, c(2, 2, 1)]
colnames(entrezid) <- c("probe", "Gene.Symbol", "EntrezGene.ID")
rownames(entrezid) <- entrezid[[1]]

# Update expression data
dge_gse_v_entrez <- dge_gse_v[rownames(entrezid), ]
dge_gse_v_entrez$genes <- entrezid

# Create subset for demonstration
set.seed(123)
dge_gse_v_entrez_sub <- dge_gse_v_entrez[, sample(colnames(dge_gse_v_entrez), 30)]

The resulting gene annotation table “entrezid” has the following format.

Subtype Prediction Methods

Method 1: PAM50 Classification

The PAM50 classifier from the PAM50 package uses a 50-gene signature to predict breast cancer subtypes.

# Format expression matrix
expr_dge_gse_v_entrez_sub <- as.data.frame(dge_gse_v_entrez_sub$E)
rownames(expr_dge_gse_v_entrez_sub) <- entrezid$EntrezGene.ID

# Predict subtypes
df.pred <- PAM50(expr_dge_gse_v_entrez_sub, cutoff = 0)

# Save results
fwrite(as.data.table(df.pred), "subtype_pred_pam50.csv")

# Visualize results
pdf("heatmap_subtype_pred_pam50_expression.pdf", width = 7, height = 9)
plotPAM50(as.data.frame(dge_gse_v_entrez_sub$E), df.pred$PAM50)
dev.off()

The resulting df.pred object contains a probability matrix where each row represents a sample and each column shows the likelihood of that sample belonging to a specific subtype.

Visualize the expression levels of PAM50’s diagnostic gene set using a heatmap, which highlights the distinct expression patterns that characterize each molecular subtype.

Method 2: genefu Package

The genefu package supports several breast cancer classification models:

scmgene
scmod1/scmod2
pam50
ssp2006/ssp2003
intClust
AIMS
claudinLow

To use a different model, simply change the sbt.model parameter in the molecular.subtyping() function.

Note: Each model requires its own reference dataset. If you encounter a ‘missing data’ error, load the corresponding reference data similar to how we loaded pam50.robust.

# Transpose expression matrix
expr_dge_gse_v_entrez_sub_t <- t(dge_gse_v_entrez_sub$E)

# Load PAM50 data
data(pam50.robust)

# Predict subtypes
subtype_pred_pam50 <- molecular.subtyping(
    sbt.model = "pam50",
    data = expr_dge_gse_v_entrez_sub_t,
    annot = entrezid,
    do.mapping = TRUE
)

# Process results
subtype_pred_pam50_df <- as.data.frame(subtype_pred_pam50$subtype.proba)
subtype_pred_pam50_df$Subtype <- as.character(subtype_pred_pam50$subtype)

# Save results
fwrite(subtype_pred_pam50_df, "subtype_pred_pam50_df.csv")

# Visualize results
pdf("heatmap_subtype_pred_pam50.pdf", width = 7, height = 9)
pheatmap(
    subtype_pred_pam50$subtype.proba,
    scale = "none",
    cluster_rows = TRUE,
    cluster_cols = FALSE,
    show_rownames = TRUE
)
dev.off()

The resulting subtype_pred_pam50_df object contains a probability matrix where each row represents a sample and each column shows the likelihood of that sample belonging to a specific subtype.

Visualize the classification probabilities in a heatmap format, where warm colors represent higher confidence in subtype assignments for each sample.

Method 3: GSVA Approach

If you are working on subtyping other cancers and have access to the relevant subtype gene signatures, the GSVA package is an excellent choice. GSVA offers flexibility for custom gene signatures:

# Define custom gene signatures
gene_sig_ls <- list(
    gene_basal1 = c("MRPS14", "DIP2B", /* ... other genes ... */),
    gene_basal2 = c("AKR1C1", "AMIGO2", /* ... other genes ... */),
    gene_luminal1 = c("ABCA3", "ABCG1", /* ... other genes ... */),
    gene_luminal2 = c("ADAMTS5", "ALCAM", /* ... other genes ... */)
)

# Run ssGSEA for single sample profiling
ssgseapar <- ssgseaParam(dge_gse_v_entrez_sub$E, gene_sig_ls)
gsva_pred <- gsva(ssgseapar)

# Use GSVA method for comparing samples in a larger cohort
# gsvapar <- gsvaParam(dge_gse_v_entrez_sub$E, gene_sig_ls)
# gsva_pred <- gsva(gsvapar)


# Process and save results
gsva_pred_t <- as.data.frame(t(gsva_pred))
fwrite(gsva_pred_t, "gsva_pred_t.csv", row.names = T)

# Visualize results
pdf("heatmap_subtype_pred_ssgsea.pdf", width = 7, height = 9)
pheatmap(
    gsva_pred_t,
    scale = "row",
    cluster_rows = TRUE,
    cluster_cols = FALSE,
    show_rownames = TRUE
)
dev.off()

The gsva_pred_t object stores normalized enrichment scores (NES) for each sample-subtype combination, where higher scores indicate stronger association with that particular molecular subtype.

Visualize the subtype enrichment patterns across all samples using a heatmap, where color intensity represents the strength of each sample’s association with different molecular subtypes.

Results Interpretation

When interpreting subtype predictions, consider:

Confidence Scores: Review prediction probabilities for each subtype
Gene Signature Quality: Assess the relevance of gene signatures
Sample Quality: Evaluate RNA-seq quality metrics
Clinical Correlation: Compare with known clinical parameters

Best Practices

Gene Signature Selection

Use validated gene signatures
Consider tissue-specific signatures
Update signatures based on current literature

Validation

Cross-validate predictions
Compare results across methods
Correlate with clinical data

References

Parker JS, et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol.
Hänzelmann S, et al. (2013). GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics.
Gendoo DM, et al. (2016). Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics.

NGS Learning Hub

How to Analyze RNAseq Data for Absolute Beginners Part 6: A Comprehensive Guide for Cancer Subtype Prediction

Video Tutorial

Introduction

Understanding Molecular Subtypes

Breast Cancer Subtypes Overview

Setup and Prerequisites

Required R Packages

Loading Libraries

Data Preparation

Processing Gene Expression Data

Preparing Gene Annotations

Subtype Prediction Methods

Method 1: PAM50 Classification

Method 2: genefu Package

Method 3: GSVA Approach

Results Interpretation

Best Practices

References

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags

How to Analyze RNAseq Data for Absolute Beginners Part 6: A Comprehensive Guide for Cancer Subtype Prediction

Video Tutorial

Introduction

Understanding Molecular Subtypes

Breast Cancer Subtypes Overview

Setup and Prerequisites

Required R Packages

Loading Libraries

Data Preparation

Processing Gene Expression Data

Preparing Gene Annotations

Subtype Prediction Methods

Method 1: PAM50 Classification

Method 2: genefu Package

Method 3: GSVA Approach

Results Interpretation

Best Practices

References

Share this:

Like this:

Comments

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags