How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

Video Tutorial

If you’ve followed our previous tutorials on RNA-seq analysis, you’re already familiar with gene-level analysis. But genes are far more complex than simple on/off switches – they can produce multiple versions of RNA transcripts through fascinating processes like alternative splicing. These different versions, called isoforms, allow a single gene to create multiple protein products, dramatically expanding the functional diversity of our genome. Today, we’ll dive deep into how to identify and quantify these isoforms from your RNA-seq data.

Why Isoform Analysis Matters

The importance of isoform analysis extends far beyond academic interest. When cells produce incorrect ratios of isoforms or generate aberrant isoforms, the consequences can be severe. We’ve seen this in numerous diseases:

  • Cancer cells often express unique isoform patterns that contribute to their aggressive behavior
  • Neurodegenerative disorders frequently involve misregulation of splicing
  • Cardiovascular diseases can arise from altered isoform expression

By understanding isoform expression patterns, researchers have identified new disease biomarkers and potential therapeutic targets. This makes isoform analysis an essential skill in modern genomics research.

Getting Started: Setting Up Your Environment

Before we dive into the analysis, let’s make sure you have everything you need. We’ll build upon the environment we created in our previous tutorials, adding some specialized tools for isoform analysis.

Installing Salmon: Your New Favorite Tool

For this tutorial, we’ll use Salmon – a tool that has revolutionized transcript quantification. What makes Salmon special? Unlike traditional aligners that meticulously map every read to the genome, Salmon uses a clever “quasi-mapping” strategy that’s both faster and highly accurate. Here’s how to get it running. We’ll install Salmon in our previously created RNA-seq environment.

# First, activate our trusty RNA-seq environment
conda activate rnaseq_env

# Install Salmon and trim-galore for adapter trimming
conda install -c bioconda salmon trim-galore

# Let's make sure it worked
salmon --version

Setting Up Your Reference Data

For Salmon to work its magic, we need a pre-built index. Think of this as a well-organized reference library that Salmon can quickly search through. You can download pre-built indices from refgenie:

The Analysis Pipeline: From Raw Reads to Isoform Counts

Step 1: Quality Control and Trimming

Just like in our gene-level analysis, we start with quality control. Clean data is crucial for accurate isoform quantification:

# Run trim_galore with FastQC for comprehensive QC
trim_galore --fastqc \
    --paired \
    --cores 8 \
    ~/RNAseq/raw/SRR28119110_R1_001.fastq.gz \
    ~/RNAseq/raw/SRR28119110_R2_001.fastq.gz \
    -o ~/RNAseq/isoform_analysis/trimmed/SRR28119110

Step 2: Quantifying Isoforms with Salmon

Now comes the exciting part – actually measuring isoform expression. Salmon will process your reads and tell you how much of each isoform is present:

# Run Salmon quantification
salmon quant -i ~/Genome_Index/Salmon_mm10/salmon_index/ \
    -l A \
    -1 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R1_001_val_1.fq.gz \
    -2 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R2_001_val_2.fq.gz \
    --validateMappings \
    -p 8 \
    -o ~/RNAseq/isoform_analysis/quantification/SRR28119110

The output includes a crucial file called quant.sf. This contains your isoform expression data in two key metrics:

  • TPM (Transcripts Per Million): Perfect for comparing between samples
  • Read counts: Essential for differential expression analysis

Step 3: Making Sense of the Results

Raw transcript IDs aren’t very informative. Let’s add some meaningful annotations to our results using R:

# Load essential packages
library(data.table)
library(stringr)

# Read in the GTF file - this contains our transcript annotations
gtf_mm10 <- fread("~/Genome_Index/Salmon_mm10/salmon_index/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf", skip = 5)

# Focus on transcript-level information
gtf_mm10 <- gtf_mm10[gtf_mm10[[3]] == "transcript"]

# Create a clean annotation table
transcript_annot <- data.table(
    trascript_id = gsub("transcript_id| |\"", "", 
                        sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 3)),
    trascript_name = gsub("transcript_name| |\"", "", 
                         sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 8))
)

# Add annotations to our quantification results
quant_SRR28119110 <- fread("isoform_analysis/quantification/SRR28119110/quant.sf")

quant_SRR28119110$trascript_name <- gsub("\\..*$", "", quant_SRR28119110$trascript_name)

quant_SRR28119110$trascript_name <- transcript_annot$trascript_name[
    match(quant_SRR28119110$Name, transcript_annot$trascript_id)
]

# Save our annotated results
fwrite(quant_SRR28119110, 
       "~/RNAseq/isoform_analysis/quantification/SRR28119110/quant_annotated.sf")

Step 4: Differential Isoform Expression Analysis

Now that we have annotated quantification results for each sample, it’s time to bring everything together into a unified isoform expression matrix. This step is crucial for downstream analysis, particularly for identifying differentially expressed isoforms between conditions.

Creating the Isoform Expression Matrix

Let’s combine our individual sample results into a comprehensive matrix:

# Install tximport
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(c("tximport", "rtracklayer"))

library(tximport)
library(rtracklayer)

# Estimate counts of the transcripts
files <- list.files("~/RNAseq/isoform_analysis/quantification", pattern = "quant.sf", full.names = TRUE, recursive = T)
names(files) <- basename(dirname(files))

txi <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "lengthScaledTPM")
count_tbl <- as.data.frame(txi$counts)

# Remove trascript ID version suffix if necessary
rownames(count_tbl) <- gsub("\\..*$", "", rownames(count_tbl))
count_tbl$transcript_id <- rownames(count_tbl)

# Annotate the trascript ID 
count_tbl$trascript_name <- transcript_annot$trascript_name[
  match(rownames(count_tbl), transcript_annot$trascript_id)]

# Order the columns of the count table
count_tbl <- count_tbl[, c(5:6, 1:4)]

# Save the transcript count table
fwrite(count_tbl, "isoform_expression_counts.tsv", sep = "\t")


# Create a TPM matrix if you prefer TPMs
tpm_tbl <- as.data.frame(txi$abundance)
rownames(tpm_tbl) <- gsub("\\..*$", "", rownames(tpm_tbl))

tpm_tbl$transcript_id <- rownames(tpm_tbl)
tpm_tbl$trascript_name <- transcript_annot$trascript_name[
  match(rownames(tpm_tbl), transcript_annot$trascript_id)]

tpm_tbl <- tpm_tbl[, c(5:6, 1:4)]

fwrite(tpm_tbl, "isoform_expression_tpm.tsv", sep = "\t")

Creating the Gene-level Expression Matrix

Salmon can estimate gene-level expression in both raw counts and TPM formats.

# Load the GTF file
gtf_annot <- import("~/Genome_Index/Salmon_mm10/salmon_index/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")

# Create a transcript-gene annotation table
gene_symbols <- na.omit(unique(data.frame(
  gene_id = gtf_annot$transcript_id,
  gene_symbol = gtf_annot$gene_name
)))

# Quantify gene expression
txi_gene <- tximport(files, type = "salmon", tx2gene = gene_symbols, ignoreTxVersion = T)

txi_gene_count <- as.data.frame(txi_gene$counts)
txi_gene_tpm <- as.data.frame(txi_gene$abundance)

fwrite(txi_gene_count, "gene_expression_counts.tsv", sep = "\t", row.names = T)
fwrite(txi_gene_tpm, "gene_expression_tpm.tsv", sep = "\t", row.names = T)

Differential Analysis

Most modern differential expression analysis tools require raw count data as input. Since Salmon provides transcript-level abundance estimates, we use ‘tximport’ to convert these into count estimates. We can then perform differential analysis following the same workflow as our previous gene-level analysis.

Conclusion

Isoform analysis opens up exciting possibilities for understanding gene regulation and function. While it’s more complex than gene-level analysis, the insights it provides are invaluable for understanding disease mechanisms and identifying therapeutic targets.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *