How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

Video Tutorial

If you’ve followed our previous tutorials on RNA-seq analysis, you’re already familiar with gene-level analysis. But genes are far more complex than simple on/off switches – they can produce multiple versions of RNA transcripts through fascinating processes like alternative splicing. These different versions, called isoforms, allow a single gene to create multiple protein products, dramatically expanding the functional diversity of our genome. Today, we’ll dive deep into how to identify and quantify these isoforms from your RNA-seq data.

Why Isoform Analysis Matters

The importance of isoform analysis extends far beyond academic interest. When cells produce incorrect ratios of isoforms or generate aberrant isoforms, the consequences can be severe. We’ve seen this in numerous diseases:

Cancer cells often express unique isoform patterns that contribute to their aggressive behavior
Neurodegenerative disorders frequently involve misregulation of splicing
Cardiovascular diseases can arise from altered isoform expression

By understanding isoform expression patterns, researchers have identified new disease biomarkers and potential therapeutic targets. This makes isoform analysis an essential skill in modern genomics research.

Getting Started: Setting Up Your Environment

Before we dive into the analysis, let’s make sure you have everything you need. We’ll build upon the environment we created in our previous tutorials, adding some specialized tools for isoform analysis.

Installing Salmon: Your New Favorite Tool

For this tutorial, we’ll use Salmon – a tool that has revolutionized transcript quantification. What makes Salmon special? Unlike traditional aligners that meticulously map every read to the genome, Salmon uses a clever “quasi-mapping” strategy that’s both faster and highly accurate. Here’s how to get it running. We’ll install Salmon in our previously created RNA-seq environment.

# First, activate our trusty RNA-seq environment
conda activate rnaseq_env

# Install Salmon and trim-galore for adapter trimming
conda install -c bioconda salmon trim-galore

# Let's make sure it worked
salmon --version

Setting Up Your Reference Data

For Salmon to work its magic, we need a pre-built index. Think of this as a well-organized reference library that Salmon can quickly search through. Save the downloaded Salmon index in a clearly named directory for the quantification step, e.g., ~/Genome_Index/Salmon_mm10/salmon_index/. You can download pre-built indices from refgenie:

For human data: Human Genome Index
For mouse data: Mouse Genome Index

The Analysis Pipeline: From Raw Reads to Isoform Counts

Step 1: Quality Control and Trimming

Just like in our gene-level analysis, we start with quality control. Clean data is crucial for accurate isoform quantification:

# Run trim_galore with FastQC for comprehensive QC
trim_galore --fastqc \
    --paired \
    --cores 8 \
    ~/RNAseq/raw/SRR28119110_R1_001.fastq.gz \
    ~/RNAseq/raw/SRR28119110_R2_001.fastq.gz \
    -o ~/RNAseq/isoform_analysis/trimmed/SRR28119110

Step 2: Quantifying Isoforms with Salmon

Now comes the exciting part – actually measuring isoform expression. Salmon will process your reads and tell you how much of each isoform is present:

# Run Salmon quantification
salmon quant -i ~/Genome_Index/Salmon_mm10/salmon_index/ \
    -l A \
    -1 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R1_001_val_1.fq.gz \
    -2 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R2_001_val_2.fq.gz \
    --validateMappings \
    -p 8 \
    -o ~/RNAseq/isoform_analysis/quantification/SRR28119110

The output includes a crucial file called quant.sf. This contains your isoform expression data in two key metrics:

TPM (Transcripts Per Million): Perfect for comparing between samples
Read counts: Essential for differential expression analysis

Step 3: Making Sense of the Results

Raw transcript IDs aren’t very informative. Let’s add some meaningful annotations to our results using R:

# Load essential packages
library(data.table)
library(stringr)

# Read in the GTF file - this contains our transcript annotations
gtf_mm10 <- fread("~/Genome_Index/Salmon_mm10/salmon_index/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf", skip = 5)

# Focus on transcript-level information
gtf_mm10 <- gtf_mm10[gtf_mm10[[3]] == "transcript"]

# Create a clean annotation table
transcript_annot <- data.table(
    trascript_id = gsub("transcript_id| |\"", "", 
                        sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 3)),
    trascript_name = gsub("transcript_name| |\"", "", 
                         sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 8))
)

# Add annotations to our quantification results
quant_SRR28119110 <- fread("isoform_analysis/quantification/SRR28119110/quant.sf")

quant_SRR28119110$trascript_name <- gsub("\\..*$", "", quant_SRR28119110$trascript_name)

quant_SRR28119110$trascript_name <- transcript_annot$trascript_name[
    match(quant_SRR28119110$Name, transcript_annot$trascript_id)
]

# Save our annotated results
fwrite(quant_SRR28119110, 
       "~/RNAseq/isoform_analysis/quantification/SRR28119110/quant_annotated.sf")

Step 4: Differential Isoform Expression Analysis

Now that we have annotated quantification results for each sample, it’s time to bring everything together into a unified isoform expression matrix. This step is crucial for downstream analysis, particularly for identifying differentially expressed isoforms between conditions.

Creating the Isoform Expression Matrix

Let’s combine our individual sample results into a comprehensive matrix:

# Install tximport
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(c("tximport", "rtracklayer"))

library(tximport)
library(rtracklayer)

# Estimate counts of the transcripts
files <- list.files("~/RNAseq/isoform_analysis/quantification", pattern = "quant.sf", full.names = TRUE, recursive = T)
names(files) <- basename(dirname(files))

txi <- tximport(files, type = "salmon", txOut = TRUE, countsFromAbundance = "lengthScaledTPM")
count_tbl <- as.data.frame(txi$counts)

# Remove trascript ID version suffix if necessary
rownames(count_tbl) <- gsub("\\..*$", "", rownames(count_tbl))
count_tbl$transcript_id <- rownames(count_tbl)

# Annotate the trascript ID 
count_tbl$trascript_name <- transcript_annot$trascript_name[
  match(rownames(count_tbl), transcript_annot$trascript_id)]

# Order the columns of the count table
count_tbl <- count_tbl[, c(5:6, 1:4)]

# Save the transcript count table
fwrite(count_tbl, "isoform_expression_counts.tsv", sep = "\t")


# Create a TPM matrix if you prefer TPMs
tpm_tbl <- as.data.frame(txi$abundance)
rownames(tpm_tbl) <- gsub("\\..*$", "", rownames(tpm_tbl))

tpm_tbl$transcript_id <- rownames(tpm_tbl)
tpm_tbl$trascript_name <- transcript_annot$trascript_name[
  match(rownames(tpm_tbl), transcript_annot$trascript_id)]

tpm_tbl <- tpm_tbl[, c(5:6, 1:4)]

fwrite(tpm_tbl, "isoform_expression_tpm.tsv", sep = "\t")

Creating the Gene-level Expression Matrix

Salmon can estimate gene-level expression in both raw counts and TPM formats.

# Load the GTF file
gtf_annot <- import("~/Genome_Index/Salmon_mm10/salmon_index/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")

# Create a transcript-gene annotation table
gene_symbols <- na.omit(unique(data.frame(
  gene_id = gtf_annot$transcript_id,
  gene_symbol = gtf_annot$gene_name
)))

# Quantify gene expression
txi_gene <- tximport(files, type = "salmon", tx2gene = gene_symbols, ignoreTxVersion = T)

txi_gene_count <- as.data.frame(txi_gene$counts)
txi_gene_tpm <- as.data.frame(txi_gene$abundance)

fwrite(txi_gene_count, "gene_expression_counts.tsv", sep = "\t", row.names = T)
fwrite(txi_gene_tpm, "gene_expression_tpm.tsv", sep = "\t", row.names = T)

Differential Analysis

Most modern differential expression analysis tools require raw count data as input. Since Salmon provides transcript-level abundance estimates, we use ‘tximport’ to convert these into count estimates. We can then perform differential analysis following the same workflow as our previous gene-level analysis.

Conclusion

Isoform analysis opens up exciting possibilities for understanding gene regulation and function. While it’s more complex than gene-level analysis, the insights it provides are invaluable for understanding disease mechanisms and identifying therapeutic targets.

Comments

2 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis”

Dallis

April 18, 2025

Hello and thank you for this valuable resource!

I performed the code for part 1 of your tutorial then moved to this part (part 10) as I am only interested in the isoform anlaysis.

I am stuck on “Setting up your reference data” of the tutorial. I downloaded the files but do not know where to put them or what to name the folder. Does it relate to “Genome_Index/Salmon_mm10/salmon_index” in step 2?

I am also wondering, if there is a way to loop the paired files so that you do not have to write the lines of code for each sample (I have 36).

I appreciate any information/advice you can provide.
– D

1. Lei
  
  April 24, 2025
  
  Hey Dallis,
  
  Apologies for the late response! You’re spot on – those downloaded reference data are indeed the Salmon index files you’ll need for the quantification step. To keep things organized, it’s a good idea to place these files in a clearly named directory, for example, “~/Genome_Index/Salmon_mm10/salmon_index/”.
  
  Absolutely, there are definitely ways to write code to process all your samples in a loop. However, the best approach really depends on your computing environment. Running analyses on a personal computer versus a High-Performance Computing (HPC) cluster involves quite different considerations. Generally, using a personal laptop isn’t ideal due to its limited processing power and storage capacity.
  
  I’ll be sure to share some scripts for handling multiple samples in the future!

NGS Learning Hub

How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

Video Tutorial

Why Isoform Analysis Matters

Getting Started: Setting Up Your Environment

Installing Salmon: Your New Favorite Tool

Setting Up Your Reference Data

The Analysis Pipeline: From Raw Reads to Isoform Counts

Step 1: Quality Control and Trimming

Step 2: Quantifying Isoforms with Salmon

Step 3: Making Sense of the Results

Step 4: Differential Isoform Expression Analysis

Creating the Isoform Expression Matrix

Creating the Gene-level Expression Matrix

Differential Analysis

Conclusion

Like this:

Comments

2 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis”

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags

How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

Video Tutorial

Why Isoform Analysis Matters

Getting Started: Setting Up Your Environment

Installing Salmon: Your New Favorite Tool

Setting Up Your Reference Data

The Analysis Pipeline: From Raw Reads to Isoform Counts

Step 1: Quality Control and Trimming

Step 2: Quantifying Isoforms with Salmon

Step 3: Making Sense of the Results

Step 4: Differential Isoform Expression Analysis

Creating the Isoform Expression Matrix

Creating the Gene-level Expression Matrix

Differential Analysis

Conclusion

Share this:

Like this:

Comments

2 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis”

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags