How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

How to Analyze RNAseq Data for Absolute Beginners Part 10: Isoform Analysis

If you’ve followed our previous tutorials on RNA-seq analysis, you’re already familiar with gene-level analysis. But genes are far more complex than simple on/off switches – they can produce multiple versions of RNA transcripts through fascinating processes like alternative splicing. These different versions, called isoforms, allow a single gene to create multiple protein products, dramatically expanding the functional diversity of our genome. Today, we’ll dive deep into how to identify and quantify these isoforms from your RNA-seq data.

Why Isoform Analysis Matters

The importance of isoform analysis extends far beyond academic interest. When cells produce incorrect ratios of isoforms or generate aberrant isoforms, the consequences can be severe. We’ve seen this in numerous diseases:

  • Cancer cells often express unique isoform patterns that contribute to their aggressive behavior
  • Neurodegenerative disorders frequently involve misregulation of splicing
  • Cardiovascular diseases can arise from altered isoform expression

By understanding isoform expression patterns, researchers have identified new disease biomarkers and potential therapeutic targets. This makes isoform analysis an essential skill in modern genomics research.

Getting Started: Setting Up Your Environment

Before we dive into the analysis, let’s make sure you have everything you need. We’ll build upon the environment we created in our previous tutorials, adding some specialized tools for isoform analysis.

Installing Salmon: Your New Favorite Tool

For this tutorial, we’ll use Salmon – a tool that has revolutionized transcript quantification. What makes Salmon special? Unlike traditional aligners that meticulously map every read to the genome, Salmon uses a clever “quasi-mapping” strategy that’s both faster and highly accurate. Here’s how to get it running. We’ll install Salmon in our previously created RNA-seq environment.

# First, activate our trusty RNA-seq environment
conda activate rnaseq_env

# Install Salmon and trim-galore for adapter trimming
conda install -c bioconda salmon trim-galore

# Let's make sure it worked
salmon --version

Setting Up Your Reference Data

For Salmon to work its magic, we need a pre-built index. Think of this as a well-organized reference library that Salmon can quickly search through. You can download pre-built indices from refgenie:

The Analysis Pipeline: From Raw Reads to Isoform Counts

Step 1: Quality Control and Trimming

Just like in our gene-level analysis, we start with quality control. Clean data is crucial for accurate isoform quantification:

# Run trim_galore with FastQC for comprehensive QC
trim_galore --fastqc \
    --paired \
    --cores 8 \
    ~/RNAseq/raw/SRR28119110_R1_001.fastq.gz \
    ~/RNAseq/raw/SRR28119110_R2_001.fastq.gz \
    -o ~/RNAseq/isoform_analysis/trimmed/SRR28119110

Step 2: Quantifying Isoforms with Salmon

Now comes the exciting part – actually measuring isoform expression. Salmon will process your reads and tell you how much of each isoform is present:

# Run Salmon quantification
salmon quant -i ~/Genome_Index/Salmon_mm10/salmon_index/ \
    -l A \
    -1 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R1_001_val_1.fq.gz \
    -2 ~/RNAseq/isoform_analysis/trimmed/SRR28119110/SRR28119110_R2_001_val_2.fq.gz \
    --validateMappings \
    -p 8 \
    -o ~/RNAseq/isoform_analysis/quantification/SRR28119110

The output includes a crucial file called quant.sf. This contains your isoform expression data in two key metrics:

  • TPM (Transcripts Per Million): Perfect for comparing between samples
  • Read counts: Essential for differential expression analysis

Step 3: Making Sense of the Results

Raw transcript IDs aren’t very informative. Let’s add some meaningful annotations to our results using R:

# Load essential packages
library(data.table)
library(stringr)

# Read in the GTF file - this contains our transcript annotations
gtf_mm10 <- fread("~/Genome_Index/Salmon_mm10/salmon_index/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf", skip = 5)

# Focus on transcript-level information
gtf_mm10 <- gtf_mm10[gtf_mm10[[3]] == "transcript"]

# Create a clean annotation table
transcript_annot <- data.table(
    trascript_id = gsub("transcript_id| |\"", "", 
                        sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 3)),
    trascript_name = gsub("transcript_name| |\"", "", 
                         sapply(strsplit(gtf_mm10[[9]], ";"), "[[", 8))
)

# Add annotations to our quantification results
quant_SRR28119110 <- fread("isoform_analysis/quantification/SRR28119110/quant.sf")
quant_SRR28119110$trascript_name <- transcript_annot$trascript_name[
    match(quant_SRR28119110$Name, transcript_annot$trascript_id)
]

# Save our annotated results
fwrite(quant_SRR28119110, 
       "~/RNAseq/isoform_analysis/quantification/SRR28119110/quant_annotated.sf")

Step 4: Differential Isoform Expression Analysis

Now that we have annotated quantification results for each sample, it’s time to bring everything together into a unified isoform expression matrix. This step is crucial for downstream analysis, particularly for identifying differentially expressed isoforms between conditions.

Creating the Isoform Expression Matrix

Let’s combine our individual sample results into a comprehensive matrix:

# List all quantification files
quant_files <- list.files(path = "~/RNAseq/isoform_analysis/quantification",
                         pattern = "quant_annotated.sf$",
                         recursive = TRUE,
                         full.names = TRUE)

# Read and combine all files
isoform_matrix <- lapply(quant_files, function(file) {
    sample_name <- basename(dirname(file))
    data <- fread(file)
    # Extract TPM values with transcript names
    setnames(data[, .(trascript_name, TPM)], "TPM", sample_name)
})

# Merge all samples into one matrix
isoform_matrix <- Reduce(function(x, y) merge(x, y, by = "trascript_name"), isoform_matrix)

# Set transcript names as row names
setDF(isoform_matrix)
rownames(isoform_matrix) <- isoform_matrix$trascript_name
isoform_matrix$trascript_name <- NULL

# Save the matrix for later use
fwrite(isoform_matrix, "isoform_matrix.tsv", sep = "\t", row.names = T)

Differential Analysis

After obtaining the isoform expression matrix, we can conduct differential analysis using the same approach as for gene-level data, as demonstrated in the previous tutorial.

Conclusion

Isoform analysis opens up exciting possibilities for understanding gene regulation and function. While it’s more complex than gene-level analysis, the insights it provides are invaluable for understanding disease mechanisms and identifying therapeutic targets.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *