As we move forward in our RNAseq analysis journey, we’ll be transitioning from the Linux environment to R, a powerful and versatile statistical analysis tool. R is not only a programming language but also a platform widely used in data science, statistical computing, and predictive modeling. Tech giants like Microsoft, Meta, Google, Amazon, and Netflix rely on R for data analysis, modeling, and machine learning. The good news is that your laptop should be more than capable of running R for differential expression (DE) analysis, data visualization, and other downstream statistical analyses. Let’s dive into the step-by-step process of performing DE analysis in R.
Before we proceed, a quick reminder for our readers: If you haven’t obtained your gene expression matrix yet, don’t worry! We’ve covered this crucial step in our previous posts. Be sure to check out “Part 1” and “Part 2” of our series, where we walk you through the process of quantifying gene expression from raw sequencing data to a count matrix. These posts provide step-by-step guidance on how to process your FASTQ files and generate the gene expression data we’ll be using in this analysis.
Installing R and RStudio
Install R
Installing R is straightforward and similar to installing any other program on your computer:
- Visit the official R website.
- Download the latest version for your operating system (Windows, Mac, or Linux).
- Double-click the installation package and follow the prompts.
Install RStudio
RStudio, a user-friendly graphical interface for R, can be installed in a similar manner:
- Go to the RStudio download page.
- Download the appropriate version for your system.
- Install it like any other application.
RStudio enhances the R experience by providing a comprehensive set of tools and features specifically designed for data analysis, visualization, and statistical computing. Here’s a quick overview of the RStudio interface:
- Code Editor Window: This is where you write and edit your code. Create a new R script by clicking the green button with a white cross in the top left corner. You can run sections of code by highlighting them and clicking “Run,” or execute the entire script by clicking “Source.”
- R Console Window: This window allows real-time interaction with R. Any code you run from the script editor will appear here, along with errors, results, and print statements.
- Work Space Window: This area displays a list of all objects in your current R session, including variables, data sets, and functions.
- Plot & Files Window: Here, you can view visualizations generated by your code, such as graphs and charts. You can also export or zoom into plots from this window.
Setting Up Your R Environment
Installing Required Packages
R packages are collections of functions, data sets, and documentation bundled together to perform specialized tasks. We’ll need several packages for our DE analysis. These packages are stored on different repositories, primarily CRAN and Bioconductor.
To install the necessary packages, copy and paste the following code into the “R Console” window of RStudio:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("sva", "edgeR", "limma", "Biobase", "biomaRt",
"clusterProfiler", "EnhancedVolcano", "org.Hs.eg.db",
"org.Mm.eg.db", "org.Rn.eg.db"))
install.packages(c("data.table", "readxl", "stringr", "ggplot2", "ggrepel",
"ggfortify", "ggprism", "pheatmap", "VennDiagram",
"corrplot", "Hmisc", "stats", "tidyverse"))
Be patient as the packages install. If prompted for package updates, you can press “N” to skip them.
Loading Installed Packages
Once the packages are installed, load them into your RStudio session. Select the following code and click “Run” in the Code Editor window:
library(data.table)
library(edgeR)
library(limma)
Setting Your Working Directory
Create a folder on your computer for this project and set it as your working directory in R. This ensures all output files will be saved in this location unless specified otherwise. Here are examples for both MacOS and Windows:
# MacOS
setwd("/Users/yourusername/Documents/RNAseq")
# Windows
setwd("C:\Users\yourusername\Documents\RNAseq")
Replace “yourusername” with your actual username.
Data Preparation and Analysis
Reading Data
Start by reading your raw count data:
count_tbl <- fread("counts_raw.tsv", data.table = F)
This command reads a file named “counts_raw.tsv” and stores its contents in a variable called count_tbl
. You can view this data frame by clicking on count_tbl
in the “Work Space” window or by typing View(count_tbl)
in the “R Console” window.
Data Cleaning
Adding Row Names to the Count Table
Transform the first column into row names and remove it from the table:
rownames(count_tbl) <- count_tbl[[1]]
count_tbl <- count_tbl[, -1]
Filtering Low-Expressed Genes
Remove genes with low or no expression:
perc_keep <- 0.8
gene_keep <- rowSums(count_tbl > 0) >= ceiling(perc_keep * ncol(count_tbl))
count_tbl_low_rm <- count_tbl[gene_keep, ]
This code keeps genes that are expressed in at least 80% of the samples.
Creating Metadata
Create a metadata table with information about each sample:
meta <- data.frame(SampleID = colnames(count_tbl),
Treatment = c("KRAS_SPIB", "KRAS_SPIB", "KRAS", "KRAS"))
rownames(meta) <- meta$SampleID
Creating a DGE List
Combine the count data and sample information into a DGEList object:
dge <- DGEList(counts=count_tbl_low_rm, samples = meta)
Normalization
Normalize the data to adjust for technical differences between samples:
dge <- calcNormFactors(dge, method = "TMM")
dge_v <- voom(dge, plot=TRUE)
# save the dge_v object for later use
saveRDS(dge_v, "dge_v.rds")
This process will generate a mean-variance plot to visualize the normalization results.
Creating a PCA Plot
Generate a Principal Component Analysis (PCA) plot to visualize the overall structure of your data:
shape_column <- "Treatment"
color_column <- "Treatment"
label <- TRUE
label_size <- 4
plot_save_name <- "PCA_Plot.pdf"
meta_table <- dge_v$targets
count_table_t <- as.data.frame(t(dge_v$E))
pca_prep <- prcomp(count_table_t, scale. = TRUE)
pca_plot <- autoplot(pca_prep, label, shape = shape_column, data = meta_table, colour = color_column) +
geom_text_repel(aes(label = rownames(meta_table)), size = label_size) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid.minor.y=element_blank(),
panel.grid.major.y=element_blank())
ggsave(plot_save_name, device = "pdf", units = "cm", width = 16, height = 14)
A PCA plot (Principal Component Analysis plot) is a graphical representation used in RNA-Seq analysis to visualize the overall structure of the data and to identify patterns such as sample clustering, batch effects, or outliers. In PCA plot, samples from the same treatment group tend to cluster.
Differential Expression Analysis
Perform the differential expression analysis using limma (special characters are not allowed in sample group names):
comparison <- "KRAS_SPIB-KRAS"
design <- model.matrix(~ 0 + dge_v$targets[["Treatment"]])
colnames(design) <- gsub(".*]]", "", colnames(design))
contrast_matrix <- makeContrasts(contrasts = comparison, levels = design)
fit <- lmFit(dge_v, design)
fit_2 <- contrasts.fit(fit, contrast_matrix)
fit_2 <- eBayes(fit_2)
deg_tbl <- topTable(fit_2, coef = colnames(contrast_matrix), n = Inf, p.value=1, lfc=0, sort.by = "p")
fwrite(deg_tbl, "DEG_P1_lfc0.tsv", sep = "\t", row.names = T)
This analysis compares the “KRAS_SPIB” group to the “KRAS” group. The results are stored in deg_tbl
and saved to a file named “DEG_P1_lfc0.tsv” in your working directory.
- logFC stands for log2 Fold Change.
- AveExpr shows the average expression level of the gene across all sample.
- t-statistic is a measure of how strongly the gene is differentially expressed between the conditions.
- The p-value tests the null hypothesis that there is no difference in gene expression between the two groups.
- The adjusted p-value account for multiple testing.
- The B-statistic (log-odds) represents the log-odds that a gene is differentially expressed.
Conclusion
Congratulations! You’ve successfully completed the differential expression analysis of your RNAseq data. The resulting list of differentially expressed genes (DEGs) opens up numerous possibilities for deeper exploration of gene expression patterns and functional enrichment. In future tutorials, we’ll delve into interpreting these results and performing downstream analyses to gain biological insights from your data.
Remember, this is just the beginning of your RNAseq analysis journey. As you become more comfortable with these techniques, you’ll be able to customize and expand upon this workflow to suit your specific research needs.
Comments
2 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 3: From Count Table to DEGs – Best Practices”
In the “Reading Data” step, if there is an additional column titled “GeneID” preceding the “GeneSymbol” column in the”counts_raw.tsv” file, how should the code be modified to generate the “count_tbl” file? This modification is needed because having an extra column will introduce an additional row when creating the Metadata.
I attempted to delete the “GeneID” column, but since some GeneIDs share the same GeneSymbol in the “counts_raw” file, this caused the following error:
Error in `.rowNamesDF<-`(x, value = value) :
duplicate 'row.names' are not allowed
So how to fix this?
Thank you!
If you have a “GeneID” and “GeneSymbol” column and they are the first 2 columns of your count table, follow the steps below.