How to Analyze RNAseq Data for Absolute Beginners Part 3: From Count Table to DEGs – Best Practices

How to Analyze RNAseq Data for Absolute Beginners Part 3: From Count Table to DEGs – Best Practices

By

Lei

As we move forward in our RNAseq analysis journey, we’ll be transitioning from the Linux environment to R, a powerful and versatile statistical analysis tool. R is not only a programming language but also a platform widely used in data science, statistical computing, and predictive modeling. Tech giants like Microsoft, Meta, Google, Amazon, and Netflix rely on R for data analysis, modeling, and machine learning. The good news is that your laptop should be more than capable of running R for differential expression (DE) analysis, data visualization, and other downstream statistical analyses. Let’s dive into the step-by-step process of performing DE analysis in R.

Before we proceed, a quick reminder for our readers: If you haven’t obtained your gene expression matrix yet, don’t worry! We’ve covered this crucial step in our previous posts. Be sure to check out “Part 1” and “Part 2” of our series, where we walk you through the process of quantifying gene expression from raw sequencing data to a count matrix. These posts provide step-by-step guidance on how to process your FASTQ files and generate the gene expression data we’ll be using in this analysis.

Installing R and RStudio

Install R

Installing R is straightforward and similar to installing any other program on your computer:

  1. Visit the official R website.
  2. Download the latest version for your operating system (Windows, Mac, or Linux).
  3. Double-click the installation package and follow the prompts.

Install RStudio

RStudio, a user-friendly graphical interface for R, can be installed in a similar manner:

  1. Go to the RStudio download page.
  2. Download the appropriate version for your system.
  3. Install it like any other application.

RStudio enhances the R experience by providing a comprehensive set of tools and features specifically designed for data analysis, visualization, and statistical computing. Here’s a quick overview of the RStudio interface:

  • Code Editor Window: This is where you write and edit your code. Create a new R script by clicking the green button with a white cross in the top left corner. You can run sections of code by highlighting them and clicking “Run,” or execute the entire script by clicking “Source.”
  • R Console Window: This window allows real-time interaction with R. Any code you run from the script editor will appear here, along with errors, results, and print statements.
  • Work Space Window: This area displays a list of all objects in your current R session, including variables, data sets, and functions.
  • Plot & Files Window: Here, you can view visualizations generated by your code, such as graphs and charts. You can also export or zoom into plots from this window.

Setting Up Your R Environment

Installing Required Packages

R packages are collections of functions, data sets, and documentation bundled together to perform specialized tasks. We’ll need several packages for our DE analysis. These packages are stored on different repositories, primarily CRAN and Bioconductor.

To install the necessary packages, copy and paste the following code into the “R Console” window of RStudio:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(c("sva", "edgeR", "limma", "Biobase", "biomaRt", 
                       "clusterProfiler", "EnhancedVolcano", "org.Hs.eg.db", 
                       "org.Mm.eg.db", "org.Rn.eg.db"))

install.packages(c("data.table", "readxl", "stringr", "ggplot2", "ggrepel", 
                   "ggfortify", "ggprism", "pheatmap", "VennDiagram", 
                   "corrplot", "Hmisc", "stats", "tidyverse"))

Be patient as the packages install. If prompted for package updates, you can press “N” to skip them.

Loading Installed Packages

Once the packages are installed, load them into your RStudio session. Select the following code and click “Run” in the Code Editor window:

library(data.table)
library(edgeR)
library(limma)
library(ggplot2)
library(ggrepel)
library(ggfortify)
library(stats)
library(sva)

Setting Your Working Directory

Create a folder on your computer for this project and set it as your working directory in R. This ensures all output files will be saved in this location unless specified otherwise. Here are examples for both MacOS and Windows:

# MacOS
setwd("/Users/yourusername/Documents/RNAseq")

# Windows
setwd("C:\Users\yourusername\Documents\RNAseq")

Replace “yourusername” with your actual username.

Data Preparation and Analysis

Reading Data

If you haven’t combined the count files into a single expression matrix, here’s how to do it in R:

# Retrieve count file paths
file_names <- list.files(count_file_path, pattern = "featureCounts_exon\\.txt$", recursive = T, full.names = T)
  
# Read all count files into a DGElist
dgls <- readDGE(file_names, columns = c(1, 7), skip = 1)
counts_raw <- as.data.frame(dgls$counts)
  
# Rename samples
sample_name <- gsub("_featureCounts_exon", "", basename(colnames(counts_raw)))
colnames(counts_raw) <- sample_name

# Save the combined count table
fwrite(counts_raw, "counts_raw.tsv", sep = "\t")

Start by reading your raw count data:

count_tbl <- fread("counts_raw.tsv", data.table = F)

This command reads a file named “counts_raw.tsv” and stores its contents in a variable called count_tbl. You can view this data frame by clicking on count_tbl in the “Work Space” window or by typing View(count_tbl) in the “R Console” window.

Data Cleaning

Adding Row Names to the Count Table

Transform the first column into row names and remove it from the table:

rownames(count_tbl) <- count_tbl[[1]]
count_tbl <- count_tbl[, -1]

Filtering Low-Expressed Genes

Remove genes with low or no expression:

perc_keep <- 0.8
gene_keep <- rowSums(count_tbl > 0) >= ceiling(perc_keep * ncol(count_tbl))
count_tbl_low_rm <- count_tbl[gene_keep, ]

This code keeps genes that are expressed in at least 80% of the samples.

Creating Metadata

Create a metadata table with information about each sample:

meta <- data.frame(SampleID = colnames(count_tbl),
                   Treatment = c("KRAS_SPIB", "KRAS_SPIB", "KRAS", "KRAS"))
rownames(meta) <- meta$SampleID

Creating a DGE List

Combine the count data and sample information into a DGEList object:

dge <- DGEList(counts=count_tbl_low_rm, samples = meta)

Normalization

Normalize the data to adjust for technical differences between samples:

dge <- calcNormFactors(dge, method = "TMM")
dge_v <- voom(dge, plot=TRUE)
# save the dge_v object for later use
saveRDS(dge_v, "dge_v.rds")

This process will generate a mean-variance plot to visualize the normalization results.

Creating a PCA Plot

Generate a Principal Component Analysis (PCA) plot to visualize the overall structure of your data:

shape_column <- "Treatment"
color_column <- "Treatment"
label <- TRUE
label_size <- 4
plot_save_name <- "PCA_Plot.pdf"

meta_table <- dge_v$targets
count_table_t <- as.data.frame(t(dge_v$E))
pca_prep <- prcomp(count_table_t, scale. = TRUE)

pca_plot <- autoplot(pca_prep, label, shape = shape_column, data = meta_table, colour = color_column) +
  geom_text_repel(aes(label = rownames(meta_table)), size = label_size) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_line(colour = "black"),
        panel.grid.minor.y=element_blank(),
        panel.grid.major.y=element_blank())

ggsave(plot_save_name, device = "pdf", units = "cm", width = 16, height = 14)

A PCA plot (Principal Component Analysis plot) is a graphical representation used in RNA-Seq analysis to visualize the overall structure of the data and to identify patterns such as sample clustering, batch effects, or outliers. In PCA plot, samples from the same treatment group tend to cluster.

Differential Expression Analysis

Perform the differential expression analysis using limma (special characters are not allowed in sample group names):

comparison <- "KRAS_SPIB-KRAS"

design <- model.matrix(~ 0 + dge_v$targets[["Treatment"]])
colnames(design) <- gsub(".*]]", "", colnames(design))

contrast_matrix <- makeContrasts(contrasts = comparison, levels = design)

fit <- lmFit(dge_v, design)
fit_2 <- contrasts.fit(fit, contrast_matrix)
fit_2 <- eBayes(fit_2)

deg_tbl <- topTable(fit_2, coef = colnames(contrast_matrix), n = Inf, p.value=1, lfc=0, sort.by = "p")
fwrite(deg_tbl, "DEG_P1_lfc0.tsv", sep = "\t", row.names = T)

This analysis compares the “KRAS_SPIB” group to the “KRAS” group. The results are stored in deg_tbl and saved to a file named “DEG_P1_lfc0.tsv” in your working directory.

  • logFC stands for log2 Fold Change.
  • AveExpr shows the average expression level of the gene across all sample.
  • t-statistic is a measure of how strongly the gene is differentially expressed between the conditions.
  • The p-value tests the null hypothesis that there is no difference in gene expression between the two groups.
  • The adjusted p-value account for multiple testing.
  • The B-statistic (log-odds) represents the log-odds that a gene is differentially expressed.

Conclusion

Congratulations! You’ve successfully completed the differential expression analysis of your RNAseq data. The resulting list of differentially expressed genes (DEGs) opens up numerous possibilities for deeper exploration of gene expression patterns and functional enrichment. In future tutorials, we’ll delve into interpreting these results and performing downstream analyses to gain biological insights from your data.

Remember, this is just the beginning of your RNAseq analysis journey. As you become more comfortable with these techniques, you’ll be able to customize and expand upon this workflow to suit your specific research needs.

References

  • Burger T. Fudging the volcano-plot without dredging the data. Nat Commun. 2024 Feb 15;15(1):1392. doi: 10.1038/s41467-024-45834-7. PMID: 38360828; PMCID: PMC10869345.

Comments

20 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 3: From Count Table to DEGs – Best Practices”

  1. QQD Avatar
    QQD

    In the “Reading Data” step, if there is an additional column titled “GeneID” preceding the “GeneSymbol” column in the”counts_raw.tsv” file, how should the code be modified to generate the “count_tbl” file? This modification is needed because having an extra column will introduce an additional row when creating the Metadata.

    I attempted to delete the “GeneID” column, but since some GeneIDs share the same GeneSymbol in the “counts_raw” file, this caused the following error:
    Error in `.rowNamesDF<-`(x, value = value) :
    duplicate 'row.names' are not allowed

    So how to fix this?
    Thank you!

  2. Lei Avatar
    Lei

    If you have a “GeneID” and “GeneSymbol” column and they are the first 2 columns of your count table, follow the steps below.

    # make a gene annotation table
    annot_tbl <- data.frame(GeneID = count_tbl[[1]], GeneSymbol = count_tbl[[2]])
    rownames(annot_tbl) <- annot_tbl[[1]]
    
    # remove the "GeneID" and "GeneSymbol" columns from your count table
    count_tbl <- count_tbl[, -c(1, 2)]
    
    # filter out the low-expressed genes the same way
    
    # update the annotation table
    annot_tbl <- annot_tbl[rownames(count_tbl_low_rm), ]
    
    # Create Metadata table the same way
    
    # Create a DGE List with the annotation table included
    dge <- DGEList(counts=count_tbl_low_rm, genes = annot_tbl, samples = meta)
    
    # The rest steps are all same
    
  3. Elva Avatar
    Elva

    “Error in taglist[[i]] : subscript out of bounds” came out after

    # Read all count files into a DGElist
    dgls <- readDGE(file_names, columns = c(1, 7), skip = 1)
    counts_raw <- as.data.frame(dgls$counts)

    What does this mean? How to fix this?

    Thank you!

    1. Lei Avatar
      Lei

      The R code you quoted processes the quantification tables generated by featureCounts. Specifically, it reads in all count tables, extracts the 1st and 7th columns, skips the first row, and merges them into a single count table.

      You can refer to Tutorial 2 to see an example of a featureCounts-generated count table. If your count tables have a different format, this code may produce errors. In that case, you can manually format your count table to match the expected structure shown in this tutorial before reading it into R.

  4. Megan S Avatar
    Megan S

    I think it would be helpful to include which files are needed from Part 2 to run Part 3. A lot of files were created, some are very large. Personally, I kept running out of space running STAR more than once, so ended up deleting everything except the final “featureCounts_gene.txt”. However, the next page requires “featureCounts_exon.txt” and “counts_file_path”.

    1. Lei Avatar
      Lei

      Hi Megan,

      Sorry for the confusion! The featureCounts_gene.txt and featureCounts_exon.txt files should contain the same content—I likely just named them differently across tutorials. You can double-check by comparing the content of your featureCounts results with the example shown in the tutorial.

      For gene quantification, it’s best not to use a personal computer, as tools like STAR require a lot of RAM and storage. However, all the downstream analyses can be comfortably run on a laptop.

      I’m currently setting up an HPC environment on AWS, which should be a great help for anyone without access to traditional HPC resources. Feel free to subscribe for updates!

  5. Ollie Avatar
    Ollie

    In the adding rownames to counts table, I have:
    rownames(CTRL_vs_EVs_Counts_table) <- CTRL_vs_EVs_Counts_table[[1]]
    this gives me a warning message:
    Warning message:
    Setting row names on a tibble is deprecated.
    Then if I do:
    CTRL_vs_EVs_Counts_table <- CTRL_vs_EVs_Counts_table[, -1]
    I lose the gene symbols entirely. If I don't do CTRL_vs_EVs_Counts_table <- CTRL_vs_EVs_Counts_table[, -1] and skip straight to:
    meta <- data.frame(SampleID = colnames(CTRL_vs_EVs_Counts_table),
    + Treatment = c("CTRL_Count", "W34_Count", "G34_Count", "W62_Count","G62_Count"))
    I get the error message:
    Error in data.frame(SampleID = colnames(CTRL_vs_EVs_Counts_table), Treatment = c("CTRL_Count", :
    arguments imply differing number of rows: 6, 5

    Please let me know how to fix this or what I'm doing wrong

    1. Lei Avatar

      Hi Ollie,

      The count table in the tutorial is read using fread() from the data.table package, which creates a data frame. Please double-check that you’re following those exact steps.

      The error you’re seeing when creating the “meta” table occurs because the “SampleID” and “Treatment” columns have different lengths (i.e., mismatched number of rows).

  6. Chaithra N Avatar
    Chaithra N

    Hi, first of all, thank you for such a wonderful tutorial. I was very much confused before. Currently, I am working on a plant genome, and there’s no GTF file for the same. I have got all novel transcripts in my result. How can I further annotate it?

    1. Lei Avatar

      Hi Chaithra,

      For annotating novel transcripts, I’d recommend Trinotate – it’s designed specifically for de novo transcriptome annotation and integrates multiple annotation sources (BLAST, Pfam, GO terms, etc.) into a single workflow.

      Once you have the Trinotate output, you can create a simple annotation table linking your transcript IDs to gene names and descriptions, which you can then use in your DEG analysis.

      The Trinotate documentation includes a step-by-step tutorial that should get you started: https://github.com/Trinotate/Trinotate

  7. Dieudonne Zongo Avatar
    Dieudonne Zongo

    Hello Lei,

    I am doing an RNAseq analysis and some pathways like the nitrate assimilation pathways that is not expressed in aerobic conditions show high significant DEGs. This is unexpected.

    I am working on E. coli bacterial strains with four conditions (empty vector, Gene of interest, Mutant of gene of interest, and RFP). And when I compared Gene_of_interest vs Vector, these genes are significantly downregulated in my condition Gene of interest. This is completely unexpected because those genes are only expressed in anaerobic condition.

    I filtered count ≥ 10 for all samples before DESeq2 analysis.

    But even on IGV, I saw differential expression with up to 5000 reads in some conditions.

    Is there anyway to make sure that these genes are just misinterpreted by DESeq2? I want to be sure that the difference is real? Or if I have to do other level of filtering.

    Best,

    1. Lei Avatar

      Hi Dieudonne,

      A few quick thoughts:

      Double-check your ‘count ≥ 10’ filtering step: was it applied so that genes needed ≥10 counts overall, or ≥10 in each sample individually? Genes with low or zero counts in some conditions often show spurious significance.

      For pathway analysis: if you used ORA, it’s heavily affected by your DE thresholds. GSEA (using ranked list) tends to be more robust in these cases.

      The big read differences you see in IGV are raw counts—not normalized for sequencing depth—so they don’t necessarily reflect true biological changes. DESeq2 accounts for that normalization.

      Look at your PCA plot: do the conditions separate well? Good clustering by group supports the DE results being reliable.

      To confirm, pick a couple of genes with known/expected expression patterns between conditions and see if their DE results match what you expect.

      1. Dieudonne Zongo Avatar
        Dieudonne Zongo

        Hi Lei,

        Here is the code for the DESeq2 filtering step:

        dds <- DESeqDataSetFromMatrix(
        countData = counts_mat,
        colData = smp,
        design = ~ condition
        )

        keep = 10
        dds <- dds[keep, ]
        message("After prefilter: ", nrow(dds), " genes kept")

        dds <- DESeq(dds)

        vsd <- vst(dds, blind = FALSE)

        norm_counts % rownames_to_column(“Geneid”),
        file.path(out_dir, “normalized_counts.tsv”))

        saveRDS(dds, file.path(out_dir, “dds.rds”))
        saveRDS(vsd, file.path(out_dir, “vst.rds”))

        message(“Available coefficients in dds (resultsNames):”)
        print(resultsNames(dds))

        I did GSEA enrichment because I once try ORA and it was difficult to handle with lots of dependencies.

        However for my PCA plot, the biological replicates cluster together mostly.

        1. Lei Avatar

          Your filtering step is wrong. Look at this part:

          keep = 10
          dds <- dds[keep, ] This subsets your dds object to only row 10 — meaning you're actually keeping just a single gene and passing it to DESeq2. I'm supprised how you carried out GO enrichment analysis with 1 gene. Here is the correct way for filtering for genes with ≥10 counts in all samples: keep <- rowSums(counts(dds) >= 10) == ncol(dds)
          dds <- dds[keep, ]

          1. Dieudonne Zongo Avatar
            Dieudonne Zongo

            sorry, I sent the wrong code to you, here is a portion of the exact code I used:

            # ============================================================
            # RNA-seq (STAR + featureCounts) -> DESeq2
            # Project: PDZ006 / MG1655_FTn10
            # Author: p.zongo
            # ============================================================

            # —————————-
            # BLOC 0 — Packages
            # —————————-
            if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

            cran_pkgs <- c("readr","dplyr","tibble","stringr","ggplot2","pheatmap","RColorBrewer")
            to_install <- setdiff(cran_pkgs, rownames(installed.packages()))
            if (length(to_install)) install.packages(to_install)

            bio_pkgs <- c("DESeq2","apeglm")
            for (p in bio_pkgs) if (!requireNamespace(p, quietly = TRUE)) BiocManager::install(p, ask = FALSE, update = FALSE)

            suppressPackageStartupMessages({
            library(DESeq2)
            library(apeglm)
            library(readr)
            library(dplyr)
            library(tibble)
            library(stringr)
            library(ggplot2)
            library(pheatmap)
            library(RColorBrewer)
            })

            set.seed(42)

            # —————————-
            # BLOC 1 — Paths / outputs
            # —————————-
            counts_file <- "MG1655_FTn10.counts.deseq2.shortnames.tsv"
            samples_file <- "samples_deseq2.tsv"
            annot_file <- "MG1655_FTn10.annotation.tsv" # optionnel

            stopifnot(file.exists(counts_file), file.exists(samples_file))

            ##ts <- format(Sys.time(), "%Y%m%d_%H%M%S")
            out_dir <- file.path("deseq2_out", paste0("run_", ts))
            plot_dir <- file.path(out_dir, "plots")
            dir.create(plot_dir, recursive = TRUE, showWarnings = FALSE)

            # —————————-
            # BLOC 2 — Read inputs
            # —————————-
            message("Reading counts: ", counts_file)
            cts <- read_tsv(counts_file, show_col_types = FALSE)
            stopifnot("Geneid" %in% names(cts))
            stopifnot(!anyDuplicated(cts$Geneid))

            counts_mat %
            column_to_rownames(“Geneid”) %>%
            as.matrix()

            storage.mode(counts_mat) <- "integer"

            message("Reading samples: ", samples_file)
            smp % as.data.frame()
            stopifnot(all(c(“sample”,”condition”,”replicate”) %in% names(smp)))

            # rownames = sample (pratique pour DESeq2)
            rownames(smp) <- smp$sample

            # check matching + reorder
            stopifnot(setequal(colnames(counts_mat), smp$sample))
            smp <- smp[colnames(counts_mat), , drop = FALSE]

            # ordre stable des conditions
            cond_levels <- c("pHERD30T","FinQ","FinQ_HRR","RFP_myc")
            smp$condition <- factor(smp$condition, levels = cond_levels)
            smp$replicate <- factor(smp$replicate, levels = c("1","2","3"))

            # référence
            smp$condition <- relevel(smp$condition, ref = "pHERD30T")

            message("N genes: ", nrow(counts_mat), " | N samples: ", ncol(counts_mat))
            message("Condition counts:")
            print(table(smp$condition))

            # —————————-
            # BLOC 3 — DESeq2 (NE JAMAIS moyenner avant DESeq2)
            # —————————-
            dds <- DESeqDataSetFromMatrix(
            countData = counts_mat,
            colData = smp,
            design = ~ condition
            )

            keep = 10
            dds <- dds[keep, ]
            message("After prefilter: ", nrow(dds), " genes kept")

            dds <- DESeq(dds)

            vsd <- vst(dds, blind = FALSE)

            norm_counts % rownames_to_column(“Geneid”),
            file.path(out_dir, “normalized_counts.tsv”))

            saveRDS(dds, file.path(out_dir, “dds.rds”))
            saveRDS(vsd, file.path(out_dir, “vst.rds”))

            message(“Available coefficients in dds (resultsNames):”)
            print(resultsNames(dds))

          2. Lei Avatar

            It looks like the filtering issue I pointed out hasn’t been fixed yet — please revisit that and apply the correction as outlined in my previous reply.

            Going forward, I’m happy to discuss concepts and interpretation, but debugging individual scripts is outside the scope of what I can support here. For code-specific issues, Bioconductor support and Biostars are excellent resources.

            On a related note — I’m launching a workshop soon covering a comprehensive RNA-seq analysis course, including ready-to-use scripts for downstream statistical analysis and visualization. It sounds like it would be right up your alley. Let me know if you’re interested!

  8. Minh N Tran Avatar
    Minh N Tran

    Hi Lei,

    I have a question with my DESeq2 results for my experiment. We found that many of differentially expressed genes are actually pseudogenes/novel transcripts (GM003, etc). Should this be a technical concern or if there is biological reasoning behind this? (our experiment is between WT mouse vs transgenic mice)

    1. Lei Avatar

      Hi Minh,

      The GM genes in mouse are predicted/novel genes in the Ensembl or NCBI annotation — some are pseudogenes, but many are poorly characterized loci. Seeing them appear as DEGs is not unusual, but it’s worth a quick sanity check.

      Things to check:
      – Low count filtering — are these GM genes passing your pre-filtering threshold (e.g., minimum 10 counts across samples)? Low-count genes are more prone to noisy DE calls (Most common!).
      – Multi-mapping reads — pseudogenes often share high sequence similarity with their parent genes. If your aligner allowed multi-mappers, some counts may be spurious. Check your STAR or HISAT2 alignment flags.
      – Annotation version — make sure your GTF and genome FASTA are from the same source and version to avoid any mismatch artifacts.

      1. Minh N Tran Avatar
        Minh N Tran

        Thank you Lei! I used kallisto quantification as my input for DESeq2 and I have checked for the matching GTF and genome FASTA. So DESeq2 has internal filtering step, do I still need to filter out genes with low read counts?

        1. Lei Avatar

          Hi Minh,

          DESeq2’s internal independent filtering step runs after statistical testing, not before. It removes low-count genes from the final results to improve multiple testing correction power, but those genes still influence the analysis up to that point.

          It’s still good practice to do a pre-filter before running DESeq2 — something like keeping only genes with at least 10 counts across all samples. This reduces noise going into estimation and speeds up the run.

          Once you apply pre-filtering, it’s worth checking whether those GM genes still appear as DEGs or drop out — that’ll tell you a lot about whether they’re signal or noise.

Leave a Reply

Your email address will not be published. Required fields are marked *