How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices

October 19, 2024

Lei

Video Tutorial

Introduction

The most straightforward way to obtain a count table is to request it directly from your sequencing company or your institution’s sequencing core. This option may involve an additional fee. However, for those eager to learn or save money, let’s walk through the process together.

Before we dive in, a quick reminder: If you haven’t set up your analysis environment yet, be sure to check out our previous post, “How to Analyze RNAseq Data for Absolute Beginners Part 1: Environment Setup“. This guide will ensure you have all the necessary tools and software in place before we begin processing our data. Having a properly configured environment is crucial for a smooth analysis workflow, so don’t skip this step if you’re starting from scratch!

A Brief Review of RNAseq Technology

In simple terms, the RNAseq process involves:

Extracting mRNAs from subject tissues
Fragmenting the mRNAs into smaller pieces
Sequencing these fragments
Mapping the sequences back to the reference genome
Counting how many of these pieces (reads) correspond to each gene

The number of reads falling within each gene region represents that gene’s expression level.

Sequencing can be either single-end (sequencing from one end of the mRNA pieces) or paired-end (sequencing from both ends). Paired-end sequencing has become more popular due to its numerous advantages, including higher accuracy, improved quantification, and reduced ambiguity. This tutorial will focus on paired-end RNA sequencing data.

Step 1: Download the Required Files

Before we begin, ensure you have access to a powerful Linux computer, such as your lab server or your institution’s High-Performance Computing (HPC) cluster. A regular laptop won’t suffice for this process.

For this tutorial, we’ll use example data downloaded from NCBI GEO GSE259357, derived from mouse normal pancreas and early pancreatic neoplasia samples.

For those who want to download the FASTQ files for practicing, here is how:

Install the “sra-tools” if you haven’t done so in the Part 1 tutorial.

# Activate our RNA-seq environment
conda activate rnaseq_env

# Install sra-tools
mamba install sra-tools
mamba update sra-tools

Click on the SRA Run Selector at the bottom of GSE259357 webpage.
Open your terminal on your Linux system and download all the files in the “Run” column of the table at the bottom of the page using the command line:

# Change the directory
cd ~/Downloads

# Download the FASTQ files
fasterq-dump SRR28119110
fasterq-dump SRR28119111
fasterq-dump SRR28119112
fasterq-dump SRR28119113

Once all the fastq files have been downloaded. Compress the files using the command:

# Compress the FASTQ files
gzip *.fastq

First, download the prebuilt STAR Index for mouse (mm10):

Visit refgenie
Download the mm10 STAR Index
Name the folder “star_index_mm10” (avoid spaces in the name)
Remember the folder’s location on your computer

Next, download the corresponding GTF file (gene annotation file):

Go to GENCODE
Download the appropriate GTF file

Building A Customized STAR Index

In some cases, you may need to create a custom STAR index using your own genome file. The STAR (Spliced Transcripts Alignment to a Reference) aligner requires this indexed genome for efficient read mapping. Here’s an example command you can use on a Linux system:

STAR \
    --runThreadN 12 \                          # Use 12 CPU threads for parallel processing
    --limitGenomeGenerateRAM 200000000000 \    # Set RAM limit to 200GB
    --runMode genomeGenerate \                 # Specify that we're generating a genome index
    --genomeDir ~/Genome_Index/STAR_GRCm38 \   # Directory where the index will be stored
    --genomeFastaFiles ~/Genome_Index/Genome/GRCm38/GRCm38.primary_assembly.genome.fa \ # Input genome FASTA file
    --sjdbGTFfile ~/Genome_Index/GTF/GRCm38/gencode.vM25.annotation.gtf                # Gene annotation file

Building a STAR index is a computationally intensive process that requires substantial system resources. The example above needs approximately 200GB of RAM and benefits from multiple CPU cores. For the mouse genome (GRCm38), you can obtain both the genome FASTA file and the corresponding GTF annotation file from the GENCODE database as shown above. Before running the command, make sure to modify the file paths to match the actual locations on your system. It’s also advisable to check your system’s available RAM and adjust the –limitGenomeGenerateRAM parameter accordingly to prevent memory-related errors during the indexing process.

Step 2: Create Folders for Your Files

Use the following commands to create the necessary folders. Replace the paths with your desired locations:

# Create a folder for the STAR Index and GTF file
mkdir -p ~/Tutorials/RNAseq/star_index_mm10
mkdir -p ~/Tutorials/RNAseq/GTF

# Create a folder for the fastq files
mkdir -p ~/Tutorials/RNAseq/raw

# Create folders for the adapter-trimming results
mkdir -p ~/Tutorials/RNAseq/GSE259357/trimmed/SRR28119110

# Create folders for the mapping results
mkdir -p ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110

To download files directly using the terminal:

# Navigate to the STAR index folder
cd ~/Tutorials/RNAseq/star_index_mm10

# Download the index files
wget file_url

Replace “file_url” with the actual download link (right-click on the file hyperlink and select “copy link”).

To transfer FASTQ files to your new folder:

mv ~/Tutorials/RNAseq/old_folder/*.fastq.gz ~/Tutorials/RNAseq/raw

This command moves all files ending with .fastq.gz from the old folder to the new raw folder.

Step 3: Trim the Adapters

Use the following command to trim adapters from your FASTQ files:

trim_galore --fastqc --paired --cores 8 \
  ~/Tutorials/RNAseq/GSE259357/raw/SRR28119110_R1_001.fastq.gz \
  ~/Tutorials/RNAseq/GSE259357/raw/SRR28119110_R2_001.fastq.gz \
  -o ~/Tutorials/RNAseq/GSE259357/trimmed/SRR28119110

After trimming, you’ll find the following files in the trimmed folder:

ls ~/Tutorials/RNAseq/GSE259357/trimmed/SRR28119110

The trimmed FASTQ files (SRR28119110_R1_001_val_1.fq.gz and SRR28119110_R2_001_val_2.fq.gz) will be used in the next step.

Step 4: Align to the Reference Genome

Use this command to align your data to the reference genome:

STAR --genomeDir ~/Tutorials/RNAseq/star_index_mm10 \
  --runThreadN 20 --readFilesIn \
  ~/Tutorials/RNAseq/GSE259357/trimmed/SRR28119110/SRR28119110_R1_001_val_1.fq.gz \
  ~/Tutorials/RNAseq/GSE259357/trimmed/SRR28119110/SRR28119110_R2_001_val_2.fq.gz \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMunmapped Within \
  --outSAMattributes Standard \
  --readFilesCommand zcat \
  --outFileNamePrefix ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110

After alignment, you’ll see the following files in the aligned folder:

ls ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110

The file SRR28119110_trimmedAligned.sortedByCoord.out.bam contains the aligned results needed for the next step.

Step 5: Quantify Gene Expression

Use this command to quantify gene expression from the aligned results:

featureCounts -T 8 -t exon -g gene_name -s 0 \
  -a ~/Tutorials/RNAseq/GTF/gencode.vM25.annotation.gtf \
  -o ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110/SRR28119110_featureCounts_exon.txt \
  ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110/SRR28119110_trimmedAligned.sortedByCoord.out.bam

To view the resulting count table:

less ~/Tutorials/RNAseq/GSE259357/aligned/SRR28119110/SRR28119110_featureCounts_exon.txt

Press ‘q’ to exit the preview. You can also open this file using Microsoft Excel.

The columns in the count table are: “Gene Name”, “Chromosome”, “Start”, “End”, “Strand”, “Gene Length”, and “Counts”.

Repeat Steps 3-5 for Other Samples

To process all samples efficiently:

Create a text file for each sample (e.g., ~/Tutorials/RNAseq/GSE259375/RNAseq_Quant_SRR28119110.sh) containing the code from steps 3-5.
Execute the file using: bash ~/Tutorials/RNAseq/GSE259375/RNAseq_Quant_SRR28119110.sh
Repeat for all samples.

Check Data Quality and Mapping Rates

After quantifying all samples, run:

multiqc ~/Tutorials/RNAseq/GSE259357/

This command generates a multiqc_report.html file in the ~/Tutorials/RNAseq/GSE259357/ folder, summarizing the quality of FASTQ files and mapping statistics.

As shown in the image, we’ve achieved excellent alignment rates (about 80%) to the mouse genome.

Conclusion

Congratulations! We’ve successfully quantified gene expression for all samples (4 in this case). With count tables for each sample, we’re now ready to perform differential gene expression analysis. Take a moment to appreciate your progress, and prepare for the next stage of our journey in How to Analyze RNAseq Data for Absolute Beginners (Part 3: From Count Table to DEGs – Best Practices).

Tags:

Adapter Trimming, BAM, Count, FASTQ, Gene Expression Quantification, Reads Mapping, RNAseq analysis

Comments

4 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices”

Megan S

April 6, 2025

When looking at my featureCounts preview, I have a GeneID column instead of a GeneCount column. Additionally, there are multiple chromosome, start, end, and strand per cell (i.e. for one gene, under chr, it says “chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr15;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chr16;chrX;chrX;chrX;chrX;chrX;chrY;chrY;chrY;chrY;chrY”)
What could have gone wrong and how do I fix it?

Reply
1. Lei
  
  April 7, 2025
  
  Hi Megan,
  
  I’m not sure which exact steps you followed for gene quantification, but if you’ve strictly followed the tutorial, you should have a featureCounts_gene.txt file with content matching what’s shown in the example.
  
  I recommend testing the code using just one sample first to make sure everything works as expected before processing all your samples.
  
  Reply
Dal

June 3, 2025

Hello,

Step 4

“–readFilesCommand zcat” provided errors, which I was able to get around using
“–readFilesCommand gunzip” instead.

I had to increase the ulimit to 2048 to perform
“–outSAMtype BAM SortedByCoordinate \”

Step 5

“ERROR: Paired-end reads were detected in single-end read library”

I added -p to the code and it ran

-s 0 is the default, is it still important to include?https://subread.sourceforge.net/SubreadUsersGuide.pdf

I have different columns than you described in my count table. My columns are titled “Geneid” “Chr” “Start” “End” “Strand” and “Length”

when I run the “less” command this is what my exon.txt file looks like (in a nut shell):

Xkr4 chr1;chr1;chr1;chr1;chr1;chr1;chr1 3205901;3206523;3213439;3213609;3214482;3421702;3670552 3207317;3207317;3215632;3216344;3216968;3421901;3671498 -;-;-;-;-;-;- 6094 0

Gm18956 chr1 3252757 3253236 + 480 0
Gm37180 chr1 3365731 3368549 – 2819 0
Gm37363 chr1 3375556 3377788 – 2233 0
Gm37686 chr1 3464977 3467285 – 2309 0

Gm1992 chr1;chr1 3466587;3513405 3466687;3513553 +;+ 250 0
Gm19938 chr1;chr1 3647309;3658847 3650509;3658904 -;- 3259 0

Gm37381 chr1;chr1;chr1;chr1;chr1 3905739;3984225;3985160;3985160;3986147 3906134;3984298;3985984;3985351;3986215 -;-;-;-;- 1364 0

Rp1 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 3999557;4007656;4019070;4024736;4041888;4092617;4119668;4120015;4142612;4147812;4148612;4163855;4170205;4197534;4206660;4226611;4228443;4231053;4243133;4243417;4243543;4245031;4261527;4267469;4284766;4290846;4292926;4311270;4344146;4351910;4351910;4351910;4352202;4352202;4352202;4360200;4409170;4409170 3999617;4007737;4019148;4024890;4042107;4092780;4119712;4120073;4142766;4147963;4148744;4163941;4170404;4197641;4206837;4226823;4228619;4231144;4243262;4243448;4243619;4245106;4261605;4267620;4284898;4293012;4293012;4311433;4350091;4352081;4352081;4352081;4352837;4352837;4352837;4360314;4409241;4409241 -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 12311 0

Do you have any advice on my last point? It seems like my issue might be similar to Megan’s comment.

Thank you,
-Dal

Reply
1. Lei
  
  June 3, 2025
  
  Hi Dal,
  
  The file appears to be correct. Please try opening it in Microsoft Excel. The final column contains the count data. The actual columns you need are the 1st one and the last one.
  
  Reply

NGS Learning Hub

How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices

Video Tutorial

Introduction

A Brief Review of RNAseq Technology

Step 1: Download the Required Files

Building A Customized STAR Index

Step 2: Create Folders for Your Files

Step 3: Trim the Adapters

Step 4: Align to the Reference Genome

Step 5: Quantify Gene Expression

Repeat Steps 3-5 for Other Samples

Check Data Quality and Mapping Rates

Conclusion

Like this:

Comments

4 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices”

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags

How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices

Video Tutorial

Introduction

A Brief Review of RNAseq Technology

Step 1: Download the Required Files

Building A Customized STAR Index

Step 2: Create Folders for Your Files

Step 3: Trim the Adapters

Step 4: Align to the Reference Genome

Step 5: Quantify Gene Expression

Repeat Steps 3-5 for Other Samples

Check Data Quality and Mapping Rates

Conclusion

Share this:

Like this:

Comments

4 responses to “How to Analyze RNAseq Data for Absolute Beginners Part 2: From Fastq to Counts – Best Practices”

Leave a Reply Cancel reply

Search

Subscribe

Categories

Recent Posts

Tags