Glossary

Table of Contents

RNA-seq

A-to-I Editing: The conversion of adenosine (A) to inosine (I) in RNA. Inosine is interpreted as guanosine (G) by ribosomes and reverse transcription machinery. This is mediated by the ADAR (Adenosine Deaminase Acting on RNA) family of enzymes.
ADAR Enzymes: Enzymes that catalyze A-to-I editing in double-stranded RNA (dsRNA) regions, often in non-coding regions like introns and untranslated regions (UTRs), but also in coding regions to produce protein variants.
Alu Elements: Short, repetitive sequences found abundantly in primate genomes. These elements often form dsRNA structures, making them hotspots for A-to-I editing.
Alignment: The process of mapping RNAseq reads to a reference genome or transcriptome to determine the origin and structure of transcripts.
Alternative Splicing: A process by which a single gene can produce multiple RNA isoforms through the inclusion or exclusion of specific exons, leading to the generation of multiple proteins.
Annotation: Information about the genomic features (e.g., gene locations, exons, introns) that helps interpret the RNA-seq data.
APOBEC Enzymes: Enzymes responsible for C-to-U editing, primarily seen in the editing of mRNA for specific proteins such as apolipoprotein B.
Back-Splicing: A unique process in circular RNA biogenesis where a downstream splice donor is joined to an upstream splice acceptor, forming a circular structure.
Base Quality Score: A measure of the accuracy of each nucleotide call in a sequencing read.
Batch Effect: Unwanted variation in data due to technical rather than biological factors, often arising from differences between experimental batches.
Counts: The number of reads aligned to a specific feature (e.g., gene or transcript), used as a measure of expression level.
Coverage: The number of reads that overlap a particular region of the genome or transcriptome, indicating how well that region is represented in the sequencing data.
CPM (Counts Per Million): A normalization method used to account for sequencing depth differences between samples.
C-to-U Editing: The conversion of cytosine (C) to uracil (U) in RNA. This is mediated by APOBEC (Apolipoprotein B mRNA Editing Catalytic Polypeptide) enzymes and is less prevalent than A-to-I editing.
DESeq2: A widely used R package for analyzing RNA-seq data to detect differential gene expression between different experimental conditions.
Differential Expression (DE): The process of identifying genes or transcripts whose expression levels significantly differ between conditions (e.g., treated vs. untreated samples).
Differential Splicing Analysis: A method to compare splicing patterns between different conditions or groups (e.g., treated vs. untreated samples) to identify changes in alternative splicing.
Downstream Analysis: Analysis steps that follow the initial processing of RNA-seq data, such as differential expression, pathway analysis, and functional enrichment.
EdgeR: A software package used for differential expression analysis of RNA-seq count data.
Editing Frequency (Editing Ratio): The proportion of RNA molecules at a specific site that are edited.
Editing Sites: Specific nucleotide positions in an RNA molecule where editing occurs. These sites can be identified through comparison of RNA-seq data to the corresponding DNA sequence.
Exon: A segment of a gene that codes for a portion of the final RNA transcript. Exons are retained in the mature mRNA after splicing.
Exon Skipping: An exon is skipped or included in the mRNA.
Expression Level: The abundance of a transcript in the sample, typically measured in counts or TPM (Transcripts Per Million).
False Discovery Rate (FDR): A statistical method used to correct for multiple hypothesis testing, providing a measure of the expected proportion of false positives among significant results.
Feature: Any element in the genome that is analyzed in RNA-seq, such as a gene, transcript, exon, or intron.
FPKM (Fragments Per Kilobase of transcript per Million mapped reads): A normalization method used in RNA-Seq analysis to measure gene expression levels, taking into account both the sequencing depth and the length of the transcript.
Gene Ontology (GO): A framework used for annotating genes and gene products based on their molecular function, biological process, and cellular component.
Gene Set Enrichment Analysis (GSEA): A method for determining whether a set of genes shows statistically significant differences in expression between two biological states.
GTF/GFF File: A file format that describes gene and transcript annotations, providing information about exon-intron structures.
Heatmap: A graphical representation of expression data, often used to visualize the expression levels of many genes across multiple samples.
Homopolymer: A sequence of identical nucleotides repeated consecutively in a stretch of RNA.
Hyper-Editing: A phenomenon where multiple editing events occur in close proximity, typically in regions of dsRNA. Hyper-editing is often observed in repetitive sequences such as Alu elements in primates.
Intron: A non-coding segment of a gene that is removed during RNA splicing and is not present in the mature mRNA.
Intron Retention: An intron is retained in the mRNA instead of being spliced out.
Isoform: Different mRNA molecules produced from the same gene through alternative splicing, leading to variations in the protein produced.
Junction Reads: Sequencing reads that span across splice junctions, providing evidence of splicing events.
Library Preparation: The process of converting RNA into a form that is compatible with sequencing, often involving reverse transcription into cDNA and adapter ligation.
limma: A popular R package used for the analysis of gene expression data, including both microarray and RNA-seq experiments.
Log Fold Change (logFC): A measure of how much gene expression changes between conditions, often expressed on a log scale.
Mapped Reads: Reads that have been successfully aligned to the reference genome.
Mapping: The process of aligning RNA-seq reads to a reference genome or transcriptome.
Multimapping Reads: Reads that align to more than one location in the genome, making it difficult to determine their origin.
Mutually Exclusive Exons: Only one of two exons is included in the mRNA.
Normalization: The process of adjusting RNA-seq data to account for differences in sequencing depth or other technical biases, making comparisons between samples meaningful.
Normalization Factor: A scaling factor applied to RNA-seq data to correct for differences in sequencing depth or RNA composition between samples.
P-Value: A statistical measure that indicates the probability that an observed difference could have occurred by chance.
PCA (Principal Component Analysis): A dimensionality reduction technique used to visualize the variability in high-dimensional RNA-seq data, often used for sample clustering.
Poly-A Tail: A stretch of adenine nucleotides added to the 3′ end of eukaryotic mRNA, often used to enrich for mature mRNAs during RNA-seq library preparation.
Principal Component Analysis (PCA): A dimensionality reduction technique used to visualize the variability in high-dimensional RNA-seq data, often used for sample clustering.
PSI (Percent Spliced In): A metric used to quantify exon inclusion levels, typically calculated as the percentage of transcripts including a specific exon.
Quantification: The process of determining the abundance of transcripts in an RNA-seq experiment, usually expressed as counts, FPKM (Fragments Per Kilobase of transcript per Million mapped reads), or TPM (Transcripts Per Million).
Quality Control (QC): The process of assessing the quality of RNA-seq data to detect potential problems (e.g., low sequencing depth, contamination, or batch effects).
Read: A sequence of nucleotides produced by the sequencing process, representing a fragment of RNA.
Read Count: The number of sequencing reads mapping to a gene, exon, or splice junction, used as a measure of gene expression or splicing.
RNA Editing: A post-transcriptional process where specific nucleotides in RNA are altered, leading to changes in the RNA sequence that differ from the DNA template. The most common types are A-to-I (Adenosine-to-Inosine) and C-to-U (Cytosine-to-Uracil) editing.
RNA Isoforms: Different versions of RNA transcripts produced from the same gene through mechanisms like alternative splicing, alternative promoter usage, or alternative polyadenylation. Each isoform can have unique biological functions.
RNA-Seq: RNA sequencing, a technique used to study gene expression by sequencing RNA molecules.
RPKM (Reads Per Kilobase of transcript per Million mapped reads): A normalization metric for RNA-seq data that accounts for gene length and sequencing depth.
Single-End Sequencing: RNA-seq where only one end of a fragment is sequenced, as opposed to paired-end sequencing, where both ends are sequenced.
Splice Junction: The boundary between an exon and an intron, where the splicing machinery cuts and joins RNA to remove introns and connect exons.
Splice Variants: Different versions of mRNA produced by a single gene through alternative splicing.
Spliceosome: A complex of proteins and small nuclear RNAs (snRNAs) responsible for removing introns and joining exons during RNA splicing.
Splicing: The process of removing introns and joining exons to produce a mature mRNA transcript.
Splicing Events: Specific types of alternative splicing patterns, such as:
- Exon Skipping: An exon is skipped or included in the mRNA.
- Intron Retention: An intron is retained in the mRNA instead of being spliced out.
- Alternative 5’ Splice Site: Variation in the site where splicing occurs at the 5’ end of an exon.
- Alternative 3’ Splice Site: Variation in the site where splicing occurs at the 3’ end of an exon.
- Mutually Exclusive Exons: Only one of two exons is included in the mRNA.
STAR: A popular RNA-seq read aligner known for its speed and accuracy.
Subread Software Suite: A tool used for reads alignment and gene counting.
TPM (Transcripts Per Million): A normalization metric for RNA-seq data, similar to RPKM, but allows better comparison between samples.
Transcript: The RNA product of a gene, which may be spliced and processed into different isoforms.
Transcriptome: The complete set of RNA transcripts produced by the genome, under specific conditions or in specific cells.
Trimgalore: A popular wrapper tool that combines the functionalities of Cutadapt and FastQC to automate quality control and adapter trimming of high-throughput sequencing data.
Trimmomatic: One of the most popular and flexible tools for adapter trimming and quality control.
UMI (Unique Molecular Identifier): A barcode added to individual RNA molecules before sequencing to help differentiate between technical duplicates and biological duplicates.
Volcano Plot: A scatter plot that displays the significance (p-value) and magnitude of change (fold change) of genes between two conditions, often used in differential expression analysis.
WGCNA (Weighted Gene Co-Expression Network Analysis): A method used to identify clusters (modules) of co-expressed genes and correlate them with external traits.

Chromatin Accessibility

ChIP-seq

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is used to study protein-DNA interactions on a genome-wide scale.

ChIP-seq: A technique that combines chromatin immunoprecipitation (ChIP) with high-throughput sequencing to identify protein-DNA binding sites.
Chromatin Immunoprecipitation (ChIP): The process of using an antibody to isolate protein-DNA complexes from a sample.
Transcription Factor (TF): A protein that binds to specific DNA sequences to regulate gene expression.
Histone Modification: Covalent modifications to histone proteins (e.g., methylation, acetylation) that influence chromatin structure and gene expression.
Antibody: A protein used in ChIP to specifically bind and immunoprecipitate the protein of interest (e.g., a transcription factor or modified histone).
Crosslinking: The process of chemically linking proteins to DNA (e.g., using formaldehyde) to preserve protein-DNA interactions during ChIP.
Sonication: The process of fragmenting DNA into smaller pieces using ultrasonic waves.
Input DNA: A control sample consisting of sheared, non-immunoprecipitated DNA used to normalize ChIP-seq data.
Library Preparation: The process of preparing DNA fragments for sequencing, including end repair, adapter ligation, and PCR amplification.
Reads: Short sequences of DNA generated by high-throughput sequencing.
Alignment: The process of mapping sequencing reads to a reference genome.
Peak Calling: The computational process of identifying regions of the genome with significantly enriched ChIP-seq signals compared to a control (e.g., input DNA).
False Discovery Rate (FDR): A statistical measure used to control for false positives in peak calling.
Binding Site: A specific genomic region where a protein (e.g., transcription factor) binds to DNA.
Motif: A short, conserved DNA sequence pattern recognized by a protein (e.g., transcription factor binding motif).
Enrichment: The increased abundance of sequencing reads in a specific genomic region, indicating protein-DNA interaction.
BigWig: A file format used to store dense, continuous genomic data (e.g., ChIP-seq signal tracks).
BigBed: A compressed & indexed binary format for efficiently storing and remotely visualizing BED data.
BED File: A file format used to represent genomic regions (e.g., peaks) in a tab-delimited format.
BedGraph: A text format for numerical data, often used for read coverage or signal intensity.
Heatmap: A graphical representation of data where values are depicted by color, often used to visualize ChIP-seq signal intensity across genomic regions.
Differential Binding Analysis: Comparing ChIP-seq signals between conditions to identify changes in protein-DNA interactions.
Footprinting: A method to identify precise protein-DNA interaction sites at single-base resolution using ChIP-seq or related techniques.
Super-enhancer: A cluster of enhancers with exceptionally high levels of transcription factor binding and histone modifications.
Epigenome: The complete set of epigenetic modifications (e.g., DNA methylation, histone modifications) in a cell.
Integrative Analysis: Combining ChIP-seq data with other omics data (e.g., RNA-seq, ATAC-seq) to gain deeper insights into gene regulation.

ATAC-seq

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is used to study chromatin accessibility and identify open regions of the genome.

ATAC-seq: A technique that uses a transposase enzyme to insert sequencing adapters into open regions of chromatin, followed by sequencing to identify accessible DNA regions.
Transposase: An enzyme used in ATAC-seq to cut and tag accessible DNA regions with sequencing adapters.
Chromatin Accessibility: The degree to which DNA is open and accessible to proteins, often associated with regulatory elements like promoters and enhancers.
Tn5 Transposase: A hyperactive transposase enzyme commonly used in ATAC-seq to fragment and tag accessible DNA.
Open Chromatin: Regions of the genome that are not tightly packed into nucleosomes and are accessible to transcription factors and other proteins.
Nucleosome-Free Regions (NFRs): Areas of the genome where nucleosomes are absent, often associated with active regulatory elements.
Insertion Site: The location in the genome where the transposase inserts sequencing adapters, indicating regions of chromatin accessibility.
Peak Calling: The process of identifying regions of significant chromatin accessibility from ATAC-seq data.
Footprinting: A method to infer transcription factor binding sites at single-base resolution using ATAC-seq data.
TSS (Transcription Start Site): The region of DNA where transcription of a gene begins, often analyzed in ATAC-seq to assess promoter accessibility.

CUT&RUN

CUT&RUN (Cleavage Under Targets and Release Using Nuclease) is a method to study protein-DNA interactions with high resolution and low background noise.

CUT&RUN: A technique that uses an antibody to target a protein of interest, followed by cleavage of nearby DNA using a micrococcal nuclease (MNase).
Micrococcal Nuclease (MNase): An enzyme used in CUT&RUN to cleave DNA near the protein of interest.
Targeted Cleavage: The process of cutting DNA specifically at sites bound by the protein of interest.
Background Noise: Unwanted signal in sequencing data, which is typically lower in CUT&RUN compared to ChIP-seq.
Protein-DNA Complex: The complex formed by a protein (e.g., transcription factor or histone) bound to DNA, which is targeted in CUT&RUN.
High Resolution: The ability to precisely map protein-DNA interactions at a fine scale, often achieved with CUT&RUN.
Spike-in Control: A control sample (e.g., yeast DNA) added to CUT&RUN experiments to normalize data between samples.
Single-Nucleotide Resolution: The ability to identify protein-DNA interactions at the level of individual nucleotides.

CUT&TAG

CUT&TAG (Cleavage Under Targets and Tagmentation) is a method similar to CUT&RUN but uses a transposase for DNA tagging instead of MNase.

CUT&TAG: A technique that combines antibody targeting of a protein of interest with tagmentation (fragmentation and tagging) of nearby DNA using Tn5 transposase.
Tagmentation: The process of simultaneously fragmenting DNA and adding sequencing adapters using Tn5 transposase.
Tn5 Transposase: The enzyme used in CUT&TAG to fragment and tag DNA near the protein of interest.
Antibody-Tethered Transposase: A modified Tn5 transposase that is directed to specific genomic locations by an antibody.
Low Input Compatibility: The ability of CUT&TAG to work with small amounts of starting material, making it suitable for rare cell types or limited samples.
High Signal-to-Noise Ratio: The high specificity of CUT&TAG in targeting protein-DNA interactions with minimal background noise.
Epigenetic Profiling: The use of CUT&TAG to study histone modifications, transcription factor binding, and other epigenetic features.
Multiplexing: The ability to process multiple samples simultaneously in CUT&TAG by using unique barcodes during tagmentation.

Shared Terms Across Techniques

Sequencing Depth: The number of reads generated for a given region of the genome, which affects the sensitivity and resolution of the analysis.
Reads: Short DNA sequences generated by high-throughput sequencing.
Alignment: The process of mapping sequencing reads to a reference genome.
Peak Calling: Identifying regions of significant signal (e.g., protein binding or chromatin accessibility) in sequencing data.
Motif Analysis: Identifying conserved DNA sequence patterns associated with protein binding or regulatory elements.
BigWig: A file format for storing continuous genomic data (e.g., signal tracks).
BED File: A file format for representing genomic regions (e.g., peaks) in a tab-delimited format.
Heatmap: A graphical representation of data where values are depicted by color, often used to visualize signal intensity across genomic regions.
Differential Analysis: Comparing data between conditions to identify changes in protein binding or chromatin accessibility.
Epigenome: The complete set of epigenetic modifications (e.g., DNA methylation, histone modifications) in a cell.

DNA Sequencing

DNA Methylation

5-methylcytosine (5mC): The methylated form of cytosine, often called the “fifth base” of DNA. The primary epigenetic mark in mammalian genomes.
Beta Value: Methylation level metric ranging from 0 (unmethylated) to 1 (fully methylated), calculated as methylated signal intensity divided by total signal intensity.
Bisulfite Sequencing (BS-seq): Gold standard method for detecting DNA methylation. Bisulfite treatment converts unmethylated cytosines to uracil while methylated cytosines remain unchanged.
CpG Island: Genomic region with high frequency of CpG sites (typically >500 bp with >55% GC content). Often found at gene promoters and usually unmethylated in normal cells.
CpG Site: DNA region where a cytosine nucleotide is followed by a guanine nucleotide, linked by a phosphate (5′-C-phosphate-G-3′). These are the primary sites of DNA methylation in mammals.
Differentially Methylated Position (DMP): Individual CpG site with significant methylation change between conditions.
Differentially Methylated Region (DMR): Genomic region showing significant methylation differences between conditions or sample groups.
DNA Methylation: Chemical modification where a methyl group (CH₃) is added to cytosine bases, primarily at CpG sites, regulating gene expression without changing the DNA sequence.
DNA Methyltransferases (DNMTs): Enzymes that catalyze the addition of methyl groups to cytosine bases. DNMT1 maintains methylation patterns, while DNMT3A/3B establish new patterns.
Hypermethylation: Increased DNA methylation compared to normal, often occurring at promoter CpG islands and associated with gene silencing in cancer.
Hypomethylation: Decreased DNA methylation compared to normal, can lead to genomic instability and activation of normally silenced genes.
Illumina Methylation Arrays (450K/EPIC): Bead chip technology measuring methylation at specific CpG sites (450,000 or 850,000 sites) across the genome. Commonly used in large-scale studies like TCGA.
Methylome: Complete set of DNA methylation modifications in a cell or organism’s genome.
M-value: Alternative methylation metric calculated as log2 ratio of methylated to unmethylated probe intensities. More statistically valid for differential methylation analysis.
Reduced Representation Bisulfite Sequencing (RRBS): Cost-effective method using restriction enzymes to enrich for CpG-rich regions before bisulfite sequencing.
TET Enzymes: Ten-eleven translocation enzymes that oxidize 5-methylcytosine, initiating active DNA demethylation.
Whole Genome Bisulfite Sequencing (WGBS): Comprehensive method combining bisulfite conversion with whole genome sequencing to map methylation at single-base resolution across the entire genome.

Whole Genome/Exome Sequencing

Base Quality Score: Phred-scaled probability that a base call is incorrect (Q30 = 99.9% accuracy).
ClinVar: Public database of relationships between genetic variants and human health conditions.
Copy Number Variant (CNV): Segment of DNA present in variable copy numbers compared to a reference genome, typically >1 kb in size.
Coverage/Depth: Number of times a nucleotide is read during sequencing. Typical WES uses 100x coverage; WGS uses 30-50x for germline variants.
dbSNP: Database of short genetic variations including SNVs, indels, and CNVs across multiple species.
Exome: The complete set of exons in a genome. In humans, approximately 180,000 exons comprising ~30 million base pairs.
Exome Capture/Enrichment: Process using biotinylated probes to selectively capture and enrich exonic regions before sequencing.
GATK (Genome Analysis Toolkit): Widely used software suite for variant discovery in high-throughput sequencing data, developed by Broad Institute.
Germline Filtering: Process of removing inherited variants when searching for somatic mutations by comparing tumor and matched normal samples.
Germline Variant: Genetic variation inherited from parents, present in all cells of an organism, passed to offspring.
gnomAD: Genome Aggregation Database containing genetic variation from >140,000 individuals, used to filter common population variants.
Hard Filtering: Simple variant filtering using fixed thresholds for quality metrics (alternative to VQSR for smaller datasets).
Insertion/Deletion (Indel): Addition or removal of nucleotides in the genome, ranging from 1 to ~50 base pairs.
Mapping Quality: Measure of confidence that a read is mapped to the correct genomic location.
Mutect2: GATK tool specifically designed for calling somatic variants from tumor-normal pairs.
Pathogenic Variant: Genetic change known or predicted to cause disease.
Single Nucleotide Variant (SNV): Single base pair substitution in DNA sequence. Most common type of genetic variation.
Somatic Variant: Genetic variation acquired during an organism’s lifetime, present only in specific cells (like tumor cells), not inherited.
Structural Variant (SV): Large-scale genomic alterations including deletions, duplications, inversions, and translocations (typically >50 bp).
Tumor Mutational Burden (TMB): Total number of somatic mutations per megabase of examined genomic sequence, used as biomarker for immunotherapy response.
Tumor-Normal Pair: Matched samples from the same patient (tumor tissue and normal/blood) used to distinguish somatic from germline variants.
Variant Allele Frequency (VAF): Proportion of sequencing reads supporting a variant allele at a given position. Important for detecting somatic mutations and tumor heterogeneity.
Variant Annotation: Process of adding biological context to identified variants, including gene names, functional predictions, population frequencies, and clinical significance.
Variant Call Format (VCF): Standard file format for storing gene sequence variations, including SNVs, indels, and structural variants.
Variant Calling: Computational process of identifying differences between sequenced reads and a reference genome.
Variant of Uncertain Significance (VUS): Genetic variant with unclear impact on disease risk or protein function.
Variant Quality Score Recalibration (VQSR): Machine learning approach to filter variant calls based on multiple quality metrics.
Whole Exome Sequencing (WES): Targeted sequencing of all protein-coding regions (exons) of the genome, representing ~1-2% of the genome but containing ~85% of disease-causing mutations.
Whole Genome Sequencing (WGS): Comprehensive method determining the complete DNA sequence of an organism’s genome in a single experiment, including coding and non-coding regions.

NGS101

RNA-seq

Chromatin Accessibility

ChIP-seq

ATAC-seq

CUT&RUN

CUT&TAG

Shared Terms Across Techniques

DNA Sequencing

DNA Methylation

Whole Genome/Exome Sequencing

Like this:

Search

Categories

Recent Posts

Tags

Glossary

RNA-seq

Chromatin Accessibility

ChIP-seq

ATAC-seq

CUT&RUN

CUT&TAG

Shared Terms Across Techniques

DNA Sequencing

DNA Methylation

Whole Genome/Exome Sequencing

Share this:

Like this:

Search

Categories

Recent Posts

Tags