13  Transcriptomics

13.1 Beyond the Genome: The Transcriptome

13.1.1 What Is the Transcriptome?

If the genome is all your DNA, the transcriptome is all the RNA being made in your cells right now!

Think of it like:

  • Genome = Your complete cookbook (all possible recipes)

  • Transcriptome = The recipes you’re actually using today

The transcriptome changes:

  • Between different cell types (brain vs. muscle)

  • At different times (morning vs. evening)

  • In different conditions (healthy vs. sick)

Transcriptomics is the study of all RNA molecules in a cell or organism.

13.2 The Three Main Types of RNA

13.2.1 1. mRNA (Messenger RNA)

What it does: Carries the recipe from DNA to ribosomes

Think of it like: A photocopy of a recipe that goes to the kitchen

Key facts:

  • Temporary (breaks down quickly)

  • Only about 1-5% of total RNA

  • Codes for proteins

  • Each mRNA corresponds to one gene (in eukaryotes)

Process:

  1. Transcribed from DNA in the nucleus

  2. Processed (capping, poly-A tail, splicing)

  3. Travels to cytoplasm

  4. Read by ribosomes to make protein

  5. Eventually degrades

13.2.2 2. tRNA (Transfer RNA)

What it does: Brings amino acids to the ribosome during protein synthesis

Think of it like: Delivery trucks bringing ingredients to the kitchen

Key facts:

  • Cloverleaf shape (looks like a three-leaf clover when flat)

  • 3D L-shape structure

  • About 15% of total RNA

  • Each tRNA carries one specific amino acid

  • Has an “anticodon” that matches mRNA codons

How it works:

  1. tRNA picks up its specific amino acid

  2. Recognizes the matching codon on mRNA

  3. Delivers amino acid to growing protein chain

  4. Detaches and goes back for another amino acid

There are different tRNAs for each of the 20 amino acids!

13.2.3 3. rRNA (Ribosomal RNA)

What it does: Forms the structure of ribosomes (protein-making factories)

Think of it like: The factory building itself

Key facts:

  • About 80% of total RNA! (Most abundant)

  • Combines with proteins to form ribosomes

  • Does NOT code for proteins

  • Actually catalyzes protein synthesis (it’s an enzyme!)

  • Very ancient and conserved across all life

Structure:

  • Humans have 4 types of rRNA

  • Combine with ~80 ribosomal proteins

  • Form large and small ribosomal subunits

13.3 RNA Processing in Eukaryotes

13.3.1 From Pre-mRNA to Mature mRNA

In eukaryotes, RNA goes through major changes before it’s ready! The initial RNA copy is called pre-mRNA (precursor mRNA).

13.3.2 1. 5’ Capping

What happens: A special “cap” is added to the beginning (5’ end) of the RNA

The cap:

  • A modified guanine nucleotide

  • Added immediately after transcription starts

  • Like putting a protective cap on a pen

Why it matters:

  • Protects mRNA from degradation

  • Helps ribosome recognize and bind to mRNA

  • Helps transport mRNA out of nucleus

13.3.3 2. 3’ Poly-A Tail

What happens: A long tail of adenine (A) nucleotides is added to the end (3’ end)

The tail:

  • About 200-250 adenines in a row (AAAAAAAA…)

  • Added after transcription finishes

  • Like a protective tail

Why it matters:

  • Protects mRNA from degradation

  • Helps export mRNA from nucleus

  • Helps ribosomes find and bind mRNA

  • Longer tail = longer mRNA lifespan

13.3.4 3. Splicing: Removing Introns

What happens: Introns (non-coding regions) are cut out, exons (coding regions) are joined together

The process:

  1. Pre-mRNA contains both introns and exons

  2. Spliceosome (molecular machine) recognizes intron-exon boundaries

  3. Cuts out introns

  4. Joins exons together

  5. Introns are degraded

Think of it like:

  • Filming a movie (transcription)

  • Editing out unwanted scenes (splicing)

  • Final cut ready for theaters (mature mRNA)!

13.4 Alternative Splicing: One Gene, Many Proteins!

13.4.1 The Amazing Discovery

Here’s where things get really interesting! One gene can make multiple different proteins through alternative splicing!

13.4.2 How Alternative Splicing Works

Instead of always joining exons in the same order, cells can:

  • Skip exons: Leave some exons out

  • Include or exclude exons: Sometimes include, sometimes exclude

  • Use alternative splice sites: Cut at different positions

  • Retain introns: Sometimes keep an intron in

Think of it like:

  • You have LEGO bricks numbered 1, 2, 3, 4, 5

  • You could build: 1-2-3-4-5 OR 1-3-4-5 OR 1-2-4-5 OR 1-2-3-5

  • Same bricks, different combinations, different structures!

13.4.3 Why Alternative Splicing Is Powerful

Creates protein diversity:

  • Humans have ~20,000 genes

  • But make ~100,000+ different proteins!

  • Alternative splicing explains this!

Examples:

  • DSCAM gene (in fruit flies): Can make 38,000 different proteins from ONE gene!

  • Human titin gene: Makes different versions in different muscle types

  • Antibody genes: Create diverse antibodies through splicing

Allows fine-tuned regulation:

  • Different tissues can make different protein versions

  • Respond to different conditions

  • Create proteins with slightly different functions

13.4.4 Regulation of Alternative Splicing

Special proteins called splicing factors control which exons are included:

  • Some promote exon inclusion

  • Some promote exon skipping

  • Different cells have different splicing factors

  • Responds to signals (hormones, stress, development)

13.5 Non-Coding RNAs

13.5.1 RNA That Doesn’t Make Protein

We used to think all important RNA coded for proteins. We were wrong!

Many RNAs have functions without making proteins:

13.5.2 microRNAs (miRNAs)

What they are: Tiny RNAs (about 22 nucleotides long)

What they do: Regulate gene expression by:

  • Binding to mRNA

  • Blocking translation OR

  • Causing mRNA degradation

Impact:

  • One miRNA can regulate hundreds of different genes!

  • Important for development, cell division, disease

  • Dysregulation linked to cancer

Think of them like: Volume controls that turn down gene expression

13.5.3 Long Non-Coding RNAs (lncRNAs)

What they are: Long RNA molecules (>200 nucleotides) that don’t code for protein

What they do: Many different functions:

  • Regulate gene expression

  • Organize chromatin

  • Guide proteins to specific DNA locations

  • Some functions still unknown!

Impact:

  • More common than expected (thousands in humans!)

  • Important for development and disease

  • Active area of research

13.5.4 Other Functional RNAs

  • snRNAs: Small nuclear RNAs (part of the spliceosome!)

  • snoRNAs: Small nucleolar RNAs (modify other RNAs)

  • Ribozymes: RNA molecules that act as enzymes

  • siRNAs: Small interfering RNAs (used in research and medicine)

13.6 Studying the Transcriptome

13.6.1 Why Study the Transcriptome?

Understanding gene expression:

  • Which genes are turned on in different cells?

  • How does gene expression change in disease?

  • How do cells respond to signals?

Better than studying genome alone:

  • Genome is the same in all cells

  • But transcriptome differs between cells!

  • Tells you what’s actually happening

13.6.2 Why RNA is Unstable (And Why We Convert to cDNA)

Before we discuss methods to study RNA, we need to understand an important challenge: RNA is extremely unstable!

The problem with RNA:

  • Chemically unstable: RNA has an extra -OH group at the 2’ position
    • This makes RNA reactive and prone to breaking
    • DNA is more stable (no 2’-OH group)
  • RNases everywhere: Enzymes that degrade RNA (RNases) are literally everywhere!
    • On your skin
    • In the air
    • In cells (lots of them!)
    • Very stable and hard to inactivate

Why so many RNases? An evolutionary defense!

This is actually evolution being clever:

  • Viral defense: Many viruses have RNA genomes
  • Immune system: Our cells have RNases to destroy viral RNA
  • Quality control: Cells need to degrade old or damaged RNA quickly
  • Gene regulation: Controlling RNA degradation = controlling gene expression

Think of it like:

  • Your body treats all external RNA as potentially dangerous
  • RNases are like security guards everywhere
  • They destroy RNA before it can cause problems
  • Unfortunately, this makes working with RNA experimentally very difficult!

The solution: Convert RNA to cDNA

cDNA = Complementary DNA (DNA copy of RNA)

Why cDNA is better:

  • Stable: DNA is chemically stable (no 2’-OH group)
  • RNase-resistant: RNases only cut RNA, not DNA
  • Can be amplified: Use PCR to make many copies
  • Easier to work with: Standard DNA techniques apply
  • Long-term storage: Can store cDNA for years

How RNA → cDNA conversion works:

  1. Extract RNA from cells (work quickly, keep everything cold!)
  2. Add reverse transcriptase enzyme (from retroviruses)
    • This special enzyme makes DNA from RNA template
    • Regular DNA polymerase cannot do this!
  3. Add primers (usually oligo-dT primers that bind to poly-A tail)
  4. Reverse transcriptase synthesizes DNA copy of RNA
  5. Now you have stable cDNA to work with!

Think of it like:

  • RNA = fragile ice sculpture (melts quickly)
  • cDNA = photograph of the ice sculpture (stable, permanent)
  • You can’t keep the ice sculpture, but the photo preserves the information!

13.6.3 DNA Microarrays: Measuring Gene Expression

Before NGS sequencing became popular, scientists used microarrays to measure gene expression (Schena et al. 1995; Shalon, Smith, and Brown 1996). Microarrays are still used today for certain applications!

13.6.3.1 What Is a DNA Microarray?

A DNA microarray is like a gene expression chip:

  • Glass slide (like a microscope slide)
  • Thousands of spots, each containing DNA probes for one gene
  • Measures expression of thousands of genes simultaneously!

Think of it like:

  • Each spot = a fishing hook for one specific gene
  • Different colored fish = RNA from different samples
  • Count how many fish each hook catches = measure gene expression!

13.6.3.2 How Microarrays Work

Step 1: Create the array

  • Synthesize or print short DNA sequences (probes) onto glass slide
  • Each spot contains probes for one specific gene
  • Can have 10,000-50,000+ spots on one slide!
  • Each probe is complementary to a specific mRNA

Step 2: Prepare samples

  1. Extract RNA from cells (e.g., normal cells vs. cancer cells)
  2. Convert RNA to cDNA using reverse transcriptase
    • Why? RNA is unstable!
    • cDNA is much more stable and easier to work with
  3. Label cDNA with fluorescent dyes
    • Normal cells: Label with green dye (Cy3)
    • Cancer cells: Label with red dye (Cy5)

Step 3: Hybridization

  • Mix both labeled cDNA samples together
  • Apply to microarray slide
  • Incubate (allow cDNA to bind to matching probes)
  • Wash away unbound cDNA

Step 4: Scan and analyze

  • Scan slide with laser
    • Green laser detects Cy3 (normal sample)
    • Red laser detects Cy5 (cancer sample)
  • Camera captures fluorescent signals
  • Computer measures intensity at each spot

13.6.3.3 Interpreting Microarray Colors

Two-color microarray results:

  • Green spot: Gene higher in normal sample
  • Red spot: Gene higher in cancer sample
  • Yellow spot: Gene equal in both samples (red + green = yellow)
  • Black/dark spot: Gene not expressed in either sample

Example interpretation:

Spot Color    → Meaning
───────────────────────────────────────────
Red           → Gene upregulated in disease
Green         → Gene downregulated in disease
Yellow        → No change in expression
Black         → Gene not expressed

Quantitative data:

  • Actually, you get two intensity values per spot:
    • Green intensity (Cy3)
    • Red intensity (Cy5)
  • Calculate ratio: Red/Green
    • Ratio > 1 = upregulated in cancer
    • Ratio < 1 = downregulated in cancer
    • Ratio ≈ 1 = no change

13.6.4 Quantitative Microarray Analysis: The Details

Now let’s understand HOW we actually measure and interpret the data from microarrays!

13.6.4.1 Measuring Spot Intensities

What the scanner does:

  1. Shine lasers at the microarray slide
    • Green laser (532 nm wavelength) excites Cy3 (green dye)
    • Red laser (635 nm wavelength) excites Cy5 (red dye)
  2. Camera captures emission:
    • Cy3 emits green light → camera records green intensity
    • Cy5 emits red light → camera records red intensity
    • Each spot recorded separately
  3. Software measures intensity:
    • Measures average brightness of each spot
    • Units: fluorescence intensity (arbitrary units, like 1,000 or 50,000)
    • Higher number = more fluorescence = more RNA bound

Think of it like:

  • Shining a flashlight on glow-in-the-dark stickers
  • Brighter glow = more stickers
  • Measure how bright each spot glows!

Example raw data for one spot:

  • Green intensity (Cy3): 8,000 units (normal sample)
  • Red intensity (Cy5): 24,000 units (cancer sample)
  • Background: 200 units (non-specific signal)

13.6.4.2 Calculating Ratios and Fold-Changes

Step 1: Subtract background:

  • Green signal = 8,000 - 200 = 7,800
  • Red signal = 24,000 - 200 = 23,800
  • Background = signal from areas without probes (noise)

Step 2: Calculate ratio:

  • Ratio (R/G) = Red / Green = 23,800 / 7,800 = 3.05
  • This means: Gene is 3-fold upregulated in cancer!

Interpretation:

  • Ratio > 1: Gene higher in red sample (cancer)
  • Ratio < 1: Gene higher in green sample (normal)
  • Ratio ≈ 1: No change (equal expression)

Common cutoffs:

  • Ratio ≥ 2 or ≤ 0.5: Significant change (2-fold or more)
  • Ratio ≥ 1.5 or ≤ 0.67: Moderate change
  • 0.67 < Ratio < 1.5: Usually considered no significant change

13.6.4.3 Log2 Transformation: Making Data Easier to Work With

The problem with ratios:

  • Upregulation: 2-fold, 3-fold, 4-fold… (numbers get large)
  • Downregulation: 0.5, 0.33, 0.25… (fractions, hard to compare)
  • Not symmetric!

Example:

  • 2-fold up = ratio of 2
  • 2-fold down = ratio of 0.5
  • These should look “equally different” but don’t!

The solution: Log2 transformation

Log2(ratio) = log base 2 of the ratio

How it works:

  • Log2(2) = 1 (2-fold increase)
  • Log2(0.5) = -1 (2-fold decrease)
  • Log2(1) = 0 (no change)
  • Log2(4) = 2 (4-fold increase)
  • Log2(0.25) = -2 (4-fold decrease)

Now upregulation and downregulation are symmetric!

Example calculation:

  • Ratio = 3.05
  • Log2(3.05) = 1.61
  • Interpretation: ~1.6 log2 fold-change (≈ 3-fold upregulation)

Common values:

Ratio Log2(Ratio) Fold-Change Interpretation
2.0 +1.0 2-fold up Upregulated
4.0 +2.0 4-fold up Highly upregulated
0.5 -1.0 2-fold down Downregulated
0.25 -2.0 4-fold down Highly downregulated
1.0 0.0 No change Equal expression

Why log2 is useful:

  • Symmetric: +1 and -1 look equally different from 0
  • Easy to visualize: Heatmaps, scatter plots, volcano plots
  • Standard in genomics: Everyone uses it!

13.6.4.4 Assessing Statistical Significance

Just because a gene has a ratio of 2 doesn’t mean it’s really different! We need statistics to know if it’s real or just random noise.

Key question: Is this change statistically significant?

What affects significance?:

  1. Magnitude of change: Bigger change = more likely significant
  2. Replicates: More replicates = more confident
  3. Variability: Less variation between replicates = more confident

Common statistical tests:

1. t-test:

  • Compares two groups (disease vs. normal)
  • Tests if means are significantly different
  • Outputs p-value

2. ANOVA:

  • Compares multiple groups (>2 conditions)
  • Tests if any groups are different

3. SAM (Significance Analysis of Microarrays):

  • Specialized method for microarray data
  • Accounts for multiple testing problem
  • Very popular for microarrays

P-value interpretation:

  • p < 0.05: Statistically significant (5% chance it’s random)
  • p < 0.01: Highly significant (1% chance it’s random)
  • p < 0.001: Very highly significant

Multiple testing correction:

  • Problem: Testing 40,000 genes simultaneously!
  • With p < 0.05, we expect 2,000 false positives (5% of 40,000)!
  • Solution: Adjust p-values for multiple testing

Common corrections:

  • Bonferroni: Very strict (divide p by number of tests)
    • New cutoff: 0.05 / 40,000 = 0.00000125
    • Too strict! Misses real changes
  • FDR (False Discovery Rate): Less strict, more practical
    • Controls proportion of false positives
    • FDR < 0.05: Expect 5% false positives among significant genes
    • Most commonly used!

Example result:

  • Gene X: Ratio = 3.0, Log2 = 1.58, p = 0.002, FDR = 0.03
  • Interpretation: Gene X is significantly upregulated 3-fold, with high confidence!

13.6.4.5 Visualizing Microarray Data

1. Scatter plot (MA plot):

  • X-axis: Average intensity (A = average of log2 red and log2 green)
  • Y-axis: Log2 ratio (M = log2 red - log2 green)
  • Shows which genes change and their expression level
  • Horizontal line at Y=0 = no change
  • Points above = upregulated
  • Points below = downregulated

2. Volcano plot:

  • X-axis: Log2 fold-change
  • Y-axis: -log10(p-value)
  • Shows both magnitude and significance
  • Top corners = large change AND significant (best genes!)
  • Middle = not significant
  • Far left/right but low = big change but not significant (noisy)

3. Heatmap:

  • Rows = genes
  • Columns = samples
  • Colors = expression level
    • Red = high expression
    • Green = low expression
    • Black/Yellow = medium
  • Reveals patterns across many genes and samples
  • Often includes clustering (groups similar genes/samples)

Think of heatmap like:

  • Weather map showing temperatures
  • Red = hot (high expression)
  • Blue/Green = cold (low expression)
  • Patterns emerge visually!

13.6.4.6 From Scanner to Results: Complete Workflow

1. Scan microarray:

  • Get images (green and red channels)
  • Raw fluorescence intensities

2. Grid alignment:

  • Software identifies each spot
  • Draws circles/grids around spots
  • Measures intensity inside each spot

3. Background subtraction:

  • Measure intensity between spots (background)
  • Subtract from each spot intensity
  • Corrects for non-specific signal

4. Normalization (covered in next section):

  • Correct for technical variations
  • Make arrays comparable

5. Calculate ratios:

  • Red / Green for each spot
  • Log2 transform

6. Statistical analysis:

  • Identify significantly changed genes
  • Apply multiple testing correction
  • FDR < 0.05 typically

7. Biological interpretation:

  • What do these genes do?
  • Are they related (pathways, functions)?
  • Can we validate with other methods?

Example: Finding Cancer Biomarkers

Experiment:

  • Compare tumor samples (n=10) vs. normal tissue (n=10)
  • Each sample hybridized to microarray
  • 40,000 genes tested

Results:

  • 2,500 genes significantly different (FDR < 0.05)
  • 1,200 upregulated in tumors
  • 1,300 downregulated in tumors

Top hit:

  • EGFR (Epidermal Growth Factor Receptor)
  • Ratio = 8.5 (8.5-fold upregulated)
  • Log2 = 3.09
  • p < 0.0001
  • FDR < 0.001
  • Conclusion: EGFR highly overexpressed in tumors! Potential drug target!

13.6.4.7 Single-Color vs. Two-Color Microarrays

Two-color microarrays:

  • Compare two samples directly on same array
  • Use two fluorescent dyes (e.g., Cy3 and Cy5)
  • Measures relative differences
  • Good for comparing disease vs. normal

Single-color microarrays (e.g., Affymetrix):

  • One sample per array, one dye
  • Measure absolute intensity for each sample separately
  • Compare results across multiple arrays
  • Better for comparing many samples

13.6.4.8 Microarray Applications

Gene expression profiling:

  • Compare disease vs. healthy tissue
  • Identify disease biomarkers
  • Classify cancer subtypes

Drug response:

  • Which genes change when drug is added?
  • Predict drug efficacy
  • Identify drug targets

Developmental biology:

  • How does gene expression change during development?
  • Track changes over time

Example - Cancer classification:

  1. Extract RNA from tumor sample
  2. Convert to cDNA, label with fluorescent dye
  3. Hybridize to microarray
  4. Compare expression pattern to database
  5. Classify tumor type based on gene expression profile!

13.6.5 Microarray Normalization and Housekeeping Genes

Before we can compare microarray data, we need to normalize it - correct for technical differences that aren’t biological!

13.6.5.1 Why Normalization Is Needed

The problem: Technical variations between arrays make comparison difficult

Sources of variation (non-biological):

  1. Different amounts of RNA loaded:
    • Sample A: loaded 5 µg RNA
    • Sample B: loaded 7 µg RNA
    • Sample B will have higher signal, but NOT because genes are more expressed!
  2. Different labeling efficiency:
    • Cy3 dye might label better than Cy5
    • Or vice versa
    • Creates systematic bias
  3. Scanner settings:
    • Laser power varies slightly
    • PMT (photomultiplier tube) sensitivity differences
    • Different scan times
  4. Slide differences:
    • Spot sizes vary
    • Amount of probe printed varies
    • Slide-to-slide variation

Think of it like:

  • Comparing test scores from different teachers
  • One teacher grades easier (higher scores)
  • Need to normalize to compare fairly!

Goal of normalization: Remove technical variation, keep biological variation

13.6.5.2 Housekeeping Genes: Internal Controls

Housekeeping genes = Genes that are always expressed at constant levels

Why they’re called “housekeeping”:

  • Like doing housework - always needs to be done!
  • Essential for basic cell survival
  • Should NOT change between conditions

Common housekeeping genes:

1. GAPDH (Glyceraldehyde-3-phosphate dehydrogenase):

  • Enzyme in glycolysis (energy production)
  • Most commonly used!
  • Should be same in cancer vs. normal
  • High expression (easy to detect)

2. ACTB (β-Actin):

  • Structural protein (part of cytoskeleton)
  • Always expressed
  • Very abundant

3. TUBB (β-Tubulin):

  • Another structural protein
  • Forms microtubules
  • Constantly needed

4. HPRT1 (Hypoxanthine Phosphoribosyltransferase):

  • Enzyme in nucleotide synthesis
  • More stable than GAPDH in some conditions

5. 18S rRNA:

  • Ribosomal RNA component
  • VERY abundant
  • Very stable

How housekeeping genes are used:

Method 1: As normalization reference:

  1. Measure housekeeping gene expression
  2. Should be same in both samples (disease vs. normal)
  3. If not, normalize so they ARE the same
  4. Apply same normalization to all other genes

Example:

  • Sample A (normal): GAPDH intensity = 10,000
  • Sample B (cancer): GAPDH intensity = 15,000
  • Ratio = 1.5 (but GAPDH shouldn’t change!)
  • Problem: Sample B has 1.5× higher signal overall (technical artifact)
  • Solution: Divide all Sample B intensities by 1.5

Method 2: As positive controls:

  • If housekeeping genes show big changes → something went wrong!
  • Might indicate:
    • Poor RNA quality
    • Technical problems
    • Bad normalization
    • Need to repeat experiment

Important note: Housekeeping genes aren’t perfect!

  • Can change in some conditions (stress, cell cycle, etc.)
  • Always use multiple housekeeping genes
  • Validate that they’re actually stable in YOUR experiment

13.6.5.3 Common Normalization Methods

1. Global normalization (total intensity):

  • Assume: Total amount of RNA is same in all samples
  • Method: Make total signal equal across all arrays
  • Scale factor = Average total intensity / This array’s total intensity
  • Multiply all spots by scale factor

Example:

  • Array 1 total: 1,000,000
  • Array 2 total: 1,200,000
  • Average: 1,100,000
  • Scale Array 1 by: 1,100,000 / 1,000,000 = 1.1
  • Scale Array 2 by: 1,100,000 / 1,200,000 = 0.917

Pros: Simple, fast Cons: Assumes most genes don’t change (not true if many change!)

2. Housekeeping gene normalization:

  • Use genes that shouldn’t change (GAPDH, actin, etc.)
  • Make these equal across arrays
  • Scale other genes proportionally

Pros: Biological control Cons: Requires knowing which genes are stable

3. Quantile normalization:

  • Make distribution of intensities identical across arrays
  • Statistical method
  • Most robust

Pros: Works well, widely used Cons: More complex, requires software

4. Loess normalization (for two-color arrays):

  • Corrects intensity-dependent dye bias
  • Cy3 and Cy5 might behave differently at high/low intensities
  • Smooths out the bias

Pros: Corrects specific dye biases Cons: Only for two-color arrays

13.6.5.4 Spike-In Controls

Spike-ins = Known amounts of synthetic RNA added to samples

How they work:

  1. Add same amount of synthetic RNA to all samples
  2. These RNAs don’t exist in your cells
  3. Sequence is known
  4. Probes for them on the microarray
  5. Should give same signal in all samples
  6. Use to normalize!

Think of spike-ins like:

  • Adding same amount of salt to different soups
  • Taste the salt level (measure signal)
  • Should be same in all soups
  • If not, something’s wrong with your measurement!

Common spike-in controls:

  • ERCC spike-ins: External RNA Controls Consortium
    • 92 synthetic RNAs
    • Different concentrations
    • Cover wide range of expression levels
    • Standard set used in many studies

Advantages of spike-ins:

  • Absolute control (know exact amount)
  • Independent of biological variation
  • Can assess technical reproducibility

Disadvantages:

  • Extra cost
  • Need to add carefully (pipetting error)
  • One more thing that can go wrong!

13.6.5.5 Quality Control: Checking Your Data

Before analysis, check data quality:

1. Check housekeeping genes:

  • Should be similar across all samples
  • If vary >2-fold → problem!

2. MA plot:

  • Should be centered around 0
  • If skewed → normalization needed

3. Box plots:

  • Distribution of intensities across arrays
  • Should be similar after normalization

4. Correlation between replicates:

  • Technical replicates should be very similar (R > 0.99)
  • Biological replicates should be similar (R > 0.95)
  • If not → outliers, technical problems

5. Check positive controls:

  • Spike-ins should work
  • Known upregulated genes should show up

6. Check negative controls:

  • Spots with no probe
  • Should have low/zero signal
  • High signal → contamination or high background

Example QC check:

Before normalization:

  • Array 1 median intensity: 5,000
  • Array 2 median intensity: 8,000
  • GAPDH Array 1: 10,000
  • GAPDH Array 2: 16,000
  • Problem: Array 2 systematically higher!

After normalization:

  • Array 1 median intensity: 6,000
  • Array 2 median intensity: 6,000
  • GAPDH Array 1: 12,000
  • GAPDH Array 2: 12,000
  • Good: Now comparable!

13.6.5.6 Limitations of Microarrays

Prior knowledge required:

  • Can only detect genes you put probes for
  • Cannot discover new genes or transcripts
  • Limited to known sequences

Relative quantification only:

  • Measures relative differences (fold-change)
  • Cannot tell absolute number of RNA molecules
  • Limited dynamic range

Background noise:

  • Non-specific binding creates noise
  • Cross-hybridization between similar sequences

Limited sensitivity:

  • Hard to detect low-abundance transcripts
  • Saturation at high expression levels

13.6.5.7 Microarrays vs. RNA-seq

Feature Microarray RNA-seq
Prior knowledge Required Not required
New transcripts Cannot detect Can discover
Quantification Relative Absolute possible
Dynamic range Limited (~3 orders) Wide (~5 orders)
Cost Lower for targeted Lower for discovery
Sensitivity Moderate High
Applications Known genes, routine Discovery, novel transcripts

When to use microarrays:

  • Focused gene expression studies
  • Routine clinical diagnostics
  • Cost-effective for specific gene panels
  • Comparing expression of known genes

When to use RNA-seq:

  • Discovering new transcripts
  • Detecting splice variants
  • Measuring absolute expression levels
  • Studying organisms without reference genome

13.6.6 Absolute vs. Relative Quantification

This is a fundamental difference between microarrays and RNA-seq!

13.6.6.1 What Is Relative Quantification?

Relative quantification = Comparing expression BETWEEN samples

  • Can say: “Gene X is 3-fold higher in sample A vs. sample B”
  • CANNOT say: “Sample A has 1,000 copies of Gene X”

Think of it like:

  • “This box is twice as heavy as that box” (relative)
  • But can’t say “This box weighs 5 kg” (absolute)

Microarrays do relative quantification:

  • Measure fluorescence intensities
  • Calculate ratios between samples
  • Tell you fold-change (up or down)
  • Cannot tell you absolute number of RNA molecules!

Why microarrays can’t do absolute quantification:

  1. Fluorescence intensity is arbitrary:
    • Intensity of 10,000 units doesn’t mean 10,000 RNA molecules
    • Depends on:
      • Laser power
      • Dye efficiency
      • Probe binding efficiency
      • Scanner settings
  2. Different probes behave differently:
    • Some probes bind tightly (high signal)
    • Some probes bind weakly (low signal)
    • Can’t compare intensity across different genes directly
  3. No external standard:
    • Don’t know relationship between intensity and molecule count
    • Like having a thermometer with no numbers!

What microarrays CAN tell you:

  • Gene A is 2-fold upregulated in disease
  • Gene B is 5-fold downregulated in disease
  • Gene A changed more than Gene C

What microarrays CANNOT tell you:

  • How many copies of Gene A are in the cell?
  • Is Gene A more abundant than Gene B?
  • Absolute concentration of RNA

13.6.6.2 What Is Absolute Quantification?

Absolute quantification = Measuring actual number (or concentration) of RNA molecules

  • Can say: “Sample A has ~1,000 copies of Gene X per cell”
  • Can say: “Gene X is more abundant than Gene Y”

RNA-seq can estimate absolute expression:

  • Count how many reads map to each gene
  • Normalize by:
    • Gene length (longer genes get more reads)
    • Sequencing depth (total reads in sample)
  • Estimate relative abundance

Common RNA-seq metrics:

1. Raw counts:

  • Number of reads mapping to gene
  • NOT absolute! Depends on sequencing depth
  • Example: Gene A = 5,000 reads

2. RPKM (Reads Per Kilobase per Million mapped reads):

  • Normalizes for gene length AND sequencing depth
  • RPKM = (Reads × 10⁹) / (Gene length in bp × Total reads)
  • Single-end sequencing

3. FPKM (Fragments Per Kilobase per Million):

  • Same as RPKM but for paired-end sequencing
  • Counts fragments (pairs) not individual reads

4. TPM (Transcripts Per Million):

  • Similar to FPKM but better for comparison
  • Normalizes differently (gene length first, then depth)
  • Preferred metric today!
  • Sum of all TPM values in a sample = 1,000,000

Why TPM is better than FPKM:

  • TPM values are directly comparable across samples
  • Sum to same total (1 million)
  • Like percentages that always add to 100%!

Example:

Sample 1 (30 million reads):

  • Gene A: 1,500 reads, TPM = 50
  • Gene B: 300 reads, TPM = 10

Sample 2 (60 million reads):

  • Gene A: 3,000 reads, TPM = 50
  • Gene B: 600 reads, TPM = 10

Interpretation:

  • Gene A has same TPM → same relative abundance
  • Even though raw counts doubled (because sequencing depth doubled)
  • TPM corrects for this!

Can RNA-seq give TRUE absolute counts?

Not quite, but close:

  • TPM/FPKM are proportional to true abundance
  • With spike-ins, can estimate molecules per cell
  • More absolute than microarrays!

But still not perfect:

  • PCR bias (some sequences amplify better)
  • Mapping bias (repetitive sequences hard to map)
  • Gene length effects

13.6.6.3 Comparison: Absolute vs. Relative

Feature Microarray RNA-seq
Type Relative only Semi-absolute (TPM/FPKM)
Can compare within sample? No (across genes) Yes!
Can compare across samples? Yes (fold-change) Yes (fold-change AND TPM)
Absolute molecule count? No Estimate (with spike-ins)
Compare Gene A to Gene B? No Yes (via TPM)
Units Arbitrary fluorescence Normalized counts (TPM)

When absolute quantification matters:

  1. Comparing different genes:
    • Is Gene A or Gene B more abundant?
    • Microarray can’t tell, RNA-seq can!
  2. Meta-analysis:
    • Combining data from multiple studies
    • TPM values more comparable than microarray intensities
  3. Mathematical modeling:
    • Need actual concentrations
    • TPM can be converted to estimates
  4. Therapeutic targeting:
    • Is gene expressed enough to target with drug?
    • Need to know abundance, not just fold-change

Example use case:

Question: Should we target Gene X or Gene Y for cancer therapy?

Microarray says:

  • Both upregulated 5-fold in cancer
  • Equally good targets? 🤷

RNA-seq says:

  • Gene X: TPM = 500 (highly expressed!)
  • Gene Y: TPM = 5 (barely expressed)
  • Gene X is better target! (100× more abundant)

This is information microarrays cannot provide!

13.6.7 Expression Analysis Workflows: From Sample to Discovery

Let’s walk through a complete gene expression analysis from start to finish!

13.6.7.1 Study Design: Disease vs. Normal

Research question: What genes are different in breast cancer vs. normal tissue?

Experimental design:

  • Samples:
    • 10 breast cancer tumors
    • 10 normal breast tissue (from same patients or matched controls)
  • Biological replicates: 10 per group (more is better!)
    • Why replicates? Biological variation between patients
    • Need enough to see consistent patterns
  • Technical considerations:
    • Collect tissue same way (fresh frozen or FFPE)
    • Extract RNA with same protocol
    • Check RNA quality (RIN score > 7)
    • Process all samples together (batch effects!)

13.6.7.2 Step-by-Step Workflow

Step 1: RNA Extraction

  • Homogenize tissue (break cells open)
  • Use TRIzol or column-based extraction
  • Critical: Work quickly! RNA degrades fast!
  • Keep everything cold, use RNase-free materials
  • Check RNA quality on Bioanalyzer (RIN score)

Step 2: cDNA Synthesis (covered in detail below)

  • Convert RNA to cDNA using reverse transcriptase
  • Why? cDNA is stable, RNA is not!
  • Label with fluorescent dyes (Cy3 for normal, Cy5 for tumor)

Step 3: Hybridization to Microarray

  • Mix labeled cDNA samples
  • Apply to microarray slide
  • Incubate overnight (16-18 hours)
  • Wash away unbound cDNA

Step 4: Scanning

  • Scan with laser scanner
  • Get images (green and red channels)
  • Software measures spot intensities

Step 5: Data Processing

  1. Background subtraction
  2. Normalization (using housekeeping genes or quantile normalization)
  3. Calculate ratios (Red/Green)
  4. Log2 transform

Step 6: Quality Control

  • Check housekeeping genes (should be constant)
  • Check replicate correlation
  • Remove outliers
  • MA plots, box plots

Step 7: Statistical Analysis

  • Identify significantly changed genes
  • Use t-test or SAM
  • Apply FDR correction
  • Cutoffs: |log2 fold-change| > 1 AND FDR < 0.05

Step 8: Results

Example results:

  • Total genes tested: 40,000
  • Significantly different: 2,847 genes (FDR < 0.05)
    • Upregulated in cancer: 1,523 genes
    • Downregulated in cancer: 1,324 genes

Top 10 upregulated genes:

Gene Fold-Change Log2 FC FDR Function
ERBB2 (HER2) 12.5 3.64 <0.001 Growth receptor
MKI67 (Ki67) 8.2 3.04 <0.001 Cell proliferation
CCNB1 7.8 2.96 <0.001 Cell cycle
TOP2A 6.5 2.70 <0.001 DNA replication
EGFR 5.2 2.38 <0.001 Growth receptor

Top 10 downregulated genes:

Gene Fold-Change Log2 FC FDR Function
ESR1 (ER) 0.15 -2.74 <0.001 Estrogen receptor
PGR (PR) 0.18 -2.47 <0.001 Progesterone receptor
GATA3 0.22 -2.18 <0.001 Transcription factor

13.6.7.3 Step 9: Visualization

Create heatmap:

  • Rows = top 100 changed genes
  • Columns = samples (10 cancer, 10 normal)
  • Colors = expression level (red = high, green = low)
  • Clustering groups similar samples and genes

Result: Cancers cluster together, normals cluster together!

Create volcano plot:

  • X-axis = log2 fold-change
  • Y-axis = -log10(FDR)
  • Points in top corners = significant AND large change
  • These are your biomarker candidates!

13.6.7.4 Step 10: Pathway Analysis

Question: Are these 2,847 genes related? Do they work together?

Pathway enrichment analysis:

  • Use tools: DAVID, Enrichr, GSEA
  • Test if changed genes are enriched in known pathways

Example results:

Top enriched pathways (p < 0.001):

  1. Cell cycle (78 genes, p = 1.2 × 10⁻²⁵)
  2. DNA replication (35 genes, p = 3.4 × 10⁻¹⁸)
  3. p53 signaling (42 genes, p = 2.1 × 10⁻¹⁵)
  4. ECM-receptor interaction (51 genes, p = 8.7 × 10⁻¹²)

Interpretation:

  • Cell cycle genes highly upregulated → cancer cells dividing rapidly!
  • p53 pathway disrupted → tumor suppressor pathway broken
  • Makes biological sense!

13.6.7.5 Step 11: Validation

Never trust microarray results alone! Validate with independent method:

Methods for validation:

  1. qRT-PCR (quantitative RT-PCR):
    • Gold standard for validation
    • Measure specific genes in same samples
    • More accurate than microarray
    • Validate top 10-20 genes
  2. RNA-seq:
    • Sequence same samples
    • Check if same genes come up
    • More comprehensive
  3. Protein validation:
    • Western blot or immunohistochemistry
    • Check if protein levels match RNA levels
    • Sometimes they don’t! (post-transcriptional regulation)

Example validation:

  • Microarray: ERBB2 upregulated 12.5-fold
  • qRT-PCR: ERBB2 upregulated 11.8-fold ✓
  • Western blot: HER2 protein high in tumors ✓
  • Validated!

13.6.7.6 Step 12: Biological Interpretation

Key questions:

  1. What do these genes do?
    • Look up functions (Gene Ontology, UniProt)
    • Group by function
  2. Are known cancer genes here?
    • ERBB2/HER2 → known oncogene!
    • ESR1/ER → prognostic marker
    • Matches known biology ✓
  3. Any surprising discoveries?
    • Novel genes not previously linked to cancer
    • Potential new drug targets!
  4. Clinical implications?
    • Can we classify tumors by expression pattern?
    • Predict patient outcomes?
    • Suggest treatments?

Example outcome:

  • Identified HER2-positive subtype
  • These patients respond to Herceptin (anti-HER2 drug)
  • Gene expression predicts treatment!

13.6.8 Reverse Transcriptase: The Enzyme That Makes cDNA

Now let’s understand in detail HOW we convert RNA to cDNA!

13.6.8.1 What Is Reverse Transcriptase?

Reverse transcriptase (RT) = Enzyme that makes DNA from RNA template

  • “Reverse” because normal direction is DNA → RNA (transcription)
  • This goes backwards: RNA → DNA!
  • Also called RNA-dependent DNA polymerase

Where does RT come from?

  • Retroviruses: HIV, HTLV, etc.
  • These viruses have RNA genomes
  • Need to make DNA copy to insert into our genome
  • We “borrowed” their enzyme for research!

Think of it like:

  • Normal transcription = reading a book and taking notes (DNA → RNA)
  • Reverse transcription = reconstructing the book from notes (RNA → DNA)

13.6.8.2 Why HIV Needs Reverse Transcriptase

HIV life cycle (simplified):

  1. HIV enters cell (has RNA genome)
  2. RT makes DNA copy of HIV RNA
  3. DNA copy integrates into human chromosome
  4. Now human cell makes HIV proteins forever!
  5. This is why HIV is chronic - it’s in your DNA!

Why we can’t cure HIV easily:

  • Once integrated, HIV DNA stays in genome
  • Antiretroviral drugs block RT → stop new infections
  • But can’t remove integrated DNA

How RT inhibitors work (HIV drugs):

  • AZT, nevirapine, etc.
  • Block RT enzyme
  • Prevent HIV from making DNA copy
  • Stop virus from integrating

13.6.8.3 The RT Mechanism: Step by Step

Requirements for reverse transcription:

  1. Template RNA: Your RNA of interest
  2. Reverse transcriptase enzyme: Usually from MMLV or AMV
  3. Primer: Short DNA or RNA that binds to template
  4. dNTPs: Building blocks (dATP, dTTP, dGTP, dCTP)
  5. Buffer: Right salt and pH conditions

Why RT needs a primer (just like DNA polymerase):

  • Cannot start synthesis de novo (from scratch)
  • Needs 3’-OH group to add nucleotides to
  • Primer provides this starting point

Types of primers:

1. Oligo-dT primers (most common):

  • Short stretch of T nucleotides (15-20 Ts)
  • Binds to poly-A tail of mRNA
  • Sequence: 5’-TTTTTTTTTTTTTTT-3’
  • Binds to: 3’-AAAAAAAAAAAAAAAA-5’ (poly-A tail)

Advantage: Specifically amplifies mRNA (not rRNA or tRNA) Disadvantage: Might not reach 5’ end of long transcripts

2. Random hexamer primers:

  • Short random sequences (6 nucleotides)
  • Bind randomly all along RNA
  • Example: 5’-NNNNNN-3’ (N = any base)

Advantage: Amplifies all RNA regions, including 5’ end Disadvantage: Amplifies ribosomal RNA too (wasteful!)

3. Gene-specific primers:

  • Design primer for specific gene you want
  • Most specific!

Advantage: Only your gene of interest Disadvantage: Need to design primers, one gene at a time

13.6.8.4 The Reverse Transcription Reaction

Step 1: Annealing

  • Mix RNA + primer
  • Heat to 70°C (denature any RNA secondary structure)
  • Cool to 25°C (for random hexamers) or 42°C (for oligo-dT)
  • Primer binds to complementary sequence on RNA

Example (with oligo-dT):

RNA:     5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3' (poly-A tail)
                            |||||||||||
Primer:                    3'-TTTTTTTTTTT-5' (oligo-dT)

Step 2: Synthesis

  • Add reverse transcriptase enzyme + dNTPs
  • RT synthesizes DNA complementary to RNA
  • Proceeds 5’ → 3’ direction (like all polymerases)
  • Creates RNA-DNA hybrid

After RT synthesis:

RNA:  5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3'
          |||||||||||||||||||||||||
cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'
  • Top strand = original RNA (template)
  • Bottom strand = newly synthesized cDNA (complementary DNA)

Step 3: RNA degradation (optional):

  • Add RNase H enzyme
  • Degrades RNA part of RNA-DNA hybrid
  • Leaves single-stranded cDNA

Result: Single-stranded cDNA

cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'

Step 4: Second strand synthesis (if needed):

  • Some applications need double-stranded cDNA
  • Add DNA polymerase + primers
  • Makes complementary strand

Result: Double-stranded cDNA

5'---GGCATTTGCCAAA[AAAAAAAAAA]-3'
3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'

Now looks like normal DNA! Can be:

  • Cloned into vectors
  • Amplified by PCR
  • Sequenced
  • Stored long-term

13.6.8.5 RT Enzymes: Which One to Use?

Common reverse transcriptases:

1. MMLV-RT (Moloney Murine Leukemia Virus):

  • Most commonly used
  • Works at 37-42°C
  • Good processivity (makes long cDNA)
  • Sensitive to high temperature

2. AMV-RT (Avian Myeloblastosis Virus):

  • Works at higher temperature (42-55°C)
  • Better for RNA with secondary structure
  • More RNase H activity (degrades template)

3. SuperScript (engineered MMLV):

  • Modified version of MMLV
  • More stable at higher temperature (up to 50°C)
  • Lower RNase H activity (preserves template)
  • Very popular for difficult templates

Choosing RT enzyme:

  • Standard use: MMLV-RT or SuperScript II/III
  • GC-rich RNA: SuperScript III (higher temp denatures structure)
  • Long transcripts: SuperScript IV (processivity up to 12 kb!)
  • Budget: MMLV-RT (cheaper)

13.6.8.6 Common Problems and Solutions

Problem 1: No cDNA produced

Possible causes:

  • RNA degraded (use RNase-free everything!)
  • RT enzyme dead (check expiration, storage)
  • No primer binding (wrong primer for your RNA)

Solution:

  • Check RNA quality on gel/Bioanalyzer
  • Use fresh RT enzyme
  • Try different primer type

Problem 2: cDNA too short

  • Only getting 5’ end, missing 3’ end (or vice versa)

Cause:

  • RNA secondary structure blocks RT
  • RT falls off template

Solution:

  • Use higher temperature (SuperScript III at 50°C)
  • Add DMSO or betaine (disrupts structure)
  • Use random hexamers instead of oligo-dT

Problem 3: Genomic DNA contamination

  • Amplifying genomic DNA instead of cDNA!

Solution:

  • Treat RNA with DNase I (removes DNA)
  • Design primers that span introns (won’t amplify genomic DNA)
  • Use oligo-dT primers (genomic DNA has no poly-A tail)

13.6.8.7 Why RT Is Essential for RNA Studies

Applications of reverse transcription:

  1. Microarrays: Need cDNA for labeling and hybridization
  2. RNA-seq: Most methods convert RNA to cDNA first
  3. qRT-PCR: Quantify specific genes
  4. cDNA libraries: Clone all genes from organism
  5. Northern blot alternative: Detect specific RNAs

Without RT:

  • RNA too unstable to work with
  • RNases everywhere would destroy samples
  • Can’t amplify with PCR (needs DNA)
  • Can’t clone into vectors

With RT:

  • Convert to stable cDNA
  • Amplify, sequence, clone
  • Store indefinitely
  • Study gene expression!

Summary: Reverse transcriptase is THE KEY enzyme that enables all RNA research!

13.6.9 RNA Sequencing (RNA-seq)

Modern method to study the transcriptome (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008):

How it works:

  1. Extract all RNA from cells

  2. Convert RNA to DNA (more stable)

  3. Sequence the DNA

  4. Count how many copies of each RNA

  5. Determine which genes are active

What we learn:

  • Which genes are expressed

  • How much each gene is expressed

  • Which splice variants are made

  • Discovery of new RNAs

13.6.10 Applications

Medicine:

  • Diagnose diseases by RNA patterns

  • Predict treatment response

  • Find new drug targets

  • Understand cancer

Development:

  • How embryos develop

  • Cell differentiation

  • Tissue formation

Evolution:

  • Compare gene expression across species

  • Understand how expression patterns evolved

13.7 Key Takeaways

  • Transcriptome = All RNA molecules in a cell

  • Transcriptomics = Study of the transcriptome

  • Three main RNA types:

    • mRNA (1-5%): Codes for proteins

    • tRNA (15%): Delivers amino acids

    • rRNA (80%): Forms ribosomes

  • RNA processing in eukaryotes:

    • 5’ capping: Protective cap

    • 3’ poly-A tail: Protective tail

    • Splicing: Removing introns, joining exons

  • Alternative splicing = One gene → multiple proteins

    • Explains how 20,000 genes make 100,000+ proteins

    • Regulated by splicing factors

  • Non-coding RNAs have important functions:

    • miRNAs: Regulate gene expression

    • lncRNAs: Various regulatory functions

  • RNA-seq = Modern method to study transcriptome

  • Transcriptome varies between cell types, conditions, and times


Sources: Information adapted from Khan Academy, Nature Education, and transcriptomics research literature (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008; Schena et al. 1995).

Mortazavi, Ali, Brian A. Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. 2008. “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.” Nature Methods 5 (7): 621–28.
Schena, Mark, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. 1995. “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray.” Science 270 (5235): 467–70.
Shalon, Dari, Stephen J. Smith, and Patrick O. Brown. 1996. “A DNA Microarray System for Analyzing Complex DNA Samples Using Two-Color Fluorescent Probe Hybridization.” Genome Research 6 (7): 639–45.
Wang, Zhong, Mark Gerstein, and Michael Snyder. 2009. “RNA-Seq: A Revolutionary Tool for Transcriptomics.” Nature Reviews Genetics 10 (1): 57–63.