13 Transcriptomics

13.1 Beyond the Genome: The Transcriptome

13.1.1 What Is the Transcriptome?

If the genome is all your DNA, the transcriptome is all the RNA being made in your cells right now!

Think of it like:

Genome = Your complete cookbook (all possible recipes)
Transcriptome = The recipes you’re actually using today

The transcriptome changes:

Between different cell types (brain vs. muscle)
At different times (morning vs. evening)
In different conditions (healthy vs. sick)

Transcriptomics is the study of all RNA molecules in a cell or organism.

13.2 The Three Main Types of RNA

13.2.1 1. mRNA (Messenger RNA)

What it does: Carries the recipe from DNA to ribosomes

Think of it like: A photocopy of a recipe that goes to the kitchen

Key facts:

Temporary (breaks down quickly)
Only about 1-5% of total RNA
Codes for proteins
Each mRNA corresponds to one gene (in eukaryotes)

Process:

Transcribed from DNA in the nucleus
Processed (capping, poly-A tail, splicing)
Travels to cytoplasm
Read by ribosomes to make protein
Eventually degrades

13.2.2 2. tRNA (Transfer RNA)

What it does: Brings amino acids to the ribosome during protein synthesis

Think of it like: Delivery trucks bringing ingredients to the kitchen

Key facts:

Cloverleaf shape (looks like a three-leaf clover when flat)
3D L-shape structure
About 15% of total RNA
Each tRNA carries one specific amino acid
Has an “anticodon” that matches mRNA codons

How it works:

tRNA picks up its specific amino acid
Recognizes the matching codon on mRNA
Delivers amino acid to growing protein chain
Detaches and goes back for another amino acid

There are different tRNAs for each of the 20 amino acids!

13.2.3 3. rRNA (Ribosomal RNA)

What it does: Forms the structure of ribosomes (protein-making factories)

Think of it like: The factory building itself

Key facts:

About 80% of total RNA! (Most abundant)
Combines with proteins to form ribosomes
Does NOT code for proteins
Actually catalyzes protein synthesis (it’s an enzyme!)
Very ancient and conserved across all life

Structure:

Humans have 4 types of rRNA
Combine with ~80 ribosomal proteins
Form large and small ribosomal subunits

13.3 RNA Processing in Eukaryotes

13.3.1 From Pre-mRNA to Mature mRNA

In eukaryotes, RNA goes through major changes before it’s ready! The initial RNA copy is called pre-mRNA (precursor mRNA).

13.3.2 1. 5’ Capping

What happens: A special “cap” is added to the beginning (5’ end) of the RNA

The cap:

A modified guanine nucleotide
Added immediately after transcription starts
Like putting a protective cap on a pen

Why it matters:

Protects mRNA from degradation
Helps ribosome recognize and bind to mRNA
Helps transport mRNA out of nucleus

13.3.3 2. 3’ Poly-A Tail

What happens: A long tail of adenine (A) nucleotides is added to the end (3’ end)

The tail:

About 200-250 adenines in a row (AAAAAAAA…)
Added after transcription finishes
Like a protective tail

Why it matters:

Protects mRNA from degradation
Helps export mRNA from nucleus
Helps ribosomes find and bind mRNA
Longer tail = longer mRNA lifespan

13.3.4 3. Splicing: Removing Introns

What happens: Introns (non-coding regions) are cut out, exons (coding regions) are joined together

The process:

Pre-mRNA contains both introns and exons
Spliceosome (molecular machine) recognizes intron-exon boundaries
Cuts out introns
Joins exons together
Introns are degraded

Think of it like:

Filming a movie (transcription)
Editing out unwanted scenes (splicing)
Final cut ready for theaters (mature mRNA)!

13.4 Alternative Splicing: One Gene, Many Proteins!

13.4.1 The Amazing Discovery

Here’s where things get really interesting! One gene can make multiple different proteins through alternative splicing!

13.4.2 How Alternative Splicing Works

Instead of always joining exons in the same order, cells can:

Skip exons: Leave some exons out
Include or exclude exons: Sometimes include, sometimes exclude
Use alternative splice sites: Cut at different positions
Retain introns: Sometimes keep an intron in

Think of it like:

You have LEGO bricks numbered 1, 2, 3, 4, 5
You could build: 1-2-3-4-5 OR 1-3-4-5 OR 1-2-4-5 OR 1-2-3-5
Same bricks, different combinations, different structures!

13.4.3 Why Alternative Splicing Is Powerful

Creates protein diversity:

Humans have ~20,000 genes
But make ~100,000+ different proteins!
Alternative splicing explains this!

Examples:

DSCAM gene (in fruit flies): Can make 38,000 different proteins from ONE gene!
Human titin gene: Makes different versions in different muscle types
Antibody genes: Create diverse antibodies through splicing

Allows fine-tuned regulation:

Different tissues can make different protein versions
Respond to different conditions
Create proteins with slightly different functions

13.4.4 Regulation of Alternative Splicing

Special proteins called splicing factors control which exons are included:

Some promote exon inclusion
Some promote exon skipping
Different cells have different splicing factors
Responds to signals (hormones, stress, development)

13.5 Non-Coding RNAs

13.5.1 RNA That Doesn’t Make Protein

We used to think all important RNA coded for proteins. We were wrong!

Many RNAs have functions without making proteins:

13.5.2 microRNAs (miRNAs)

What they are: Tiny RNAs (about 22 nucleotides long)

What they do: Regulate gene expression by:

Binding to mRNA
Blocking translation OR
Causing mRNA degradation

Impact:

One miRNA can regulate hundreds of different genes!
Important for development, cell division, disease
Dysregulation linked to cancer

Think of them like: Volume controls that turn down gene expression

13.5.3 Long Non-Coding RNAs (lncRNAs)

What they are: Long RNA molecules (>200 nucleotides) that don’t code for protein

What they do: Many different functions:

Regulate gene expression
Organize chromatin
Guide proteins to specific DNA locations
Some functions still unknown!

Impact:

More common than expected (thousands in humans!)
Important for development and disease
Active area of research

13.5.4 Other Functional RNAs

snRNAs: Small nuclear RNAs (part of the spliceosome!)
snoRNAs: Small nucleolar RNAs (modify other RNAs)
Ribozymes: RNA molecules that act as enzymes
siRNAs: Small interfering RNAs (used in research and medicine)

13.6 Studying the Transcriptome

13.6.1 Why Study the Transcriptome?

Understanding gene expression:

Which genes are turned on in different cells?
How does gene expression change in disease?
How do cells respond to signals?

Better than studying genome alone:

Genome is the same in all cells
But transcriptome differs between cells!
Tells you what’s actually happening

13.6.2 Why RNA is Unstable (And Why We Convert to cDNA)

Before we discuss methods to study RNA, we need to understand an important challenge: RNA is extremely unstable!

The problem with RNA:

Chemically unstable: RNA has an extra -OH group at the 2’ position
- This makes RNA reactive and prone to breaking
- DNA is more stable (no 2’-OH group)
RNases everywhere: Enzymes that degrade RNA (RNases) are literally everywhere!
- On your skin
- In the air
- In cells (lots of them!)
- Very stable and hard to inactivate

Why so many RNases? An evolutionary defense!

This is actually evolution being clever:

Viral defense: Many viruses have RNA genomes
Immune system: Our cells have RNases to destroy viral RNA
Quality control: Cells need to degrade old or damaged RNA quickly
Gene regulation: Controlling RNA degradation = controlling gene expression

Think of it like:

Your body treats all external RNA as potentially dangerous
RNases are like security guards everywhere
They destroy RNA before it can cause problems
Unfortunately, this makes working with RNA experimentally very difficult!

The solution: Convert RNA to cDNA

cDNA = Complementary DNA (DNA copy of RNA)

Why cDNA is better:

Stable: DNA is chemically stable (no 2’-OH group)
RNase-resistant: RNases only cut RNA, not DNA
Can be amplified: Use PCR to make many copies
Easier to work with: Standard DNA techniques apply
Long-term storage: Can store cDNA for years

How RNA → cDNA conversion works:

Extract RNA from cells (work quickly, keep everything cold!)
Add reverse transcriptase enzyme (from retroviruses)
- This special enzyme makes DNA from RNA template
- Regular DNA polymerase cannot do this!
Add primers (usually oligo-dT primers that bind to poly-A tail)
Reverse transcriptase synthesizes DNA copy of RNA
Now you have stable cDNA to work with!

Think of it like:

RNA = fragile ice sculpture (melts quickly)
cDNA = photograph of the ice sculpture (stable, permanent)
You can’t keep the ice sculpture, but the photo preserves the information!

13.6.3 DNA Microarrays: Measuring Gene Expression

Before NGS sequencing became popular, scientists used microarrays to measure gene expression (Schena et al. 1995; Shalon, Smith, and Brown 1996). Microarrays are still used today for certain applications!

13.6.3.1 What Is a DNA Microarray?

A DNA microarray is like a gene expression chip:

Glass slide (like a microscope slide)
Thousands of spots, each containing DNA probes for one gene
Measures expression of thousands of genes simultaneously!

Think of it like:

Each spot = a fishing hook for one specific gene
Different colored fish = RNA from different samples
Count how many fish each hook catches = measure gene expression!

13.6.3.2 How Microarrays Work

Step 1: Create the array

Synthesize or print short DNA sequences (probes) onto glass slide
Each spot contains probes for one specific gene
Can have 10,000-50,000+ spots on one slide!
Each probe is complementary to a specific mRNA

Step 2: Prepare samples

Extract RNA from cells (e.g., normal cells vs. cancer cells)
Convert RNA to cDNA using reverse transcriptase
- Why? RNA is unstable!
- cDNA is much more stable and easier to work with
Label cDNA with fluorescent dyes
- Normal cells: Label with green dye (Cy3)
- Cancer cells: Label with red dye (Cy5)

Step 3: Hybridization

Mix both labeled cDNA samples together
Apply to microarray slide
Incubate (allow cDNA to bind to matching probes)
Wash away unbound cDNA

Step 4: Scan and analyze

Scan slide with laser
- Green laser detects Cy3 (normal sample)
- Red laser detects Cy5 (cancer sample)
Camera captures fluorescent signals
Computer measures intensity at each spot

13.6.3.3 Interpreting Microarray Colors

Two-color microarray results:

Green spot: Gene higher in normal sample
Red spot: Gene higher in cancer sample
Yellow spot: Gene equal in both samples (red + green = yellow)
Black/dark spot: Gene not expressed in either sample

Example interpretation:

Spot Color    → Meaning
───────────────────────────────────────────
Red           → Gene upregulated in disease
Green         → Gene downregulated in disease
Yellow        → No change in expression
Black         → Gene not expressed

Quantitative data:

Actually, you get two intensity values per spot:
- Green intensity (Cy3)
- Red intensity (Cy5)
Calculate ratio: Red/Green
- Ratio > 1 = upregulated in cancer
- Ratio < 1 = downregulated in cancer
- Ratio ≈ 1 = no change

13.6.4 Quantitative Microarray Analysis: The Details

Now let’s understand HOW we actually measure and interpret the data from microarrays!

13.6.4.1 Measuring Spot Intensities

What the scanner does:

Shine lasers at the microarray slide
- Green laser (532 nm wavelength) excites Cy3 (green dye)
- Red laser (635 nm wavelength) excites Cy5 (red dye)
Camera captures emission:
- Cy3 emits green light → camera records green intensity
- Cy5 emits red light → camera records red intensity
- Each spot recorded separately
Software measures intensity:
- Measures average brightness of each spot
- Units: fluorescence intensity (arbitrary units, like 1,000 or 50,000)
- Higher number = more fluorescence = more RNA bound

Think of it like:

Shining a flashlight on glow-in-the-dark stickers
Brighter glow = more stickers
Measure how bright each spot glows!

Example raw data for one spot:

Green intensity (Cy3): 8,000 units (normal sample)
Red intensity (Cy5): 24,000 units (cancer sample)
Background: 200 units (non-specific signal)

13.6.4.2 Calculating Ratios and Fold-Changes

Step 1: Subtract background:

Green signal = 8,000 - 200 = 7,800
Red signal = 24,000 - 200 = 23,800
Background = signal from areas without probes (noise)

Step 2: Calculate ratio:

Ratio (R/G) = Red / Green = 23,800 / 7,800 = 3.05
This means: Gene is 3-fold upregulated in cancer!

Interpretation:

Ratio > 1: Gene higher in red sample (cancer)
Ratio < 1: Gene higher in green sample (normal)
Ratio ≈ 1: No change (equal expression)

Common cutoffs:

Ratio ≥ 2 or ≤ 0.5: Significant change (2-fold or more)
Ratio ≥ 1.5 or ≤ 0.67: Moderate change
0.67 < Ratio < 1.5: Usually considered no significant change

13.6.4.3 Log2 Transformation: Making Data Easier to Work With

The problem with ratios:

Upregulation: 2-fold, 3-fold, 4-fold… (numbers get large)
Downregulation: 0.5, 0.33, 0.25… (fractions, hard to compare)
Not symmetric!

Example:

2-fold up = ratio of 2
2-fold down = ratio of 0.5
These should look “equally different” but don’t!

The solution: Log2 transformation

Log2(ratio) = log base 2 of the ratio

How it works:

Log2(2) = 1 (2-fold increase)
Log2(0.5) = -1 (2-fold decrease)
Log2(1) = 0 (no change)
Log2(4) = 2 (4-fold increase)
Log2(0.25) = -2 (4-fold decrease)

Now upregulation and downregulation are symmetric!

Example calculation:

Ratio = 3.05
Log2(3.05) = 1.61
Interpretation: ~1.6 log2 fold-change (≈ 3-fold upregulation)

Common values:

Ratio	Log2(Ratio)	Fold-Change	Interpretation
2.0	+1.0	2-fold up	Upregulated
4.0	+2.0	4-fold up	Highly upregulated
0.5	-1.0	2-fold down	Downregulated
0.25	-2.0	4-fold down	Highly downregulated
1.0	0.0	No change	Equal expression

Why log2 is useful:

Symmetric: +1 and -1 look equally different from 0
Easy to visualize: Heatmaps, scatter plots, volcano plots
Standard in genomics: Everyone uses it!

13.6.4.4 Assessing Statistical Significance

Just because a gene has a ratio of 2 doesn’t mean it’s really different! We need statistics to know if it’s real or just random noise.

Key question: Is this change statistically significant?

What affects significance?:

Magnitude of change: Bigger change = more likely significant
Replicates: More replicates = more confident
Variability: Less variation between replicates = more confident

Common statistical tests:

1. t-test:

Compares two groups (disease vs. normal)
Tests if means are significantly different
Outputs p-value

2. ANOVA:

Compares multiple groups (>2 conditions)
Tests if any groups are different

3. SAM (Significance Analysis of Microarrays):

Specialized method for microarray data
Accounts for multiple testing problem
Very popular for microarrays

P-value interpretation:

p < 0.05: Statistically significant (5% chance it’s random)
p < 0.01: Highly significant (1% chance it’s random)
p < 0.001: Very highly significant

Multiple testing correction:

Problem: Testing 40,000 genes simultaneously!
With p < 0.05, we expect 2,000 false positives (5% of 40,000)!
Solution: Adjust p-values for multiple testing

Common corrections:

Bonferroni: Very strict (divide p by number of tests)
- New cutoff: 0.05 / 40,000 = 0.00000125
- Too strict! Misses real changes
FDR (False Discovery Rate): Less strict, more practical
- Controls proportion of false positives
- FDR < 0.05: Expect 5% false positives among significant genes
- Most commonly used!

Example result:

Gene X: Ratio = 3.0, Log2 = 1.58, p = 0.002, FDR = 0.03
Interpretation: Gene X is significantly upregulated 3-fold, with high confidence!

13.6.4.5 Visualizing Microarray Data

1. Scatter plot (MA plot):

X-axis: Average intensity (A = average of log2 red and log2 green)
Y-axis: Log2 ratio (M = log2 red - log2 green)
Shows which genes change and their expression level
Horizontal line at Y=0 = no change
Points above = upregulated
Points below = downregulated

2. Volcano plot:

X-axis: Log2 fold-change
Y-axis: -log10(p-value)
Shows both magnitude and significance
Top corners = large change AND significant (best genes!)
Middle = not significant
Far left/right but low = big change but not significant (noisy)

3. Heatmap:

Rows = genes
Columns = samples
Colors = expression level
- Red = high expression
- Green = low expression
- Black/Yellow = medium
Reveals patterns across many genes and samples
Often includes clustering (groups similar genes/samples)

Think of heatmap like:

Weather map showing temperatures
Red = hot (high expression)
Blue/Green = cold (low expression)
Patterns emerge visually!

13.6.4.6 From Scanner to Results: Complete Workflow

1. Scan microarray:

Get images (green and red channels)
Raw fluorescence intensities

2. Grid alignment:

Software identifies each spot
Draws circles/grids around spots
Measures intensity inside each spot

3. Background subtraction:

Measure intensity between spots (background)
Subtract from each spot intensity
Corrects for non-specific signal

4. Normalization (covered in next section):

Correct for technical variations
Make arrays comparable

5. Calculate ratios:

Red / Green for each spot
Log2 transform

6. Statistical analysis:

Identify significantly changed genes
Apply multiple testing correction
FDR < 0.05 typically

7. Biological interpretation:

What do these genes do?
Are they related (pathways, functions)?
Can we validate with other methods?

Example: Finding Cancer Biomarkers

Experiment:

Compare tumor samples (n=10) vs. normal tissue (n=10)
Each sample hybridized to microarray
40,000 genes tested

Results:

2,500 genes significantly different (FDR < 0.05)
1,200 upregulated in tumors
1,300 downregulated in tumors

Top hit:

EGFR (Epidermal Growth Factor Receptor)
Ratio = 8.5 (8.5-fold upregulated)
Log2 = 3.09
p < 0.0001
FDR < 0.001
Conclusion: EGFR highly overexpressed in tumors! Potential drug target!

13.6.4.7 Single-Color vs. Two-Color Microarrays

Two-color microarrays:

Compare two samples directly on same array
Use two fluorescent dyes (e.g., Cy3 and Cy5)
Measures relative differences
Good for comparing disease vs. normal

Single-color microarrays (e.g., Affymetrix):

One sample per array, one dye
Measure absolute intensity for each sample separately
Compare results across multiple arrays
Better for comparing many samples

13.6.4.8 Microarray Applications

Gene expression profiling:

Compare disease vs. healthy tissue
Identify disease biomarkers
Classify cancer subtypes

Drug response:

Which genes change when drug is added?
Predict drug efficacy
Identify drug targets

Developmental biology:

How does gene expression change during development?
Track changes over time

Example - Cancer classification:

Extract RNA from tumor sample
Convert to cDNA, label with fluorescent dye
Hybridize to microarray
Compare expression pattern to database
Classify tumor type based on gene expression profile!

13.6.5 Microarray Normalization and Housekeeping Genes

Before we can compare microarray data, we need to normalize it - correct for technical differences that aren’t biological!

13.6.5.1 Why Normalization Is Needed

The problem: Technical variations between arrays make comparison difficult

Sources of variation (non-biological):

Different amounts of RNA loaded:
- Sample A: loaded 5 µg RNA
- Sample B: loaded 7 µg RNA
- Sample B will have higher signal, but NOT because genes are more expressed!
Different labeling efficiency:
- Cy3 dye might label better than Cy5
- Or vice versa
- Creates systematic bias
Scanner settings:
- Laser power varies slightly
- PMT (photomultiplier tube) sensitivity differences
- Different scan times
Slide differences:
- Spot sizes vary
- Amount of probe printed varies
- Slide-to-slide variation

Think of it like:

Comparing test scores from different teachers
One teacher grades easier (higher scores)
Need to normalize to compare fairly!

Goal of normalization: Remove technical variation, keep biological variation

13.6.5.2 Housekeeping Genes: Internal Controls

Housekeeping genes = Genes that are always expressed at constant levels

Why they’re called “housekeeping”:

Like doing housework - always needs to be done!
Essential for basic cell survival
Should NOT change between conditions

Common housekeeping genes:

1. GAPDH (Glyceraldehyde-3-phosphate dehydrogenase):

Enzyme in glycolysis (energy production)
Most commonly used!
Should be same in cancer vs. normal
High expression (easy to detect)

2. ACTB (β-Actin):

Structural protein (part of cytoskeleton)
Always expressed
Very abundant

3. TUBB (β-Tubulin):

Another structural protein
Forms microtubules
Constantly needed

4. HPRT1 (Hypoxanthine Phosphoribosyltransferase):

Enzyme in nucleotide synthesis
More stable than GAPDH in some conditions

5. 18S rRNA:

Ribosomal RNA component
VERY abundant
Very stable

How housekeeping genes are used:

Method 1: As normalization reference:

Measure housekeeping gene expression
Should be same in both samples (disease vs. normal)
If not, normalize so they ARE the same
Apply same normalization to all other genes

Example:

Sample A (normal): GAPDH intensity = 10,000
Sample B (cancer): GAPDH intensity = 15,000
Ratio = 1.5 (but GAPDH shouldn’t change!)
Problem: Sample B has 1.5× higher signal overall (technical artifact)
Solution: Divide all Sample B intensities by 1.5

Method 2: As positive controls:

If housekeeping genes show big changes → something went wrong!
Might indicate:
- Poor RNA quality
- Technical problems
- Bad normalization
- Need to repeat experiment

Important note: Housekeeping genes aren’t perfect!

Can change in some conditions (stress, cell cycle, etc.)
Always use multiple housekeeping genes
Validate that they’re actually stable in YOUR experiment

13.6.5.3 Common Normalization Methods

1. Global normalization (total intensity):

Assume: Total amount of RNA is same in all samples
Method: Make total signal equal across all arrays
Scale factor = Average total intensity / This array’s total intensity
Multiply all spots by scale factor

Example:

Array 1 total: 1,000,000
Array 2 total: 1,200,000
Average: 1,100,000
Scale Array 1 by: 1,100,000 / 1,000,000 = 1.1
Scale Array 2 by: 1,100,000 / 1,200,000 = 0.917

Pros: Simple, fast Cons: Assumes most genes don’t change (not true if many change!)

2. Housekeeping gene normalization:

Use genes that shouldn’t change (GAPDH, actin, etc.)
Make these equal across arrays
Scale other genes proportionally

Pros: Biological control Cons: Requires knowing which genes are stable

3. Quantile normalization:

Make distribution of intensities identical across arrays
Statistical method
Most robust

Pros: Works well, widely used Cons: More complex, requires software

4. Loess normalization (for two-color arrays):

Corrects intensity-dependent dye bias
Cy3 and Cy5 might behave differently at high/low intensities
Smooths out the bias

Pros: Corrects specific dye biases Cons: Only for two-color arrays

13.6.5.4 Spike-In Controls

Spike-ins = Known amounts of synthetic RNA added to samples

How they work:

Add same amount of synthetic RNA to all samples
These RNAs don’t exist in your cells
Sequence is known
Probes for them on the microarray
Should give same signal in all samples
Use to normalize!

Think of spike-ins like:

Adding same amount of salt to different soups
Taste the salt level (measure signal)
Should be same in all soups
If not, something’s wrong with your measurement!

Common spike-in controls:

ERCC spike-ins: External RNA Controls Consortium
- 92 synthetic RNAs
- Different concentrations
- Cover wide range of expression levels
- Standard set used in many studies

Advantages of spike-ins:

Absolute control (know exact amount)
Independent of biological variation
Can assess technical reproducibility

Disadvantages:

Extra cost
Need to add carefully (pipetting error)
One more thing that can go wrong!

13.6.5.5 Quality Control: Checking Your Data

Before analysis, check data quality:

1. Check housekeeping genes:

Should be similar across all samples
If vary >2-fold → problem!

2. MA plot:

Should be centered around 0
If skewed → normalization needed

3. Box plots:

Distribution of intensities across arrays
Should be similar after normalization

4. Correlation between replicates:

Technical replicates should be very similar (R > 0.99)
Biological replicates should be similar (R > 0.95)
If not → outliers, technical problems

5. Check positive controls:

Spike-ins should work
Known upregulated genes should show up

6. Check negative controls:

Spots with no probe
Should have low/zero signal
High signal → contamination or high background

Example QC check:

Before normalization:

Array 1 median intensity: 5,000
Array 2 median intensity: 8,000
GAPDH Array 1: 10,000
GAPDH Array 2: 16,000
Problem: Array 2 systematically higher!

After normalization:

Array 1 median intensity: 6,000
Array 2 median intensity: 6,000
GAPDH Array 1: 12,000
GAPDH Array 2: 12,000
Good: Now comparable!

13.6.5.6 Limitations of Microarrays

Prior knowledge required:

Can only detect genes you put probes for
Cannot discover new genes or transcripts
Limited to known sequences

Relative quantification only:

Measures relative differences (fold-change)
Cannot tell absolute number of RNA molecules
Limited dynamic range

Background noise:

Non-specific binding creates noise
Cross-hybridization between similar sequences

Limited sensitivity:

Hard to detect low-abundance transcripts
Saturation at high expression levels

13.6.5.7 Microarrays vs. RNA-seq

Feature	Microarray	RNA-seq
Prior knowledge	Required	Not required
New transcripts	Cannot detect	Can discover
Quantification	Relative	Absolute possible
Dynamic range	Limited (~3 orders)	Wide (~5 orders)
Cost	Lower for targeted	Lower for discovery
Sensitivity	Moderate	High
Applications	Known genes, routine	Discovery, novel transcripts

When to use microarrays:

Focused gene expression studies
Routine clinical diagnostics
Cost-effective for specific gene panels
Comparing expression of known genes

When to use RNA-seq:

Discovering new transcripts
Detecting splice variants
Measuring absolute expression levels
Studying organisms without reference genome

13.6.6 Absolute vs. Relative Quantification

This is a fundamental difference between microarrays and RNA-seq!

13.6.6.1 What Is Relative Quantification?

Relative quantification = Comparing expression BETWEEN samples

Can say: “Gene X is 3-fold higher in sample A vs. sample B”
CANNOT say: “Sample A has 1,000 copies of Gene X”

Think of it like:

“This box is twice as heavy as that box” (relative)
But can’t say “This box weighs 5 kg” (absolute)

Microarrays do relative quantification:

Measure fluorescence intensities
Calculate ratios between samples
Tell you fold-change (up or down)
Cannot tell you absolute number of RNA molecules!

Why microarrays can’t do absolute quantification:

Fluorescence intensity is arbitrary:
- Intensity of 10,000 units doesn’t mean 10,000 RNA molecules
- Depends on:
  - Laser power
  - Dye efficiency
  - Probe binding efficiency
  - Scanner settings
Different probes behave differently:
- Some probes bind tightly (high signal)
- Some probes bind weakly (low signal)
- Can’t compare intensity across different genes directly
No external standard:
- Don’t know relationship between intensity and molecule count
- Like having a thermometer with no numbers!

What microarrays CAN tell you:

Gene A is 2-fold upregulated in disease
Gene B is 5-fold downregulated in disease
Gene A changed more than Gene C

What microarrays CANNOT tell you:

How many copies of Gene A are in the cell?
Is Gene A more abundant than Gene B?
Absolute concentration of RNA

13.6.6.2 What Is Absolute Quantification?

Absolute quantification = Measuring actual number (or concentration) of RNA molecules

Can say: “Sample A has ~1,000 copies of Gene X per cell”
Can say: “Gene X is more abundant than Gene Y”

RNA-seq can estimate absolute expression:

Count how many reads map to each gene
Normalize by:
- Gene length (longer genes get more reads)
- Sequencing depth (total reads in sample)
Estimate relative abundance

Common RNA-seq metrics:

1. Raw counts:

Number of reads mapping to gene
NOT absolute! Depends on sequencing depth
Example: Gene A = 5,000 reads

2. RPKM (Reads Per Kilobase per Million mapped reads):

Normalizes for gene length AND sequencing depth
RPKM = (Reads × 10⁹) / (Gene length in bp × Total reads)
Single-end sequencing

3. FPKM (Fragments Per Kilobase per Million):

Same as RPKM but for paired-end sequencing
Counts fragments (pairs) not individual reads

4. TPM (Transcripts Per Million):

Similar to FPKM but better for comparison
Normalizes differently (gene length first, then depth)
Preferred metric today!
Sum of all TPM values in a sample = 1,000,000

Why TPM is better than FPKM:

TPM values are directly comparable across samples
Sum to same total (1 million)
Like percentages that always add to 100%!

Example:

Sample 1 (30 million reads):

Gene A: 1,500 reads, TPM = 50
Gene B: 300 reads, TPM = 10

Sample 2 (60 million reads):

Gene A: 3,000 reads, TPM = 50
Gene B: 600 reads, TPM = 10

Interpretation:

Gene A has same TPM → same relative abundance
Even though raw counts doubled (because sequencing depth doubled)
TPM corrects for this!

Can RNA-seq give TRUE absolute counts?

Not quite, but close:

TPM/FPKM are proportional to true abundance
With spike-ins, can estimate molecules per cell
More absolute than microarrays!

But still not perfect:

PCR bias (some sequences amplify better)
Mapping bias (repetitive sequences hard to map)
Gene length effects

13.6.6.3 Comparison: Absolute vs. Relative

Feature	Microarray	RNA-seq
Type	Relative only	Semi-absolute (TPM/FPKM)
Can compare within sample?	No (across genes)	Yes!
Can compare across samples?	Yes (fold-change)	Yes (fold-change AND TPM)
Absolute molecule count?	No	Estimate (with spike-ins)
Compare Gene A to Gene B?	No	Yes (via TPM)
Units	Arbitrary fluorescence	Normalized counts (TPM)

When absolute quantification matters:

Comparing different genes:
- Is Gene A or Gene B more abundant?
- Microarray can’t tell, RNA-seq can!
Meta-analysis:
- Combining data from multiple studies
- TPM values more comparable than microarray intensities
Mathematical modeling:
- Need actual concentrations
- TPM can be converted to estimates
Therapeutic targeting:
- Is gene expressed enough to target with drug?
- Need to know abundance, not just fold-change

Example use case:

Question: Should we target Gene X or Gene Y for cancer therapy?

Microarray says:

Both upregulated 5-fold in cancer
Equally good targets? 🤷

RNA-seq says:

Gene X: TPM = 500 (highly expressed!)
Gene Y: TPM = 5 (barely expressed)
Gene X is better target! (100× more abundant)

This is information microarrays cannot provide!

13.6.7 Expression Analysis Workflows: From Sample to Discovery

Let’s walk through a complete gene expression analysis from start to finish!

13.6.7.1 Study Design: Disease vs. Normal

Research question: What genes are different in breast cancer vs. normal tissue?

Experimental design:

Samples:
- 10 breast cancer tumors
- 10 normal breast tissue (from same patients or matched controls)
Biological replicates: 10 per group (more is better!)
- Why replicates? Biological variation between patients
- Need enough to see consistent patterns
Technical considerations:
- Collect tissue same way (fresh frozen or FFPE)
- Extract RNA with same protocol
- Check RNA quality (RIN score > 7)
- Process all samples together (batch effects!)

13.6.7.2 Step-by-Step Workflow

Step 1: RNA Extraction

Homogenize tissue (break cells open)
Use TRIzol or column-based extraction
Critical: Work quickly! RNA degrades fast!
Keep everything cold, use RNase-free materials
Check RNA quality on Bioanalyzer (RIN score)

Step 2: cDNA Synthesis (covered in detail below)

Convert RNA to cDNA using reverse transcriptase
Why? cDNA is stable, RNA is not!
Label with fluorescent dyes (Cy3 for normal, Cy5 for tumor)

Step 3: Hybridization to Microarray

Mix labeled cDNA samples
Apply to microarray slide
Incubate overnight (16-18 hours)
Wash away unbound cDNA

Step 4: Scanning

Scan with laser scanner
Get images (green and red channels)
Software measures spot intensities

Step 5: Data Processing

Background subtraction
Normalization (using housekeeping genes or quantile normalization)
Calculate ratios (Red/Green)
Log2 transform

Step 6: Quality Control

Check housekeeping genes (should be constant)
Check replicate correlation
Remove outliers
MA plots, box plots

Step 7: Statistical Analysis

Identify significantly changed genes
Use t-test or SAM
Apply FDR correction
Cutoffs: |log2 fold-change| > 1 AND FDR < 0.05

Step 8: Results

Example results:

Total genes tested: 40,000
Significantly different: 2,847 genes (FDR < 0.05)
- Upregulated in cancer: 1,523 genes
- Downregulated in cancer: 1,324 genes

Top 10 upregulated genes:

Gene	Fold-Change	Log2 FC	FDR	Function
ERBB2 (HER2)	12.5	3.64	<0.001	Growth receptor
MKI67 (Ki67)	8.2	3.04	<0.001	Cell proliferation
CCNB1	7.8	2.96	<0.001	Cell cycle
TOP2A	6.5	2.70	<0.001	DNA replication
EGFR	5.2	2.38	<0.001	Growth receptor

Top 10 downregulated genes:

Gene	Fold-Change	Log2 FC	FDR	Function
ESR1 (ER)	0.15	-2.74	<0.001	Estrogen receptor
PGR (PR)	0.18	-2.47	<0.001	Progesterone receptor
GATA3	0.22	-2.18	<0.001	Transcription factor

13.6.7.3 Step 9: Visualization

Create heatmap:

Rows = top 100 changed genes
Columns = samples (10 cancer, 10 normal)
Colors = expression level (red = high, green = low)
Clustering groups similar samples and genes

Result: Cancers cluster together, normals cluster together!

Create volcano plot:

X-axis = log2 fold-change
Y-axis = -log10(FDR)
Points in top corners = significant AND large change
These are your biomarker candidates!

13.6.7.4 Step 10: Pathway Analysis

Question: Are these 2,847 genes related? Do they work together?

Pathway enrichment analysis:

Use tools: DAVID, Enrichr, GSEA
Test if changed genes are enriched in known pathways

Example results:

Top enriched pathways (p < 0.001):

Cell cycle (78 genes, p = 1.2 × 10⁻²⁵)
DNA replication (35 genes, p = 3.4 × 10⁻¹⁸)
p53 signaling (42 genes, p = 2.1 × 10⁻¹⁵)
ECM-receptor interaction (51 genes, p = 8.7 × 10⁻¹²)

Interpretation:

Cell cycle genes highly upregulated → cancer cells dividing rapidly!
p53 pathway disrupted → tumor suppressor pathway broken
Makes biological sense!

13.6.7.5 Step 11: Validation

Never trust microarray results alone! Validate with independent method:

Methods for validation:

qRT-PCR (quantitative RT-PCR):
- Gold standard for validation
- Measure specific genes in same samples
- More accurate than microarray
- Validate top 10-20 genes
RNA-seq:
- Sequence same samples
- Check if same genes come up
- More comprehensive
Protein validation:
- Western blot or immunohistochemistry
- Check if protein levels match RNA levels
- Sometimes they don’t! (post-transcriptional regulation)

Example validation:

Microarray: ERBB2 upregulated 12.5-fold
qRT-PCR: ERBB2 upregulated 11.8-fold ✓
Western blot: HER2 protein high in tumors ✓
Validated!

13.6.7.6 Step 12: Biological Interpretation

Key questions:

What do these genes do?
- Look up functions (Gene Ontology, UniProt)
- Group by function
Are known cancer genes here?
- ERBB2/HER2 → known oncogene!
- ESR1/ER → prognostic marker
- Matches known biology ✓
Any surprising discoveries?
- Novel genes not previously linked to cancer
- Potential new drug targets!
Clinical implications?
- Can we classify tumors by expression pattern?
- Predict patient outcomes?
- Suggest treatments?

Example outcome:

Identified HER2-positive subtype
These patients respond to Herceptin (anti-HER2 drug)
Gene expression predicts treatment!

13.6.8 Reverse Transcriptase: The Enzyme That Makes cDNA

Now let’s understand in detail HOW we convert RNA to cDNA!

13.6.8.1 What Is Reverse Transcriptase?

Reverse transcriptase (RT) = Enzyme that makes DNA from RNA template

“Reverse” because normal direction is DNA → RNA (transcription)
This goes backwards: RNA → DNA!
Also called RNA-dependent DNA polymerase

Where does RT come from?

Retroviruses: HIV, HTLV, etc.
These viruses have RNA genomes
Need to make DNA copy to insert into our genome
We “borrowed” their enzyme for research!

Think of it like:

Normal transcription = reading a book and taking notes (DNA → RNA)
Reverse transcription = reconstructing the book from notes (RNA → DNA)

13.6.8.2 Why HIV Needs Reverse Transcriptase

HIV life cycle (simplified):

HIV enters cell (has RNA genome)
RT makes DNA copy of HIV RNA
DNA copy integrates into human chromosome
Now human cell makes HIV proteins forever!
This is why HIV is chronic - it’s in your DNA!

Why we can’t cure HIV easily:

Once integrated, HIV DNA stays in genome
Antiretroviral drugs block RT → stop new infections
But can’t remove integrated DNA

How RT inhibitors work (HIV drugs):

AZT, nevirapine, etc.
Block RT enzyme
Prevent HIV from making DNA copy
Stop virus from integrating

13.6.8.3 The RT Mechanism: Step by Step

Requirements for reverse transcription:

Template RNA: Your RNA of interest
Reverse transcriptase enzyme: Usually from MMLV or AMV
Primer: Short DNA or RNA that binds to template
dNTPs: Building blocks (dATP, dTTP, dGTP, dCTP)
Buffer: Right salt and pH conditions

Why RT needs a primer (just like DNA polymerase):

Cannot start synthesis de novo (from scratch)
Needs 3’-OH group to add nucleotides to
Primer provides this starting point

Types of primers:

1. Oligo-dT primers (most common):

Short stretch of T nucleotides (15-20 Ts)
Binds to poly-A tail of mRNA
Sequence: 5’-TTTTTTTTTTTTTTT-3’
Binds to: 3’-AAAAAAAAAAAAAAAA-5’ (poly-A tail)

Advantage: Specifically amplifies mRNA (not rRNA or tRNA) Disadvantage: Might not reach 5’ end of long transcripts

2. Random hexamer primers:

Short random sequences (6 nucleotides)
Bind randomly all along RNA
Example: 5’-NNNNNN-3’ (N = any base)

Advantage: Amplifies all RNA regions, including 5’ end Disadvantage: Amplifies ribosomal RNA too (wasteful!)

3. Gene-specific primers:

Design primer for specific gene you want
Most specific!

Advantage: Only your gene of interest Disadvantage: Need to design primers, one gene at a time

13.6.8.4 The Reverse Transcription Reaction

Step 1: Annealing

Mix RNA + primer
Heat to 70°C (denature any RNA secondary structure)
Cool to 25°C (for random hexamers) or 42°C (for oligo-dT)
Primer binds to complementary sequence on RNA

Example (with oligo-dT):

RNA:     5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3' (poly-A tail)
                            |||||||||||
Primer:                    3'-TTTTTTTTTTT-5' (oligo-dT)

Step 2: Synthesis

Add reverse transcriptase enzyme + dNTPs
RT synthesizes DNA complementary to RNA
Proceeds 5’ → 3’ direction (like all polymerases)
Creates RNA-DNA hybrid

After RT synthesis:

RNA:  5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3'
          |||||||||||||||||||||||||
cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'

Top strand = original RNA (template)
Bottom strand = newly synthesized cDNA (complementary DNA)

Step 3: RNA degradation (optional):

Add RNase H enzyme
Degrades RNA part of RNA-DNA hybrid
Leaves single-stranded cDNA

Result: Single-stranded cDNA

cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'

Step 4: Second strand synthesis (if needed):

Some applications need double-stranded cDNA
Add DNA polymerase + primers
Makes complementary strand

Result: Double-stranded cDNA

5'---GGCATTTGCCAAA[AAAAAAAAAA]-3'
3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'

Now looks like normal DNA! Can be:

Cloned into vectors
Amplified by PCR
Sequenced
Stored long-term

13.6.8.5 RT Enzymes: Which One to Use?

Common reverse transcriptases:

1. MMLV-RT (Moloney Murine Leukemia Virus):

Most commonly used
Works at 37-42°C
Good processivity (makes long cDNA)
Sensitive to high temperature

2. AMV-RT (Avian Myeloblastosis Virus):

Works at higher temperature (42-55°C)
Better for RNA with secondary structure
More RNase H activity (degrades template)

3. SuperScript (engineered MMLV):

Modified version of MMLV
More stable at higher temperature (up to 50°C)
Lower RNase H activity (preserves template)
Very popular for difficult templates

Choosing RT enzyme:

Standard use: MMLV-RT or SuperScript II/III
GC-rich RNA: SuperScript III (higher temp denatures structure)
Long transcripts: SuperScript IV (processivity up to 12 kb!)
Budget: MMLV-RT (cheaper)

13.6.8.6 Common Problems and Solutions

Problem 1: No cDNA produced

Possible causes:

RNA degraded (use RNase-free everything!)
RT enzyme dead (check expiration, storage)
No primer binding (wrong primer for your RNA)

Solution:

Check RNA quality on gel/Bioanalyzer
Use fresh RT enzyme
Try different primer type

Problem 2: cDNA too short

Only getting 5’ end, missing 3’ end (or vice versa)

Cause:

RNA secondary structure blocks RT
RT falls off template

Solution:

Use higher temperature (SuperScript III at 50°C)
Add DMSO or betaine (disrupts structure)
Use random hexamers instead of oligo-dT

Problem 3: Genomic DNA contamination

Amplifying genomic DNA instead of cDNA!

Solution:

Treat RNA with DNase I (removes DNA)
Design primers that span introns (won’t amplify genomic DNA)
Use oligo-dT primers (genomic DNA has no poly-A tail)

13.6.8.7 Why RT Is Essential for RNA Studies

Applications of reverse transcription:

Microarrays: Need cDNA for labeling and hybridization
RNA-seq: Most methods convert RNA to cDNA first
qRT-PCR: Quantify specific genes
cDNA libraries: Clone all genes from organism
Northern blot alternative: Detect specific RNAs

Without RT:

RNA too unstable to work with
RNases everywhere would destroy samples
Can’t amplify with PCR (needs DNA)
Can’t clone into vectors

With RT:

Convert to stable cDNA
Amplify, sequence, clone
Store indefinitely
Study gene expression!

Summary: Reverse transcriptase is THE KEY enzyme that enables all RNA research!

13.6.9 RNA Sequencing (RNA-seq)

Modern method to study the transcriptome (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008):

How it works:

Extract all RNA from cells
Convert RNA to DNA (more stable)
Sequence the DNA
Count how many copies of each RNA
Determine which genes are active

What we learn:

Which genes are expressed
How much each gene is expressed
Which splice variants are made
Discovery of new RNAs

13.6.10 Applications

Medicine:

Diagnose diseases by RNA patterns
Predict treatment response
Find new drug targets
Understand cancer

Development:

How embryos develop
Cell differentiation
Tissue formation

Evolution:

Compare gene expression across species
Understand how expression patterns evolved

13.7 Key Takeaways

Transcriptome = All RNA molecules in a cell
Transcriptomics = Study of the transcriptome
Three main RNA types:
- mRNA (1-5%): Codes for proteins
- tRNA (15%): Delivers amino acids
- rRNA (80%): Forms ribosomes
RNA processing in eukaryotes:
- 5’ capping: Protective cap
- 3’ poly-A tail: Protective tail
- Splicing: Removing introns, joining exons
Alternative splicing = One gene → multiple proteins
- Explains how 20,000 genes make 100,000+ proteins
- Regulated by splicing factors
Non-coding RNAs have important functions:
- miRNAs: Regulate gene expression
- lncRNAs: Various regulatory functions
RNA-seq = Modern method to study transcriptome
Transcriptome varies between cell types, conditions, and times

Sources: Information adapted from Khan Academy, Nature Education, and transcriptomics research literature (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008; Schena et al. 1995).

Mortazavi, Ali, Brian A. Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. 2008. “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.” Nature Methods 5 (7): 621–28.

Schena, Mark, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. 1995. “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray.” Science 270 (5235): 467–70.

Shalon, Dari, Stephen J. Smith, and Patrick O. Brown. 1996. “A DNA Microarray System for Analyzing Complex DNA Samples Using Two-Color Fluorescent Probe Hybridization.” Genome Research 6 (7): 639–45.

Wang, Zhong, Mark Gerstein, and Michael Snyder. 2009. “RNA-Seq: A Revolutionary Tool for Transcriptomics.” Nature Reviews Genetics 10 (1): 57–63.