13 Transcriptomics
13.1 Beyond the Genome: The Transcriptome
13.1.1 What Is the Transcriptome?
If the genome is all your DNA, the transcriptome is all the RNA being made in your cells right now!
Think of it like:
Genome = Your complete cookbook (all possible recipes)
Transcriptome = The recipes you’re actually using today
The transcriptome changes:
Between different cell types (brain vs. muscle)
At different times (morning vs. evening)
In different conditions (healthy vs. sick)
Transcriptomics is the study of all RNA molecules in a cell or organism.
13.2 The Three Main Types of RNA
13.2.1 1. mRNA (Messenger RNA)
What it does: Carries the recipe from DNA to ribosomes
Think of it like: A photocopy of a recipe that goes to the kitchen
Key facts:
Temporary (breaks down quickly)
Only about 1-5% of total RNA
Codes for proteins
Each mRNA corresponds to one gene (in eukaryotes)
Process:
Transcribed from DNA in the nucleus
Processed (capping, poly-A tail, splicing)
Travels to cytoplasm
Read by ribosomes to make protein
Eventually degrades
13.2.2 2. tRNA (Transfer RNA)
What it does: Brings amino acids to the ribosome during protein synthesis
Think of it like: Delivery trucks bringing ingredients to the kitchen
Key facts:
Cloverleaf shape (looks like a three-leaf clover when flat)
3D L-shape structure
About 15% of total RNA
Each tRNA carries one specific amino acid
Has an “anticodon” that matches mRNA codons
How it works:
tRNA picks up its specific amino acid
Recognizes the matching codon on mRNA
Delivers amino acid to growing protein chain
Detaches and goes back for another amino acid
There are different tRNAs for each of the 20 amino acids!
13.2.3 3. rRNA (Ribosomal RNA)
What it does: Forms the structure of ribosomes (protein-making factories)
Think of it like: The factory building itself
Key facts:
About 80% of total RNA! (Most abundant)
Combines with proteins to form ribosomes
Does NOT code for proteins
Actually catalyzes protein synthesis (it’s an enzyme!)
Very ancient and conserved across all life
Structure:
Humans have 4 types of rRNA
Combine with ~80 ribosomal proteins
Form large and small ribosomal subunits
13.3 RNA Processing in Eukaryotes
13.3.1 From Pre-mRNA to Mature mRNA
In eukaryotes, RNA goes through major changes before it’s ready! The initial RNA copy is called pre-mRNA (precursor mRNA).
13.3.2 1. 5’ Capping
What happens: A special “cap” is added to the beginning (5’ end) of the RNA
The cap:
A modified guanine nucleotide
Added immediately after transcription starts
Like putting a protective cap on a pen
Why it matters:
Protects mRNA from degradation
Helps ribosome recognize and bind to mRNA
Helps transport mRNA out of nucleus
13.3.3 2. 3’ Poly-A Tail
What happens: A long tail of adenine (A) nucleotides is added to the end (3’ end)
The tail:
About 200-250 adenines in a row (AAAAAAAA…)
Added after transcription finishes
Like a protective tail
Why it matters:
Protects mRNA from degradation
Helps export mRNA from nucleus
Helps ribosomes find and bind mRNA
Longer tail = longer mRNA lifespan
13.3.4 3. Splicing: Removing Introns
What happens: Introns (non-coding regions) are cut out, exons (coding regions) are joined together
The process:
Pre-mRNA contains both introns and exons
Spliceosome (molecular machine) recognizes intron-exon boundaries
Cuts out introns
Joins exons together
Introns are degraded
Think of it like:
Filming a movie (transcription)
Editing out unwanted scenes (splicing)
Final cut ready for theaters (mature mRNA)!
13.4 Alternative Splicing: One Gene, Many Proteins!
13.4.1 The Amazing Discovery
Here’s where things get really interesting! One gene can make multiple different proteins through alternative splicing!
13.4.2 How Alternative Splicing Works
Instead of always joining exons in the same order, cells can:
Skip exons: Leave some exons out
Include or exclude exons: Sometimes include, sometimes exclude
Use alternative splice sites: Cut at different positions
Retain introns: Sometimes keep an intron in
Think of it like:
You have LEGO bricks numbered 1, 2, 3, 4, 5
You could build: 1-2-3-4-5 OR 1-3-4-5 OR 1-2-4-5 OR 1-2-3-5
Same bricks, different combinations, different structures!
13.4.3 Why Alternative Splicing Is Powerful
Creates protein diversity:
Humans have ~20,000 genes
But make ~100,000+ different proteins!
Alternative splicing explains this!
Examples:
DSCAM gene (in fruit flies): Can make 38,000 different proteins from ONE gene!
Human titin gene: Makes different versions in different muscle types
Antibody genes: Create diverse antibodies through splicing
Allows fine-tuned regulation:
Different tissues can make different protein versions
Respond to different conditions
Create proteins with slightly different functions
13.4.4 Regulation of Alternative Splicing
Special proteins called splicing factors control which exons are included:
Some promote exon inclusion
Some promote exon skipping
Different cells have different splicing factors
Responds to signals (hormones, stress, development)
13.5 Non-Coding RNAs
13.5.1 RNA That Doesn’t Make Protein
We used to think all important RNA coded for proteins. We were wrong!
Many RNAs have functions without making proteins:
13.5.2 microRNAs (miRNAs)
What they are: Tiny RNAs (about 22 nucleotides long)
What they do: Regulate gene expression by:
Binding to mRNA
Blocking translation OR
Causing mRNA degradation
Impact:
One miRNA can regulate hundreds of different genes!
Important for development, cell division, disease
Dysregulation linked to cancer
Think of them like: Volume controls that turn down gene expression
13.5.3 Long Non-Coding RNAs (lncRNAs)
What they are: Long RNA molecules (>200 nucleotides) that don’t code for protein
What they do: Many different functions:
Regulate gene expression
Organize chromatin
Guide proteins to specific DNA locations
Some functions still unknown!
Impact:
More common than expected (thousands in humans!)
Important for development and disease
Active area of research
13.5.4 Other Functional RNAs
snRNAs: Small nuclear RNAs (part of the spliceosome!)
snoRNAs: Small nucleolar RNAs (modify other RNAs)
Ribozymes: RNA molecules that act as enzymes
siRNAs: Small interfering RNAs (used in research and medicine)
13.6 Studying the Transcriptome
13.6.1 Why Study the Transcriptome?
Understanding gene expression:
Which genes are turned on in different cells?
How does gene expression change in disease?
How do cells respond to signals?
Better than studying genome alone:
Genome is the same in all cells
But transcriptome differs between cells!
Tells you what’s actually happening
13.6.2 Why RNA is Unstable (And Why We Convert to cDNA)
Before we discuss methods to study RNA, we need to understand an important challenge: RNA is extremely unstable!
The problem with RNA:
- Chemically unstable: RNA has an extra -OH group at the 2’ position
- This makes RNA reactive and prone to breaking
- DNA is more stable (no 2’-OH group)
- RNases everywhere: Enzymes that degrade RNA (RNases) are literally everywhere!
- On your skin
- In the air
- In cells (lots of them!)
- Very stable and hard to inactivate
Why so many RNases? An evolutionary defense!
This is actually evolution being clever:
- Viral defense: Many viruses have RNA genomes
- Immune system: Our cells have RNases to destroy viral RNA
- Quality control: Cells need to degrade old or damaged RNA quickly
- Gene regulation: Controlling RNA degradation = controlling gene expression
Think of it like:
- Your body treats all external RNA as potentially dangerous
- RNases are like security guards everywhere
- They destroy RNA before it can cause problems
- Unfortunately, this makes working with RNA experimentally very difficult!
The solution: Convert RNA to cDNA
cDNA = Complementary DNA (DNA copy of RNA)
Why cDNA is better:
- Stable: DNA is chemically stable (no 2’-OH group)
- RNase-resistant: RNases only cut RNA, not DNA
- Can be amplified: Use PCR to make many copies
- Easier to work with: Standard DNA techniques apply
- Long-term storage: Can store cDNA for years
How RNA → cDNA conversion works:
- Extract RNA from cells (work quickly, keep everything cold!)
- Add reverse transcriptase enzyme (from retroviruses)
- This special enzyme makes DNA from RNA template
- Regular DNA polymerase cannot do this!
- Add primers (usually oligo-dT primers that bind to poly-A tail)
- Reverse transcriptase synthesizes DNA copy of RNA
- Now you have stable cDNA to work with!
Think of it like:
- RNA = fragile ice sculpture (melts quickly)
- cDNA = photograph of the ice sculpture (stable, permanent)
- You can’t keep the ice sculpture, but the photo preserves the information!
13.6.3 DNA Microarrays: Measuring Gene Expression
Before NGS sequencing became popular, scientists used microarrays to measure gene expression (Schena et al. 1995; Shalon, Smith, and Brown 1996). Microarrays are still used today for certain applications!
13.6.3.1 What Is a DNA Microarray?
A DNA microarray is like a gene expression chip:
- Glass slide (like a microscope slide)
- Thousands of spots, each containing DNA probes for one gene
- Measures expression of thousands of genes simultaneously!
Think of it like:
- Each spot = a fishing hook for one specific gene
- Different colored fish = RNA from different samples
- Count how many fish each hook catches = measure gene expression!
13.6.3.2 How Microarrays Work
Step 1: Create the array
- Synthesize or print short DNA sequences (probes) onto glass slide
- Each spot contains probes for one specific gene
- Can have 10,000-50,000+ spots on one slide!
- Each probe is complementary to a specific mRNA
Step 2: Prepare samples
- Extract RNA from cells (e.g., normal cells vs. cancer cells)
- Convert RNA to cDNA using reverse transcriptase
- Why? RNA is unstable!
- cDNA is much more stable and easier to work with
- Label cDNA with fluorescent dyes
- Normal cells: Label with green dye (Cy3)
- Cancer cells: Label with red dye (Cy5)
Step 3: Hybridization
- Mix both labeled cDNA samples together
- Apply to microarray slide
- Incubate (allow cDNA to bind to matching probes)
- Wash away unbound cDNA
Step 4: Scan and analyze
- Scan slide with laser
- Green laser detects Cy3 (normal sample)
- Red laser detects Cy5 (cancer sample)
- Camera captures fluorescent signals
- Computer measures intensity at each spot
13.6.3.3 Interpreting Microarray Colors
Two-color microarray results:
- Green spot: Gene higher in normal sample
- Red spot: Gene higher in cancer sample
- Yellow spot: Gene equal in both samples (red + green = yellow)
- Black/dark spot: Gene not expressed in either sample
Example interpretation:
Spot Color → Meaning
───────────────────────────────────────────
Red → Gene upregulated in disease
Green → Gene downregulated in disease
Yellow → No change in expression
Black → Gene not expressed
Quantitative data:
- Actually, you get two intensity values per spot:
- Green intensity (Cy3)
- Red intensity (Cy5)
- Calculate ratio: Red/Green
- Ratio > 1 = upregulated in cancer
- Ratio < 1 = downregulated in cancer
- Ratio ≈ 1 = no change
13.6.4 Quantitative Microarray Analysis: The Details
Now let’s understand HOW we actually measure and interpret the data from microarrays!
13.6.4.1 Measuring Spot Intensities
What the scanner does:
- Shine lasers at the microarray slide
- Green laser (532 nm wavelength) excites Cy3 (green dye)
- Red laser (635 nm wavelength) excites Cy5 (red dye)
- Camera captures emission:
- Cy3 emits green light → camera records green intensity
- Cy5 emits red light → camera records red intensity
- Each spot recorded separately
- Software measures intensity:
- Measures average brightness of each spot
- Units: fluorescence intensity (arbitrary units, like 1,000 or 50,000)
- Higher number = more fluorescence = more RNA bound
Think of it like:
- Shining a flashlight on glow-in-the-dark stickers
- Brighter glow = more stickers
- Measure how bright each spot glows!
Example raw data for one spot:
- Green intensity (Cy3): 8,000 units (normal sample)
- Red intensity (Cy5): 24,000 units (cancer sample)
- Background: 200 units (non-specific signal)
13.6.4.2 Calculating Ratios and Fold-Changes
Step 1: Subtract background:
- Green signal = 8,000 - 200 = 7,800
- Red signal = 24,000 - 200 = 23,800
- Background = signal from areas without probes (noise)
Step 2: Calculate ratio:
- Ratio (R/G) = Red / Green = 23,800 / 7,800 = 3.05
- This means: Gene is 3-fold upregulated in cancer!
Interpretation:
- Ratio > 1: Gene higher in red sample (cancer)
- Ratio < 1: Gene higher in green sample (normal)
- Ratio ≈ 1: No change (equal expression)
Common cutoffs:
- Ratio ≥ 2 or ≤ 0.5: Significant change (2-fold or more)
- Ratio ≥ 1.5 or ≤ 0.67: Moderate change
- 0.67 < Ratio < 1.5: Usually considered no significant change
13.6.4.3 Log2 Transformation: Making Data Easier to Work With
The problem with ratios:
- Upregulation: 2-fold, 3-fold, 4-fold… (numbers get large)
- Downregulation: 0.5, 0.33, 0.25… (fractions, hard to compare)
- Not symmetric!
Example:
- 2-fold up = ratio of 2
- 2-fold down = ratio of 0.5
- These should look “equally different” but don’t!
The solution: Log2 transformation
Log2(ratio) = log base 2 of the ratio
How it works:
- Log2(2) = 1 (2-fold increase)
- Log2(0.5) = -1 (2-fold decrease)
- Log2(1) = 0 (no change)
- Log2(4) = 2 (4-fold increase)
- Log2(0.25) = -2 (4-fold decrease)
Now upregulation and downregulation are symmetric!
Example calculation:
- Ratio = 3.05
- Log2(3.05) = 1.61
- Interpretation: ~1.6 log2 fold-change (≈ 3-fold upregulation)
Common values:
Ratio | Log2(Ratio) | Fold-Change | Interpretation |
---|---|---|---|
2.0 | +1.0 | 2-fold up | Upregulated |
4.0 | +2.0 | 4-fold up | Highly upregulated |
0.5 | -1.0 | 2-fold down | Downregulated |
0.25 | -2.0 | 4-fold down | Highly downregulated |
1.0 | 0.0 | No change | Equal expression |
Why log2 is useful:
- Symmetric: +1 and -1 look equally different from 0
- Easy to visualize: Heatmaps, scatter plots, volcano plots
- Standard in genomics: Everyone uses it!
13.6.4.4 Assessing Statistical Significance
Just because a gene has a ratio of 2 doesn’t mean it’s really different! We need statistics to know if it’s real or just random noise.
Key question: Is this change statistically significant?
What affects significance?:
- Magnitude of change: Bigger change = more likely significant
- Replicates: More replicates = more confident
- Variability: Less variation between replicates = more confident
Common statistical tests:
1. t-test:
- Compares two groups (disease vs. normal)
- Tests if means are significantly different
- Outputs p-value
2. ANOVA:
- Compares multiple groups (>2 conditions)
- Tests if any groups are different
3. SAM (Significance Analysis of Microarrays):
- Specialized method for microarray data
- Accounts for multiple testing problem
- Very popular for microarrays
P-value interpretation:
- p < 0.05: Statistically significant (5% chance it’s random)
- p < 0.01: Highly significant (1% chance it’s random)
- p < 0.001: Very highly significant
Multiple testing correction:
- Problem: Testing 40,000 genes simultaneously!
- With p < 0.05, we expect 2,000 false positives (5% of 40,000)!
- Solution: Adjust p-values for multiple testing
Common corrections:
- Bonferroni: Very strict (divide p by number of tests)
- New cutoff: 0.05 / 40,000 = 0.00000125
- Too strict! Misses real changes
- FDR (False Discovery Rate): Less strict, more practical
- Controls proportion of false positives
- FDR < 0.05: Expect 5% false positives among significant genes
- Most commonly used!
Example result:
- Gene X: Ratio = 3.0, Log2 = 1.58, p = 0.002, FDR = 0.03
- Interpretation: Gene X is significantly upregulated 3-fold, with high confidence!
13.6.4.5 Visualizing Microarray Data
1. Scatter plot (MA plot):
- X-axis: Average intensity (A = average of log2 red and log2 green)
- Y-axis: Log2 ratio (M = log2 red - log2 green)
- Shows which genes change and their expression level
- Horizontal line at Y=0 = no change
- Points above = upregulated
- Points below = downregulated
2. Volcano plot:
- X-axis: Log2 fold-change
- Y-axis: -log10(p-value)
- Shows both magnitude and significance
- Top corners = large change AND significant (best genes!)
- Middle = not significant
- Far left/right but low = big change but not significant (noisy)
3. Heatmap:
- Rows = genes
- Columns = samples
- Colors = expression level
- Red = high expression
- Green = low expression
- Black/Yellow = medium
- Reveals patterns across many genes and samples
- Often includes clustering (groups similar genes/samples)
Think of heatmap like:
- Weather map showing temperatures
- Red = hot (high expression)
- Blue/Green = cold (low expression)
- Patterns emerge visually!
13.6.4.6 From Scanner to Results: Complete Workflow
1. Scan microarray:
- Get images (green and red channels)
- Raw fluorescence intensities
2. Grid alignment:
- Software identifies each spot
- Draws circles/grids around spots
- Measures intensity inside each spot
3. Background subtraction:
- Measure intensity between spots (background)
- Subtract from each spot intensity
- Corrects for non-specific signal
4. Normalization (covered in next section):
- Correct for technical variations
- Make arrays comparable
5. Calculate ratios:
- Red / Green for each spot
- Log2 transform
6. Statistical analysis:
- Identify significantly changed genes
- Apply multiple testing correction
- FDR < 0.05 typically
7. Biological interpretation:
- What do these genes do?
- Are they related (pathways, functions)?
- Can we validate with other methods?
Example: Finding Cancer Biomarkers
Experiment:
- Compare tumor samples (n=10) vs. normal tissue (n=10)
- Each sample hybridized to microarray
- 40,000 genes tested
Results:
- 2,500 genes significantly different (FDR < 0.05)
- 1,200 upregulated in tumors
- 1,300 downregulated in tumors
Top hit:
- EGFR (Epidermal Growth Factor Receptor)
- Ratio = 8.5 (8.5-fold upregulated)
- Log2 = 3.09
- p < 0.0001
- FDR < 0.001
- Conclusion: EGFR highly overexpressed in tumors! Potential drug target!
13.6.4.7 Single-Color vs. Two-Color Microarrays
Two-color microarrays:
- Compare two samples directly on same array
- Use two fluorescent dyes (e.g., Cy3 and Cy5)
- Measures relative differences
- Good for comparing disease vs. normal
Single-color microarrays (e.g., Affymetrix):
- One sample per array, one dye
- Measure absolute intensity for each sample separately
- Compare results across multiple arrays
- Better for comparing many samples
13.6.4.8 Microarray Applications
Gene expression profiling:
- Compare disease vs. healthy tissue
- Identify disease biomarkers
- Classify cancer subtypes
Drug response:
- Which genes change when drug is added?
- Predict drug efficacy
- Identify drug targets
Developmental biology:
- How does gene expression change during development?
- Track changes over time
Example - Cancer classification:
- Extract RNA from tumor sample
- Convert to cDNA, label with fluorescent dye
- Hybridize to microarray
- Compare expression pattern to database
- Classify tumor type based on gene expression profile!
13.6.5 Microarray Normalization and Housekeeping Genes
Before we can compare microarray data, we need to normalize it - correct for technical differences that aren’t biological!
13.6.5.1 Why Normalization Is Needed
The problem: Technical variations between arrays make comparison difficult
Sources of variation (non-biological):
- Different amounts of RNA loaded:
- Sample A: loaded 5 µg RNA
- Sample B: loaded 7 µg RNA
- Sample B will have higher signal, but NOT because genes are more expressed!
- Different labeling efficiency:
- Cy3 dye might label better than Cy5
- Or vice versa
- Creates systematic bias
- Scanner settings:
- Laser power varies slightly
- PMT (photomultiplier tube) sensitivity differences
- Different scan times
- Slide differences:
- Spot sizes vary
- Amount of probe printed varies
- Slide-to-slide variation
Think of it like:
- Comparing test scores from different teachers
- One teacher grades easier (higher scores)
- Need to normalize to compare fairly!
Goal of normalization: Remove technical variation, keep biological variation
13.6.5.2 Housekeeping Genes: Internal Controls
Housekeeping genes = Genes that are always expressed at constant levels
Why they’re called “housekeeping”:
- Like doing housework - always needs to be done!
- Essential for basic cell survival
- Should NOT change between conditions
Common housekeeping genes:
1. GAPDH (Glyceraldehyde-3-phosphate dehydrogenase):
- Enzyme in glycolysis (energy production)
- Most commonly used!
- Should be same in cancer vs. normal
- High expression (easy to detect)
2. ACTB (β-Actin):
- Structural protein (part of cytoskeleton)
- Always expressed
- Very abundant
3. TUBB (β-Tubulin):
- Another structural protein
- Forms microtubules
- Constantly needed
4. HPRT1 (Hypoxanthine Phosphoribosyltransferase):
- Enzyme in nucleotide synthesis
- More stable than GAPDH in some conditions
5. 18S rRNA:
- Ribosomal RNA component
- VERY abundant
- Very stable
How housekeeping genes are used:
Method 1: As normalization reference:
- Measure housekeeping gene expression
- Should be same in both samples (disease vs. normal)
- If not, normalize so they ARE the same
- Apply same normalization to all other genes
Example:
- Sample A (normal): GAPDH intensity = 10,000
- Sample B (cancer): GAPDH intensity = 15,000
- Ratio = 1.5 (but GAPDH shouldn’t change!)
- Problem: Sample B has 1.5× higher signal overall (technical artifact)
- Solution: Divide all Sample B intensities by 1.5
Method 2: As positive controls:
- If housekeeping genes show big changes → something went wrong!
- Might indicate:
- Poor RNA quality
- Technical problems
- Bad normalization
- Need to repeat experiment
Important note: Housekeeping genes aren’t perfect!
- Can change in some conditions (stress, cell cycle, etc.)
- Always use multiple housekeeping genes
- Validate that they’re actually stable in YOUR experiment
13.6.5.3 Common Normalization Methods
1. Global normalization (total intensity):
- Assume: Total amount of RNA is same in all samples
- Method: Make total signal equal across all arrays
- Scale factor = Average total intensity / This array’s total intensity
- Multiply all spots by scale factor
Example:
- Array 1 total: 1,000,000
- Array 2 total: 1,200,000
- Average: 1,100,000
- Scale Array 1 by: 1,100,000 / 1,000,000 = 1.1
- Scale Array 2 by: 1,100,000 / 1,200,000 = 0.917
Pros: Simple, fast Cons: Assumes most genes don’t change (not true if many change!)
2. Housekeeping gene normalization:
- Use genes that shouldn’t change (GAPDH, actin, etc.)
- Make these equal across arrays
- Scale other genes proportionally
Pros: Biological control Cons: Requires knowing which genes are stable
3. Quantile normalization:
- Make distribution of intensities identical across arrays
- Statistical method
- Most robust
Pros: Works well, widely used Cons: More complex, requires software
4. Loess normalization (for two-color arrays):
- Corrects intensity-dependent dye bias
- Cy3 and Cy5 might behave differently at high/low intensities
- Smooths out the bias
Pros: Corrects specific dye biases Cons: Only for two-color arrays
13.6.5.4 Spike-In Controls
Spike-ins = Known amounts of synthetic RNA added to samples
How they work:
- Add same amount of synthetic RNA to all samples
- These RNAs don’t exist in your cells
- Sequence is known
- Probes for them on the microarray
- Should give same signal in all samples
- Use to normalize!
Think of spike-ins like:
- Adding same amount of salt to different soups
- Taste the salt level (measure signal)
- Should be same in all soups
- If not, something’s wrong with your measurement!
Common spike-in controls:
- ERCC spike-ins: External RNA Controls Consortium
- 92 synthetic RNAs
- Different concentrations
- Cover wide range of expression levels
- Standard set used in many studies
Advantages of spike-ins:
- Absolute control (know exact amount)
- Independent of biological variation
- Can assess technical reproducibility
Disadvantages:
- Extra cost
- Need to add carefully (pipetting error)
- One more thing that can go wrong!
13.6.5.5 Quality Control: Checking Your Data
Before analysis, check data quality:
1. Check housekeeping genes:
- Should be similar across all samples
- If vary >2-fold → problem!
2. MA plot:
- Should be centered around 0
- If skewed → normalization needed
3. Box plots:
- Distribution of intensities across arrays
- Should be similar after normalization
4. Correlation between replicates:
- Technical replicates should be very similar (R > 0.99)
- Biological replicates should be similar (R > 0.95)
- If not → outliers, technical problems
5. Check positive controls:
- Spike-ins should work
- Known upregulated genes should show up
6. Check negative controls:
- Spots with no probe
- Should have low/zero signal
- High signal → contamination or high background
Example QC check:
Before normalization:
- Array 1 median intensity: 5,000
- Array 2 median intensity: 8,000
- GAPDH Array 1: 10,000
- GAPDH Array 2: 16,000
- Problem: Array 2 systematically higher!
After normalization:
- Array 1 median intensity: 6,000
- Array 2 median intensity: 6,000
- GAPDH Array 1: 12,000
- GAPDH Array 2: 12,000
- Good: Now comparable!
13.6.5.6 Limitations of Microarrays
Prior knowledge required:
- Can only detect genes you put probes for
- Cannot discover new genes or transcripts
- Limited to known sequences
Relative quantification only:
- Measures relative differences (fold-change)
- Cannot tell absolute number of RNA molecules
- Limited dynamic range
Background noise:
- Non-specific binding creates noise
- Cross-hybridization between similar sequences
Limited sensitivity:
- Hard to detect low-abundance transcripts
- Saturation at high expression levels
13.6.5.7 Microarrays vs. RNA-seq
Feature | Microarray | RNA-seq |
---|---|---|
Prior knowledge | Required | Not required |
New transcripts | Cannot detect | Can discover |
Quantification | Relative | Absolute possible |
Dynamic range | Limited (~3 orders) | Wide (~5 orders) |
Cost | Lower for targeted | Lower for discovery |
Sensitivity | Moderate | High |
Applications | Known genes, routine | Discovery, novel transcripts |
When to use microarrays:
- Focused gene expression studies
- Routine clinical diagnostics
- Cost-effective for specific gene panels
- Comparing expression of known genes
When to use RNA-seq:
- Discovering new transcripts
- Detecting splice variants
- Measuring absolute expression levels
- Studying organisms without reference genome
13.6.6 Absolute vs. Relative Quantification
This is a fundamental difference between microarrays and RNA-seq!
13.6.6.1 What Is Relative Quantification?
Relative quantification = Comparing expression BETWEEN samples
- Can say: “Gene X is 3-fold higher in sample A vs. sample B”
- CANNOT say: “Sample A has 1,000 copies of Gene X”
Think of it like:
- “This box is twice as heavy as that box” (relative)
- But can’t say “This box weighs 5 kg” (absolute)
Microarrays do relative quantification:
- Measure fluorescence intensities
- Calculate ratios between samples
- Tell you fold-change (up or down)
- Cannot tell you absolute number of RNA molecules!
Why microarrays can’t do absolute quantification:
- Fluorescence intensity is arbitrary:
- Intensity of 10,000 units doesn’t mean 10,000 RNA molecules
- Depends on:
- Laser power
- Dye efficiency
- Probe binding efficiency
- Scanner settings
- Different probes behave differently:
- Some probes bind tightly (high signal)
- Some probes bind weakly (low signal)
- Can’t compare intensity across different genes directly
- No external standard:
- Don’t know relationship between intensity and molecule count
- Like having a thermometer with no numbers!
What microarrays CAN tell you:
- Gene A is 2-fold upregulated in disease
- Gene B is 5-fold downregulated in disease
- Gene A changed more than Gene C
What microarrays CANNOT tell you:
- How many copies of Gene A are in the cell?
- Is Gene A more abundant than Gene B?
- Absolute concentration of RNA
13.6.6.2 What Is Absolute Quantification?
Absolute quantification = Measuring actual number (or concentration) of RNA molecules
- Can say: “Sample A has ~1,000 copies of Gene X per cell”
- Can say: “Gene X is more abundant than Gene Y”
RNA-seq can estimate absolute expression:
- Count how many reads map to each gene
- Normalize by:
- Gene length (longer genes get more reads)
- Sequencing depth (total reads in sample)
- Estimate relative abundance
Common RNA-seq metrics:
1. Raw counts:
- Number of reads mapping to gene
- NOT absolute! Depends on sequencing depth
- Example: Gene A = 5,000 reads
2. RPKM (Reads Per Kilobase per Million mapped reads):
- Normalizes for gene length AND sequencing depth
- RPKM = (Reads × 10⁹) / (Gene length in bp × Total reads)
- Single-end sequencing
3. FPKM (Fragments Per Kilobase per Million):
- Same as RPKM but for paired-end sequencing
- Counts fragments (pairs) not individual reads
4. TPM (Transcripts Per Million):
- Similar to FPKM but better for comparison
- Normalizes differently (gene length first, then depth)
- Preferred metric today!
- Sum of all TPM values in a sample = 1,000,000
Why TPM is better than FPKM:
- TPM values are directly comparable across samples
- Sum to same total (1 million)
- Like percentages that always add to 100%!
Example:
Sample 1 (30 million reads):
- Gene A: 1,500 reads, TPM = 50
- Gene B: 300 reads, TPM = 10
Sample 2 (60 million reads):
- Gene A: 3,000 reads, TPM = 50
- Gene B: 600 reads, TPM = 10
Interpretation:
- Gene A has same TPM → same relative abundance
- Even though raw counts doubled (because sequencing depth doubled)
- TPM corrects for this!
Can RNA-seq give TRUE absolute counts?
Not quite, but close:
- TPM/FPKM are proportional to true abundance
- With spike-ins, can estimate molecules per cell
- More absolute than microarrays!
But still not perfect:
- PCR bias (some sequences amplify better)
- Mapping bias (repetitive sequences hard to map)
- Gene length effects
13.6.6.3 Comparison: Absolute vs. Relative
Feature | Microarray | RNA-seq |
---|---|---|
Type | Relative only | Semi-absolute (TPM/FPKM) |
Can compare within sample? | No (across genes) | Yes! |
Can compare across samples? | Yes (fold-change) | Yes (fold-change AND TPM) |
Absolute molecule count? | No | Estimate (with spike-ins) |
Compare Gene A to Gene B? | No | Yes (via TPM) |
Units | Arbitrary fluorescence | Normalized counts (TPM) |
When absolute quantification matters:
- Comparing different genes:
- Is Gene A or Gene B more abundant?
- Microarray can’t tell, RNA-seq can!
- Meta-analysis:
- Combining data from multiple studies
- TPM values more comparable than microarray intensities
- Mathematical modeling:
- Need actual concentrations
- TPM can be converted to estimates
- Therapeutic targeting:
- Is gene expressed enough to target with drug?
- Need to know abundance, not just fold-change
Example use case:
Question: Should we target Gene X or Gene Y for cancer therapy?
Microarray says:
- Both upregulated 5-fold in cancer
- Equally good targets? 🤷
RNA-seq says:
- Gene X: TPM = 500 (highly expressed!)
- Gene Y: TPM = 5 (barely expressed)
- Gene X is better target! (100× more abundant)
This is information microarrays cannot provide!
13.6.7 Expression Analysis Workflows: From Sample to Discovery
Let’s walk through a complete gene expression analysis from start to finish!
13.6.7.1 Study Design: Disease vs. Normal
Research question: What genes are different in breast cancer vs. normal tissue?
Experimental design:
- Samples:
- 10 breast cancer tumors
- 10 normal breast tissue (from same patients or matched controls)
- Biological replicates: 10 per group (more is better!)
- Why replicates? Biological variation between patients
- Need enough to see consistent patterns
- Technical considerations:
- Collect tissue same way (fresh frozen or FFPE)
- Extract RNA with same protocol
- Check RNA quality (RIN score > 7)
- Process all samples together (batch effects!)
13.6.7.2 Step-by-Step Workflow
Step 1: RNA Extraction
- Homogenize tissue (break cells open)
- Use TRIzol or column-based extraction
- Critical: Work quickly! RNA degrades fast!
- Keep everything cold, use RNase-free materials
- Check RNA quality on Bioanalyzer (RIN score)
Step 2: cDNA Synthesis (covered in detail below)
- Convert RNA to cDNA using reverse transcriptase
- Why? cDNA is stable, RNA is not!
- Label with fluorescent dyes (Cy3 for normal, Cy5 for tumor)
Step 3: Hybridization to Microarray
- Mix labeled cDNA samples
- Apply to microarray slide
- Incubate overnight (16-18 hours)
- Wash away unbound cDNA
Step 4: Scanning
- Scan with laser scanner
- Get images (green and red channels)
- Software measures spot intensities
Step 5: Data Processing
- Background subtraction
- Normalization (using housekeeping genes or quantile normalization)
- Calculate ratios (Red/Green)
- Log2 transform
Step 6: Quality Control
- Check housekeeping genes (should be constant)
- Check replicate correlation
- Remove outliers
- MA plots, box plots
Step 7: Statistical Analysis
- Identify significantly changed genes
- Use t-test or SAM
- Apply FDR correction
- Cutoffs: |log2 fold-change| > 1 AND FDR < 0.05
Step 8: Results
Example results:
- Total genes tested: 40,000
- Significantly different: 2,847 genes (FDR < 0.05)
- Upregulated in cancer: 1,523 genes
- Downregulated in cancer: 1,324 genes
Top 10 upregulated genes:
Gene | Fold-Change | Log2 FC | FDR | Function |
---|---|---|---|---|
ERBB2 (HER2) | 12.5 | 3.64 | <0.001 | Growth receptor |
MKI67 (Ki67) | 8.2 | 3.04 | <0.001 | Cell proliferation |
CCNB1 | 7.8 | 2.96 | <0.001 | Cell cycle |
TOP2A | 6.5 | 2.70 | <0.001 | DNA replication |
EGFR | 5.2 | 2.38 | <0.001 | Growth receptor |
Top 10 downregulated genes:
Gene | Fold-Change | Log2 FC | FDR | Function |
---|---|---|---|---|
ESR1 (ER) | 0.15 | -2.74 | <0.001 | Estrogen receptor |
PGR (PR) | 0.18 | -2.47 | <0.001 | Progesterone receptor |
GATA3 | 0.22 | -2.18 | <0.001 | Transcription factor |
13.6.7.3 Step 9: Visualization
Create heatmap:
- Rows = top 100 changed genes
- Columns = samples (10 cancer, 10 normal)
- Colors = expression level (red = high, green = low)
- Clustering groups similar samples and genes
Result: Cancers cluster together, normals cluster together!
Create volcano plot:
- X-axis = log2 fold-change
- Y-axis = -log10(FDR)
- Points in top corners = significant AND large change
- These are your biomarker candidates!
13.6.7.4 Step 10: Pathway Analysis
Question: Are these 2,847 genes related? Do they work together?
Pathway enrichment analysis:
- Use tools: DAVID, Enrichr, GSEA
- Test if changed genes are enriched in known pathways
Example results:
Top enriched pathways (p < 0.001):
- Cell cycle (78 genes, p = 1.2 × 10⁻²⁵)
- DNA replication (35 genes, p = 3.4 × 10⁻¹⁸)
- p53 signaling (42 genes, p = 2.1 × 10⁻¹⁵)
- ECM-receptor interaction (51 genes, p = 8.7 × 10⁻¹²)
Interpretation:
- Cell cycle genes highly upregulated → cancer cells dividing rapidly!
- p53 pathway disrupted → tumor suppressor pathway broken
- Makes biological sense!
13.6.7.5 Step 11: Validation
Never trust microarray results alone! Validate with independent method:
Methods for validation:
- qRT-PCR (quantitative RT-PCR):
- Gold standard for validation
- Measure specific genes in same samples
- More accurate than microarray
- Validate top 10-20 genes
- RNA-seq:
- Sequence same samples
- Check if same genes come up
- More comprehensive
- Protein validation:
- Western blot or immunohistochemistry
- Check if protein levels match RNA levels
- Sometimes they don’t! (post-transcriptional regulation)
Example validation:
- Microarray: ERBB2 upregulated 12.5-fold
- qRT-PCR: ERBB2 upregulated 11.8-fold ✓
- Western blot: HER2 protein high in tumors ✓
- Validated!
13.6.7.6 Step 12: Biological Interpretation
Key questions:
- What do these genes do?
- Look up functions (Gene Ontology, UniProt)
- Group by function
- Are known cancer genes here?
- ERBB2/HER2 → known oncogene!
- ESR1/ER → prognostic marker
- Matches known biology ✓
- Any surprising discoveries?
- Novel genes not previously linked to cancer
- Potential new drug targets!
- Clinical implications?
- Can we classify tumors by expression pattern?
- Predict patient outcomes?
- Suggest treatments?
Example outcome:
- Identified HER2-positive subtype
- These patients respond to Herceptin (anti-HER2 drug)
- Gene expression predicts treatment!
13.6.8 Reverse Transcriptase: The Enzyme That Makes cDNA
Now let’s understand in detail HOW we convert RNA to cDNA!
13.6.8.1 What Is Reverse Transcriptase?
Reverse transcriptase (RT) = Enzyme that makes DNA from RNA template
- “Reverse” because normal direction is DNA → RNA (transcription)
- This goes backwards: RNA → DNA!
- Also called RNA-dependent DNA polymerase
Where does RT come from?
- Retroviruses: HIV, HTLV, etc.
- These viruses have RNA genomes
- Need to make DNA copy to insert into our genome
- We “borrowed” their enzyme for research!
Think of it like:
- Normal transcription = reading a book and taking notes (DNA → RNA)
- Reverse transcription = reconstructing the book from notes (RNA → DNA)
13.6.8.2 Why HIV Needs Reverse Transcriptase
HIV life cycle (simplified):
- HIV enters cell (has RNA genome)
- RT makes DNA copy of HIV RNA
- DNA copy integrates into human chromosome
- Now human cell makes HIV proteins forever!
- This is why HIV is chronic - it’s in your DNA!
Why we can’t cure HIV easily:
- Once integrated, HIV DNA stays in genome
- Antiretroviral drugs block RT → stop new infections
- But can’t remove integrated DNA
How RT inhibitors work (HIV drugs):
- AZT, nevirapine, etc.
- Block RT enzyme
- Prevent HIV from making DNA copy
- Stop virus from integrating
13.6.8.3 The RT Mechanism: Step by Step
Requirements for reverse transcription:
- Template RNA: Your RNA of interest
- Reverse transcriptase enzyme: Usually from MMLV or AMV
- Primer: Short DNA or RNA that binds to template
- dNTPs: Building blocks (dATP, dTTP, dGTP, dCTP)
- Buffer: Right salt and pH conditions
Why RT needs a primer (just like DNA polymerase):
- Cannot start synthesis de novo (from scratch)
- Needs 3’-OH group to add nucleotides to
- Primer provides this starting point
Types of primers:
1. Oligo-dT primers (most common):
- Short stretch of T nucleotides (15-20 Ts)
- Binds to poly-A tail of mRNA
- Sequence: 5’-TTTTTTTTTTTTTTT-3’
- Binds to: 3’-AAAAAAAAAAAAAAAA-5’ (poly-A tail)
Advantage: Specifically amplifies mRNA (not rRNA or tRNA) Disadvantage: Might not reach 5’ end of long transcripts
2. Random hexamer primers:
- Short random sequences (6 nucleotides)
- Bind randomly all along RNA
- Example: 5’-NNNNNN-3’ (N = any base)
Advantage: Amplifies all RNA regions, including 5’ end Disadvantage: Amplifies ribosomal RNA too (wasteful!)
3. Gene-specific primers:
- Design primer for specific gene you want
- Most specific!
Advantage: Only your gene of interest Disadvantage: Need to design primers, one gene at a time
13.6.8.4 The Reverse Transcription Reaction
Step 1: Annealing
- Mix RNA + primer
- Heat to 70°C (denature any RNA secondary structure)
- Cool to 25°C (for random hexamers) or 42°C (for oligo-dT)
- Primer binds to complementary sequence on RNA
Example (with oligo-dT):
RNA: 5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3' (poly-A tail)
|||||||||||
Primer: 3'-TTTTTTTTTTT-5' (oligo-dT)
Step 2: Synthesis
- Add reverse transcriptase enzyme + dNTPs
- RT synthesizes DNA complementary to RNA
- Proceeds 5’ → 3’ direction (like all polymerases)
- Creates RNA-DNA hybrid
After RT synthesis:
RNA: 5'---GGCAUUUGCCAAA[AAAAAAAAAA]-3'
|||||||||||||||||||||||||
cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'
- Top strand = original RNA (template)
- Bottom strand = newly synthesized cDNA (complementary DNA)
Step 3: RNA degradation (optional):
- Add RNase H enzyme
- Degrades RNA part of RNA-DNA hybrid
- Leaves single-stranded cDNA
Result: Single-stranded cDNA
cDNA: 3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'
Step 4: Second strand synthesis (if needed):
- Some applications need double-stranded cDNA
- Add DNA polymerase + primers
- Makes complementary strand
Result: Double-stranded cDNA
5'---GGCATTTGCCAAA[AAAAAAAAAA]-3'
3'---CCGTAAACGGTTT[TTTTTTTTTTT]-5'
Now looks like normal DNA! Can be:
- Cloned into vectors
- Amplified by PCR
- Sequenced
- Stored long-term
13.6.8.5 RT Enzymes: Which One to Use?
Common reverse transcriptases:
1. MMLV-RT (Moloney Murine Leukemia Virus):
- Most commonly used
- Works at 37-42°C
- Good processivity (makes long cDNA)
- Sensitive to high temperature
2. AMV-RT (Avian Myeloblastosis Virus):
- Works at higher temperature (42-55°C)
- Better for RNA with secondary structure
- More RNase H activity (degrades template)
3. SuperScript (engineered MMLV):
- Modified version of MMLV
- More stable at higher temperature (up to 50°C)
- Lower RNase H activity (preserves template)
- Very popular for difficult templates
Choosing RT enzyme:
- Standard use: MMLV-RT or SuperScript II/III
- GC-rich RNA: SuperScript III (higher temp denatures structure)
- Long transcripts: SuperScript IV (processivity up to 12 kb!)
- Budget: MMLV-RT (cheaper)
13.6.8.6 Common Problems and Solutions
Problem 1: No cDNA produced
Possible causes:
- RNA degraded (use RNase-free everything!)
- RT enzyme dead (check expiration, storage)
- No primer binding (wrong primer for your RNA)
Solution:
- Check RNA quality on gel/Bioanalyzer
- Use fresh RT enzyme
- Try different primer type
Problem 2: cDNA too short
- Only getting 5’ end, missing 3’ end (or vice versa)
Cause:
- RNA secondary structure blocks RT
- RT falls off template
Solution:
- Use higher temperature (SuperScript III at 50°C)
- Add DMSO or betaine (disrupts structure)
- Use random hexamers instead of oligo-dT
Problem 3: Genomic DNA contamination
- Amplifying genomic DNA instead of cDNA!
Solution:
- Treat RNA with DNase I (removes DNA)
- Design primers that span introns (won’t amplify genomic DNA)
- Use oligo-dT primers (genomic DNA has no poly-A tail)
13.6.8.7 Why RT Is Essential for RNA Studies
Applications of reverse transcription:
- Microarrays: Need cDNA for labeling and hybridization
- RNA-seq: Most methods convert RNA to cDNA first
- qRT-PCR: Quantify specific genes
- cDNA libraries: Clone all genes from organism
- Northern blot alternative: Detect specific RNAs
Without RT:
- RNA too unstable to work with
- RNases everywhere would destroy samples
- Can’t amplify with PCR (needs DNA)
- Can’t clone into vectors
With RT:
- Convert to stable cDNA
- Amplify, sequence, clone
- Store indefinitely
- Study gene expression!
Summary: Reverse transcriptase is THE KEY enzyme that enables all RNA research!
13.6.9 RNA Sequencing (RNA-seq)
Modern method to study the transcriptome (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008):
How it works:
Extract all RNA from cells
Convert RNA to DNA (more stable)
Sequence the DNA
Count how many copies of each RNA
Determine which genes are active
What we learn:
Which genes are expressed
How much each gene is expressed
Which splice variants are made
Discovery of new RNAs
13.6.10 Applications
Medicine:
Diagnose diseases by RNA patterns
Predict treatment response
Find new drug targets
Understand cancer
Development:
How embryos develop
Cell differentiation
Tissue formation
Evolution:
Compare gene expression across species
Understand how expression patterns evolved
13.7 Key Takeaways
Transcriptome = All RNA molecules in a cell
Transcriptomics = Study of the transcriptome
Three main RNA types:
mRNA (1-5%): Codes for proteins
tRNA (15%): Delivers amino acids
rRNA (80%): Forms ribosomes
RNA processing in eukaryotes:
5’ capping: Protective cap
3’ poly-A tail: Protective tail
Splicing: Removing introns, joining exons
Alternative splicing = One gene → multiple proteins
Explains how 20,000 genes make 100,000+ proteins
Regulated by splicing factors
Non-coding RNAs have important functions:
miRNAs: Regulate gene expression
lncRNAs: Various regulatory functions
RNA-seq = Modern method to study transcriptome
Transcriptome varies between cell types, conditions, and times
Sources: Information adapted from Khan Academy, Nature Education, and transcriptomics research literature (Wang, Gerstein, and Snyder 2009; Mortazavi et al. 2008; Schena et al. 1995).