18 Sequencing Technologies
18.1 Reading the Book of Life
18.1.1 What Is DNA Sequencing?
DNA sequencing = Reading the order of A, T, G, and C letters in DNA
Think of it like:
Reading the text in a book, letter by letter
Decoding a secret message
Reading the instruction manual of life!
18.2 From Sanger to Next-Generation Sequencing
18.2.1 The First Method: Sanger Sequencing (1977)
Developed by: Frederick Sanger (won his 2nd Nobel Prize for this!) (Sanger, Nicklen, and Coulson 1977)
How it works (simplified):
Start with DNA template
Add special labeled nucleotides that STOP copying when incorporated
Create DNA fragments of different lengths
Separate fragments by size
Read the sequence from the pattern!
Characteristics:
Accurate: 99.99% accuracy!
Slow: One sequence at a time
Expensive: Cost millions for whole genomes
Read length: Up to ~1,000 base pairs per read
Impact:
Used for the Human Genome Project!
Still used today for small-scale sequencing
Gold standard for accuracy
The Human Genome Project timeline:
Started: 1990
Finished: 2003
Cost: $3 billion
Time: 13 years
Method: Sanger sequencing
That’s $1 per base pair!
18.2.2 Sanger Sequencing: The Detailed Mechanism
18.2.2.1 The Biological Foundation
Sanger sequencing is based on semi-conservative DNA replication - the same process cells use to copy DNA.
What is semi-conservative replication?
Think of DNA like a zipper with two sides:
One side stays (the template)
One new side is built (the copy)
The old side guides making the new side
Key facts:
DNA replication needs a template strand
One strand serves as the template (parent)
A new strand is built matching the template (daughter)
Watson and Crick predicted this in 1953
Meselson-Stahl proved it in 1958
18.2.2.2 Why DNA Polymerase Needs a Primer
This is an important question! Here’s the simple answer:
The problem:
RNA polymerase → Can start making RNA from nothing
DNA polymerase → Cannot start from nothing, needs a primer first
Why the difference?
Think of it like starting a zipper:
RNA polymerase is like a self-starting zipper
DNA polymerase is like a zipper that needs the first tooth in place
Three reasons DNA polymerase needs a primer:
Reason 1: Error checking
DNA polymerase checks each base it adds (proofreading)
It needs something to hold onto to check properly
Like needing a ruler to measure accurately
RNA polymerase doesn’t check as carefully (RNA is temporary anyway)
Reason 2: Chemical differences
RNA polymerase uses NTPs (have an extra -OH group)
DNA polymerase uses dNTPs (missing that -OH group)
That extra -OH helps start from nothing
DNA polymerase lost this ability during evolution
Reason 3: Accuracy matters
DNA is permanent → needs to be perfect
Primer gives DNA polymerase a stable starting point
Like building a house - you need a foundation first
This reduces copying errors
Why this matters:
PCR → Needs primers to work
Sanger sequencing → Needs primers to work
Cell DNA replication → Uses RNA primers (made by primase enzyme)
18.2.3 The Chemistry of Sanger Sequencing
18.2.3.1 The Special Ingredients
Sanger sequencing uses TWO types of building blocks:
1. Normal dNTPs (regular DNA building blocks):
What they are: dATP, dTTP, dGTP, dCTP
What they do: Build DNA normally
Key feature: Have a 3’-OH group
Result: DNA chain keeps growing
2. Special ddNTPs (chain terminators):
What they are: ddATP, ddTTP, ddGTP, ddCTP
What they do: STOP DNA growth when added
Key feature: MISSING the 3’-OH group
Result: DNA chain stops immediately
Bonus: Each is labeled with a different fluorescent color
ddA = Green
ddT = Red
ddG = Yellow
ddC = Blue
18.2.3.2 The Simple Difference
Normal dNTP (keeps going):
Has -OH group → Next base can attach → Chain grows
**Special ddNTP** (stops):
NO -OH group → Next base CANNOT attach → Chain stops
Think of it like:
- Normal dNTP = Lego brick with connectors on top
- Special ddNTP = Lego brick with flat top (nothing can attach)
#### How Sanger Sequencing Works (Step by Step)
Let's sequence this DNA: `5'-TACGGCATGCTA-3'`
**Step 1: Add primer**
The primer tells DNA polymerase where to start:
Template DNA: 5'---TACGGCATGCTA---3'
Primer: 3'-ATGC-5'
↑
Start here!
**Step 2: Add the mixture**
We add:
- **LOTS** of normal dNTPs (A, T, G, C)
- **A FEW** special ddNTPs (ddA, ddT, ddG, ddC) with colors
**Step 3: DNA polymerase starts copying**
Sometimes it adds a normal base → keeps going
Sometimes it adds a special colored base → STOPS
This creates fragments of different lengths:
Fragment 1: 3'-ATGCCGddT-5' (Stopped at position 7, RED)
Fragment 2: 3'-ATGCCGTAddC-5' (Stopped at position 8, BLUE)
Fragment 3: 3'-ATGCCGTACddG-5' (Stopped at position 9, YELLOW)
Fragment 4: 3'-ATGCCGTACGddA-5' (Stopped at position 10, GREEN)
Fragment 5: 3'-ATGCCGTACGAddT-5' (Stopped at position 11, RED)
Each fragment ends with a colored base!
**Step 4: Separate by size**
- Pour fragments through a thin tube
- Small fragments run fast → detected first
- Large fragments run slow → detected last
- Like a race where smaller runners are faster

**Figure 15.1**: Sanger sequencing chain termination method showing how ddNTPs terminate DNA synthesis at different positions, creating fragments of varying lengths that reveal the DNA sequence.
*Image credit: Estevezj, Wikimedia Commons, CC BY-SA 3.0*
**Step 5: Read the colors**
Camera sees colors in order:
Position 7: RED → T was added
Position 8: BLUE → C was added
Position 9: YELLOW → G was added
Position 10: GREEN → A was added
Position 11: RED → T was added
Sequence = TCGAT...
Done! We just read the DNA sequence!
### The Original Sanger Method: Four Tubes
**Historical note**: The method we just described is the MODERN version (1-tube with 4 colors). The original Sanger method (1977) was different!
**How the original method worked**:
Instead of one tube with all four ddNTPs (each a different color), Sanger used **FOUR SEPARATE TUBES**:
- **Tube 1**: dNTPs + ddATP (stops at A)
- **Tube 2**: dNTPs + ddTTP (stops at T)
- **Tube 3**: dNTPs + ddGTP (stops at G)
- **Tube 4**: dNTPs + ddCTP (stops at C)
Think of it like:
- Modern method = one race with 4 colored jerseys
- Original method = 4 separate races!
**The original labeling**:
- Used **radioactive labels** (³²P or ³⁵S), not fluorescent colors
- All fragments in one tube were the same "color" (radioactive)
- No fancy colors - just radioactive signal!
**Reading the results**:
1. Run each tube on separate lanes of a gel
2. Use **gel electrophoresis** (fragments separate by size)
3. Expose gel to X-ray film (autoradiography)
4. Dark bands appear where radioactive fragments are
5. Read sequence by comparing the 4 lanes!
**Example reading**:
A lane T lane G lane C lane
---- ---- ---- ----
Small - - band - → G (smallest = first base) band - - - → A (next) - band - - → T (next) - - - band → C (next) - - band - → G (next) Large - band - - → T (largest = last base)
Sequence = GATCGT
**Why this was challenging**:
- Needed 4 separate reactions (4× more work!)
- Manual reading of gel bands (tedious, error-prone)
- Radioactive materials (safety concerns)
- Limited to ~300-400 bases per gel
- Took hours to days per sequence
**The improvement to modern method**:
- **1986-1987**: Fluorescent dyes instead of radioactivity
- Safer (no radiation)
- Four colors = one tube instead of four!
- Automated detection with lasers
- **1990s**: Capillary electrophoresis instead of gel
- Faster (minutes instead of hours)
- Automated (no manual reading)
- Longer reads (up to 1,000 bp)
- High-throughput (96 samples simultaneously)
**Why learn about the old method?**
- Helps understand the principle (ddNTPs stop synthesis)
- Appreciates modern automation
- Many papers from 1977-2005 use this method
- Nobel Prize-winning technique!
### Modern Sanger Sequencing Protocol
**Workflow**:
**Step 1: PCR Amplification** (optional but common):
- Amplify target region
- Purify PCR product
- Ensures enough template
**Step 2: Sequencing Reaction**:
- Mix template DNA + sequencing primer
- Add DNA polymerase (thermostable, e.g., Taq)
- Add mix of dNTPs and fluorescent-labeled ddNTPs
- Thermal cycling (like PCR, but linear amplification)
**Typical mix ratio**:
- dNTPs:ddNTPs = ~100:1
- Ensures most reactions continue
- But some terminate at each position
**Step 3: Cleanup**:
- Remove excess ddNTPs
- Remove salts (interfere with electrophoresis)
- Use spin columns or magnetic beads
**Step 4: Capillary Electrophoresis**:
- Inject sample into capillary
- Apply electric field
- Fragments separate by size
- Laser excites fluorophores
- Camera detects emission
**Step 5: Data Analysis**:
- Software converts peaks to sequence
- Quality scores assigned to each base
- Generate chromatogram (peak visualization)
### Reading a Sanger Chromatogram
**What you see**:
A G C T A G C T
| | | | | | | |
Peak Peak Peak Peak Peak Peak Peak Peak
**Quality indicators**:
**Good quality**:
- Sharp, well-separated peaks
- Single color at each position
- Uniform peak height
- High signal-to-noise ratio
**Poor quality**:
- Overlapping peaks (multiple colors)
- Broad peaks
- Low peak height
- Often at start (primer) or end (long reads) of sequence
**Phred quality score**:
- Q20 = 99% accuracy (1 error in 100 bases)
- Q30 = 99.9% accuracy (1 error in 1,000 bases)
- Most modern Sanger reads Q30 for first 700-800 bp
### Semi-Conservative Replication Revisited
**Meselson-Stahl Experiment (1958)** proved semi-conservative replication:
**Experiment**:
1. Grow E. coli in heavy nitrogen (¹⁵N) medium
2. DNA becomes "heavy" (labeled)
3. Switch to normal nitrogen (¹⁴N) medium
4. After one replication: DNA is "hybrid" (one heavy, one light strand)
5. After two replications: Half hybrid, half light
**Result**: Proved DNA replicates semi-conservatively
- One template strand conserved
- One new strand synthesized
**This principle underlies**:
- All DNA replication in cells
- PCR amplification
- Sanger sequencing
- All DNA synthesis!
**Connection to Sanger sequencing**:
- Template strand = original DNA
- Newly synthesized strand = sequence read
- ddNTPs terminate synthesis at random positions
- Pattern reveals template sequence
### Why Semi-Conservative Replication Matters for Sequencing
**Advantages**:
1. **Accuracy**: Template strand provides perfect guide
2. **Complementarity**: A pairs with T, G pairs with C
3. **Proofreading**: DNA polymerase checks each base
4. **Reliability**: Same template gives same sequence
**Fidelity**:
- DNA polymerase error rate: ~1 in 10⁷ (with proofreading)
- Sanger sequencing error rate: ~1 in 10⁴ (due to termination randomness)
- Still extremely accurate!
### Sanger Sequencing Applications Today
**Despite NGS dominance, Sanger still used for**:
**1. Variant Validation**:
- NGS finds potential mutation
- Sanger confirms it
- Gold standard for clinical diagnostics
**2. Small-Scale Projects**:
- Sequencing few genes
- Checking clones
- Verifying plasmids
- More cost-effective than NGS setup
**3. Long Reads**:
- Up to 1,000 bp in single read
- NGS reads much shorter (typically 150-300 bp)
- Useful for spanning repetitive regions
**4. Low-Throughput Needs**:
- 1-96 samples at a time
- Don't need millions of reads
- Academic labs, clinical labs
### Limitations of Sanger Sequencing
**Compared to NGS**:
- **Throughput**: Only 96 samples/run vs. millions in NGS
- **Cost**: $5-10 per reaction vs. $0.01 per Mb for NGS
- **Speed**: Hours per sample vs. days for whole genome (but millions of sequences!)
- **Scalability**: Cannot do whole genomes economically
**When NOT to use Sanger**:
- Whole genome sequencing
- RNA-seq experiments
- ChIP-seq experiments
- Metagenomics
- Any high-throughput application
### Historical Impact
**Before Sanger sequencing** (pre-1977):
- No way to read DNA sequence directly
- Relied on protein sequencing (slower, harder)
- Genetic code deciphered with difficulty
**After Sanger sequencing** (1977-2005):
- First genes sequenced
- First genomes sequenced (viruses, bacteria)
- **Human Genome Project possible**!
- Molecular biology revolution
- Medical genetics born
**Key milestones using Sanger**:
- 1977: First DNA sequence (bacteriophage φX174, 5,375 bp)
- 1995: First bacterial genome (H. influenzae, 1.8 Mb)
- 1996: First eukaryotic genome (yeast, 12 Mb)
- 2001: Human genome draft (3 Gb!)
- 2003: Human genome complete
**Nobel Prizes related to sequencing**:
- 1980: Sanger - DNA sequencing method
- 1993: Mullis - PCR (enables sequencing)
- 2020: Charpentier & Doudna - CRISPR (genome editing, relies on sequencing)
### Comparison Table: Sanger vs. NGS
| Feature | Sanger Sequencing | Next-Generation Sequencing |
|---------|-------------------|----------------------------|
| **Read length** | 700-1,000 bp | 100-300 bp (Illumina) |
| **Accuracy** | 99.9% (Q30) | 99%+ (depends on platform) |
| **Throughput** | 1-96 sequences/run | Millions to billions/run |
| **Cost per Mb** | $500-1,000 | $0.01-0.10 |
| **Time** | Hours | Days (for whole genome) |
| **Applications** | Validation, cloning, small projects | WGS, RNA-seq, variant discovery |
| **Best for** | Low throughput, long reads | High throughput, discovery |
### The Revolution: Next-Generation Sequencing (NGS)
**What changed**: In the mid-2000s, new technologies emerged!
**NGS characteristics**:
- **Massively parallel**: Sequence millions of fragments simultaneously!
- **Fast**: Whole genome in days instead of years
- **Cheap**: Now costs less than $1,000 per genome
- **High-throughput**: Generate billions of base pairs of data
**Key difference from Sanger**:
- Sanger: Read ONE sequence at a time (serial)
- NGS: Read MILLIONS of sequences at once (parallel)
Think of it like:
- **Sanger**: Reading one book
- **NGS**: Reading an entire library simultaneously!
### How NGS Works (General Process)
**1. Library Preparation**:
- Break DNA into small fragments
- Add special "adapters" to fragment ends
- Like putting address labels on letters
**2. Amplification**:
- Make many copies of each fragment
- Like photocopying letters many times
**3. Sequencing**:
- Fragments attach to a surface
- Nucleotides added one at a time
- Each nucleotide emits light (different colors for A, T, G, C)
- Camera records the lights
- Computer determines sequence!
**4. Data Analysis**:
- Assemble millions of short sequences
- Like putting together a jigsaw puzzle
- Compare to reference genome
- Identify differences
### NGS Platforms
**Illumina** (most popular):
- Short reads (100-300 bp)
- Very accurate
- Most widely used
- Relatively cheap
**Ion Torrent**:
- Detects pH changes (not light)
- Faster
- Good for targeted sequencing
**Others**: Many platforms exist, each with trade-offs!
### The Evolution of Read Lengths
**Read length** = How many bases we can sequence in one continuous read
Think of it like:
- **Short reads** = reading one sentence at a time
- **Long reads** = reading entire paragraphs or pages!
**Historical progression of NGS read lengths**:
**Early days (2005-2008)**:
- **25-35 bp** reads
- Illumina Genome Analyzer
- Very short! Like reading only 2-3 words at a time
- Hard to assemble genomes
- But: massively parallel, millions of reads!
**First generation NGS (2008-2012)**:
- **50-75 bp** reads (single-end)
- **2 × 50 bp** (paired-end) = 100 bp total information
- Better, but still challenging
- Could map to reference genomes
- Assembly still difficult
**Second generation (2012-2015)**:
- **100-150 bp** reads
- **2 × 100 bp** or **2 × 150 bp** (paired-end)
- Big improvement!
- Good for most applications
- Became the standard
**Modern era (2015-present)**:
- **150-300 bp** reads (standard)
- **2 × 150 bp** or **2 × 250 bp** common
- Some platforms: **2 × 300 bp**!
- Long enough for most needs
- Good balance of length, accuracy, and cost
**Longest Illumina reads today**:
- MiSeq: Up to **2 × 300 bp** (600 bp total)
- NovaSeq: Usually **2 × 250 bp** (500 bp total)
- NextSeq: **2 × 150 bp** (300 bp total)
### Why Read Length Matters
**Longer reads are better for**:
1. **Genome assembly**:
- Longer reads span more of the genome
- Easier to put pieces together
- Like having larger puzzle pieces
2. **Repetitive regions**:
- Genomes have many repeats (same sequence multiple times)
- Short reads get "lost" in repeats
- Longer reads can span repeats
- Like reading "AAAAAAA..." - need to read the whole thing to know how many As!
3. **Structural variants**:
- Large deletions, insertions, inversions
- Need to span the variant
- Short reads might miss it entirely
4. **Isoform detection** (RNA-seq):
- Full-length transcripts are thousands of bases long
- Longer reads capture more of each isoform
- Better identification of splice variants
**Shorter reads are better for**:
1. **Accuracy**:
- Short reads have fewer errors
- Quality degrades toward the end of reads
- 50 bp at Q30 is better than 300 bp at Q20
2. **Cost**:
- Shorter reads = cheaper per base
- More reads per run
- Better for high-throughput applications
3. **Speed**:
- Shorter reads sequence faster
- Fewer cycles = faster run time
### The Trade-Offs
| Read Length | Accuracy | Cost | Speed | Best For |
|-------------|----------|------|-------|----------|
| **50-75 bp** | Highest | Cheapest | Fastest | RNA-seq, ChIP-seq, targeted sequencing |
| **100-150 bp** | High | Moderate | Moderate | Standard WGS, exomes, most applications |
| **250-300 bp** | Good | Higher | Slower | Assembly, amplicon sequencing, metagenomics |
**Most common today**: **2 × 150 bp** (paired-end, 300 bp total information)
- Good balance of everything
- Cheap enough for routine use
- Long enough for most applications
- Standard for clinical sequencing
### Why We Couldn't Make Long Reads Earlier
**Technical challenges**:
1. **Reversible terminators don't work well for long reads**:
- Each cycle adds chemicals that must be removed
- After ~300 cycles, too much residue builds up
- Signals get weaker and noisier
- Like photocopying a photocopy 300 times - gets fuzzy!
2. **Chemistry degradation**:
- DNA polymerase efficiency decreases
- Fluorescent dyes accumulate
- Clusters get dimmer
- Error rate increases
3. **Phasing problems**:
- In a cluster of 1,000 identical molecules
- All should be at the same position
- But some lag behind or jump ahead (phasing errors)
- After 200-300 cycles, too much variation
- Signal becomes mixed and unclear
**Recent improvements**:
- **Better chemistry**: More stable reagents
- **Better enzymes**: DNA polymerase that works longer
- **Better optics**: Brighter signals, better cameras
- **Better algorithms**: Correct for phasing errors
- **Result**: 300 bp reads now routine!
### Paired-End vs. Single-End
**Single-end sequencing**:
- Sequence only one end of fragment
- One read per fragment
- Example: 150 bp read
**Paired-end sequencing**:
- Sequence BOTH ends of fragment
- Two reads per fragment
- Example: 2 × 150 bp = two 150 bp reads from same fragment
- Know the distance between reads (~300-500 bp)
**Why paired-end is better**:
- **More information**: Two reads instead of one!
- **Better mapping**: Know both ends help place fragment
- **Detect structural variants**: If distance is wrong, something's inserted or deleted
- **Span gaps**: One end maps, helps place the other end
**Standard today**: Almost always paired-end!
### Long-Read Technologies (3rd Generation)
**The limit of short-read technologies**:
- Illumina: Maximum ~300-500 bp (paired-end)
- Even with improvements, can't go much longer
- Fundamental chemistry limitations
**Solution**: Completely different technologies!
**PacBio**:
- **10,000-100,000+ bp** reads
- Average: ~10-20 kb
- Longest: > 100 kb!
- Uses single-molecule real-time sequencing
- Different chemistry (not reversible terminators)
**Oxford Nanopore**:
- **Regular reads**: 10-50 kb
- **Ultra-long reads**: 100-200 kb
- **Record**: > 2 million bp!
- Uses electrical current (not fluorescence)
- Can sequence indefinitely until DNA breaks
**Trade-offs for long reads**:
- **Pros**: Much longer! Span repetitive regions, complete genomes
- **Cons**: Higher error rate (~90-95% vs. 99%+), more expensive, lower throughput
**The future**: Combine short accurate reads (Illumina) with long reads (PacBio/Nanopore) for best of both!
## Illumina Sequencing: The Details
### Why Illumina Dominates NGS
**Illumina/Solexa sequencing** is the most widely used NGS technology today. It was developed by Solexa (acquired by Illumina in 2007) and has become the gold standard for high-throughput sequencing [@bentley2008accurate].
**Key advantages**:
- **Massively parallel**: Sequence millions of DNA fragments simultaneously
- **High accuracy**: 99%+ accuracy with quality scores
- **Cost-effective**: Cheapest per-base cost
- **Versatile**: Used for genome sequencing, RNA-seq, ChIP-seq, and more
### The Core Principle: Sequencing by Synthesis (SBS)
Illumina uses **sequencing by synthesis (SBS)** - it sequences DNA by watching DNA polymerase add nucleotides one at a time!
**The brilliant idea**:
- Use special fluorescent nucleotides
- Each base (A, T, G, C) glows a different color
- Watch which color appears → know which base was added
- Repeat for every position in the DNA!
Think of it like:
- DNA polymerase is building a LEGO tower
- Each LEGO brick glows a different color
- A camera takes a picture after each brick is added
- The color sequence reveals the DNA sequence!
### The Magic Ingredient: Reversible Terminator Nucleotides
**Normal dNTPs** (regular DNA building blocks):
- DNA polymerase can add them continuously
- Hard to control - might add multiple bases at once
**Illumina's reversible terminators**:
- **Modified dNTPs** with two special features:
1. **Fluorescent dye** attached (different color for A, T, G, C)
2. **Reversible terminator** group at 3' position
**How they work**:
1. DNA polymerase adds ONE modified nucleotide
2. The terminator BLOCKS the next base from being added
3. Camera captures the fluorescent signal
4. Enzymes REMOVE the fluorescent dye and terminator
5. Next cycle begins!
**The four colors**:
- **Adenine (A)**: Green fluorescence
- **Thymine (T)**: Red fluorescence
- **Guanine (G)**: Yellow fluorescence
- **Cytosine (C)**: Blue fluorescence
(Note: Exact colors vary by Illumina platform, but each base has a unique signal)
### Step-by-Step: How Illumina Sequencing Works
#### Step 1: Library Preparation
**Goal**: Prepare DNA fragments with special adapters
Library preparation is like preparing letters for mailing - you need to put addresses and return addresses on them so they get to the right place!
**Process**:
**1. Fragment DNA**: Break genomic DNA into small pieces (typically 200-500 bp)
**Methods to fragment DNA**:
**a) Enzymatic fragmentation** (using enzymes):
- **Restriction enzymes**: Cut DNA at specific sequences
- Like molecular scissors that only cut at certain words
- Problem: Cuts are not random, biased toward certain sequences
- Less common for NGS
- **Tagmentation** (Nextera method):
- Enzyme (transposase) cuts AND adds adapters simultaneously!
- Very fast (minutes instead of hours)
- Becoming very popular
- Like a machine that stamps and addresses letters at once
**b) Physical fragmentation** (mechanical):
- **Sonication**: Use sound waves to break DNA
- High-frequency sound vibrates DNA until it breaks
- Like using a jackhammer on ice
- Creates random fragments
- Takes 5-30 minutes
- **Nebulization**: Force DNA through tiny holes
- High pressure shears the DNA
- Like pushing spaghetti through a screen
- Also creates random fragments
- Quick but uses more DNA
- **Enzymatic shearing**: Enzymes that cut randomly
- More gentle, less biased
- Like using scissors randomly on a string
**Why fragment size matters**:
- **Too short** (< 100 bp): Hard to map uniquely to genome
- **Too long** (> 800 bp): Won't fit in sequencer, harder to amplify
- **Just right** (200-500 bp): Perfect for Illumina!
**2. End repair**: Make fragment ends smooth and blunt
After fragmentation, DNA ends are ragged (like torn paper edges). We need to smooth them out!
- Remove overhangs (extra single-stranded bits)
- Fill in gaps
- Make all ends blunt (flat)
- Like filing down rough edges on wood
**Why this matters**: Adapters can only attach to blunt ends properly
**3. A-tailing**: Add single "A" nucleotide to 3' ends
- Adds one adenine (A) to each end
- Prepares for adapter ligation
- Like adding a hook to hang things on
**4. Add adapters**: Attach short DNA sequences to both ends
**What are adapters?**
Adapters are short DNA sequences (20-100 bp) that get attached to BOTH ends of every DNA fragment.
Think of adapters like:
- **Mailing addresses** on letters
- **Barcode labels** on packages
- **Handles** on suitcases
**Adapter structure**:
5’— [P5 adapter] —[Your DNA fragment]— [P7 adapter] —3’ ↑ ↑ Binds to flow cell Binds to flow cell + sequencing primer + sequencing primer ```
What adapters contain:
- Flow cell binding sequences:
- Let fragments stick to flow cell surface
- Complementary to oligos on flow cell
- Like Velcro hooks that match Velcro loops
- Sequencing primer binding sites:
- Where sequencing primers attach
- Start point for DNA polymerase
- Like “Start here” signs
- Index/Barcode sequences (optional):
- Unique tags (6-8 bp) to identify samples
- Like apartment numbers on an address
- Allows multiplexing!
5. Ligation: Glue adapters to DNA fragments
- Use DNA ligase enzyme (molecular glue)
- Creates covalent bonds
- Very stable connection
- Like super-gluing handles onto boxes
6. Size selection: Keep only fragments of desired size
After adapter ligation, we have fragments of many sizes. We need to keep only the ones we want!
Methods:
- Gel electrophoresis: Separate by size, cut out desired band
- Magnetic beads (SPRI beads): Most common today
- Beads bind DNA based on fragment size
- Add magnets to pull beads (with DNA) to side
- Wash away unwanted fragments
- Like using a magnet to sort iron filings by size
Typical target: 300-500 bp fragments (including adapters)
7. PCR amplification: Make many copies of each fragment
- Need millions of each fragment for clustering
- Use PCR primers that bind to adapters
- Typically 8-12 PCR cycles
- Too many cycles = bias and duplicates
- Too few cycles = not enough DNA
Think of it like:
- Photocopying important documents
- Need enough copies to work with
- But too many copies waste paper!
Quality control at each step:
- After fragmentation: Check size distribution (should be 200-500 bp)
- After adapter ligation: Check that adapters attached (size increases by ~120 bp)
- After PCR: Check DNA concentration (need enough!) and size
- Use: Bioanalyzer or TapeStation (measures DNA size electronically)
Why library prep matters:
- Good library = high-quality sequencing data
- Bad library = garbage data (garbage in, garbage out!)
- Most common source of sequencing errors is poor library prep
18.2.4 Multiplexing: Sequencing Many Samples at Once
The problem: Illumina machines generate BILLIONS of reads per run. But you might only need MILLIONS for your sample. Wasteful!
The solution: Multiplexing = mix multiple samples in one sequencing run!
How it works:
- Sample 1: Add index “ATCACG” to adapters
- Sample 2: Add index “CGATGT” to adapters
- Sample 3: Add index “TTAGGC” to adapters
- Mix all three samples together
- Sequence them all at once!
- Computer reads the index and sorts reads by sample
Think of it like:
- Mailing 100 letters from different people in one mailbag
- Each letter has a return address (index)
- Post office sorts them by return address at destination!
Benefits:
- Cost-effective: Share one expensive sequencing run
- Efficient: Don’t waste billions of reads on one sample
- Common: Can multiplex 12, 24, 96, or even 384 samples!
Real example:
- NovaSeq generates 20 billion reads
- You need only 100 million reads per genome
- Can sequence 200 genomes in one run!
- Cost: $10,000 run ÷ 200 samples = $50 per genome!
18.2.5 Vector-Based Sequencing (Older Method)
Historical note: Before modern library prep, scientists used vectors (bacterial plasmids):
Old method:
- Insert your DNA fragment into a plasmid (circular DNA)
- Transform bacteria with plasmid
- Grow bacteria (each makes many copies)
- Extract plasmid DNA
- Sequence using Sanger sequencing
- Primers bind to known vector sequences
Why vectors were used:
- Amplified DNA (bacteria copy it millions of times)
- Provided known primer binding sites
- Cloned DNA for storage
Why we don’t use them for NGS:
- Too slow (growing bacteria takes days)
- Not needed (PCR amplifies DNA faster)
- Can’t multiplex easily
- Modern library prep is faster and better!
Still used for:
- Sanger sequencing of cloned genes
- Making DNA constructs for experiments
- Long-term storage of DNA sequences
Why adapters matter:
- Let fragments stick to flow cell surface (P5/P7 sequences)
- Provide primer binding sites for sequencing
- Add barcodes (for multiplexing - mixing multiple samples)
- Enable bridge amplification (clustering on flow cell)
Library = collection of all DNA fragments with adapters attached
Think of library like:
- All books in a library have call numbers (adapters)
- Call numbers let you find and organize books
- Same way, adapters let us find and sequence DNA fragments!
18.2.5.1 Step 2: Cluster Generation (Bridge Amplification)
The flow cell:
- Glass slide with millions of tiny spots
- Each spot has oligonucleotides (short DNA sequences) attached
- These oligos match the adapter sequences
Bridge amplification process:
- Attach: DNA fragment binds to flow cell via adapter
- Bend: Fragment bends over and binds to nearby oligo (makes a “bridge”)
- Amplify: DNA polymerase copies the fragment
- Separate: Double-stranded DNA is denatured (separated)
- Repeat: Process repeats ~30-35 times
Result: Each single DNA molecule becomes a “cluster” of ~1,000 identical copies in one tiny spot!
Why clustering is brilliant:
- Signal amplification: 1,000 copies glow much brighter than 1 molecule
- Accuracy: All copies in cluster have same sequence
- Millions of clusters: Each original DNA fragment gets its own cluster
- Parallel sequencing: All clusters sequenced simultaneously!
Think of it like:
- One person shouting = hard to hear
- 1,000 people shouting the same thing = very loud and clear!
18.2.5.2 Step 3: Sequencing by Synthesis
Now the actual sequencing begins!
Cycle 1: Add first base
Add mix to flow cell:
- All four fluorescent reversible terminator dNTPs (A, T, G, C)
- DNA polymerase
- Sequencing primer (binds to adapter)
Incorporation:
- DNA polymerase adds ONE base to each cluster
- Only the complementary base binds
- Terminator prevents adding more bases
Wash: Remove unincorporated nucleotides
Image:
- Laser excites fluorescent dyes
- Camera captures emission from each cluster
- Computer records which color (which base)
Cleave:
- Enzymes remove fluorescent dye
- Enzymes remove terminator group
- 3’-OH is restored for next cycle
Cycle 2: Add second base
- Repeat exact same process
- Add second base
- Image the color
- Record the base
Cycles 3, 4, 5… up to 150-300 cycles:
- Continue adding one base at a time
- Image each cycle
- Build up the complete sequence!
Key features:
- One base per cycle: Reversible terminators prevent errors from homopolymers (like AAAAA)
- All four dNTPs present: Natural competition minimizes bias
- Massively parallel: Millions of clusters sequenced simultaneously
18.2.5.3 Step 4: Data Analysis and Base Calling
What happens to the images:
Image analysis: Software identifies each cluster spot
Intensity measurement: Measures fluorescence intensity and color for each cluster
Base calling: Determines which base (A, T, G, C) was added
- Compares signal to expected wavelengths
- Assigns most likely base
Quality scoring: Assigns confidence score (Phred score) to each base
- Q20 = 99% accuracy (1 error in 100 bases)
- Q30 = 99.9% accuracy (1 error in 1,000 bases)
- Q40 = 99.99% accuracy (1 error in 10,000 bases)
FASTQ file generation: Creates file with:
- Read sequence
- Quality score for each base
- Read identifier
Quality control:
- Low-quality reads filtered out
- Adapter sequences trimmed
- Reads ready for alignment or assembly!
18.2.6 Paired-End Sequencing
Even more powerful: Illumina can sequence BOTH ends of each DNA fragment!
How it works:
- Sequence one end (forward read, 150 bp)
- Wash away sequencing reagents
- Perform bridge amplification again to regenerate clusters
- Sequence the OTHER end (reverse read, 150 bp)
Advantages:
- Better mapping: Know exact distance between two ends
- Detect rearrangements: Identify insertions, deletions, inversions
- Span repetitive regions: One end maps uniquely, helps place the other
- Improved accuracy: Two reads better than one
Applications:
- Genome assembly: Connect contigs
- Variant detection: Find structural variants
- RNA-seq: Identify splice junctions
- Metagenomics: Better species identification
18.2.7 Illumina Platforms Comparison
Platform | Reads per Run | Output per Run | Read Length | Time |
---|---|---|---|---|
MiSeq | 25 million | 15 Gb | 2 × 300 bp | 4-55 hours |
NextSeq | 400 million | 120 Gb | 2 × 150 bp | 12-30 hours |
HiSeq | 5 billion | 1,500 Gb | 2 × 150 bp | 1-6 days |
NovaSeq | 20 billion | 6,000 Gb | 2 × 250 bp | 13-44 hours |
Choosing a platform:
- Small projects (targeted sequencing): MiSeq
- Medium projects (RNA-seq, exomes): NextSeq
- Large projects (whole genomes): NovaSeq
- Clinical labs: MiSeq, NextSeq (faster turnaround)
18.2.8 Common Applications of Illumina Sequencing
Whole Genome Sequencing (WGS):
- Sequence entire genome
- Human genome: ~30× coverage typical
- Detect all types of variants
Whole Exome Sequencing (WES):
- Sequence only protein-coding regions (~1% of genome)
- Much cheaper than WGS
- Good for finding disease-causing mutations
RNA-seq (Transcriptomics):
- Sequence all RNA in sample
- Measure gene expression levels
- Find splice variants, fusion genes
ChIP-seq (Epigenomics):
- Map protein-DNA interactions
- Find transcription factor binding sites
- Map histone modifications
Targeted Sequencing:
- Sequence specific genes or regions
- Cancer panels (e.g., 50 cancer-related genes)
- Very deep coverage of specific regions
Metagenomics:
- Sequence all DNA in environmental sample
- Identify microbes in gut, soil, ocean
- Don’t need to culture organisms!
18.2.9 Limitations and Challenges
Compared to Sanger sequencing:
- Shorter reads: 150-300 bp vs. 700-1,000 bp
- More complex data analysis: Millions of short reads to assemble
- Higher upfront cost: Equipment is expensive
Compared to long-read sequencing (PacBio, Nanopore):
- Cannot span large repetitive regions: Short reads get “lost”
- Miss structural variants: Large insertions/deletions hard to detect
- Phasing difficult: Hard to tell which variants are on same chromosome
Technical challenges:
- GC bias: GC-rich regions harder to sequence
- PCR duplicates: Multiple reads from same original molecule
- Coverage uniformity: Some regions get more reads than others
- Data storage: Terabytes of data per run!
18.2.10 The Future of Illumina Technology
Recent improvements:
- Longer reads: Now up to 2 × 250 bp or 2 × 300 bp
- Higher accuracy: Improved chemistry, better base calling algorithms
- Faster runs: Hours instead of days
- Lower cost: Approaching $100 per human genome
Emerging technologies:
- Complete Long-Read: Illumina acquiring long-read technology
- Single-cell sequencing: Sequence individual cells
- Spatial transcriptomics: Sequence RNA with spatial location information
18.2.11 Summary: Illumina Sequencing
What it is: Sequencing by synthesis using fluorescent reversible terminators
How it works:
- Fragment DNA and add adapters
- Create clusters by bridge amplification
- Sequence one base at a time (different color for each base)
- Image after each base addition
- Repeat for 150-300 cycles
Key advantages:
- Massively parallel (millions of reads simultaneously)
- High accuracy (99%+)
- Cost-effective
- Versatile applications
Key limitations:
- Short reads
- GC bias
- Complex data analysis
Impact: Revolutionized genomics, enabling personalized medicine, cancer genomics, microbiome research, and much more!
18.3 Coverage and Depth: Reading Each Base Multiple Times
18.3.1 What Is Coverage?
When sequencing a genome, we don’t just read each base once - we read it MANY times! This is called coverage or depth.
Coverage = How many times, on average, each base is sequenced
Think of it like:
- Reading a book once → might miss typos
- Reading a book 30 times → very confident about every word!
Notation:
- 30× coverage (pronounced “30 ex”) = each base read 30 times on average
- 50× coverage = each base read 50 times
- 100× coverage = each base read 100 times
18.3.2 Why Do We Need Multiple Reads?
Sequencing errors happen!
- Illumina: ~99% accuracy per base
- That means 1 error per 100 bases!
- Human genome = 3 billion bases
- 1% error = 30 million errors if we only read once!
Solution: Read each base many times and vote!
Example:
If we sequence one position 30 times and get:
- 28 reads say “A”
- 1 read says “T”
- 1 read says “C”
We’re confident the correct base is A (the errors were T and C).
Think of it like:
- 30 witnesses to an event
- 28 say “the car was red”
- 2 say “the car was blue”
- You trust the majority → the car was red!
18.3.3 How Coverage Works in Practice
Low coverage (10× or less):
- Pros: Cheap, fast
- Cons: Less confident, might miss variants
- Uses: Population studies (many individuals, shallow depth)
Medium coverage (30×):
- Pros: Good balance of cost and accuracy
- Cons: Might miss some rare variants
- Uses: Clinical sequencing, most research
High coverage (50-100×):
- Pros: Very confident, detects rare variants
- Cons: Expensive, more data to store
- Uses: Cancer sequencing, rare disease diagnosis
Ultra-high coverage (500× or more):
- Pros: Can detect tiny fractions of variant alleles
- Cons: Very expensive
- Uses: Liquid biopsies (detecting 1% cancer DNA in blood)
18.3.4 Coverage in Different Applications
Whole genome sequencing (WGS):
- Clinical diagnostic: 30× coverage
- Research: 30-50× coverage
- Population studies: 10-15× coverage
Whole exome sequencing (WES):
- Typical: 100× coverage (only sequencing 1% of genome, so can afford more depth)
RNA-seq:
- Coverage here means number of reads per gene
- Not the same as genome coverage!
Targeted sequencing (specific genes):
- Cancer panels: 500-1,000× coverage
- Why so high? To detect mutations in small % of cells
18.3.5 The Cost-Coverage Trade-off
More coverage = more cost:
- 30× human genome: ~$1,000
- 50× human genome: ~$1,500
- 100× human genome: ~$3,000
Diminishing returns:
- Going from 10× to 30× = huge improvement
- Going from 30× to 50× = moderate improvement
- Going from 50× to 100× = small improvement
Think of it like:
- Adding the 10th security camera helps a lot
- Adding the 100th security camera doesn’t help much more
18.3.6 Coverage Is Not Uniform
Important note: Coverage is an AVERAGE, but some regions get more reads than others!
Reasons for uneven coverage:
- GC content: GC-rich regions often have lower coverage
- Repetitive sequences: Hard to sequence, lower coverage
- PCR bias: Some fragments amplify better than others
- Random chance: Statistical variation
Example:
With 30× average coverage:
- Some bases might have 50× coverage
- Other bases might have 10× coverage
- A few bases might have 0× coverage (gaps!)
18.3.7 Quality Scores and Coverage
Coverage and quality scores work together:
- High coverage + high quality = very confident
- High coverage + low quality = less confident (many bad reads)
- Low coverage + high quality = somewhat confident
- Low coverage + low quality = not confident!
Quality score (Phred score):
- Q30 = 99.9% accuracy
- Q20 = 99% accuracy
- Below Q20 = often filtered out
18.3.8 Real-World Example: Diagnosing a Genetic Disease
Scenario: Patient with unknown genetic disease
Approach:
- Sequence patient’s genome at 30× coverage
- Each base read ~30 times
- Computer identifies variants (differences from reference)
- High coverage = confident these variants are real
- Find causative mutation!
Why 30× matters:
- Can confidently call heterozygous variants (one copy mutated)
- Expected ratio: 15 reads normal, 15 reads mutated
- Clear signal!
With only 5× coverage:
- Might see: 3 reads normal, 2 reads mutated
- OR: 4 reads normal, 1 read mutated
- Hard to tell if it’s real or an error!
18.3.9 Key Takeaways on Coverage
- Coverage = number of times each base is sequenced
- 30× is standard for clinical sequencing
- Higher coverage = more confident, but more expensive
- Coverage is uneven across the genome
- Multiple reads allow error correction through voting
- Different applications need different coverage levels
Think of coverage like:
- Insurance: More coverage = more protection, but costs more!
- Witnesses: More witnesses = more confident about the truth!
18.4 Human Genome Sequencing: Then vs. Now
18.4.1 The Amazing Progress
Feature | 2003 (HGP) | 2025 (Today) |
---|---|---|
Cost | $3 billion | <$1,000 |
Time | 13 years | 1-2 days |
Method | Sanger sequencing | NGS (Illumina) |
Accuracy | 99.99% | 99.9% |
Availability | Research only | Clinical & consumer |
That’s a reduction of:
Cost: 3 million-fold!
Time: 2,000-fold!
It’s like going from a price of a house to a coffee!
18.4.2 What This Enables
1. Personalized Medicine:
Sequence your genome to find disease risks
Choose treatments based on your genetics
Predict drug responses
2. Cancer Genomics:
Sequence tumor DNA
Find specific mutations
Target treatments to specific mutations
Monitor treatment response
3. Rare Disease Diagnosis:
Sequence patients with unknown diseases
Find causative mutations
Enable treatment or management
4. Prenatal Testing:
Non-invasive prenatal testing (NIPT)
Detect genetic disorders before birth
From mother’s blood (no risk to baby!)
5. Consumer Genomics:
23andMe, Ancestry.com
Learn about ancestry
Find genetic relatives
Some health information
18.5 Third-Generation Sequencing
18.5.1 Long-Read Technologies
Problem with NGS (2nd gen):
Short reads (100-300 bp)
Hard to assemble repetitive regions
Miss large structural variations
Solution: Third-generation sequencing!
18.5.2 PacBio (Pacific Biosciences)
Key feature: Very long reads (10,000-100,000+ bp)
How it works:
Single molecule sequencing
Watch DNA polymerase add nucleotides in real-time
Each nucleotide fluoresces
Incredible! Like watching a molecular movie!
Advantages:
Long reads span difficult regions
Can sequence through repetitive DNA
Detects DNA modifications directly
Disadvantages:
Higher error rate (but getting better!)
More expensive
Lower throughput
18.5.3 Oxford Nanopore
Key feature: Extremely long reads (up to 2 million bp!)
How it works:
The revolutionary principle: Instead of detecting fluorescence, Oxford Nanopore measures changes in electrical current!
Step-by-step:
- The nanopore: A tiny protein pore embedded in a membrane
- Think of it like a donut hole that DNA passes through
- Maintains an electrical current flowing through it
- DNA threading: Single-stranded DNA is pulled through the pore
- Motor protein controls speed
- DNA passes through one base at a time
- Current disruption: Each base (A, T, G, C) blocks current differently
- Each base has a unique electrical signature
- Computer records current changes in real-time
- Pattern of disruptions reveals the sequence!
- No amplification needed: Sequences single DNA molecules directly
- No PCR required (unlike Illumina!)
- Avoids PCR errors and biases
- Detects DNA modifications directly (methylation, etc.)
Key advantages of no PCR:
- Faster workflow (skip amplification step)
- More accurate (no PCR duplication errors)
- Detects epigenetic modifications
- Lower sample requirements
Device features:
- Portable: USB stick size (MinION device)
- Real-time sequencing: Watch sequences appear live!
- Direct RNA sequencing: Can sequence RNA without converting to DNA first
Advantages:
- Longest reads available: Regular reads >10 kb, ultra-long reads >100 kb, record >2 Mb!
- Portable: Used in field research, remote locations, even the International Space Station!
- Real-time results: No need to wait for run to finish
- Direct RNA sequencing: Preserves modifications, no reverse transcription needed
- No amplification bias: Sequences original molecules
- Epigenetic detection: Detects methylation and other modifications directly
Disadvantages:
Higher error rate: ~92-95% accuracy (vs. 99%+ for Illumina)
- But improving rapidly with better algorithms
- Errors are random, not systematic
- Can be corrected with higher coverage
More expensive per base: Still more costly than Illumina for high coverage
Newer technology: Still improving (algorithms, chemistry, accuracy)
Applications where Nanopore excels:
- Structural variants: Long reads easily span deletions, insertions, inversions
- Repetitive regions: Can read through repeats that confuse short reads
- Transcript isoforms: Full-length RNA sequencing reveals splice variants
- Metagenomics: Long reads better for species identification
- Rapid diagnostics: Real-time results for outbreak response
- Field work: Portable sequencing where labs don’t exist!
18.5.4 Why Long Reads Matter
1. Complete Genome Assembly:
2022: First truly complete human genome
Used long reads to fill gaps
Repetitive regions finally sequenced!
2. Structural Variants:
Large deletions, insertions, inversions
Hard to detect with short reads
Easy with long reads
3. Phasing:
Determine which variants are on same chromosome
Important for understanding genetics
4. Repeat Regions:
Centromeres, telomeres
Highly repetitive
Spanned by long reads
18.6 Applications of Sequencing
18.6.1 Research
Genomics:
Sequence genomes of any organism
Understand evolution
Discover new species
Transcriptomics (RNA-seq):
Sequence all RNA in cells
Measure gene expression
Find new transcripts
Metagenomics:
Sequence all DNA in an environment
Discover microbes in gut, ocean, soil
Don’t need to culture them!
Epigenomics:
Some methods detect DNA methylation
Map epigenetic modifications
18.6.2 Medicine
Diagnosis:
Identify genetic diseases
Diagnose infections (sequence pathogen)
Cancer genomics
Pharmacogenomics:
Predict drug responses
Avoid adverse reactions
Personalize medication
Liquid Biopsies:
Sequence cell-free DNA in blood
Early cancer detection
Monitor treatment
18.6.3 Agriculture
Crop Improvement:
Sequence crop genomes
Find beneficial genes
Speed up breeding
Livestock:
Breed healthier animals
Understand genetics
Disease resistance
18.6.4 Forensics
Criminal Justice:
DNA fingerprinting
Identify suspects or victims
Solve cold cases
Paternity Testing:
- Determine biological relationships
18.6.5 Conservation
Endangered Species:
Sequence threatened species
Preserve genetic diversity
Plan breeding programs
Ancient DNA:
Sequence extinct species (mammoths, Neanderthals!)
Understand evolution
Learn about history
18.7 The Future of Sequencing
18.7.1 Emerging Technologies
Real-Time Single-Molecule Sequencing:
Watch individual DNA molecules being read
No amplification needed
Detect modifications directly
In Situ Sequencing:
Sequence DNA/RNA inside intact cells and tissues
See spatial organization
Preserve 3D context
$100 Genome:
Even cheaper than today
Routine for everyone
Preventive medicine
18.7.2 Challenges Ahead
Data Analysis:
Generating data is easy now
Interpreting data is hard!
Need better algorithms and AI
Ethical Issues:
Privacy of genetic information
Insurance discrimination
Incidental findings (finding diseases you weren’t looking for)
Equity:
Making sequencing available to everyone
Not just wealthy countries/people
Global health applications
18.8 Key Takeaways
DNA sequencing = Reading the order of A, T, G, C in DNA
Sanger sequencing (1977):
First method
Accurate but slow and expensive
Used for Human Genome Project
Next-Generation Sequencing (NGS):
Massively parallel (millions of sequences at once)
Fast, cheap, high-throughput
Revolutionized genomics!
Progress: $3 billion/13 years → <$1,000/1-2 days
Third-generation sequencing:
Long reads (up to millions of base pairs!)
PacBio and Oxford Nanopore
Completed the human genome
Applications: Medicine, research, agriculture, forensics, conservation
Future: Even cheaper, faster, better analysis, ethical considerations
Sources: Information adapted from Illumina, NHGRI, Technology Networks, and sequencing technology literature (Bentley et al. 2008; Shendure and Ji 2008; Mardis 2017).