18 Sequencing Technologies

18.1 Reading the Book of Life

18.1.1 What Is DNA Sequencing?

DNA sequencing = Reading the order of A, T, G, and C letters in DNA

Think of it like:

Reading the text in a book, letter by letter
Decoding a secret message
Reading the instruction manual of life!

18.2 From Sanger to Next-Generation Sequencing

18.2.1 The First Method: Sanger Sequencing (1977)

Developed by: Frederick Sanger (won his 2nd Nobel Prize for this!) (Sanger, Nicklen, and Coulson 1977)

How it works (simplified):

Start with DNA template
Add special labeled nucleotides that STOP copying when incorporated
Create DNA fragments of different lengths
Separate fragments by size
Read the sequence from the pattern!

Characteristics:

Accurate: 99.99% accuracy!
Slow: One sequence at a time
Expensive: Cost millions for whole genomes
Read length: Up to ~1,000 base pairs per read

Impact:

Used for the Human Genome Project!
Still used today for small-scale sequencing
Gold standard for accuracy

The Human Genome Project timeline:

Started: 1990
Finished: 2003
Cost: $3 billion
Time: 13 years
Method: Sanger sequencing

That’s $1 per base pair!

18.2.2 Sanger Sequencing: The Detailed Mechanism

18.2.2.1 The Biological Foundation

Sanger sequencing is based on semi-conservative DNA replication - the same process cells use to copy DNA.

What is semi-conservative replication?

Think of DNA like a zipper with two sides:

One side stays (the template)
One new side is built (the copy)
The old side guides making the new side

Key facts:

DNA replication needs a template strand
One strand serves as the template (parent)
A new strand is built matching the template (daughter)
Watson and Crick predicted this in 1953
Meselson-Stahl proved it in 1958

18.2.2.2 Why DNA Polymerase Needs a Primer

This is an important question! Here’s the simple answer:

The problem:

RNA polymerase → Can start making RNA from nothing
DNA polymerase → Cannot start from nothing, needs a primer first

Why the difference?

Think of it like starting a zipper:

RNA polymerase is like a self-starting zipper
DNA polymerase is like a zipper that needs the first tooth in place

Three reasons DNA polymerase needs a primer:

Reason 1: Error checking

DNA polymerase checks each base it adds (proofreading)
It needs something to hold onto to check properly
Like needing a ruler to measure accurately
RNA polymerase doesn’t check as carefully (RNA is temporary anyway)

Reason 2: Chemical differences

RNA polymerase uses NTPs (have an extra -OH group)
DNA polymerase uses dNTPs (missing that -OH group)
That extra -OH helps start from nothing
DNA polymerase lost this ability during evolution

Reason 3: Accuracy matters

DNA is permanent → needs to be perfect
Primer gives DNA polymerase a stable starting point
Like building a house - you need a foundation first
This reduces copying errors

Why this matters:

PCR → Needs primers to work
Sanger sequencing → Needs primers to work
Cell DNA replication → Uses RNA primers (made by primase enzyme)

18.2.3 The Chemistry of Sanger Sequencing

18.2.3.1 The Special Ingredients

Sanger sequencing uses TWO types of building blocks:

1. Normal dNTPs (regular DNA building blocks):

What they are: dATP, dTTP, dGTP, dCTP
What they do: Build DNA normally
Key feature: Have a 3’-OH group
Result: DNA chain keeps growing

2. Special ddNTPs (chain terminators):

What they are: ddATP, ddTTP, ddGTP, ddCTP
What they do: STOP DNA growth when added
Key feature: MISSING the 3’-OH group
Result: DNA chain stops immediately
Bonus: Each is labeled with a different fluorescent color
- ddA = Green
- ddT = Red
- ddG = Yellow
- ddC = Blue

18.2.3.2 The Simple Difference

Normal dNTP (keeps going):

Has -OH group → Next base can attach → Chain grows


**Special ddNTP** (stops):

NO -OH group → Next base CANNOT attach → Chain stops


Think of it like:

- Normal dNTP = Lego brick with connectors on top

- Special ddNTP = Lego brick with flat top (nothing can attach)

#### How Sanger Sequencing Works (Step by Step)

Let's sequence this DNA: `5'-TACGGCATGCTA-3'`

**Step 1: Add primer**

The primer tells DNA polymerase where to start:

Template DNA: 5'---TACGGCATGCTA---3'
Primer:            3'-ATGC-5'
                       ↑
                  Start here!


**Step 2: Add the mixture**

We add:

- **LOTS** of normal dNTPs (A, T, G, C)

- **A FEW** special ddNTPs (ddA, ddT, ddG, ddC) with colors

**Step 3: DNA polymerase starts copying**

Sometimes it adds a normal base → keeps going
Sometimes it adds a special colored base → STOPS

This creates fragments of different lengths:

Fragment 1: 3'-ATGCCGddT-5'      (Stopped at position 7, RED)
Fragment 2: 3'-ATGCCGTAddC-5'    (Stopped at position 8, BLUE)
Fragment 3: 3'-ATGCCGTACddG-5'   (Stopped at position 9, YELLOW)
Fragment 4: 3'-ATGCCGTACGddA-5'  (Stopped at position 10, GREEN)
Fragment 5: 3'-ATGCCGTACGAddT-5' (Stopped at position 11, RED)


Each fragment ends with a colored base!

**Step 4: Separate by size**

- Pour fragments through a thin tube

- Small fragments run fast → detected first

- Large fragments run slow → detected last

- Like a race where smaller runners are faster

![Sanger Sequencing Method](images/ch15/sanger-sequencing.svg)

**Figure 15.1**: Sanger sequencing chain termination method showing how ddNTPs terminate DNA synthesis at different positions, creating fragments of varying lengths that reveal the DNA sequence.

*Image credit: Estevezj, Wikimedia Commons, CC BY-SA 3.0*

**Step 5: Read the colors**

Camera sees colors in order:

Position 7:  RED    → T was added
Position 8:  BLUE   → C was added
Position 9:  YELLOW → G was added
Position 10: GREEN  → A was added
Position 11: RED    → T was added

Sequence = TCGAT...


Done! We just read the DNA sequence!

### The Original Sanger Method: Four Tubes

**Historical note**: The method we just described is the MODERN version (1-tube with 4 colors). The original Sanger method (1977) was different!

**How the original method worked**:

Instead of one tube with all four ddNTPs (each a different color), Sanger used **FOUR SEPARATE TUBES**:

- **Tube 1**: dNTPs + ddATP (stops at A)
- **Tube 2**: dNTPs + ddTTP (stops at T)
- **Tube 3**: dNTPs + ddGTP (stops at G)
- **Tube 4**: dNTPs + ddCTP (stops at C)

Think of it like:

- Modern method = one race with 4 colored jerseys
- Original method = 4 separate races!

**The original labeling**:

- Used **radioactive labels** (³²P or ³⁵S), not fluorescent colors
- All fragments in one tube were the same "color" (radioactive)
- No fancy colors - just radioactive signal!

**Reading the results**:

1. Run each tube on separate lanes of a gel
2. Use **gel electrophoresis** (fragments separate by size)
3. Expose gel to X-ray film (autoradiography)
4. Dark bands appear where radioactive fragments are
5. Read sequence by comparing the 4 lanes!

**Example reading**:

    A lane   T lane   G lane   C lane
    ----     ----     ----     ----

Small - - band - → G (smallest = first base) band - - - → A (next) - band - - → T (next) - - - band → C (next) - - band - → G (next) Large - band - - → T (largest = last base)

Sequence = GATCGT


**Why this was challenging**:

- Needed 4 separate reactions (4× more work!)
- Manual reading of gel bands (tedious, error-prone)
- Radioactive materials (safety concerns)
- Limited to ~300-400 bases per gel
- Took hours to days per sequence

**The improvement to modern method**:

- **1986-1987**: Fluorescent dyes instead of radioactivity
  - Safer (no radiation)
  - Four colors = one tube instead of four!
  - Automated detection with lasers

- **1990s**: Capillary electrophoresis instead of gel
  - Faster (minutes instead of hours)
  - Automated (no manual reading)
  - Longer reads (up to 1,000 bp)
  - High-throughput (96 samples simultaneously)

**Why learn about the old method?**

- Helps understand the principle (ddNTPs stop synthesis)
- Appreciates modern automation
- Many papers from 1977-2005 use this method
- Nobel Prize-winning technique!

### Modern Sanger Sequencing Protocol

**Workflow**:

**Step 1: PCR Amplification** (optional but common):

- Amplify target region

- Purify PCR product

- Ensures enough template

**Step 2: Sequencing Reaction**:

- Mix template DNA + sequencing primer

- Add DNA polymerase (thermostable, e.g., Taq)

- Add mix of dNTPs and fluorescent-labeled ddNTPs

- Thermal cycling (like PCR, but linear amplification)

**Typical mix ratio**:

- dNTPs:ddNTPs = ~100:1

- Ensures most reactions continue

- But some terminate at each position

**Step 3: Cleanup**:

- Remove excess ddNTPs

- Remove salts (interfere with electrophoresis)

- Use spin columns or magnetic beads

**Step 4: Capillary Electrophoresis**:

- Inject sample into capillary

- Apply electric field

- Fragments separate by size

- Laser excites fluorophores

- Camera detects emission

**Step 5: Data Analysis**:

- Software converts peaks to sequence

- Quality scores assigned to each base

- Generate chromatogram (peak visualization)

### Reading a Sanger Chromatogram

**What you see**:

      A    G    C    T    A    G    C    T
      |    |    |    |    |    |    |    |
     Peak Peak Peak Peak Peak Peak Peak Peak


**Quality indicators**:

**Good quality**:

- Sharp, well-separated peaks

- Single color at each position

- Uniform peak height

- High signal-to-noise ratio

**Poor quality**:

- Overlapping peaks (multiple colors)

- Broad peaks

- Low peak height

- Often at start (primer) or end (long reads) of sequence

**Phred quality score**:

- Q20 = 99% accuracy (1 error in 100 bases)

- Q30 = 99.9% accuracy (1 error in 1,000 bases)

- Most modern Sanger reads Q30 for first 700-800 bp

### Semi-Conservative Replication Revisited

**Meselson-Stahl Experiment (1958)** proved semi-conservative replication:

**Experiment**:

1. Grow E. coli in heavy nitrogen (¹⁵N) medium

2. DNA becomes "heavy" (labeled)

3. Switch to normal nitrogen (¹⁴N) medium

4. After one replication: DNA is "hybrid" (one heavy, one light strand)

5. After two replications: Half hybrid, half light

**Result**: Proved DNA replicates semi-conservatively

- One template strand conserved

- One new strand synthesized

**This principle underlies**:

- All DNA replication in cells

- PCR amplification

- Sanger sequencing

- All DNA synthesis!

**Connection to Sanger sequencing**:

- Template strand = original DNA

- Newly synthesized strand = sequence read

- ddNTPs terminate synthesis at random positions

- Pattern reveals template sequence

### Why Semi-Conservative Replication Matters for Sequencing

**Advantages**:

1. **Accuracy**: Template strand provides perfect guide

2. **Complementarity**: A pairs with T, G pairs with C

3. **Proofreading**: DNA polymerase checks each base

4. **Reliability**: Same template gives same sequence

**Fidelity**:

- DNA polymerase error rate: ~1 in 10⁷ (with proofreading)

- Sanger sequencing error rate: ~1 in 10⁴ (due to termination randomness)

- Still extremely accurate!

### Sanger Sequencing Applications Today

**Despite NGS dominance, Sanger still used for**:

**1. Variant Validation**:

- NGS finds potential mutation

- Sanger confirms it

- Gold standard for clinical diagnostics

**2. Small-Scale Projects**:

- Sequencing few genes

- Checking clones

- Verifying plasmids

- More cost-effective than NGS setup

**3. Long Reads**:

- Up to 1,000 bp in single read

- NGS reads much shorter (typically 150-300 bp)

- Useful for spanning repetitive regions

**4. Low-Throughput Needs**:

- 1-96 samples at a time

- Don't need millions of reads

- Academic labs, clinical labs

### Limitations of Sanger Sequencing

**Compared to NGS**:

- **Throughput**: Only 96 samples/run vs. millions in NGS

- **Cost**: $5-10 per reaction vs. $0.01 per Mb for NGS

- **Speed**: Hours per sample vs. days for whole genome (but millions of sequences!)

- **Scalability**: Cannot do whole genomes economically

**When NOT to use Sanger**:

- Whole genome sequencing

- RNA-seq experiments

- ChIP-seq experiments

- Metagenomics

- Any high-throughput application

### Historical Impact

**Before Sanger sequencing** (pre-1977):

- No way to read DNA sequence directly

- Relied on protein sequencing (slower, harder)

- Genetic code deciphered with difficulty

**After Sanger sequencing** (1977-2005):

- First genes sequenced

- First genomes sequenced (viruses, bacteria)

- **Human Genome Project possible**!

- Molecular biology revolution

- Medical genetics born

**Key milestones using Sanger**:

- 1977: First DNA sequence (bacteriophage φX174, 5,375 bp)

- 1995: First bacterial genome (H. influenzae, 1.8 Mb)

- 1996: First eukaryotic genome (yeast, 12 Mb)

- 2001: Human genome draft (3 Gb!)

- 2003: Human genome complete

**Nobel Prizes related to sequencing**:

- 1980: Sanger - DNA sequencing method

- 1993: Mullis - PCR (enables sequencing)

- 2020: Charpentier & Doudna - CRISPR (genome editing, relies on sequencing)

### Comparison Table: Sanger vs. NGS

| Feature | Sanger Sequencing | Next-Generation Sequencing |
|---------|-------------------|----------------------------|
| **Read length** | 700-1,000 bp | 100-300 bp (Illumina) |
| **Accuracy** | 99.9% (Q30) | 99%+ (depends on platform) |
| **Throughput** | 1-96 sequences/run | Millions to billions/run |
| **Cost per Mb** | $500-1,000 | $0.01-0.10 |
| **Time** | Hours | Days (for whole genome) |
| **Applications** | Validation, cloning, small projects | WGS, RNA-seq, variant discovery |
| **Best for** | Low throughput, long reads | High throughput, discovery |

### The Revolution: Next-Generation Sequencing (NGS)

**What changed**: In the mid-2000s, new technologies emerged!

**NGS characteristics**:

- **Massively parallel**: Sequence millions of fragments simultaneously!

- **Fast**: Whole genome in days instead of years

- **Cheap**: Now costs less than $1,000 per genome

- **High-throughput**: Generate billions of base pairs of data

**Key difference from Sanger**:

- Sanger: Read ONE sequence at a time (serial)

- NGS: Read MILLIONS of sequences at once (parallel)

Think of it like:

- **Sanger**: Reading one book

- **NGS**: Reading an entire library simultaneously!

### How NGS Works (General Process)

**1. Library Preparation**:

- Break DNA into small fragments

- Add special "adapters" to fragment ends

- Like putting address labels on letters

**2. Amplification**:

- Make many copies of each fragment

- Like photocopying letters many times

**3. Sequencing**:

- Fragments attach to a surface

- Nucleotides added one at a time

- Each nucleotide emits light (different colors for A, T, G, C)

- Camera records the lights

- Computer determines sequence!

**4. Data Analysis**:

- Assemble millions of short sequences

- Like putting together a jigsaw puzzle

- Compare to reference genome

- Identify differences

### NGS Platforms

**Illumina** (most popular):

- Short reads (100-300 bp)

- Very accurate

- Most widely used

- Relatively cheap

**Ion Torrent**:

- Detects pH changes (not light)

- Faster

- Good for targeted sequencing

**Others**: Many platforms exist, each with trade-offs!

### The Evolution of Read Lengths

**Read length** = How many bases we can sequence in one continuous read

Think of it like:

- **Short reads** = reading one sentence at a time
- **Long reads** = reading entire paragraphs or pages!

**Historical progression of NGS read lengths**:

**Early days (2005-2008)**:

- **25-35 bp** reads
- Illumina Genome Analyzer
- Very short! Like reading only 2-3 words at a time
- Hard to assemble genomes
- But: massively parallel, millions of reads!

**First generation NGS (2008-2012)**:

- **50-75 bp** reads (single-end)
- **2 × 50 bp** (paired-end) = 100 bp total information
- Better, but still challenging
- Could map to reference genomes
- Assembly still difficult

**Second generation (2012-2015)**:

- **100-150 bp** reads
- **2 × 100 bp** or **2 × 150 bp** (paired-end)
- Big improvement!
- Good for most applications
- Became the standard

**Modern era (2015-present)**:

- **150-300 bp** reads (standard)
- **2 × 150 bp** or **2 × 250 bp** common
- Some platforms: **2 × 300 bp**!
- Long enough for most needs
- Good balance of length, accuracy, and cost

**Longest Illumina reads today**:

- MiSeq: Up to **2 × 300 bp** (600 bp total)
- NovaSeq: Usually **2 × 250 bp** (500 bp total)
- NextSeq: **2 × 150 bp** (300 bp total)

### Why Read Length Matters

**Longer reads are better for**:

1. **Genome assembly**:
   - Longer reads span more of the genome
   - Easier to put pieces together
   - Like having larger puzzle pieces

2. **Repetitive regions**:
   - Genomes have many repeats (same sequence multiple times)
   - Short reads get "lost" in repeats
   - Longer reads can span repeats
   - Like reading "AAAAAAA..." - need to read the whole thing to know how many As!

3. **Structural variants**:
   - Large deletions, insertions, inversions
   - Need to span the variant
   - Short reads might miss it entirely

4. **Isoform detection** (RNA-seq):
   - Full-length transcripts are thousands of bases long
   - Longer reads capture more of each isoform
   - Better identification of splice variants

**Shorter reads are better for**:

1. **Accuracy**:
   - Short reads have fewer errors
   - Quality degrades toward the end of reads
   - 50 bp at Q30 is better than 300 bp at Q20

2. **Cost**:
   - Shorter reads = cheaper per base
   - More reads per run
   - Better for high-throughput applications

3. **Speed**:
   - Shorter reads sequence faster
   - Fewer cycles = faster run time

### The Trade-Offs

| Read Length | Accuracy | Cost | Speed | Best For |
|-------------|----------|------|-------|----------|
| **50-75 bp** | Highest | Cheapest | Fastest | RNA-seq, ChIP-seq, targeted sequencing |
| **100-150 bp** | High | Moderate | Moderate | Standard WGS, exomes, most applications |
| **250-300 bp** | Good | Higher | Slower | Assembly, amplicon sequencing, metagenomics |

**Most common today**: **2 × 150 bp** (paired-end, 300 bp total information)

- Good balance of everything
- Cheap enough for routine use
- Long enough for most applications
- Standard for clinical sequencing

### Why We Couldn't Make Long Reads Earlier

**Technical challenges**:

1. **Reversible terminators don't work well for long reads**:
   - Each cycle adds chemicals that must be removed
   - After ~300 cycles, too much residue builds up
   - Signals get weaker and noisier
   - Like photocopying a photocopy 300 times - gets fuzzy!

2. **Chemistry degradation**:
   - DNA polymerase efficiency decreases
   - Fluorescent dyes accumulate
   - Clusters get dimmer
   - Error rate increases

3. **Phasing problems**:
   - In a cluster of 1,000 identical molecules
   - All should be at the same position
   - But some lag behind or jump ahead (phasing errors)
   - After 200-300 cycles, too much variation
   - Signal becomes mixed and unclear

**Recent improvements**:

- **Better chemistry**: More stable reagents
- **Better enzymes**: DNA polymerase that works longer
- **Better optics**: Brighter signals, better cameras
- **Better algorithms**: Correct for phasing errors
- **Result**: 300 bp reads now routine!

### Paired-End vs. Single-End

**Single-end sequencing**:

- Sequence only one end of fragment
- One read per fragment
- Example: 150 bp read

**Paired-end sequencing**:

- Sequence BOTH ends of fragment
- Two reads per fragment
- Example: 2 × 150 bp = two 150 bp reads from same fragment
- Know the distance between reads (~300-500 bp)

**Why paired-end is better**:

- **More information**: Two reads instead of one!
- **Better mapping**: Know both ends help place fragment
- **Detect structural variants**: If distance is wrong, something's inserted or deleted
- **Span gaps**: One end maps, helps place the other end

**Standard today**: Almost always paired-end!

### Long-Read Technologies (3rd Generation)

**The limit of short-read technologies**:

- Illumina: Maximum ~300-500 bp (paired-end)
- Even with improvements, can't go much longer
- Fundamental chemistry limitations

**Solution**: Completely different technologies!

**PacBio**:

- **10,000-100,000+ bp** reads
- Average: ~10-20 kb
- Longest: > 100 kb!
- Uses single-molecule real-time sequencing
- Different chemistry (not reversible terminators)

**Oxford Nanopore**:

- **Regular reads**: 10-50 kb
- **Ultra-long reads**: 100-200 kb
- **Record**: > 2 million bp!
- Uses electrical current (not fluorescence)
- Can sequence indefinitely until DNA breaks

**Trade-offs for long reads**:

- **Pros**: Much longer! Span repetitive regions, complete genomes
- **Cons**: Higher error rate (~90-95% vs. 99%+), more expensive, lower throughput

**The future**: Combine short accurate reads (Illumina) with long reads (PacBio/Nanopore) for best of both!

## Illumina Sequencing: The Details

### Why Illumina Dominates NGS

**Illumina/Solexa sequencing** is the most widely used NGS technology today. It was developed by Solexa (acquired by Illumina in 2007) and has become the gold standard for high-throughput sequencing [@bentley2008accurate].

**Key advantages**:

- **Massively parallel**: Sequence millions of DNA fragments simultaneously
- **High accuracy**: 99%+ accuracy with quality scores
- **Cost-effective**: Cheapest per-base cost
- **Versatile**: Used for genome sequencing, RNA-seq, ChIP-seq, and more

### The Core Principle: Sequencing by Synthesis (SBS)

Illumina uses **sequencing by synthesis (SBS)** - it sequences DNA by watching DNA polymerase add nucleotides one at a time!

**The brilliant idea**:

- Use special fluorescent nucleotides
- Each base (A, T, G, C) glows a different color
- Watch which color appears → know which base was added
- Repeat for every position in the DNA!

Think of it like:

- DNA polymerase is building a LEGO tower
- Each LEGO brick glows a different color
- A camera takes a picture after each brick is added
- The color sequence reveals the DNA sequence!

### The Magic Ingredient: Reversible Terminator Nucleotides

**Normal dNTPs** (regular DNA building blocks):

- DNA polymerase can add them continuously
- Hard to control - might add multiple bases at once

**Illumina's reversible terminators**:

- **Modified dNTPs** with two special features:
  1. **Fluorescent dye** attached (different color for A, T, G, C)
  2. **Reversible terminator** group at 3' position

**How they work**:

1. DNA polymerase adds ONE modified nucleotide
2. The terminator BLOCKS the next base from being added
3. Camera captures the fluorescent signal
4. Enzymes REMOVE the fluorescent dye and terminator
5. Next cycle begins!

**The four colors**:

- **Adenine (A)**: Green fluorescence
- **Thymine (T)**: Red fluorescence
- **Guanine (G)**: Yellow fluorescence
- **Cytosine (C)**: Blue fluorescence

(Note: Exact colors vary by Illumina platform, but each base has a unique signal)

### Step-by-Step: How Illumina Sequencing Works

#### Step 1: Library Preparation

**Goal**: Prepare DNA fragments with special adapters

Library preparation is like preparing letters for mailing - you need to put addresses and return addresses on them so they get to the right place!

**Process**:

**1. Fragment DNA**: Break genomic DNA into small pieces (typically 200-500 bp)

**Methods to fragment DNA**:

**a) Enzymatic fragmentation** (using enzymes):

- **Restriction enzymes**: Cut DNA at specific sequences
  - Like molecular scissors that only cut at certain words
  - Problem: Cuts are not random, biased toward certain sequences
  - Less common for NGS

- **Tagmentation** (Nextera method):
  - Enzyme (transposase) cuts AND adds adapters simultaneously!
  - Very fast (minutes instead of hours)
  - Becoming very popular
  - Like a machine that stamps and addresses letters at once

**b) Physical fragmentation** (mechanical):

- **Sonication**: Use sound waves to break DNA
  - High-frequency sound vibrates DNA until it breaks
  - Like using a jackhammer on ice
  - Creates random fragments
  - Takes 5-30 minutes

- **Nebulization**: Force DNA through tiny holes
  - High pressure shears the DNA
  - Like pushing spaghetti through a screen
  - Also creates random fragments
  - Quick but uses more DNA

- **Enzymatic shearing**: Enzymes that cut randomly
  - More gentle, less biased
  - Like using scissors randomly on a string

**Why fragment size matters**:

- **Too short** (< 100 bp): Hard to map uniquely to genome
- **Too long** (> 800 bp): Won't fit in sequencer, harder to amplify
- **Just right** (200-500 bp): Perfect for Illumina!

**2. End repair**: Make fragment ends smooth and blunt

After fragmentation, DNA ends are ragged (like torn paper edges). We need to smooth them out!

- Remove overhangs (extra single-stranded bits)
- Fill in gaps
- Make all ends blunt (flat)
- Like filing down rough edges on wood

**Why this matters**: Adapters can only attach to blunt ends properly

**3. A-tailing**: Add single "A" nucleotide to 3' ends

- Adds one adenine (A) to each end
- Prepares for adapter ligation
- Like adding a hook to hang things on

**4. Add adapters**: Attach short DNA sequences to both ends

**What are adapters?**

Adapters are short DNA sequences (20-100 bp) that get attached to BOTH ends of every DNA fragment.

Think of adapters like:

- **Mailing addresses** on letters
- **Barcode labels** on packages
- **Handles** on suitcases

**Adapter structure**:

5’— [P5 adapter] —[Your DNA fragment]— [P7 adapter] —3’ ↑ ↑ Binds to flow cell Binds to flow cell + sequencing primer + sequencing primer ```

What adapters contain:

Flow cell binding sequences:
- Let fragments stick to flow cell surface
- Complementary to oligos on flow cell
- Like Velcro hooks that match Velcro loops
Sequencing primer binding sites:
- Where sequencing primers attach
- Start point for DNA polymerase
- Like “Start here” signs
Index/Barcode sequences (optional):
- Unique tags (6-8 bp) to identify samples
- Like apartment numbers on an address
- Allows multiplexing!

5. Ligation: Glue adapters to DNA fragments

Use DNA ligase enzyme (molecular glue)
Creates covalent bonds
Very stable connection
Like super-gluing handles onto boxes

6. Size selection: Keep only fragments of desired size

After adapter ligation, we have fragments of many sizes. We need to keep only the ones we want!

Methods:

Gel electrophoresis: Separate by size, cut out desired band
Magnetic beads (SPRI beads): Most common today
- Beads bind DNA based on fragment size
- Add magnets to pull beads (with DNA) to side
- Wash away unwanted fragments
- Like using a magnet to sort iron filings by size

Typical target: 300-500 bp fragments (including adapters)

7. PCR amplification: Make many copies of each fragment

Need millions of each fragment for clustering
Use PCR primers that bind to adapters
Typically 8-12 PCR cycles
Too many cycles = bias and duplicates
Too few cycles = not enough DNA

Think of it like:

Photocopying important documents
Need enough copies to work with
But too many copies waste paper!

Quality control at each step:

After fragmentation: Check size distribution (should be 200-500 bp)
After adapter ligation: Check that adapters attached (size increases by ~120 bp)
After PCR: Check DNA concentration (need enough!) and size
Use: Bioanalyzer or TapeStation (measures DNA size electronically)

Why library prep matters:

Good library = high-quality sequencing data
Bad library = garbage data (garbage in, garbage out!)
Most common source of sequencing errors is poor library prep

18.2.4 Multiplexing: Sequencing Many Samples at Once

The problem: Illumina machines generate BILLIONS of reads per run. But you might only need MILLIONS for your sample. Wasteful!

The solution: Multiplexing = mix multiple samples in one sequencing run!

How it works:

Sample 1: Add index “ATCACG” to adapters
Sample 2: Add index “CGATGT” to adapters
Sample 3: Add index “TTAGGC” to adapters
Mix all three samples together
Sequence them all at once!
Computer reads the index and sorts reads by sample

Think of it like:

Mailing 100 letters from different people in one mailbag
Each letter has a return address (index)
Post office sorts them by return address at destination!

Benefits:

Cost-effective: Share one expensive sequencing run
Efficient: Don’t waste billions of reads on one sample
Common: Can multiplex 12, 24, 96, or even 384 samples!

Real example:

NovaSeq generates 20 billion reads
You need only 100 million reads per genome
Can sequence 200 genomes in one run!
Cost: $10,000 run ÷ 200 samples = $50 per genome!

18.2.5 Vector-Based Sequencing (Older Method)

Historical note: Before modern library prep, scientists used vectors (bacterial plasmids):

Old method:

Insert your DNA fragment into a plasmid (circular DNA)
Transform bacteria with plasmid
Grow bacteria (each makes many copies)
Extract plasmid DNA
Sequence using Sanger sequencing
Primers bind to known vector sequences

Why vectors were used:

Amplified DNA (bacteria copy it millions of times)
Provided known primer binding sites
Cloned DNA for storage

Why we don’t use them for NGS:

Too slow (growing bacteria takes days)
Not needed (PCR amplifies DNA faster)
Can’t multiplex easily
Modern library prep is faster and better!

Still used for:

Sanger sequencing of cloned genes
Making DNA constructs for experiments
Long-term storage of DNA sequences

Why adapters matter:

Let fragments stick to flow cell surface (P5/P7 sequences)
Provide primer binding sites for sequencing
Add barcodes (for multiplexing - mixing multiple samples)
Enable bridge amplification (clustering on flow cell)

Library = collection of all DNA fragments with adapters attached

Think of library like:

All books in a library have call numbers (adapters)
Call numbers let you find and organize books
Same way, adapters let us find and sequence DNA fragments!

18.2.5.1 Step 2: Cluster Generation (Bridge Amplification)

The flow cell:

Glass slide with millions of tiny spots
Each spot has oligonucleotides (short DNA sequences) attached
These oligos match the adapter sequences

Bridge amplification process:

Attach: DNA fragment binds to flow cell via adapter
Bend: Fragment bends over and binds to nearby oligo (makes a “bridge”)
Amplify: DNA polymerase copies the fragment
Separate: Double-stranded DNA is denatured (separated)
Repeat: Process repeats ~30-35 times

Result: Each single DNA molecule becomes a “cluster” of ~1,000 identical copies in one tiny spot!

Why clustering is brilliant:

Signal amplification: 1,000 copies glow much brighter than 1 molecule
Accuracy: All copies in cluster have same sequence
Millions of clusters: Each original DNA fragment gets its own cluster
Parallel sequencing: All clusters sequenced simultaneously!

Think of it like:

One person shouting = hard to hear
1,000 people shouting the same thing = very loud and clear!

18.2.5.2 Step 3: Sequencing by Synthesis

Now the actual sequencing begins!

Cycle 1: Add first base

Add mix to flow cell:
- All four fluorescent reversible terminator dNTPs (A, T, G, C)
- DNA polymerase
- Sequencing primer (binds to adapter)
Incorporation:
- DNA polymerase adds ONE base to each cluster
- Only the complementary base binds
- Terminator prevents adding more bases
Wash: Remove unincorporated nucleotides
Image:
- Laser excites fluorescent dyes
- Camera captures emission from each cluster
- Computer records which color (which base)
Cleave:
- Enzymes remove fluorescent dye
- Enzymes remove terminator group
- 3’-OH is restored for next cycle

Cycle 2: Add second base

Repeat exact same process
Add second base
Image the color
Record the base

Cycles 3, 4, 5… up to 150-300 cycles:

Continue adding one base at a time
Image each cycle
Build up the complete sequence!

Key features:

One base per cycle: Reversible terminators prevent errors from homopolymers (like AAAAA)
All four dNTPs present: Natural competition minimizes bias
Massively parallel: Millions of clusters sequenced simultaneously

18.2.5.3 Step 4: Data Analysis and Base Calling

What happens to the images:

Image analysis: Software identifies each cluster spot
Intensity measurement: Measures fluorescence intensity and color for each cluster
Base calling: Determines which base (A, T, G, C) was added
- Compares signal to expected wavelengths
- Assigns most likely base
Quality scoring: Assigns confidence score (Phred score) to each base
- Q20 = 99% accuracy (1 error in 100 bases)
- Q30 = 99.9% accuracy (1 error in 1,000 bases)
- Q40 = 99.99% accuracy (1 error in 10,000 bases)
FASTQ file generation: Creates file with:
- Read sequence
- Quality score for each base
- Read identifier

Quality control:

Low-quality reads filtered out
Adapter sequences trimmed
Reads ready for alignment or assembly!

18.2.6 Paired-End Sequencing

Even more powerful: Illumina can sequence BOTH ends of each DNA fragment!

How it works:

Sequence one end (forward read, 150 bp)
Wash away sequencing reagents
Perform bridge amplification again to regenerate clusters
Sequence the OTHER end (reverse read, 150 bp)

Advantages:

Better mapping: Know exact distance between two ends
Detect rearrangements: Identify insertions, deletions, inversions
Span repetitive regions: One end maps uniquely, helps place the other
Improved accuracy: Two reads better than one

Applications:

Genome assembly: Connect contigs
Variant detection: Find structural variants
RNA-seq: Identify splice junctions
Metagenomics: Better species identification

18.2.7 Illumina Platforms Comparison

Platform	Reads per Run	Output per Run	Read Length	Time
MiSeq	25 million	15 Gb	2 × 300 bp	4-55 hours
NextSeq	400 million	120 Gb	2 × 150 bp	12-30 hours
HiSeq	5 billion	1,500 Gb	2 × 150 bp	1-6 days
NovaSeq	20 billion	6,000 Gb	2 × 250 bp	13-44 hours

Choosing a platform:

Small projects (targeted sequencing): MiSeq
Medium projects (RNA-seq, exomes): NextSeq
Large projects (whole genomes): NovaSeq
Clinical labs: MiSeq, NextSeq (faster turnaround)

18.2.8 Common Applications of Illumina Sequencing

Whole Genome Sequencing (WGS):

Sequence entire genome
Human genome: ~30× coverage typical
Detect all types of variants

Whole Exome Sequencing (WES):

Sequence only protein-coding regions (~1% of genome)
Much cheaper than WGS
Good for finding disease-causing mutations

RNA-seq (Transcriptomics):

Sequence all RNA in sample
Measure gene expression levels
Find splice variants, fusion genes

ChIP-seq (Epigenomics):

Map protein-DNA interactions
Find transcription factor binding sites
Map histone modifications

Targeted Sequencing:

Sequence specific genes or regions
Cancer panels (e.g., 50 cancer-related genes)
Very deep coverage of specific regions

Metagenomics:

Sequence all DNA in environmental sample
Identify microbes in gut, soil, ocean
Don’t need to culture organisms!

18.2.9 Limitations and Challenges

Compared to Sanger sequencing:

Shorter reads: 150-300 bp vs. 700-1,000 bp
More complex data analysis: Millions of short reads to assemble
Higher upfront cost: Equipment is expensive

Compared to long-read sequencing (PacBio, Nanopore):

Cannot span large repetitive regions: Short reads get “lost”
Miss structural variants: Large insertions/deletions hard to detect
Phasing difficult: Hard to tell which variants are on same chromosome

Technical challenges:

GC bias: GC-rich regions harder to sequence
PCR duplicates: Multiple reads from same original molecule
Coverage uniformity: Some regions get more reads than others
Data storage: Terabytes of data per run!

18.2.10 The Future of Illumina Technology

Recent improvements:

Longer reads: Now up to 2 × 250 bp or 2 × 300 bp
Higher accuracy: Improved chemistry, better base calling algorithms
Faster runs: Hours instead of days
Lower cost: Approaching $100 per human genome

Emerging technologies:

Complete Long-Read: Illumina acquiring long-read technology
Single-cell sequencing: Sequence individual cells
Spatial transcriptomics: Sequence RNA with spatial location information

18.2.11 Summary: Illumina Sequencing

What it is: Sequencing by synthesis using fluorescent reversible terminators

How it works:

Fragment DNA and add adapters
Create clusters by bridge amplification
Sequence one base at a time (different color for each base)
Image after each base addition
Repeat for 150-300 cycles

Key advantages:

Massively parallel (millions of reads simultaneously)
High accuracy (99%+)
Cost-effective
Versatile applications

Key limitations:

Short reads
GC bias
Complex data analysis

Impact: Revolutionized genomics, enabling personalized medicine, cancer genomics, microbiome research, and much more!

18.3 Coverage and Depth: Reading Each Base Multiple Times

18.3.1 What Is Coverage?

When sequencing a genome, we don’t just read each base once - we read it MANY times! This is called coverage or depth.

Coverage = How many times, on average, each base is sequenced

Think of it like:

Reading a book once → might miss typos
Reading a book 30 times → very confident about every word!

Notation:

30× coverage (pronounced “30 ex”) = each base read 30 times on average
50× coverage = each base read 50 times
100× coverage = each base read 100 times

18.3.2 Why Do We Need Multiple Reads?

Sequencing errors happen!

Illumina: ~99% accuracy per base
That means 1 error per 100 bases!
Human genome = 3 billion bases
1% error = 30 million errors if we only read once!

Solution: Read each base many times and vote!

Example:

If we sequence one position 30 times and get:

28 reads say “A”
1 read says “T”
1 read says “C”

We’re confident the correct base is A (the errors were T and C).

Think of it like:

30 witnesses to an event
28 say “the car was red”
2 say “the car was blue”
You trust the majority → the car was red!

18.3.3 How Coverage Works in Practice

Low coverage (10× or less):

Pros: Cheap, fast
Cons: Less confident, might miss variants
Uses: Population studies (many individuals, shallow depth)

Medium coverage (30×):

Pros: Good balance of cost and accuracy
Cons: Might miss some rare variants
Uses: Clinical sequencing, most research

High coverage (50-100×):

Pros: Very confident, detects rare variants
Cons: Expensive, more data to store
Uses: Cancer sequencing, rare disease diagnosis

Ultra-high coverage (500× or more):

Pros: Can detect tiny fractions of variant alleles
Cons: Very expensive
Uses: Liquid biopsies (detecting 1% cancer DNA in blood)

18.3.4 Coverage in Different Applications

Whole genome sequencing (WGS):

Clinical diagnostic: 30× coverage
Research: 30-50× coverage
Population studies: 10-15× coverage

Whole exome sequencing (WES):

Typical: 100× coverage (only sequencing 1% of genome, so can afford more depth)

RNA-seq:

Coverage here means number of reads per gene
Not the same as genome coverage!

Targeted sequencing (specific genes):

Cancer panels: 500-1,000× coverage
Why so high? To detect mutations in small % of cells

18.3.5 The Cost-Coverage Trade-off

More coverage = more cost:

30× human genome: ~$1,000
50× human genome: ~$1,500
100× human genome: ~$3,000

Diminishing returns:

Going from 10× to 30× = huge improvement
Going from 30× to 50× = moderate improvement
Going from 50× to 100× = small improvement

Think of it like:

Adding the 10th security camera helps a lot
Adding the 100th security camera doesn’t help much more

18.3.6 Coverage Is Not Uniform

Important note: Coverage is an AVERAGE, but some regions get more reads than others!

Reasons for uneven coverage:

GC content: GC-rich regions often have lower coverage
Repetitive sequences: Hard to sequence, lower coverage
PCR bias: Some fragments amplify better than others
Random chance: Statistical variation

Example:

With 30× average coverage:

Some bases might have 50× coverage
Other bases might have 10× coverage
A few bases might have 0× coverage (gaps!)

18.3.7 Quality Scores and Coverage

Coverage and quality scores work together:

High coverage + high quality = very confident
High coverage + low quality = less confident (many bad reads)
Low coverage + high quality = somewhat confident
Low coverage + low quality = not confident!

Quality score (Phred score):

Q30 = 99.9% accuracy
Q20 = 99% accuracy
Below Q20 = often filtered out

18.3.8 Real-World Example: Diagnosing a Genetic Disease

Scenario: Patient with unknown genetic disease

Approach:

Sequence patient’s genome at 30× coverage
Each base read ~30 times
Computer identifies variants (differences from reference)
High coverage = confident these variants are real
Find causative mutation!

Why 30× matters:

Can confidently call heterozygous variants (one copy mutated)
Expected ratio: 15 reads normal, 15 reads mutated
Clear signal!

With only 5× coverage:

Might see: 3 reads normal, 2 reads mutated
OR: 4 reads normal, 1 read mutated
Hard to tell if it’s real or an error!

18.3.9 Key Takeaways on Coverage

Coverage = number of times each base is sequenced
30× is standard for clinical sequencing
Higher coverage = more confident, but more expensive
Coverage is uneven across the genome
Multiple reads allow error correction through voting
Different applications need different coverage levels

Think of coverage like:

Insurance: More coverage = more protection, but costs more!
Witnesses: More witnesses = more confident about the truth!

18.4 Human Genome Sequencing: Then vs. Now

18.4.1 The Amazing Progress

Feature	2003 (HGP)	2025 (Today)
Cost	$3 billion	<$1,000
Time	13 years	1-2 days
Method	Sanger sequencing	NGS (Illumina)
Accuracy	99.99%	99.9%
Availability	Research only	Clinical & consumer

That’s a reduction of:

Cost: 3 million-fold!
Time: 2,000-fold!

It’s like going from a price of a house to a coffee!

18.4.2 What This Enables

1. Personalized Medicine:

Sequence your genome to find disease risks
Choose treatments based on your genetics
Predict drug responses

2. Cancer Genomics:

Sequence tumor DNA
Find specific mutations
Target treatments to specific mutations
Monitor treatment response

3. Rare Disease Diagnosis:

Sequence patients with unknown diseases
Find causative mutations
Enable treatment or management

4. Prenatal Testing:

Non-invasive prenatal testing (NIPT)
Detect genetic disorders before birth
From mother’s blood (no risk to baby!)

5. Consumer Genomics:

23andMe, Ancestry.com
Learn about ancestry
Find genetic relatives
Some health information

18.5 Third-Generation Sequencing

18.5.1 Long-Read Technologies

Problem with NGS (2nd gen):

Short reads (100-300 bp)
Hard to assemble repetitive regions
Miss large structural variations

Solution: Third-generation sequencing!

18.5.2 PacBio (Pacific Biosciences)

Key feature: Very long reads (10,000-100,000+ bp)

How it works:

Single molecule sequencing
Watch DNA polymerase add nucleotides in real-time
Each nucleotide fluoresces
Incredible! Like watching a molecular movie!

Advantages:

Long reads span difficult regions
Can sequence through repetitive DNA
Detects DNA modifications directly

Disadvantages:

Higher error rate (but getting better!)
More expensive
Lower throughput

18.5.3 Oxford Nanopore

Key feature: Extremely long reads (up to 2 million bp!)

How it works:

The revolutionary principle: Instead of detecting fluorescence, Oxford Nanopore measures changes in electrical current!

Step-by-step:

The nanopore: A tiny protein pore embedded in a membrane
- Think of it like a donut hole that DNA passes through
- Maintains an electrical current flowing through it
DNA threading: Single-stranded DNA is pulled through the pore
- Motor protein controls speed
- DNA passes through one base at a time
Current disruption: Each base (A, T, G, C) blocks current differently
- Each base has a unique electrical signature
- Computer records current changes in real-time
- Pattern of disruptions reveals the sequence!
No amplification needed: Sequences single DNA molecules directly
- No PCR required (unlike Illumina!)
- Avoids PCR errors and biases
- Detects DNA modifications directly (methylation, etc.)

Key advantages of no PCR:

Faster workflow (skip amplification step)
More accurate (no PCR duplication errors)
Detects epigenetic modifications
Lower sample requirements

Device features:

Portable: USB stick size (MinION device)
Real-time sequencing: Watch sequences appear live!
Direct RNA sequencing: Can sequence RNA without converting to DNA first

Advantages:

Longest reads available: Regular reads >10 kb, ultra-long reads >100 kb, record >2 Mb!
Portable: Used in field research, remote locations, even the International Space Station!
Real-time results: No need to wait for run to finish
Direct RNA sequencing: Preserves modifications, no reverse transcription needed
No amplification bias: Sequences original molecules
Epigenetic detection: Detects methylation and other modifications directly

Disadvantages:

Higher error rate: ~92-95% accuracy (vs. 99%+ for Illumina)
- But improving rapidly with better algorithms
- Errors are random, not systematic
- Can be corrected with higher coverage
More expensive per base: Still more costly than Illumina for high coverage
Newer technology: Still improving (algorithms, chemistry, accuracy)

Applications where Nanopore excels:

Structural variants: Long reads easily span deletions, insertions, inversions
Repetitive regions: Can read through repeats that confuse short reads
Transcript isoforms: Full-length RNA sequencing reveals splice variants
Metagenomics: Long reads better for species identification
Rapid diagnostics: Real-time results for outbreak response
Field work: Portable sequencing where labs don’t exist!

18.5.4 Why Long Reads Matter

1. Complete Genome Assembly:

2022: First truly complete human genome
Used long reads to fill gaps
Repetitive regions finally sequenced!

2. Structural Variants:

Large deletions, insertions, inversions
Hard to detect with short reads
Easy with long reads

3. Phasing:

Determine which variants are on same chromosome
Important for understanding genetics

4. Repeat Regions:

Centromeres, telomeres
Highly repetitive
Spanned by long reads

18.6 Applications of Sequencing

18.6.1 Research

Genomics:

Sequence genomes of any organism
Understand evolution
Discover new species

Transcriptomics (RNA-seq):

Sequence all RNA in cells
Measure gene expression
Find new transcripts

Metagenomics:

Sequence all DNA in an environment
Discover microbes in gut, ocean, soil
Don’t need to culture them!

Epigenomics:

Some methods detect DNA methylation
Map epigenetic modifications

18.6.2 Medicine

Diagnosis:

Identify genetic diseases
Diagnose infections (sequence pathogen)
Cancer genomics

Pharmacogenomics:

Predict drug responses
Avoid adverse reactions
Personalize medication

Liquid Biopsies:

Sequence cell-free DNA in blood
Early cancer detection
Monitor treatment

18.6.3 Agriculture

Crop Improvement:

Sequence crop genomes
Find beneficial genes
Speed up breeding

Livestock:

Breed healthier animals
Understand genetics
Disease resistance

18.6.4 Forensics

Criminal Justice:

DNA fingerprinting
Identify suspects or victims
Solve cold cases

Paternity Testing:

Determine biological relationships

18.6.5 Conservation

Endangered Species:

Sequence threatened species
Preserve genetic diversity
Plan breeding programs

Ancient DNA:

Sequence extinct species (mammoths, Neanderthals!)
Understand evolution
Learn about history

18.7 The Future of Sequencing

18.7.1 Emerging Technologies

Real-Time Single-Molecule Sequencing:

Watch individual DNA molecules being read
No amplification needed
Detect modifications directly

In Situ Sequencing:

Sequence DNA/RNA inside intact cells and tissues
See spatial organization
Preserve 3D context

$100 Genome:

Even cheaper than today
Routine for everyone
Preventive medicine

18.7.2 Challenges Ahead

Data Analysis:

Generating data is easy now
Interpreting data is hard!
Need better algorithms and AI

Ethical Issues:

Privacy of genetic information
Insurance discrimination
Incidental findings (finding diseases you weren’t looking for)

Equity:

Making sequencing available to everyone
Not just wealthy countries/people
Global health applications

18.8 Key Takeaways

DNA sequencing = Reading the order of A, T, G, C in DNA
Sanger sequencing (1977):
- First method
- Accurate but slow and expensive
- Used for Human Genome Project
Next-Generation Sequencing (NGS):
- Massively parallel (millions of sequences at once)
- Fast, cheap, high-throughput
- Revolutionized genomics!
Progress: $3 billion/13 years → <$1,000/1-2 days
Third-generation sequencing:
- Long reads (up to millions of base pairs!)
- PacBio and Oxford Nanopore
- Completed the human genome
Applications: Medicine, research, agriculture, forensics, conservation
Future: Even cheaper, faster, better analysis, ethical considerations

Sources: Information adapted from Illumina, NHGRI, Technology Networks, and sequencing technology literature (Bentley et al. 2008; Shendure and Ji 2008; Mardis 2017).

Bentley, David R., Shankar Balasubramanian, Harold P. Swerdlow, Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, et al. 2008. “Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry.” Nature 456 (7218): 53–59.

Mardis, Elaine R. 2017. “DNA Sequencing Technologies: 2006–2016.” Nature Protocols 12 (2): 213–18.

Sanger, Frederick, Steven Nicklen, and Alan R. Coulson. 1977. “DNA Sequencing with Chain-Terminating Inhibitors.” Proceedings of the National Academy of Sciences 74 (12): 5463–67.

Shendure, Jay, and Hanlee Ji. 2008. “Next-Generation DNA Sequencing.” Nature Biotechnology 26 (10): 1135–45.