18  Sequencing Technologies

18.1 Reading the Book of Life

18.1.1 What Is DNA Sequencing?

DNA sequencing = Reading the order of A, T, G, and C letters in DNA

Think of it like:

  • Reading the text in a book, letter by letter

  • Decoding a secret message

  • Reading the instruction manual of life!

18.2 From Sanger to Next-Generation Sequencing

18.2.1 The First Method: Sanger Sequencing (1977)

Developed by: Frederick Sanger (won his 2nd Nobel Prize for this!) (Sanger, Nicklen, and Coulson 1977)

How it works (simplified):

  1. Start with DNA template

  2. Add special labeled nucleotides that STOP copying when incorporated

  3. Create DNA fragments of different lengths

  4. Separate fragments by size

  5. Read the sequence from the pattern!

Characteristics:

  • Accurate: 99.99% accuracy!

  • Slow: One sequence at a time

  • Expensive: Cost millions for whole genomes

  • Read length: Up to ~1,000 base pairs per read

Impact:

  • Used for the Human Genome Project!

  • Still used today for small-scale sequencing

  • Gold standard for accuracy

The Human Genome Project timeline:

  • Started: 1990

  • Finished: 2003

  • Cost: $3 billion

  • Time: 13 years

  • Method: Sanger sequencing

That’s $1 per base pair!

18.2.2 Sanger Sequencing: The Detailed Mechanism

18.2.2.1 The Biological Foundation

Sanger sequencing is based on semi-conservative DNA replication - the same process cells use to copy DNA.

What is semi-conservative replication?

Think of DNA like a zipper with two sides:

  • One side stays (the template)

  • One new side is built (the copy)

  • The old side guides making the new side

Key facts:

  • DNA replication needs a template strand

  • One strand serves as the template (parent)

  • A new strand is built matching the template (daughter)

  • Watson and Crick predicted this in 1953

  • Meselson-Stahl proved it in 1958

18.2.2.2 Why DNA Polymerase Needs a Primer

This is an important question! Here’s the simple answer:

The problem:

  • RNA polymerase → Can start making RNA from nothing

  • DNA polymerase → Cannot start from nothing, needs a primer first

Why the difference?

Think of it like starting a zipper:

  • RNA polymerase is like a self-starting zipper

  • DNA polymerase is like a zipper that needs the first tooth in place

Three reasons DNA polymerase needs a primer:

Reason 1: Error checking

  • DNA polymerase checks each base it adds (proofreading)

  • It needs something to hold onto to check properly

  • Like needing a ruler to measure accurately

  • RNA polymerase doesn’t check as carefully (RNA is temporary anyway)

Reason 2: Chemical differences

  • RNA polymerase uses NTPs (have an extra -OH group)

  • DNA polymerase uses dNTPs (missing that -OH group)

  • That extra -OH helps start from nothing

  • DNA polymerase lost this ability during evolution

Reason 3: Accuracy matters

  • DNA is permanent → needs to be perfect

  • Primer gives DNA polymerase a stable starting point

  • Like building a house - you need a foundation first

  • This reduces copying errors

Why this matters:

  • PCR → Needs primers to work

  • Sanger sequencing → Needs primers to work

  • Cell DNA replication → Uses RNA primers (made by primase enzyme)

18.2.3 The Chemistry of Sanger Sequencing

18.2.3.1 The Special Ingredients

Sanger sequencing uses TWO types of building blocks:

1. Normal dNTPs (regular DNA building blocks):

  • What they are: dATP, dTTP, dGTP, dCTP

  • What they do: Build DNA normally

  • Key feature: Have a 3’-OH group

  • Result: DNA chain keeps growing

2. Special ddNTPs (chain terminators):

  • What they are: ddATP, ddTTP, ddGTP, ddCTP

  • What they do: STOP DNA growth when added

  • Key feature: MISSING the 3’-OH group

  • Result: DNA chain stops immediately

  • Bonus: Each is labeled with a different fluorescent color

    • ddA = Green

    • ddT = Red

    • ddG = Yellow

    • ddC = Blue

18.2.3.2 The Simple Difference

Normal dNTP (keeps going):

Has -OH group → Next base can attach → Chain grows

**Special ddNTP** (stops):
NO -OH group → Next base CANNOT attach → Chain stops

Think of it like:

- Normal dNTP = Lego brick with connectors on top

- Special ddNTP = Lego brick with flat top (nothing can attach)

#### How Sanger Sequencing Works (Step by Step)

Let's sequence this DNA: `5'-TACGGCATGCTA-3'`

**Step 1: Add primer**

The primer tells DNA polymerase where to start:
Template DNA: 5'---TACGGCATGCTA---3'
Primer:            3'-ATGC-5'
                       ↑
                  Start here!

**Step 2: Add the mixture**

We add:

- **LOTS** of normal dNTPs (A, T, G, C)

- **A FEW** special ddNTPs (ddA, ddT, ddG, ddC) with colors

**Step 3: DNA polymerase starts copying**

Sometimes it adds a normal base → keeps going
Sometimes it adds a special colored base → STOPS

This creates fragments of different lengths:
Fragment 1: 3'-ATGCCGddT-5'      (Stopped at position 7, RED)
Fragment 2: 3'-ATGCCGTAddC-5'    (Stopped at position 8, BLUE)
Fragment 3: 3'-ATGCCGTACddG-5'   (Stopped at position 9, YELLOW)
Fragment 4: 3'-ATGCCGTACGddA-5'  (Stopped at position 10, GREEN)
Fragment 5: 3'-ATGCCGTACGAddT-5' (Stopped at position 11, RED)

Each fragment ends with a colored base!

**Step 4: Separate by size**

- Pour fragments through a thin tube

- Small fragments run fast → detected first

- Large fragments run slow → detected last

- Like a race where smaller runners are faster

![Sanger Sequencing Method](images/ch15/sanger-sequencing.svg)

**Figure 15.1**: Sanger sequencing chain termination method showing how ddNTPs terminate DNA synthesis at different positions, creating fragments of varying lengths that reveal the DNA sequence.

*Image credit: Estevezj, Wikimedia Commons, CC BY-SA 3.0*

**Step 5: Read the colors**

Camera sees colors in order:
Position 7:  RED    → T was added
Position 8:  BLUE   → C was added
Position 9:  YELLOW → G was added
Position 10: GREEN  → A was added
Position 11: RED    → T was added

Sequence = TCGAT...

Done! We just read the DNA sequence!

### The Original Sanger Method: Four Tubes

**Historical note**: The method we just described is the MODERN version (1-tube with 4 colors). The original Sanger method (1977) was different!

**How the original method worked**:

Instead of one tube with all four ddNTPs (each a different color), Sanger used **FOUR SEPARATE TUBES**:

- **Tube 1**: dNTPs + ddATP (stops at A)
- **Tube 2**: dNTPs + ddTTP (stops at T)
- **Tube 3**: dNTPs + ddGTP (stops at G)
- **Tube 4**: dNTPs + ddCTP (stops at C)

Think of it like:

- Modern method = one race with 4 colored jerseys
- Original method = 4 separate races!

**The original labeling**:

- Used **radioactive labels** (³²P or ³⁵S), not fluorescent colors
- All fragments in one tube were the same "color" (radioactive)
- No fancy colors - just radioactive signal!

**Reading the results**:

1. Run each tube on separate lanes of a gel
2. Use **gel electrophoresis** (fragments separate by size)
3. Expose gel to X-ray film (autoradiography)
4. Dark bands appear where radioactive fragments are
5. Read sequence by comparing the 4 lanes!

**Example reading**:
    A lane   T lane   G lane   C lane
    ----     ----     ----     ----

Small - - band - → G (smallest = first base) band - - - → A (next) - band - - → T (next) - - - band → C (next) - - band - → G (next) Large - band - - → T (largest = last base)

Sequence = GATCGT


**Why this was challenging**:

- Needed 4 separate reactions (4× more work!)
- Manual reading of gel bands (tedious, error-prone)
- Radioactive materials (safety concerns)
- Limited to ~300-400 bases per gel
- Took hours to days per sequence

**The improvement to modern method**:

- **1986-1987**: Fluorescent dyes instead of radioactivity
  - Safer (no radiation)
  - Four colors = one tube instead of four!
  - Automated detection with lasers

- **1990s**: Capillary electrophoresis instead of gel
  - Faster (minutes instead of hours)
  - Automated (no manual reading)
  - Longer reads (up to 1,000 bp)
  - High-throughput (96 samples simultaneously)

**Why learn about the old method?**

- Helps understand the principle (ddNTPs stop synthesis)
- Appreciates modern automation
- Many papers from 1977-2005 use this method
- Nobel Prize-winning technique!

### Modern Sanger Sequencing Protocol

**Workflow**:

**Step 1: PCR Amplification** (optional but common):

- Amplify target region

- Purify PCR product

- Ensures enough template

**Step 2: Sequencing Reaction**:

- Mix template DNA + sequencing primer

- Add DNA polymerase (thermostable, e.g., Taq)

- Add mix of dNTPs and fluorescent-labeled ddNTPs

- Thermal cycling (like PCR, but linear amplification)

**Typical mix ratio**:

- dNTPs:ddNTPs = ~100:1

- Ensures most reactions continue

- But some terminate at each position

**Step 3: Cleanup**:

- Remove excess ddNTPs

- Remove salts (interfere with electrophoresis)

- Use spin columns or magnetic beads

**Step 4: Capillary Electrophoresis**:

- Inject sample into capillary

- Apply electric field

- Fragments separate by size

- Laser excites fluorophores

- Camera detects emission

**Step 5: Data Analysis**:

- Software converts peaks to sequence

- Quality scores assigned to each base

- Generate chromatogram (peak visualization)

### Reading a Sanger Chromatogram

**What you see**:
      A    G    C    T    A    G    C    T
      |    |    |    |    |    |    |    |
     Peak Peak Peak Peak Peak Peak Peak Peak

**Quality indicators**:

**Good quality**:

- Sharp, well-separated peaks

- Single color at each position

- Uniform peak height

- High signal-to-noise ratio

**Poor quality**:

- Overlapping peaks (multiple colors)

- Broad peaks

- Low peak height

- Often at start (primer) or end (long reads) of sequence

**Phred quality score**:

- Q20 = 99% accuracy (1 error in 100 bases)

- Q30 = 99.9% accuracy (1 error in 1,000 bases)

- Most modern Sanger reads Q30 for first 700-800 bp

### Semi-Conservative Replication Revisited

**Meselson-Stahl Experiment (1958)** proved semi-conservative replication:

**Experiment**:

1. Grow E. coli in heavy nitrogen (¹⁵N) medium

2. DNA becomes "heavy" (labeled)

3. Switch to normal nitrogen (¹⁴N) medium

4. After one replication: DNA is "hybrid" (one heavy, one light strand)

5. After two replications: Half hybrid, half light

**Result**: Proved DNA replicates semi-conservatively

- One template strand conserved

- One new strand synthesized

**This principle underlies**:

- All DNA replication in cells

- PCR amplification

- Sanger sequencing

- All DNA synthesis!

**Connection to Sanger sequencing**:

- Template strand = original DNA

- Newly synthesized strand = sequence read

- ddNTPs terminate synthesis at random positions

- Pattern reveals template sequence

### Why Semi-Conservative Replication Matters for Sequencing

**Advantages**:

1. **Accuracy**: Template strand provides perfect guide

2. **Complementarity**: A pairs with T, G pairs with C

3. **Proofreading**: DNA polymerase checks each base

4. **Reliability**: Same template gives same sequence

**Fidelity**:

- DNA polymerase error rate: ~1 in 10⁷ (with proofreading)

- Sanger sequencing error rate: ~1 in 10⁴ (due to termination randomness)

- Still extremely accurate!

### Sanger Sequencing Applications Today

**Despite NGS dominance, Sanger still used for**:

**1. Variant Validation**:

- NGS finds potential mutation

- Sanger confirms it

- Gold standard for clinical diagnostics

**2. Small-Scale Projects**:

- Sequencing few genes

- Checking clones

- Verifying plasmids

- More cost-effective than NGS setup

**3. Long Reads**:

- Up to 1,000 bp in single read

- NGS reads much shorter (typically 150-300 bp)

- Useful for spanning repetitive regions

**4. Low-Throughput Needs**:

- 1-96 samples at a time

- Don't need millions of reads

- Academic labs, clinical labs

### Limitations of Sanger Sequencing

**Compared to NGS**:

- **Throughput**: Only 96 samples/run vs. millions in NGS

- **Cost**: $5-10 per reaction vs. $0.01 per Mb for NGS

- **Speed**: Hours per sample vs. days for whole genome (but millions of sequences!)

- **Scalability**: Cannot do whole genomes economically

**When NOT to use Sanger**:

- Whole genome sequencing

- RNA-seq experiments

- ChIP-seq experiments

- Metagenomics

- Any high-throughput application

### Historical Impact

**Before Sanger sequencing** (pre-1977):

- No way to read DNA sequence directly

- Relied on protein sequencing (slower, harder)

- Genetic code deciphered with difficulty

**After Sanger sequencing** (1977-2005):

- First genes sequenced

- First genomes sequenced (viruses, bacteria)

- **Human Genome Project possible**!

- Molecular biology revolution

- Medical genetics born

**Key milestones using Sanger**:

- 1977: First DNA sequence (bacteriophage φX174, 5,375 bp)

- 1995: First bacterial genome (H. influenzae, 1.8 Mb)

- 1996: First eukaryotic genome (yeast, 12 Mb)

- 2001: Human genome draft (3 Gb!)

- 2003: Human genome complete

**Nobel Prizes related to sequencing**:

- 1980: Sanger - DNA sequencing method

- 1993: Mullis - PCR (enables sequencing)

- 2020: Charpentier & Doudna - CRISPR (genome editing, relies on sequencing)

### Comparison Table: Sanger vs. NGS

| Feature | Sanger Sequencing | Next-Generation Sequencing |
|---------|-------------------|----------------------------|
| **Read length** | 700-1,000 bp | 100-300 bp (Illumina) |
| **Accuracy** | 99.9% (Q30) | 99%+ (depends on platform) |
| **Throughput** | 1-96 sequences/run | Millions to billions/run |
| **Cost per Mb** | $500-1,000 | $0.01-0.10 |
| **Time** | Hours | Days (for whole genome) |
| **Applications** | Validation, cloning, small projects | WGS, RNA-seq, variant discovery |
| **Best for** | Low throughput, long reads | High throughput, discovery |

### The Revolution: Next-Generation Sequencing (NGS)

**What changed**: In the mid-2000s, new technologies emerged!

**NGS characteristics**:

- **Massively parallel**: Sequence millions of fragments simultaneously!

- **Fast**: Whole genome in days instead of years

- **Cheap**: Now costs less than $1,000 per genome

- **High-throughput**: Generate billions of base pairs of data

**Key difference from Sanger**:

- Sanger: Read ONE sequence at a time (serial)

- NGS: Read MILLIONS of sequences at once (parallel)

Think of it like:

- **Sanger**: Reading one book

- **NGS**: Reading an entire library simultaneously!

### How NGS Works (General Process)

**1. Library Preparation**:

- Break DNA into small fragments

- Add special "adapters" to fragment ends

- Like putting address labels on letters

**2. Amplification**:

- Make many copies of each fragment

- Like photocopying letters many times

**3. Sequencing**:

- Fragments attach to a surface

- Nucleotides added one at a time

- Each nucleotide emits light (different colors for A, T, G, C)

- Camera records the lights

- Computer determines sequence!

**4. Data Analysis**:

- Assemble millions of short sequences

- Like putting together a jigsaw puzzle

- Compare to reference genome

- Identify differences

### NGS Platforms

**Illumina** (most popular):

- Short reads (100-300 bp)

- Very accurate

- Most widely used

- Relatively cheap

**Ion Torrent**:

- Detects pH changes (not light)

- Faster

- Good for targeted sequencing

**Others**: Many platforms exist, each with trade-offs!

### The Evolution of Read Lengths

**Read length** = How many bases we can sequence in one continuous read

Think of it like:

- **Short reads** = reading one sentence at a time
- **Long reads** = reading entire paragraphs or pages!

**Historical progression of NGS read lengths**:

**Early days (2005-2008)**:

- **25-35 bp** reads
- Illumina Genome Analyzer
- Very short! Like reading only 2-3 words at a time
- Hard to assemble genomes
- But: massively parallel, millions of reads!

**First generation NGS (2008-2012)**:

- **50-75 bp** reads (single-end)
- **2 × 50 bp** (paired-end) = 100 bp total information
- Better, but still challenging
- Could map to reference genomes
- Assembly still difficult

**Second generation (2012-2015)**:

- **100-150 bp** reads
- **2 × 100 bp** or **2 × 150 bp** (paired-end)
- Big improvement!
- Good for most applications
- Became the standard

**Modern era (2015-present)**:

- **150-300 bp** reads (standard)
- **2 × 150 bp** or **2 × 250 bp** common
- Some platforms: **2 × 300 bp**!
- Long enough for most needs
- Good balance of length, accuracy, and cost

**Longest Illumina reads today**:

- MiSeq: Up to **2 × 300 bp** (600 bp total)
- NovaSeq: Usually **2 × 250 bp** (500 bp total)
- NextSeq: **2 × 150 bp** (300 bp total)

### Why Read Length Matters

**Longer reads are better for**:

1. **Genome assembly**:
   - Longer reads span more of the genome
   - Easier to put pieces together
   - Like having larger puzzle pieces

2. **Repetitive regions**:
   - Genomes have many repeats (same sequence multiple times)
   - Short reads get "lost" in repeats
   - Longer reads can span repeats
   - Like reading "AAAAAAA..." - need to read the whole thing to know how many As!

3. **Structural variants**:
   - Large deletions, insertions, inversions
   - Need to span the variant
   - Short reads might miss it entirely

4. **Isoform detection** (RNA-seq):
   - Full-length transcripts are thousands of bases long
   - Longer reads capture more of each isoform
   - Better identification of splice variants

**Shorter reads are better for**:

1. **Accuracy**:
   - Short reads have fewer errors
   - Quality degrades toward the end of reads
   - 50 bp at Q30 is better than 300 bp at Q20

2. **Cost**:
   - Shorter reads = cheaper per base
   - More reads per run
   - Better for high-throughput applications

3. **Speed**:
   - Shorter reads sequence faster
   - Fewer cycles = faster run time

### The Trade-Offs

| Read Length | Accuracy | Cost | Speed | Best For |
|-------------|----------|------|-------|----------|
| **50-75 bp** | Highest | Cheapest | Fastest | RNA-seq, ChIP-seq, targeted sequencing |
| **100-150 bp** | High | Moderate | Moderate | Standard WGS, exomes, most applications |
| **250-300 bp** | Good | Higher | Slower | Assembly, amplicon sequencing, metagenomics |

**Most common today**: **2 × 150 bp** (paired-end, 300 bp total information)

- Good balance of everything
- Cheap enough for routine use
- Long enough for most applications
- Standard for clinical sequencing

### Why We Couldn't Make Long Reads Earlier

**Technical challenges**:

1. **Reversible terminators don't work well for long reads**:
   - Each cycle adds chemicals that must be removed
   - After ~300 cycles, too much residue builds up
   - Signals get weaker and noisier
   - Like photocopying a photocopy 300 times - gets fuzzy!

2. **Chemistry degradation**:
   - DNA polymerase efficiency decreases
   - Fluorescent dyes accumulate
   - Clusters get dimmer
   - Error rate increases

3. **Phasing problems**:
   - In a cluster of 1,000 identical molecules
   - All should be at the same position
   - But some lag behind or jump ahead (phasing errors)
   - After 200-300 cycles, too much variation
   - Signal becomes mixed and unclear

**Recent improvements**:

- **Better chemistry**: More stable reagents
- **Better enzymes**: DNA polymerase that works longer
- **Better optics**: Brighter signals, better cameras
- **Better algorithms**: Correct for phasing errors
- **Result**: 300 bp reads now routine!

### Paired-End vs. Single-End

**Single-end sequencing**:

- Sequence only one end of fragment
- One read per fragment
- Example: 150 bp read

**Paired-end sequencing**:

- Sequence BOTH ends of fragment
- Two reads per fragment
- Example: 2 × 150 bp = two 150 bp reads from same fragment
- Know the distance between reads (~300-500 bp)

**Why paired-end is better**:

- **More information**: Two reads instead of one!
- **Better mapping**: Know both ends help place fragment
- **Detect structural variants**: If distance is wrong, something's inserted or deleted
- **Span gaps**: One end maps, helps place the other end

**Standard today**: Almost always paired-end!

### Long-Read Technologies (3rd Generation)

**The limit of short-read technologies**:

- Illumina: Maximum ~300-500 bp (paired-end)
- Even with improvements, can't go much longer
- Fundamental chemistry limitations

**Solution**: Completely different technologies!

**PacBio**:

- **10,000-100,000+ bp** reads
- Average: ~10-20 kb
- Longest: > 100 kb!
- Uses single-molecule real-time sequencing
- Different chemistry (not reversible terminators)

**Oxford Nanopore**:

- **Regular reads**: 10-50 kb
- **Ultra-long reads**: 100-200 kb
- **Record**: > 2 million bp!
- Uses electrical current (not fluorescence)
- Can sequence indefinitely until DNA breaks

**Trade-offs for long reads**:

- **Pros**: Much longer! Span repetitive regions, complete genomes
- **Cons**: Higher error rate (~90-95% vs. 99%+), more expensive, lower throughput

**The future**: Combine short accurate reads (Illumina) with long reads (PacBio/Nanopore) for best of both!

## Illumina Sequencing: The Details

### Why Illumina Dominates NGS

**Illumina/Solexa sequencing** is the most widely used NGS technology today. It was developed by Solexa (acquired by Illumina in 2007) and has become the gold standard for high-throughput sequencing [@bentley2008accurate].

**Key advantages**:

- **Massively parallel**: Sequence millions of DNA fragments simultaneously
- **High accuracy**: 99%+ accuracy with quality scores
- **Cost-effective**: Cheapest per-base cost
- **Versatile**: Used for genome sequencing, RNA-seq, ChIP-seq, and more

### The Core Principle: Sequencing by Synthesis (SBS)

Illumina uses **sequencing by synthesis (SBS)** - it sequences DNA by watching DNA polymerase add nucleotides one at a time!

**The brilliant idea**:

- Use special fluorescent nucleotides
- Each base (A, T, G, C) glows a different color
- Watch which color appears → know which base was added
- Repeat for every position in the DNA!

Think of it like:

- DNA polymerase is building a LEGO tower
- Each LEGO brick glows a different color
- A camera takes a picture after each brick is added
- The color sequence reveals the DNA sequence!

### The Magic Ingredient: Reversible Terminator Nucleotides

**Normal dNTPs** (regular DNA building blocks):

- DNA polymerase can add them continuously
- Hard to control - might add multiple bases at once

**Illumina's reversible terminators**:

- **Modified dNTPs** with two special features:
  1. **Fluorescent dye** attached (different color for A, T, G, C)
  2. **Reversible terminator** group at 3' position

**How they work**:

1. DNA polymerase adds ONE modified nucleotide
2. The terminator BLOCKS the next base from being added
3. Camera captures the fluorescent signal
4. Enzymes REMOVE the fluorescent dye and terminator
5. Next cycle begins!

**The four colors**:

- **Adenine (A)**: Green fluorescence
- **Thymine (T)**: Red fluorescence
- **Guanine (G)**: Yellow fluorescence
- **Cytosine (C)**: Blue fluorescence

(Note: Exact colors vary by Illumina platform, but each base has a unique signal)

### Step-by-Step: How Illumina Sequencing Works

#### Step 1: Library Preparation

**Goal**: Prepare DNA fragments with special adapters

Library preparation is like preparing letters for mailing - you need to put addresses and return addresses on them so they get to the right place!

**Process**:

**1. Fragment DNA**: Break genomic DNA into small pieces (typically 200-500 bp)

**Methods to fragment DNA**:

**a) Enzymatic fragmentation** (using enzymes):

- **Restriction enzymes**: Cut DNA at specific sequences
  - Like molecular scissors that only cut at certain words
  - Problem: Cuts are not random, biased toward certain sequences
  - Less common for NGS

- **Tagmentation** (Nextera method):
  - Enzyme (transposase) cuts AND adds adapters simultaneously!
  - Very fast (minutes instead of hours)
  - Becoming very popular
  - Like a machine that stamps and addresses letters at once

**b) Physical fragmentation** (mechanical):

- **Sonication**: Use sound waves to break DNA
  - High-frequency sound vibrates DNA until it breaks
  - Like using a jackhammer on ice
  - Creates random fragments
  - Takes 5-30 minutes

- **Nebulization**: Force DNA through tiny holes
  - High pressure shears the DNA
  - Like pushing spaghetti through a screen
  - Also creates random fragments
  - Quick but uses more DNA

- **Enzymatic shearing**: Enzymes that cut randomly
  - More gentle, less biased
  - Like using scissors randomly on a string

**Why fragment size matters**:

- **Too short** (< 100 bp): Hard to map uniquely to genome
- **Too long** (> 800 bp): Won't fit in sequencer, harder to amplify
- **Just right** (200-500 bp): Perfect for Illumina!

**2. End repair**: Make fragment ends smooth and blunt

After fragmentation, DNA ends are ragged (like torn paper edges). We need to smooth them out!

- Remove overhangs (extra single-stranded bits)
- Fill in gaps
- Make all ends blunt (flat)
- Like filing down rough edges on wood

**Why this matters**: Adapters can only attach to blunt ends properly

**3. A-tailing**: Add single "A" nucleotide to 3' ends

- Adds one adenine (A) to each end
- Prepares for adapter ligation
- Like adding a hook to hang things on

**4. Add adapters**: Attach short DNA sequences to both ends

**What are adapters?**

Adapters are short DNA sequences (20-100 bp) that get attached to BOTH ends of every DNA fragment.

Think of adapters like:

- **Mailing addresses** on letters
- **Barcode labels** on packages
- **Handles** on suitcases

**Adapter structure**:

5’— [P5 adapter] —[Your DNA fragment]— [P7 adapter] —3’ ↑ ↑ Binds to flow cell Binds to flow cell + sequencing primer + sequencing primer ```

What adapters contain:

  1. Flow cell binding sequences:
    • Let fragments stick to flow cell surface
    • Complementary to oligos on flow cell
    • Like Velcro hooks that match Velcro loops
  2. Sequencing primer binding sites:
    • Where sequencing primers attach
    • Start point for DNA polymerase
    • Like “Start here” signs
  3. Index/Barcode sequences (optional):
    • Unique tags (6-8 bp) to identify samples
    • Like apartment numbers on an address
    • Allows multiplexing!

5. Ligation: Glue adapters to DNA fragments

  • Use DNA ligase enzyme (molecular glue)
  • Creates covalent bonds
  • Very stable connection
  • Like super-gluing handles onto boxes

6. Size selection: Keep only fragments of desired size

After adapter ligation, we have fragments of many sizes. We need to keep only the ones we want!

Methods:

  • Gel electrophoresis: Separate by size, cut out desired band
  • Magnetic beads (SPRI beads): Most common today
    • Beads bind DNA based on fragment size
    • Add magnets to pull beads (with DNA) to side
    • Wash away unwanted fragments
    • Like using a magnet to sort iron filings by size

Typical target: 300-500 bp fragments (including adapters)

7. PCR amplification: Make many copies of each fragment

  • Need millions of each fragment for clustering
  • Use PCR primers that bind to adapters
  • Typically 8-12 PCR cycles
  • Too many cycles = bias and duplicates
  • Too few cycles = not enough DNA

Think of it like:

  • Photocopying important documents
  • Need enough copies to work with
  • But too many copies waste paper!

Quality control at each step:

  • After fragmentation: Check size distribution (should be 200-500 bp)
  • After adapter ligation: Check that adapters attached (size increases by ~120 bp)
  • After PCR: Check DNA concentration (need enough!) and size
  • Use: Bioanalyzer or TapeStation (measures DNA size electronically)

Why library prep matters:

  • Good library = high-quality sequencing data
  • Bad library = garbage data (garbage in, garbage out!)
  • Most common source of sequencing errors is poor library prep

18.2.4 Multiplexing: Sequencing Many Samples at Once

The problem: Illumina machines generate BILLIONS of reads per run. But you might only need MILLIONS for your sample. Wasteful!

The solution: Multiplexing = mix multiple samples in one sequencing run!

How it works:

  1. Sample 1: Add index “ATCACG” to adapters
  2. Sample 2: Add index “CGATGT” to adapters
  3. Sample 3: Add index “TTAGGC” to adapters
  4. Mix all three samples together
  5. Sequence them all at once!
  6. Computer reads the index and sorts reads by sample

Think of it like:

  • Mailing 100 letters from different people in one mailbag
  • Each letter has a return address (index)
  • Post office sorts them by return address at destination!

Benefits:

  • Cost-effective: Share one expensive sequencing run
  • Efficient: Don’t waste billions of reads on one sample
  • Common: Can multiplex 12, 24, 96, or even 384 samples!

Real example:

  • NovaSeq generates 20 billion reads
  • You need only 100 million reads per genome
  • Can sequence 200 genomes in one run!
  • Cost: $10,000 run ÷ 200 samples = $50 per genome!

18.2.5 Vector-Based Sequencing (Older Method)

Historical note: Before modern library prep, scientists used vectors (bacterial plasmids):

Old method:

  1. Insert your DNA fragment into a plasmid (circular DNA)
  2. Transform bacteria with plasmid
  3. Grow bacteria (each makes many copies)
  4. Extract plasmid DNA
  5. Sequence using Sanger sequencing
  6. Primers bind to known vector sequences

Why vectors were used:

  • Amplified DNA (bacteria copy it millions of times)
  • Provided known primer binding sites
  • Cloned DNA for storage

Why we don’t use them for NGS:

  • Too slow (growing bacteria takes days)
  • Not needed (PCR amplifies DNA faster)
  • Can’t multiplex easily
  • Modern library prep is faster and better!

Still used for:

  • Sanger sequencing of cloned genes
  • Making DNA constructs for experiments
  • Long-term storage of DNA sequences

Why adapters matter:

  • Let fragments stick to flow cell surface (P5/P7 sequences)
  • Provide primer binding sites for sequencing
  • Add barcodes (for multiplexing - mixing multiple samples)
  • Enable bridge amplification (clustering on flow cell)

Library = collection of all DNA fragments with adapters attached

Think of library like:

  • All books in a library have call numbers (adapters)
  • Call numbers let you find and organize books
  • Same way, adapters let us find and sequence DNA fragments!

18.2.5.1 Step 2: Cluster Generation (Bridge Amplification)

The flow cell:

  • Glass slide with millions of tiny spots
  • Each spot has oligonucleotides (short DNA sequences) attached
  • These oligos match the adapter sequences

Bridge amplification process:

  1. Attach: DNA fragment binds to flow cell via adapter
  2. Bend: Fragment bends over and binds to nearby oligo (makes a “bridge”)
  3. Amplify: DNA polymerase copies the fragment
  4. Separate: Double-stranded DNA is denatured (separated)
  5. Repeat: Process repeats ~30-35 times

Result: Each single DNA molecule becomes a “cluster” of ~1,000 identical copies in one tiny spot!

Why clustering is brilliant:

  • Signal amplification: 1,000 copies glow much brighter than 1 molecule
  • Accuracy: All copies in cluster have same sequence
  • Millions of clusters: Each original DNA fragment gets its own cluster
  • Parallel sequencing: All clusters sequenced simultaneously!

Think of it like:

  • One person shouting = hard to hear
  • 1,000 people shouting the same thing = very loud and clear!

18.2.5.2 Step 3: Sequencing by Synthesis

Now the actual sequencing begins!

Cycle 1: Add first base

  1. Add mix to flow cell:

    • All four fluorescent reversible terminator dNTPs (A, T, G, C)
    • DNA polymerase
    • Sequencing primer (binds to adapter)
  2. Incorporation:

    • DNA polymerase adds ONE base to each cluster
    • Only the complementary base binds
    • Terminator prevents adding more bases
  3. Wash: Remove unincorporated nucleotides

  4. Image:

    • Laser excites fluorescent dyes
    • Camera captures emission from each cluster
    • Computer records which color (which base)
  5. Cleave:

    • Enzymes remove fluorescent dye
    • Enzymes remove terminator group
    • 3’-OH is restored for next cycle

Cycle 2: Add second base

  • Repeat exact same process
  • Add second base
  • Image the color
  • Record the base

Cycles 3, 4, 5… up to 150-300 cycles:

  • Continue adding one base at a time
  • Image each cycle
  • Build up the complete sequence!

Key features:

  • One base per cycle: Reversible terminators prevent errors from homopolymers (like AAAAA)
  • All four dNTPs present: Natural competition minimizes bias
  • Massively parallel: Millions of clusters sequenced simultaneously

18.2.5.3 Step 4: Data Analysis and Base Calling

What happens to the images:

  1. Image analysis: Software identifies each cluster spot

  2. Intensity measurement: Measures fluorescence intensity and color for each cluster

  3. Base calling: Determines which base (A, T, G, C) was added

    • Compares signal to expected wavelengths
    • Assigns most likely base
  4. Quality scoring: Assigns confidence score (Phred score) to each base

    • Q20 = 99% accuracy (1 error in 100 bases)
    • Q30 = 99.9% accuracy (1 error in 1,000 bases)
    • Q40 = 99.99% accuracy (1 error in 10,000 bases)
  5. FASTQ file generation: Creates file with:

    • Read sequence
    • Quality score for each base
    • Read identifier

Quality control:

  • Low-quality reads filtered out
  • Adapter sequences trimmed
  • Reads ready for alignment or assembly!

18.2.6 Paired-End Sequencing

Even more powerful: Illumina can sequence BOTH ends of each DNA fragment!

How it works:

  1. Sequence one end (forward read, 150 bp)
  2. Wash away sequencing reagents
  3. Perform bridge amplification again to regenerate clusters
  4. Sequence the OTHER end (reverse read, 150 bp)

Advantages:

  • Better mapping: Know exact distance between two ends
  • Detect rearrangements: Identify insertions, deletions, inversions
  • Span repetitive regions: One end maps uniquely, helps place the other
  • Improved accuracy: Two reads better than one

Applications:

  • Genome assembly: Connect contigs
  • Variant detection: Find structural variants
  • RNA-seq: Identify splice junctions
  • Metagenomics: Better species identification

18.2.7 Illumina Platforms Comparison

Platform Reads per Run Output per Run Read Length Time
MiSeq 25 million 15 Gb 2 × 300 bp 4-55 hours
NextSeq 400 million 120 Gb 2 × 150 bp 12-30 hours
HiSeq 5 billion 1,500 Gb 2 × 150 bp 1-6 days
NovaSeq 20 billion 6,000 Gb 2 × 250 bp 13-44 hours

Choosing a platform:

  • Small projects (targeted sequencing): MiSeq
  • Medium projects (RNA-seq, exomes): NextSeq
  • Large projects (whole genomes): NovaSeq
  • Clinical labs: MiSeq, NextSeq (faster turnaround)

18.2.8 Common Applications of Illumina Sequencing

Whole Genome Sequencing (WGS):

  • Sequence entire genome
  • Human genome: ~30× coverage typical
  • Detect all types of variants

Whole Exome Sequencing (WES):

  • Sequence only protein-coding regions (~1% of genome)
  • Much cheaper than WGS
  • Good for finding disease-causing mutations

RNA-seq (Transcriptomics):

  • Sequence all RNA in sample
  • Measure gene expression levels
  • Find splice variants, fusion genes

ChIP-seq (Epigenomics):

  • Map protein-DNA interactions
  • Find transcription factor binding sites
  • Map histone modifications

Targeted Sequencing:

  • Sequence specific genes or regions
  • Cancer panels (e.g., 50 cancer-related genes)
  • Very deep coverage of specific regions

Metagenomics:

  • Sequence all DNA in environmental sample
  • Identify microbes in gut, soil, ocean
  • Don’t need to culture organisms!

18.2.9 Limitations and Challenges

Compared to Sanger sequencing:

  • Shorter reads: 150-300 bp vs. 700-1,000 bp
  • More complex data analysis: Millions of short reads to assemble
  • Higher upfront cost: Equipment is expensive

Compared to long-read sequencing (PacBio, Nanopore):

  • Cannot span large repetitive regions: Short reads get “lost”
  • Miss structural variants: Large insertions/deletions hard to detect
  • Phasing difficult: Hard to tell which variants are on same chromosome

Technical challenges:

  • GC bias: GC-rich regions harder to sequence
  • PCR duplicates: Multiple reads from same original molecule
  • Coverage uniformity: Some regions get more reads than others
  • Data storage: Terabytes of data per run!

18.2.10 The Future of Illumina Technology

Recent improvements:

  • Longer reads: Now up to 2 × 250 bp or 2 × 300 bp
  • Higher accuracy: Improved chemistry, better base calling algorithms
  • Faster runs: Hours instead of days
  • Lower cost: Approaching $100 per human genome

Emerging technologies:

  • Complete Long-Read: Illumina acquiring long-read technology
  • Single-cell sequencing: Sequence individual cells
  • Spatial transcriptomics: Sequence RNA with spatial location information

18.2.11 Summary: Illumina Sequencing

What it is: Sequencing by synthesis using fluorescent reversible terminators

How it works:

  1. Fragment DNA and add adapters
  2. Create clusters by bridge amplification
  3. Sequence one base at a time (different color for each base)
  4. Image after each base addition
  5. Repeat for 150-300 cycles

Key advantages:

  • Massively parallel (millions of reads simultaneously)
  • High accuracy (99%+)
  • Cost-effective
  • Versatile applications

Key limitations:

  • Short reads
  • GC bias
  • Complex data analysis

Impact: Revolutionized genomics, enabling personalized medicine, cancer genomics, microbiome research, and much more!

18.3 Coverage and Depth: Reading Each Base Multiple Times

18.3.1 What Is Coverage?

When sequencing a genome, we don’t just read each base once - we read it MANY times! This is called coverage or depth.

Coverage = How many times, on average, each base is sequenced

Think of it like:

  • Reading a book once → might miss typos
  • Reading a book 30 times → very confident about every word!

Notation:

  • 30× coverage (pronounced “30 ex”) = each base read 30 times on average
  • 50× coverage = each base read 50 times
  • 100× coverage = each base read 100 times

18.3.2 Why Do We Need Multiple Reads?

Sequencing errors happen!

  • Illumina: ~99% accuracy per base
  • That means 1 error per 100 bases!
  • Human genome = 3 billion bases
  • 1% error = 30 million errors if we only read once!

Solution: Read each base many times and vote!

Example:

If we sequence one position 30 times and get:

  • 28 reads say “A”
  • 1 read says “T”
  • 1 read says “C”

We’re confident the correct base is A (the errors were T and C).

Think of it like:

  • 30 witnesses to an event
  • 28 say “the car was red”
  • 2 say “the car was blue”
  • You trust the majority → the car was red!

18.3.3 How Coverage Works in Practice

Low coverage (10× or less):

  • Pros: Cheap, fast
  • Cons: Less confident, might miss variants
  • Uses: Population studies (many individuals, shallow depth)

Medium coverage (30×):

  • Pros: Good balance of cost and accuracy
  • Cons: Might miss some rare variants
  • Uses: Clinical sequencing, most research

High coverage (50-100×):

  • Pros: Very confident, detects rare variants
  • Cons: Expensive, more data to store
  • Uses: Cancer sequencing, rare disease diagnosis

Ultra-high coverage (500× or more):

  • Pros: Can detect tiny fractions of variant alleles
  • Cons: Very expensive
  • Uses: Liquid biopsies (detecting 1% cancer DNA in blood)

18.3.4 Coverage in Different Applications

Whole genome sequencing (WGS):

  • Clinical diagnostic: 30× coverage
  • Research: 30-50× coverage
  • Population studies: 10-15× coverage

Whole exome sequencing (WES):

  • Typical: 100× coverage (only sequencing 1% of genome, so can afford more depth)

RNA-seq:

  • Coverage here means number of reads per gene
  • Not the same as genome coverage!

Targeted sequencing (specific genes):

  • Cancer panels: 500-1,000× coverage
  • Why so high? To detect mutations in small % of cells

18.3.5 The Cost-Coverage Trade-off

More coverage = more cost:

  • 30× human genome: ~$1,000
  • 50× human genome: ~$1,500
  • 100× human genome: ~$3,000

Diminishing returns:

  • Going from 10× to 30× = huge improvement
  • Going from 30× to 50× = moderate improvement
  • Going from 50× to 100× = small improvement

Think of it like:

  • Adding the 10th security camera helps a lot
  • Adding the 100th security camera doesn’t help much more

18.3.6 Coverage Is Not Uniform

Important note: Coverage is an AVERAGE, but some regions get more reads than others!

Reasons for uneven coverage:

  • GC content: GC-rich regions often have lower coverage
  • Repetitive sequences: Hard to sequence, lower coverage
  • PCR bias: Some fragments amplify better than others
  • Random chance: Statistical variation

Example:

With 30× average coverage:

  • Some bases might have 50× coverage
  • Other bases might have 10× coverage
  • A few bases might have 0× coverage (gaps!)

18.3.7 Quality Scores and Coverage

Coverage and quality scores work together:

  • High coverage + high quality = very confident
  • High coverage + low quality = less confident (many bad reads)
  • Low coverage + high quality = somewhat confident
  • Low coverage + low quality = not confident!

Quality score (Phred score):

  • Q30 = 99.9% accuracy
  • Q20 = 99% accuracy
  • Below Q20 = often filtered out

18.3.8 Real-World Example: Diagnosing a Genetic Disease

Scenario: Patient with unknown genetic disease

Approach:

  1. Sequence patient’s genome at 30× coverage
  2. Each base read ~30 times
  3. Computer identifies variants (differences from reference)
  4. High coverage = confident these variants are real
  5. Find causative mutation!

Why 30× matters:

  • Can confidently call heterozygous variants (one copy mutated)
  • Expected ratio: 15 reads normal, 15 reads mutated
  • Clear signal!

With only 5× coverage:

  • Might see: 3 reads normal, 2 reads mutated
  • OR: 4 reads normal, 1 read mutated
  • Hard to tell if it’s real or an error!

18.3.9 Key Takeaways on Coverage

  • Coverage = number of times each base is sequenced
  • 30× is standard for clinical sequencing
  • Higher coverage = more confident, but more expensive
  • Coverage is uneven across the genome
  • Multiple reads allow error correction through voting
  • Different applications need different coverage levels

Think of coverage like:

  • Insurance: More coverage = more protection, but costs more!
  • Witnesses: More witnesses = more confident about the truth!

18.4 Human Genome Sequencing: Then vs. Now

18.4.1 The Amazing Progress

Feature 2003 (HGP) 2025 (Today)
Cost $3 billion <$1,000
Time 13 years 1-2 days
Method Sanger sequencing NGS (Illumina)
Accuracy 99.99% 99.9%
Availability Research only Clinical & consumer

That’s a reduction of:

  • Cost: 3 million-fold!

  • Time: 2,000-fold!

It’s like going from a price of a house to a coffee!

18.4.2 What This Enables

1. Personalized Medicine:

  • Sequence your genome to find disease risks

  • Choose treatments based on your genetics

  • Predict drug responses

2. Cancer Genomics:

  • Sequence tumor DNA

  • Find specific mutations

  • Target treatments to specific mutations

  • Monitor treatment response

3. Rare Disease Diagnosis:

  • Sequence patients with unknown diseases

  • Find causative mutations

  • Enable treatment or management

4. Prenatal Testing:

  • Non-invasive prenatal testing (NIPT)

  • Detect genetic disorders before birth

  • From mother’s blood (no risk to baby!)

5. Consumer Genomics:

  • 23andMe, Ancestry.com

  • Learn about ancestry

  • Find genetic relatives

  • Some health information

18.5 Third-Generation Sequencing

18.5.1 Long-Read Technologies

Problem with NGS (2nd gen):

  • Short reads (100-300 bp)

  • Hard to assemble repetitive regions

  • Miss large structural variations

Solution: Third-generation sequencing!

18.5.2 PacBio (Pacific Biosciences)

Key feature: Very long reads (10,000-100,000+ bp)

How it works:

  • Single molecule sequencing

  • Watch DNA polymerase add nucleotides in real-time

  • Each nucleotide fluoresces

  • Incredible! Like watching a molecular movie!

Advantages:

  • Long reads span difficult regions

  • Can sequence through repetitive DNA

  • Detects DNA modifications directly

Disadvantages:

  • Higher error rate (but getting better!)

  • More expensive

  • Lower throughput

18.5.3 Oxford Nanopore

Key feature: Extremely long reads (up to 2 million bp!)

How it works:

The revolutionary principle: Instead of detecting fluorescence, Oxford Nanopore measures changes in electrical current!

Step-by-step:

  1. The nanopore: A tiny protein pore embedded in a membrane
    • Think of it like a donut hole that DNA passes through
    • Maintains an electrical current flowing through it
  2. DNA threading: Single-stranded DNA is pulled through the pore
    • Motor protein controls speed
    • DNA passes through one base at a time
  3. Current disruption: Each base (A, T, G, C) blocks current differently
    • Each base has a unique electrical signature
    • Computer records current changes in real-time
    • Pattern of disruptions reveals the sequence!
  4. No amplification needed: Sequences single DNA molecules directly
    • No PCR required (unlike Illumina!)
    • Avoids PCR errors and biases
    • Detects DNA modifications directly (methylation, etc.)

Key advantages of no PCR:

  • Faster workflow (skip amplification step)
  • More accurate (no PCR duplication errors)
  • Detects epigenetic modifications
  • Lower sample requirements

Device features:

  • Portable: USB stick size (MinION device)
  • Real-time sequencing: Watch sequences appear live!
  • Direct RNA sequencing: Can sequence RNA without converting to DNA first

Advantages:

  • Longest reads available: Regular reads >10 kb, ultra-long reads >100 kb, record >2 Mb!
  • Portable: Used in field research, remote locations, even the International Space Station!
  • Real-time results: No need to wait for run to finish
  • Direct RNA sequencing: Preserves modifications, no reverse transcription needed
  • No amplification bias: Sequences original molecules
  • Epigenetic detection: Detects methylation and other modifications directly

Disadvantages:

  • Higher error rate: ~92-95% accuracy (vs. 99%+ for Illumina)

    • But improving rapidly with better algorithms
    • Errors are random, not systematic
    • Can be corrected with higher coverage
  • More expensive per base: Still more costly than Illumina for high coverage

  • Newer technology: Still improving (algorithms, chemistry, accuracy)

Applications where Nanopore excels:

  • Structural variants: Long reads easily span deletions, insertions, inversions
  • Repetitive regions: Can read through repeats that confuse short reads
  • Transcript isoforms: Full-length RNA sequencing reveals splice variants
  • Metagenomics: Long reads better for species identification
  • Rapid diagnostics: Real-time results for outbreak response
  • Field work: Portable sequencing where labs don’t exist!

18.5.4 Why Long Reads Matter

1. Complete Genome Assembly:

  • 2022: First truly complete human genome

  • Used long reads to fill gaps

  • Repetitive regions finally sequenced!

2. Structural Variants:

  • Large deletions, insertions, inversions

  • Hard to detect with short reads

  • Easy with long reads

3. Phasing:

  • Determine which variants are on same chromosome

  • Important for understanding genetics

4. Repeat Regions:

  • Centromeres, telomeres

  • Highly repetitive

  • Spanned by long reads

18.6 Applications of Sequencing

18.6.1 Research

Genomics:

  • Sequence genomes of any organism

  • Understand evolution

  • Discover new species

Transcriptomics (RNA-seq):

  • Sequence all RNA in cells

  • Measure gene expression

  • Find new transcripts

Metagenomics:

  • Sequence all DNA in an environment

  • Discover microbes in gut, ocean, soil

  • Don’t need to culture them!

Epigenomics:

  • Some methods detect DNA methylation

  • Map epigenetic modifications

18.6.2 Medicine

Diagnosis:

  • Identify genetic diseases

  • Diagnose infections (sequence pathogen)

  • Cancer genomics

Pharmacogenomics:

  • Predict drug responses

  • Avoid adverse reactions

  • Personalize medication

Liquid Biopsies:

  • Sequence cell-free DNA in blood

  • Early cancer detection

  • Monitor treatment

18.6.3 Agriculture

Crop Improvement:

  • Sequence crop genomes

  • Find beneficial genes

  • Speed up breeding

Livestock:

  • Breed healthier animals

  • Understand genetics

  • Disease resistance

18.6.4 Forensics

Criminal Justice:

  • DNA fingerprinting

  • Identify suspects or victims

  • Solve cold cases

Paternity Testing:

  • Determine biological relationships

18.6.5 Conservation

Endangered Species:

  • Sequence threatened species

  • Preserve genetic diversity

  • Plan breeding programs

Ancient DNA:

  • Sequence extinct species (mammoths, Neanderthals!)

  • Understand evolution

  • Learn about history

18.7 The Future of Sequencing

18.7.1 Emerging Technologies

Real-Time Single-Molecule Sequencing:

  • Watch individual DNA molecules being read

  • No amplification needed

  • Detect modifications directly

In Situ Sequencing:

  • Sequence DNA/RNA inside intact cells and tissues

  • See spatial organization

  • Preserve 3D context

$100 Genome:

  • Even cheaper than today

  • Routine for everyone

  • Preventive medicine

18.7.2 Challenges Ahead

Data Analysis:

  • Generating data is easy now

  • Interpreting data is hard!

  • Need better algorithms and AI

Ethical Issues:

  • Privacy of genetic information

  • Insurance discrimination

  • Incidental findings (finding diseases you weren’t looking for)

Equity:

  • Making sequencing available to everyone

  • Not just wealthy countries/people

  • Global health applications

18.8 Key Takeaways

  • DNA sequencing = Reading the order of A, T, G, C in DNA

  • Sanger sequencing (1977):

    • First method

    • Accurate but slow and expensive

    • Used for Human Genome Project

  • Next-Generation Sequencing (NGS):

    • Massively parallel (millions of sequences at once)

    • Fast, cheap, high-throughput

    • Revolutionized genomics!

  • Progress: $3 billion/13 years → <$1,000/1-2 days

  • Third-generation sequencing:

    • Long reads (up to millions of base pairs!)

    • PacBio and Oxford Nanopore

    • Completed the human genome

  • Applications: Medicine, research, agriculture, forensics, conservation

  • Future: Even cheaper, faster, better analysis, ethical considerations


Sources: Information adapted from Illumina, NHGRI, Technology Networks, and sequencing technology literature (Bentley et al. 2008; Shendure and Ji 2008; Mardis 2017).

Bentley, David R., Shankar Balasubramanian, Harold P. Swerdlow, Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, et al. 2008. “Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry.” Nature 456 (7218): 53–59.
Mardis, Elaine R. 2017. “DNA Sequencing Technologies: 2006–2016.” Nature Protocols 12 (2): 213–18.
Sanger, Frederick, Steven Nicklen, and Alan R. Coulson. 1977. “DNA Sequencing with Chain-Terminating Inhibitors.” Proceedings of the National Academy of Sciences 74 (12): 5463–67.
Shendure, Jay, and Hanlee Ji. 2008. “Next-Generation DNA Sequencing.” Nature Biotechnology 26 (10): 1135–45.