Question? Leave a message!


BASICS ON BASICS ON MOLECULAR BIOLOGY MOLECULAR BIOLOGY Cell – DNA – RNA – protein Sequencing methods arising questions for handling the data, making sense of it next two week lectures: sequence alignment and genome assemblyCells  Fundamental working units of every living system.  Every organism is composed of one of two radically different types of cells: – prokaryotic cells – eukaryotic cells which have DNA inside a nucleus.  Prokaryotes and Eukaryotes are descended from primitive cells and the results of 3.5 billion years of evolution. 2Prokaryotes and Eukaryotes  According to the most recent evidence, there are three main branches to the tree of life  Prokaryotes include Archaea (“ancient ones”) and bacteria  Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae Lecture: Phylogenetic trees, this topic in more detail 3All Cells have common Cycles  Born, eat, replicate, and die 4Common features of organisms  Chemical energy is stored in ATP  Genetic information is encoded by DNA  Information is transcribed into RNA  There is a common triplet genetic code – some variations are known, however  Translation into proteins involves ribosomes  Shared metabolic pathways  Similar proteins among diverse groups of organisms 5All Life depends on 3 critical molecules  DNAs (Deoxyribonucleic acid) – Hold information on how cell works  RNAs (Ribonucleic acid) – Act to transfer short pieces of information to different parts of cell – Provide templates to synthesize into protein  Proteins – Form enzymes that send signals to other cells and regulate gene activity – Form body’s major components 6DNA structure  DNA has a double helix structure which is composed of – sugar molecule – phosphate group – and a base (A,C,G,T)  By convention, we read DNA strings in direction of transcription: from 5’ end to 3’ end 5’ ATTTAGGCC 3’ 3’ TAAATCCGG 5’ 7DNA is contained in chromosomes In eukaryotes, DNA is packed into linear chromosomes In prokaryotes, DNA is usually contained in a single, circular chromosome 8 chromosomes  Somatic cells (cells in all, except the germline, tissues) in humans have 2 pairs of 22 chromosomes + XX (female) or XY (male) = total of 46 chromosomes  Germline cells have 22 chromosomes + either X or Y = total of 23 chromosomes Karyogram of human male using Giemsa staining ( 9RNA  RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil)  Several types of RNA exist for different functions in the cell. tRNA linear and 3D view: 10DNA, RNA, and the Flow of Information ”The central dogma” Replication Transcription Translation Is this true Denis Noble: The principles of Systems Biology illustrated using the virtual heart 11  Proteins are polypeptides (strings of amino acid residues)  Represented using strings of letters from an alphabet of 20: AEGLV…WKKLAG  Typical length 50…1000 residues Urease enzyme from Helicobacter pylori 12 Amino acids 13How DNA/RNA codes for protein  DNA alphabet contains four letters but must specify protein, or polypeptide sequence of 20 letters. 3  Trinucleotides (triplets) allow 4 = 64 possible trinucleotides  Triplets are also called codons 14Proteins  20 different amino acids – different chemical properties cause the protein chains to fold up into specific threedimensional structures that define their particular functions in the cell.  Proteins do all essential work for the cell – build cellular structures – digest nutrients – execute metabolic functions – mediate information flow within a cell and among cellular communities.  Proteins work together with other proteins or nucleic acids as "molecular machines" – structures that fit together and function in highly specific, lockandkey ways. 15Genes  “A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products”  A DNA segment whose information is expressed either as an RNA molecule or protein (folding) (translation) MSG … (transcription) 5’ 3’ … a t g a g t g g a … … t a c t c a c c t … 3’ 5’ 16 http://fold.itGenes alleles  A gene can have different variants  The variants of the same gene are called alleles MSG … MSR … 5’ 5’ … a t g a g t g g a … … a t g a g t c g a … … t a c t c a c c t … 3’ … t a c t c a g c t … 3’ 17Genes can be found on both strands 5’ 3’ 3’ 5’ 18Exons and introns splicing Exons 5’ 3’ 3’ 5’ Introns are removed from RNA after transcription Exons are joined: This process is called splicing 19Alternative splicing Different splice variants may be generated 5’ 3’ A C B 3’ 5’ A C B C B A C … 20DNA and continuum of life....  Prokaryotes are typically haploid: they have a single (circular) chromosome  DNA is usually inherited vertically (parent to daughter)  Inheritance is clonal – Descendants are faithful copies of an ancestral DNA – Variation is introduced via mutations, transposable elements, and horizontal transfer of DNA Chromosome map of S. dysenteriae, the nine rings describe different properties of the genome 21Biological string manipulation  Point mutation: substitution of a base – …ACGGCT… = …ACGCCT…  Deletion: removal of one or more contiguous bases (substring) – …TTGATCA… = …TTTCA…  Insertion: insertion of a substring – …GGCTAG… = …GGTCAACTAG… Lecture: Sequence alignment Lecture: Genome rearrangements 22Genome sequencing assembly  DNA sequencing – How do we obtain DNA sequence information from organisms  Genome assembly – What is needed to put together DNA sequence information from sequencing  First statement of sequence assembly problem: – Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for some string matching problems arising in molecular genetics. Proc. 9th IFIP World Computer Congress, 1983 23Recovery of shredded newspaper 24DNA sequencing  DNA sequencing: resolving a nucleotide sequence (wholegenome or less)  Many different methods developed – MaxamGilbert method (1977) – Sanger method (1977) – Highthroughput methods, ”nextgeneration” methods 25Sanger sequencing: sequencing by synthesis  A sequencing technique developed by 1977  Also called dideoxy sequencing  A DNA polymerase is an enzyme that catalyzes DNA synthesis  DNA polymerase needs a primer  Synthesis proceeds always in 5’3’ direction  In Sanger sequencing, chainterminating  dideoxynucleoside triphosphates (ddXTPs) are employed – ddATP, ddCTP, ddGTP, ddTTP lack the 3’OH tail of dXTPs  A mixture of dXTPs with small amount of ddXTPs is given to DNA polymerase with DNA template and primer  ddXTPs are given fluorescent labels  When DNA polymerase encounters a ddXTP, the synthesis cannot proceed  The process yields copied sequences of different lengths  Each sequence is terminated by a labeled ddXTP 26Determining the sequence  Sequences are sorted according to length by capillary electrophoresis  Fluorescent signals corresponding to labels are registered  Base calling: identifying which base corresponds to each position in a Output sequences from read base calling are called reads – Nontrivial problem 27Reads are short 1  Modern Sanger sequencers can produce quality reads up to 750 bases – Instruments provide you with a quality file for bases in reads, in addition to actual sequence data 9  Compare the read length against the size of the human genome (2.9x10 bases)  Reads have to be assembled 28Problems 1  Sanger sequencing error rate per base varies from 1 to 3  Repeats in DNA – For example, 300 base longs Alu sequence repeated is over million times in human genome – Repeats occur in different scales  What happens if repeat length is longer than read length  Shortest superstring problem – Find the shortest string that ”explains” the reads – Given a set of strings (reads), find a shortest string that contains all of them 29Sequence assembly and combination locks  What is common with sequence assembly and opening keypad locks 30Wholegenome shotgun sequence  Wholegenome shotgun sequence assembly starts with a large sample of genomic DNA 1. Sample is randomly partitioned into inserts of length 500 bases 2. Inserts are multiplied by cloning them into a vector which is used to infect bacteria 3. DNA is collected from bacteria and sequenced 4. Reads are assembled 31Assembly of reads with OverlapLayout Consensus algorithm  Overlap – Finding potentially overlapping reads  Layout – Finding the order of reads along DNA  Consensus (Multiple alignment) – Deriving the DNA sequence from the layout  Next, the method is described at a very abstract level, skipping a lot of details 32Finding overlaps acggagtcc  First, pairwise overlap alignment of agtccgcgctt reads is resolved  Reads can be from either DNA strand: r 1 The reverse complement r of each read r has to be considered … a t g a g t g g a … 5’ 3’ … t a c t c a c c t … 3’ 5’ r 2 r : tgagt, r : actca 1 1 r : tccac, r : gtgga 2 2 33Example sequence to assemble 5’ –CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGT A TGCGCA TCG TGA TTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT 3’  20 reads: Read Read Read Read 1 CATCGTCA TCACGATG 11 GGTCGGTG CACCGACC 2 CGGTGAAG CTTCACCG 12 ATCGTGAT ATCACGAT 3 TATGCGCA TGCGCATA 13 GCGCTGCG CGCAGCGC 4 GACGAGTC GACTCGTC 14 GCATCGTG CACGATGC 5 CTGACAAA TTTGTCAG 15 AGCGCGCT AGCGCGCT 6 ATGCGCAT ATGCGCAT 16 GAAGTTGT ACAACTTC 7 ATGCGGTC GACCGCAT 17 AGTGAAAC GTTTCACT 8 CTGCGTGA TCACGCAG 18 ACGCGATG CATCGCGT 9 GCGTGACG CGTCACGC 19 GCGCATCG CGATGCGC 10 GTCGGTGA TCACCGAC 20 AAGTGAAA TTTCACTT 34Finding overlaps  Overlap between two reads can Overlap(1, 6) = 3 be found with a dynamic 6 ATGCGCAT programming algorithm – Errors can be taken into account 1 CATCGTCA 12 ATCGTGAT  Dynamic programming will be discussed more during the next Overlap(1, 12) = 7 two weeks 6 12 1 3 7  Overlap scores stored into the overlap matrix – Entries (i, j) below the diagonal denote overlap of read r and r i j 35Finding layout consensus  Method extends the assembly greedily by choosing the best Ambiguous bases overlaps  Both orientations are considered 7 GACCGCAT 6=6 ATGCGCAT  Sequence is extended as far as 14 GCATCGTG possible 1 CATCGTGA 12 ATCGTGAT 19 GCGCATCG 13 CGCAGCGC consensus sequence CGCATCGTGAT 36Finding layout consensus  We move on to next best overlaps and extend the sequence from there  The method stops when there are 2 CGGTGAAG no more overlaps to consider 10 GTCGGTGA 11 GGTCGGTG  A number of contigs is produced 7 ATGCGGTC  Contig stands for contiguous ATGCGGTCGGTGAAG sequence, resulting from merging reads 37Wholegenome shotgun sequencing: summary Original genome sequence … … Reads Nonoverlapping Overlapping reads read = Contig  Ordering of the reads is initially unknown  Overlaps resolved by aligning the reads 9 7  In a 3x10 bp genome with 500 bp reads and 5x coverage, there are 10 reads and 7 7 13 10 (10 1)/2 = 5x10 pairwise sequence comparisons 38Repeats in DNA and genome assembly Two instances of the same repeat 39Repeats in DNA cause problems in sequence assembly  Recap: if repeat length exceeds read length, we might not get the correct assembly  This is a problem especially in eukaryotes – 3.1 of genome consists of repeats in Drosophila, 45 in human  Possible solutions 1. Increase read length – feasible 2. Divide genome into smaller parts, with known order, and sequence parts individually 40”Divide and conquer” sequencing approaches: BACbyBAC Wholegenome shotgun sequencing Genome Divideandconquer Genome BAC library 41BACbyBAC sequencing  Each BAC (Bacterial Artificial Chromosome) is about 150 kbp  Covering the human genome requires 30000 BACs  BACs shotgunsequenced separately – Number of repeats in each BAC is significantly smaller than in the whole genome... – ...needs much more manual work compared to wholegenome shotgun sequencing 42Hybrid method  Divideandconquer and wholegenome shotgun approaches can be combined – Obtain high coverage from wholegenome shotgun sequencing for short contigs – Generate of a set of BAC contigs with low coverage – Use BAC contigs to ”bin” short contigs to correct places  This approach was used to sequence the brown Norway rat genome in 2004 43First wholegenome shotgun sequencing project: Drosophila melanogaster  Fruit fly is a common model organism in biological studies  Wholegenome assembly reported in Eugene Myers, et al., A Whole Genome Assembly of Drosophila, Science 24, 2000  Genome size 120 Mbp 44 of the Human Genome  The (draft) human genome was published in 2001  Two efforts: – Human Genome Project (public consortium) – Celera (private company)  HGP: BACbyBAC approach HGP: Nature 15 February 2001 Vol 409 Number 6822  Celera: wholegenome shotgun sequencing Celera: Science 16 February 2001 Vol 291, Issue 5507 45Sequencing of the Human Genome  The (draft) human genome was published in 2001  Two efforts: – Human Genome Project (public consortium) – Celera (private company)  HGP: BACbyBAC approach HGP: Nature 15 February 2001  Celera: wholegenome Vol 409 Number 6822 shotgun sequencing Celera: Science 16 February 2001 Vol 291, Issue 5507 46Nextgen sequencing: 454  Sanger sequencing is the prominent firstgeneration sequencing method  Many new sequencing methods are emerging  Genome Sequencer FLX (454 Life Science / Roche) – 100 Mb / 7.5 h run – Read length 250300 bp – 99.5 accuracy / base in a single run – 99.99 accuracy / base in consensus 47The method used by the Roche/454 sequencer to amplify singlestranded DNA copies from a fragment library on agarose beads. A mixture of DNA fragments with agarose beads containing complementary oligonucleotides to the adapters at the fragment ends are mixed in an approximately 1:1 ratio. The mixture is encapsulated by vigorous vortexing into aqueous micelles that contain PCR reactants surrounded by oil, and pipetted into a 96well microtiter plate for PCR amplification. The resulting beads are decorated with approximately 1 million copies of the original singlestranded fragment, which provides sufficient signal strength during the pyrosequencing reaction that follows to detect and record nucleotide incorporation events. sstDNA, singlestranded template DNA.Nextgen sequencing: Illumina Solexa  Illumina / Solexa Genome Analyzer – Read length 35 50 bp – 12 Gb / 36 day run – 98.5 accuracy / base in a single run – 99.99 accuracy / consensus with 3x coverage 49The Illumina sequencingbysynthesis approach. Cluster strands created by bridge amplification are primed and all four fluorescently labeled, 3 OH blocked nucleotides are added to the flow cell with DNA polymerase. The cluster strands are extended by one nucleotide. Following the incorporation step, the unused nucleotides and DNA polymerase molecules are washed away, a scan buffer is added to the flow cell, and the optics system scans each lane of the flow cell by imaging units called tiles. Once imaging is completed, chemicals that effect cleavage of the fluorescent labels and the 3 OH blocking groups are added to the flow cell, which prepares the cluster strands for another round of fluorescent nucleotide incorporation.Nextgen sequencing: SOLiD  SOLiD – Read length 2530 bp – 12 Gb / 510 day run – 99.94 accuracy / base – 99.999 accuracy / consensus with 15x coverage 51The ligasemediated sequencing approach of the Applied Biosystems SOLiD sequencer. In a manner similar to Roche/454 emulsion PCR amplification, DNA fragments for SOLiD sequencing are amplified on the surfaces of 1 m magnetic beads to provide sufficient signal during the sequencing reactions, and are then deposited onto a flow cell slide. Ligasemediated sequencing begins by annealing a primer to the shared adapter sequences on each amplified fragment, and then DNA ligase is provided along with specific fluorescentlabeled 8mers, whose 4th and 5th bases are encoded by the attached fluorescent group. Each ligation step is followed by fluorescence detection, after which a regeneration step removes bases from the ligated 8mer (including the fluorescent group) and concomitantly prepares the extended primer for another round of ligation. (b) Principles of two base encoding. Because each fluorescent group on a ligated 8mer identifies a twobase combination, the resulting sequence reads can be screened for base calling errors versus true polymorphisms versus single base deletions by aligning the individual reads to a known highquality reference sequence.
Website URL