Bioinformatics algorithms ppt

bioinformatics and computer science ppt and bioinformatics dynamic programming ppt
FrankRoberts Profile Pic
FrankRoberts,France,Researcher
Published Date:11-07-2017
Your Website URL(Optional)
Comment
BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Biological Computer + Data Calculations 2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduWhat is Bioinformatics? • (Molecular) Bio -informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Bioinformatics is “MIS” for Molecular Biology Information  3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology: an Information Science Central Paradigm Central Dogma for Bioinformatics of Molecular Biology Genomic Sequence Information DNA - mRNA (level) - RNA - Protein Sequence - Protein - Protein Structure - Protein Function - Phenotype - Phenotype - DNA Molecules Large Amounts of Information ◊ Sequence, Structure, Function ◊ Standardized Processes ◊ Statistical ◊ Mechanism, Specificity, Regulation (idea from D Brutlag, Stanford, graphics from S Strobel) Most cellular functions are performed or facilitated by proteins. Primary biocatalyst Cofactor transport/storage Mechanical motion/support Immune protection Control of growth/differentiation Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Genetic material Some catalytic activity       4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information - DNA Raw DNA Sequence atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac ◊ Coding or Not? atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca ◊ Parse into genes? gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ◊4bases:AGCT ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt ◊1Kinagene, gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca 2 M in genome tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg  5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Protein Sequence 20 letter alphabet ◊ ACDEFGHIKLMNPQRSTVWY but not BJOUXZ Strings of 300 aa in an average protein (in bacteria), 200 aa in a domain 200 K known protein sequences d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTLNKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTVGKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLDKPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVGKIMVVGRRTYESF d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ -G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKA d3dfr__ -PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ -G-RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-.IMVIGGGRVYEQFLPKA d3dfr__ -PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQELVIAGGAQIFTAFKDDV    6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Macromolecular Structure DNA/RNA/Protein ◊ Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)  7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Protein Structure Details Statistics on Number of XYZ triplets ◊ 200 residues/domain- 200 CA atoms, separated by 3.8 A ◊ Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A •= 1500 xyz triplets (=8x200) per protein domain ◊ 10 K known domain, 300 folds ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67 ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68 ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69 ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70 ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71 ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72 ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73 ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74 ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75 ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76 ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77 ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78 ... ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510 ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511 ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512 ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513 ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514 ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515 TER 1450 LYS 186 1GKY1516  8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu1995 Bacteria, 1.6 Genomes Mb, 1600 genes Science highlight 269: 496 the 1997 Finiteness Eukaryote, of the 13 Mb, 6K genes Nature World of 387:1 Sequences 1998 Animal, 100 Mb, 20K genes Science 282: 1945 2000? Human, 3 Gb, 100K genes ??? 9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Whole Genomes The Revolution Driving Everything Fleischmann, R.D.,Adams,M. D., White,O., Clayton,R. A., Kirkness,E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K.V.,Fraser, C. M.,Smith, H.O. &Venter,J.C.(1995). "Whole-genome random sequencing and assembly ofHaemophilus influenzae rd." Genome sequence now Science 269: 496-512. accumulate so quickly that, (Picture adapted from TIGR website, in less than a week, a http://www.tigr.org) single laboratory can Integrative Data produce more bits of data 1995, HI (bacteria): 1.6 Mb & 1600 genes done than Shakespeare 1997, yeast: 13 Mb & 6000 genes for yeast managedinalifetime, 1998, worm: 100Mb with 19 K genes although the latter make 1999: 30 completed genomes better reading. 2003, human: 3 Gb & 100 K genes... G A Pekso, Nature 401: 115-116 (1999)   10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGene Expression Datasets: the Transcriptosome Young/Lander, Chips, Abs. Exp. Also: SAGE; Samson and Church, Chips; Brown, array, Aebersold, Snyder, Protein Rel. Exp. over Transposons, Expression Timecourse Protein Exp. µµ µµ 11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduArray Data Yeast Expression Data in Academia: levels for all 6000 genes Can only sequence genome once but can do an infinite variety of these array experiments at 10 time points, 6000 x 10 = 60K floats telling signal from (courtesy of J Hager) background 12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduOther Whole- Genome Experiments Systematic Knockouts Winzeler, E. A., Shoemaker, D. D., 2 hybrids, linkage maps Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. & Boeke, J. D., Bussey, H., Chu, A. M., Zhu, L. (1998). Construction of a modular yeast Connelly, C., Davis, K., Dietrich, F., Dow, two-hybrid cDNA library from human EST clones for the human genome protein linkage map. Gene 215, S. W., El Bakkoury, M., Foury, F., Friend, 143-52 S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Davis, R. W. & et al. (1999). For yeast: Functional characterization of the S. 6000 x 6000 / 2 cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 18Minteractions 13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Other Integrative Data Information to understand genomes ◊ Metabolic Pathways (glycolysis), traditional biochemistry ◊ Regulatory Networks ◊ Whole Organisms Phylogeny, traditional zoology ◊ Environments, Habitats, ecology ◊ The Literature (MEDLINE) The Future.... (Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)   14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduExplonential Growth of Data Matched by Development of Computer Technology Internet Hosts CPU vs Disk & Net ◊ As important as the increase in computer speed has been, the ability to store large amounts of information on computersiseven 1979 1981 1983 1985 1987 1989 1991 1993 1995 more crucial 4500 140 4000 120 DrivingForcein 3500 100 3000 Num. Bioinformatics 80 2500 Protein 2000 60 (Internet picture adapted Domain 1500 from D Brutlag, Stanford) 40 Structures 1000 20 500 0 0 1980 1985 1990 1995  CPU Instruction Time (ns) 15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics is born (courtesy of Finn Drablos) 16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduWeber Cartoon 17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduThe Character of Molecular Biology Information: Redundancy and Multiplicity Different Sequences Have the Same Structure Organism has many similar genes Single Gene May Have Multiple Functions Integrative Genomics - Genes are grouped into Pathways genes ↔ structures ↔ functions ↔ pathways ↔ Genomic Sequence Redundancy expression levels ↔ due to the Genetic Code regulatory systems ↔ …. Howdowefindthe similarities?.....       18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduNew Paradigm for Scientific Computing Because of Physics increase in data and ◊ Prediction based on physical improvement in computers, principles new calculations become ◊ Exact Determination of Rocket Trajectory possible ◊ Supercomputer, CPU But Bioinformatics has a new Biology style of calculation... ◊ Classifying information and ◊ Two Paradigms discovering unexpected relationships ◊ globin colicin plastocyanin repressor ◊ networks, “federated” database    19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGeneral Types of “Informatics” in Bioinformatics Databases Geometry ◊ Robotics ◊ Building, Querying ◊ Graphics (Surfaces, Volumes) ◊ Object DB ◊ Comparison and 3D Matching Text String Comparison (Visision, recognition) ◊ Text Search Physical Simulation ◊ 1D Alignment ◊ Newtonian Mechanics ◊ Significance Statistics ◊ Electrostatics ◊ Alta Vista, grep ◊ Numerical Algorithms Finding Patterns ◊ Simulation ◊ AI / Machine Learning ◊ Clustering ◊ Datamining      20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu