Question? Leave a message!




BIOINFORMATICS Introduction

BIOINFORMATICS Introduction 8
BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Biological Computer + Data Calculations 2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduWhat is Bioinformatics • (Molecular) Bio informatics • One idea for a definition Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physicalchemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a largescale. Bioinformatics is “MIS” for Molecular Biology Information  3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology: an Information Science Central Paradigm Central Dogma for Bioinformatics of Molecular Biology Genomic Sequence Information DNA mRNA (level) RNA Protein Sequence Protein Protein Structure Protein Function Phenotype Phenotype DNA Molecules Large Amounts of Information ◊ Sequence, Structure, Function ◊ Standardized Processes ◊ Statistical ◊ Mechanism, Specificity, Regulation (idea from D Brutlag, Stanford, graphics from S Strobel) Most cellular functions are performed or facilitated by proteins. Primary biocatalyst Cofactor transport/storage Mechanical motion/support Immune protection Control of growth/differentiation Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Genetic material Some catalytic activity       4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information DNA Raw DNA Sequence atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac ◊ Coding or Not atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca ◊ Parse into genes gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ◊4bases:AGCT ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt ◊1Kinagene, gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca 2 M in genome tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg  5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Protein Sequence 20 letter alphabet ◊ ACDEFGHIKLMNPQRSTVWY but not BJOUXZ Strings of 300 aa in an average protein (in bacteria), 200 aa in a domain 200 K known protein sequences d1dhfa LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSI d8dfr LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQNAVIMGKKTWFSI d4dfra ISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESI d3dfr TAFLWAQDRDGLIGKDGHLPWHLPDDLHYFRAQTVGKIMVVGRRTYESF d1dhfa LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSI d8dfr LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQNAVIMGKKTWFSI d4dfra ISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESI d3dfr TAFLWAQDRNGLIGKDGHLPWHLPDDLHYFRAQTVGKIMVVGRRTYESF d1dhfa VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra GRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKA d3dfr PKRPLPERTNVVLTHQEDYQAQGAVVVHDVAAVFAYAKQHLDQELVIAGGAQIFTAFKDDV d1dhfa PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra GRPLPGRKNIILSSSQPGTDDRVTWVKSVDEAIAACGDVPE.IMVIGGGRVYEQFLPKA d3dfr PKRPLPERTNVVLTHQEDYQAQGAVVVHDVAAVFAYAKQHLDQELVIAGGAQIFTAFKDDV    6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Macromolecular Structure DNA/RNA/Protein ◊ Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)  7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Protein Structure Details Statistics on Number of XYZ triplets ◊ 200 residues/domain 200 CA atoms, separated by 3.8 A ◊ Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A •= 1500 xyz triplets (=8x200) per protein domain ◊ 10 K known domain, 300 folds ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67 ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68 ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69 ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70 ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71 ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72 ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73 ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74 ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75 ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76 ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77 ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78 ... ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510 ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511 ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512 ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513 ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514 ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515 TER 1450 LYS 186 1GKY1516  8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu1995 Bacteria, 1.6 Genomes Mb, 1600 genes Science highlight 269: 496 the 1997 Finiteness Eukaryote, of the 13 Mb, 6K genes Nature World of 387:1 Sequences 1998 Animal, 100 Mb, 20K genes Science 282: 1945 2000 Human, 3 Gb, 100K genes 9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Whole Genomes The Revolution Driving Everything Fleischmann, R.D.,Adams,M. D., White,O., Clayton,R. A., Kirkness,E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K.V.,Fraser, C. M.,Smith, H.O. Venter,J.C.(1995). "Wholegenome random sequencing and assembly ofHaemophilus influenzae rd." Genome sequence now Science 269: 496512. accumulate so quickly that, (Picture adapted from TIGR website, in less than a week, a http://www.tigr.org) single laboratory can Integrative Data produce more bits of data 1995, HI (bacteria): 1.6 Mb 1600 genes done than Shakespeare 1997, yeast: 13 Mb 6000 genes for yeast managedinalifetime, 1998, worm: 100Mb with 19 K genes although the latter make 1999: 30 completed genomes better reading. 2003, human: 3 Gb 100 K genes... G A Pekso, Nature 401: 115116 (1999)   10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGene Expression Datasets: the Transcriptosome Young/Lander, Chips, Abs. Exp. Also: SAGE; Samson and Church, Chips; Brown, array, Aebersold, Snyder, Protein Rel. Exp. over Transposons, Expression Timecourse Protein Exp. µµ µµ 11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduArray Data Yeast Expression Data in Academia: levels for all 6000 genes Can only sequence genome once but can do an infinite variety of these array experiments at 10 time points, 6000 x 10 = 60K floats telling signal from (courtesy of J Hager) background 12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduOther Whole Genome Experiments Systematic Knockouts Winzeler, E. A., Shoemaker, D. D., 2 hybrids, linkage maps Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. Boeke, J. D., Bussey, H., Chu, A. M., Zhu, L. (1998). Construction of a modular yeast Connelly, C., Davis, K., Dietrich, F., Dow, twohybrid cDNA library from human EST clones for the human genome protein linkage map. Gene 215, S. W., El Bakkoury, M., Foury, F., Friend, 14352 S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Davis, R. W. et al. (1999). For yeast: Functional characterization of the S. 6000 x 6000 / 2 cerevisiae genome by gene deletion and parallel analysis. Science 285, 9016 18Minteractions 13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMolecular Biology Information: Other Integrative Data Information to understand genomes ◊ Metabolic Pathways (glycolysis), traditional biochemistry ◊ Regulatory Networks ◊ Whole Organisms Phylogeny, traditional zoology ◊ Environments, Habitats, ecology ◊ The Literature (MEDLINE) The Future.... (Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)   14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduExplonential Growth of Data Matched by Development of Computer Technology Internet Hosts CPU vs Disk Net ◊ As important as the increase in computer speed has been, the ability to store large amounts of information on computersiseven 1979 1981 1983 1985 1987 1989 1991 1993 1995 more crucial 4500 140 4000 120 DrivingForcein 3500 100 3000 Num. Bioinformatics 80 2500 Protein 2000 60 (Internet picture adapted Domain 1500 from D Brutlag, Stanford) 40 Structures 1000 20 500 0 0 1980 1985 1990 1995  CPU Instruction Time (ns) 15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics is born (courtesy of Finn Drablos) 16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduWeber Cartoon 17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduThe Character of Molecular Biology Information: Redundancy and Multiplicity Different Sequences Have the Same Structure Organism has many similar genes Single Gene May Have Multiple Functions Integrative Genomics Genes are grouped into Pathways genes ↔ structures ↔ functions ↔ pathways ↔ Genomic Sequence Redundancy expression levels ↔ due to the Genetic Code regulatory systems ↔ …. Howdowefindthe similarities.....       18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduNew Paradigm for Scientific Computing Because of Physics increase in data and ◊ Prediction based on physical improvement in computers, principles new calculations become ◊ Exact Determination of Rocket Trajectory possible ◊ Supercomputer, CPU But Bioinformatics has a new Biology style of calculation... ◊ Classifying information and ◊ Two Paradigms discovering unexpected relationships ◊ globin colicin plastocyanin repressor ◊ networks, “federated” database    19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduGeneral Types of “Informatics” in Bioinformatics Databases Geometry ◊ Robotics ◊ Building, Querying ◊ Graphics (Surfaces, Volumes) ◊ Object DB ◊ Comparison and 3D Matching Text String Comparison (Visision, recognition) ◊ Text Search Physical Simulation ◊ 1D Alignment ◊ Newtonian Mechanics ◊ Significance Statistics ◊ Electrostatics ◊ Alta Vista, grep ◊ Numerical Algorithms Finding Patterns ◊ Simulation ◊ AI / Machine Learning ◊ Clustering ◊ Datamining      20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Topics Genome Sequence Finding Genes in Genomic DNA ◊ introns ◊ exons ◊ promotors Characterizing Repeats in Genomic DNA ◊ Statistics ◊ Patterns Duplications in the Genome    21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduSequence Alignment Bioinformatics ◊ nonexact string matching, gaps ◊ How to align two strings optimally Topics via Dynamic Programming ◊ Local vs Global Alignment Protein Sequence ◊ Suboptimal Alignment ◊ Hashing to increase speed (BLAST, FASTA) Scoring schemes and ◊ Amino acid substitution scoring Matching statistics matrices ◊ How to tell if a given alignment or Multiple Alignment and match is statistically significant Consensus Patterns ◊ A Pvalue (or an evalue) ◊ How to align more than one ◊ Score Distributions sequence and then fuse the (extreme val. dist.) result in a consensus ◊ Low Complexity Sequences representation ◊ Transitive Comparisons ◊ HMMs, Profiles ◊ Motifs    22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Topics Sequence / Structure Secondary Structure “Prediction” ◊ via Propensities ◊ Neural Networks, Genetic Tertiary Structure Prediction Alg. ◊ Fold Recognition ◊ Simple Statistics ◊ Threading ◊ TMhelix finding ◊ Ab initio ◊ Assessing Secondary Function Prediction Structure Prediction ◊ Active site identification Relation of Sequence Similarity to Structural Similarity     23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics Structures Basic Protein Geometry and Structural Alignment LeastSquares Fitting ◊ Aligning sequences on the basis of 3D structure. ◊ Distances, Angles, Axes, Rotations ◊ DP does not converge, unlike sequences, what to do Calculating a helix axis in 3D ◊ Other Approaches: Distance via fitting a line Matrices, Hashing ◊ LSQ fit of 2 structures ◊ Fold Library ◊ Molecular Graphics Calculation of Volume and Surface ◊ How to represent a plane ◊ How to represent a solid ◊ How to calculate an area ◊ Docking and Drug Design as Surface Matching ◊ Packing Measurement     24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduRelational Database Topics Concepts Databases ◊ Keys, Foreign Keys ◊ SQL, OODBMS, views, forms, transactions, reports, indexes Clustering and Trees ◊ Joining Tables, Normalization ◊ Basic clustering Natural Join as "where" UPGMA selection on cross product singlelinkage Array Referencing (perl/dbm) multiple linkage ◊ Forms and Reports ◊ Other Methods ◊ Crosstabulation Parsimony, Maximum Protein Units likelihood ◊ What are the units of biological ◊ Evolutionary implications information The Bias Problem sequence, structure ◊ sequence weighting motifs, modules, domains ◊ sampling ◊ How classified: folds, motions, pathways, functions            25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics Genomics Expression Analysis Genome Comparisons ◊ Time Courses clustering ◊ Ortholog Families, pathways ◊ Measuring differences ◊ Largescale censuses ◊ Identifying Regulatory Regions ◊ Frequent Words Analysis ◊ Genome Annotation Large scale cross referencing ◊ Trees from Genomes of information ◊ Identification of interacting Function Classification and proteins Orthologs The Genomic vs. Single Structural Genomics molecule Perspective ◊ Folds in Genomes, shared common folds ◊ Bulk Structure Prediction Genome Trees         26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduTopics Simulation Molecular Simulation Parameter Sets ◊ Geometry Energy Forces Number Density ◊ Basic interactions, potential PoissonBoltzman Equation energy functions Lattice Models and ◊ Electrostatics Simplification ◊ VDW Forces ◊ Bonds as Springs ◊ How structure changes over time How to measure the change in a vector (gradient) ◊ Molecular Dynamics MC ◊ Energy Minimization       27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBioinformatics Schematic 28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduBackground Math Biology Need to Know Calculation of Standard DNA, RNA, alpha Today Deviation, a Bellshaped helix, the cell nucleus, Distribution (of test scores), ATP a3D vector What You’ll Force is the Derivative (grad) of Proteins are tightly Learn Energy, Rotation Matrices (3D), a packed, sequence Pvalueof.01andanExtreme homology twilight Value Distribution zone, protein families Not really PoissonBoltzman Equation, What GroEL does, a necessary…. Design a Hashing Function, Write worm is a metazoa, E. a Recursive Descent Parser coli is gram negative, what chemokines are 29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (1) Digital Libraries ◊ Automated Bibliographic Search and Textual Comparison ◊ Knowledge bases for biological literature Motif Discovery Using Gibb's Sampling Methods for Structure Determination ◊ Computational Crystallography Refinement ◊ NMR Structure Determination Distance Geometry Metabolic Pathway Simulation The DNA Computer        30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (1, Answers) (YES) Digital Libraries ◊ Automated Bibliographic Search and Textual Comparison ◊ Knowledge bases for biological literature (YES) Motif Discovery Using Gibb's Sampling (NO) Methods for Structure Determination ◊ Computational Crystallography Refinement ◊ NMR Structure Determination (YES) Distance Geometry (YES) Metabolic Pathway Simulation (NO) The DNA Computer        31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (2) Gene identification by sequence inspection ◊ Prediction of splice sites DNA methods in forensics Modeling of Populations of Organisms ◊ Ecological Modeling Genomic Sequencing Methods ◊ Assembling Contigs ◊ Physical and genetic mapping Linkage Analysis ◊ Linking specific genes to various traits      32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (2, Answers) (YES) Gene identification by sequence inspection ◊ Prediction of splice sites (YES) DNA methods in forensics (NO) Modeling of Populations of Organisms ◊ Ecological Modeling (NO) Genomic Sequencing Methods ◊ Assembling Contigs ◊ Physical and genetic mapping (YES) Linkage Analysis ◊ Linking specific genes to various traits      33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (3) RNA structure prediction Identification in sequences Radiological Image Processing ◊ Computational Representations for Human Anatomy (visible human) Artificial Life Simulations ◊ Artificial Immunology / Computer Security ◊ Genetic Algorithms in molecular biology Homology modeling Determination of Phylogenies Based on Non molecular Organism Characteristics Computerized Diagnosis based on Genetic Analysis (Pedigrees)       34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduAre They or Aren’t They Bioinformatics (3, Answers) (YES) RNA structure prediction Identification in sequences (NO) Radiological Image Processing ◊ Computational Representations for Human Anatomy (visible human) (NO) Artificial Life Simulations ◊ Artificial Immunology / Computer Security ◊ (NO) Genetic Algorithms in molecular biology (YES) Homology modeling (NO) Determination of Phylogenies Based on Non molecular Organism Characteristics (NO) Computerized Diagnosis based on Genetic Analysis (Pedigrees)       35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application I: Designing Drugs Understanding How Structures Bind Other Molecules (Function) Designing Inhibitors Docking, Structure Modeling (From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).    36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application II: Finding Homologues Find Similar Ones in Different Organisms Human vs. Mouse vs. Yeast ◊ Easier to do Expts. on latter (Section from NCBI Disease Genes Database Reproduced Below.) Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Human Disease MIM Human GenBank BLASTX Yeast GenBank Yeast Gene Gene Acc for Pvalue Gene Acc for Description Human cDNA Yeast cDNA Hereditary Nonpolyposis Colon Cancer 120436 MSH2 U03911 9.2e261 MSH2 M84170 DNA repair protein Hereditary Nonpolyposis Colon Cancer 120436 MLH1 U07418 6.3e196 MLH1 U07187 DNA repair protein Cystic Fibrosis 219700 CFTR M28668 1.3e167 YCF1 L35237 Metal resistance protein Wilson Disease 277900 WND U11700 5.9e161 CCC2 L36317 Probable copper transporter Glycerol Kinase Deficiency 307030 GK L13943 1.8e129 GUT1 X69049 Glycerol kinase Bloom Syndrome 210900 BLM U39817 2.6e119 SGS1 U22341 Helicase Adrenoleukodystrophy, Xlinked 300100 ALD Z21876 3.4e107 PXA1 U17065 Peroxisomal ABC transporter Ataxia Telangiectasia 208900 ATM U26455 2.8e90 TEL1 U31331 PI3 kinase Amyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e58 SOD1 J03279 Superoxide dismutase Myotonic Dystrophy 160900 DM L19268 5.4e53 YPK1 M21307 Serine/threonine protein kinase Lowe Syndrome 309000 OCRL M88162 1.2e47 YIL002C Z47047 Putative IPP5phosphatase Neurofibromatosis, Type 1 162200 NF1 M89914 2.0e46 IRA2 M33779 Inhibitory regulator protein Choroideremia 303100 CHM X78121 2.1e42 GDI1 S69371 GDP dissociation inhibitor Diastrophic Dysplasia 222600 DTD U14528 7.2e38 SUL1 X82013 Sulfate permease Lissencephaly 247200 LIS1 L13385 1.7e34 MET30 L26505 Methionine metabolism Thomsen Disease 160800 CLC1 Z25884 7.9e31 GEF1 Z23117 Voltagegated chloride channel Wilms Tumor 194070 WT1 X51630 1.1e20 FZF1 X67787 Sulphite resistance protein Achondroplasia 100800 FGFR3 M58051 2.0e18 IPL1 U07163 Serine/threoinine protein kinase Menkes Syndrome 309400 MNK X69208 2.1e17 CCC2 L36317 Probable copper transporter   37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application II: Finding Homologues (cont.) CrossReferencing, one thing to another thing Sequence Comparison and Scoring Analogous Problems for Structure Comparison Comparison has two parts: (1) Optimally Aligning 2 entities to get a Comparison Score (2) Assessing Significance of this score in a given Context Integrated Presentation ◊ Align Sequences ◊ Align Structures ◊ Score in a Uniform Framework      38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduMajor Application II: Overall Genome Characterization Overall Occurrence of a Certain Feature in the Genome ◊ e.g. how many kinases in Yeast Compare Organisms and Tissues ◊ Expression levels in Cancerous vs Normal Tissues Databases, Statistics (Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI)    39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.eduSimplfying Genomes with Folds, Pathways, c 100000 genes (human) 1 234567 89 1011121314151617181920… 1000 folds 1 23456 7 8 9 10 11 12 13 14 15 … 1000 genes (T. pallidum) 40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edusupersecondary person protein helix individual structure (ββ,ΤΜ−ΤΜ, At What plant fold (Ig) strand atom αβαβ,ααα) (C,H,O...) Structural Resolution Are Organisms Different 1m 100Å 10Å 1Å 1 2 3 4 5 6 7 8 9 1011 1213 141516 171819 20 … Practical Relevance (human) (Pathogen only folds (T. pallidum) as possible targets) 1 2 3 4567 89 10 11 12 13 14 15 … Drug 41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
FrankRoberts
User Type:
Researcher
Country:
France
Uploaded Date:
11-07-2017