Lecture notes on Bioinformatics

what is bioinformatics and its applications and Lecture notes on introduction to bioinformatics pdf
NathanBenett Profile Pic
Published Date:11-07-2017
Your Website URL(Optional)
LectureNotes InstituteofBioinformatics,JohannesKeplerUniversityLinz BioinformaticsI SequenceAnalysisandPhylogenetics WinterSemester2013/2014 bySeppHochreiter Institute of Bioinformatics Tel. +43 732 2468 8880 Johannes Kepler University Linz Fax +43 732 2468 9511 A-4040 Linz, Austria http://www.bioinf.jku.atChapter1 BiologicalBasics This chapter gives an overview over the biological basics needed in bioinformatics. Students with a background in biology or life sciences may skip this chapter if they are familiar with cell biology or molecular biology. The chapter starts with the structure of the eukaryotic cell, then states the “central dogma of molecular biology”, explains the DNA, explains the RNA, discusses transcription, explains splicing, introduces amino acids, describes the genetic code, explains translation, and finally sum- marizes the protein folding process. 1.1 The Cell 13 14 Each human consists of 10 to 100 trillions (10 to 10 ) of cells which have quite different functions. Muscle cells are needed to transform chemical energy into mechanical energy, nerve cells transport information via electrical potential, liver cells produce enzymes, sensory cells must respond to external conditions, blood cells must transport oxygen, sperm and egg cell are needed for reproduction, connective tissue cells are needed for bone, fat, fibers, etc. We focus on the eukaryotic cells, i.e. complex cells with a nucleus as in mammals, in contrast to prokaryotic cells (no nucleus) found in bacteria and archaea (organisms similar to bacteria which live in extreme conditions). Each cell is a very complex organization like a whole country with power plants, export and import products, library, production machines, highly developed organization to keep the property, delivery systems, defense mechanism, information network, control mechanism, repair mechanism, regulation mechanism, etc. A cell’s diameter is between 10 and 30m and consists mostly of water inside a membrane “bag”. The membrane is a phospholipid bilayer with pores which allow things to go out of and into the cell. The fluid within a cell is called “the cytoplasm” consisting besides the water of free amino acids (), proteins (), nucleic acids (), RNA (), DNA (), glucose (energy supply medium), and more. The molecules of the cytoplasm are 50% proteins, 15% nucleic acids, 15% carbohy- drates (storage devices or building blocks for structures), 10% lipids (structures with water hating tails; needed to build membranes), and 10% other. Inside the cytoplasm there are various struc- tures called organelles (with membranes) whereas the remaining fluid is called “cytosol” (mostly water). 12 Chapter 1. Biological Basics Organelles: Nucleus: location of the DNA, transcription and many “housekeeping” proteins (); center is nucleolus where ribosomal RNA is produced. Endoplasmic Reticulum (ER): protein construction and transport machinery; smooth ER also participates in the synthesis of various lipids, fatty acids and steroids (e.g., hormones), carbohydrate metabolism. Ribosomes (): either located on the ER or free in the cytosol; machinery for translation (), i.e. mRNA () is transformed into amino acid sequences which fold () and become the proteins. Golgi Apparatus: glycosylation, secretion; processes proteins which are transported in vesi- cles (chemical changes or adding of molecules). Lysosomes: digestion; contain digestive enzymes (acid hydrolases) to digest macromolecules including lipases, which digest lipids, carbohydrases for the digestion of carbohydrates (e.g., sugars), proteases for proteins, and nucleases, which digest nucleic acids. Centrosome: important for cell cycle Peroxisomes: catabolic reactions through oxygen; they rid the cell of toxic substances. Microtubules: built from tubulin, cell structure elements (size of the cell) and transport ways for transport proteins Cytoskeleton: Microtubules, actin and intermediate filaments. These are structure building components. Mitochondria: energy (ATP ()) production from food, has its on genetic material and ribosomes (37 genes () in humans variants are called “haplotypes” ()), only maternal inheritance The only difference between cells is the different proteins they produce. Protein production not only determines the cell type but also body functions, thinking, immune response, healing, hormone production and more. The cells are built of proteins and everything which occurs in the human body is realized by proteins. Proteins are the substances of life. In detail they are enzymes catalyzing chemical reactions, sensors (pH value, chemical concentration), storage containers (fat), transporters of molecules (hemoglobin transports O ), 2 structural components of the tissue (tubulin, actin collagen), mechanical devices (muscle contraction, transport), communication machines in the cell (decoding information, transcription, translation),4 Chapter 1. Biological Basics Figure 1.2: Eukaryotic cell of a plant. markers gene regulation parts (binding to nucleic acids), hormones and their receptors (regulation of target cells), components of the defense and immune system (antibodies), neurotransmitter and their receptors, nano-machines for building, reconfiguring, and reassembling proteins, and more. All information about the proteins and, therefore, about the organism is coded in the DNA (). The DNA decoding is famous under the term “human genome project” – as all information about an organism is called genome (see Fig. ?? for a cartoon of this project). 1.2 Central Dogma of Molecular Biology The central dogma of molecular biology says "DNA makes RNA makes protein". Therefore, all knowledge about life and its building blocks, the proteins, is coded in the DNA. RNA is the blueprint from parts of the DNA which is read out to be supplied to the protein construction site. The making of RNA from DNA is called “transcription” and the making of protein from RNA is called “translation”. In eukaryotic cells the DNA is located in the nucleus, but also chloroplasts (in plants) and mitochondria contain DNA.1.3. DNA 5 Figure 1.3: Cartoon of the “human genome project”. The part of the DNA which codes a single protein is called “gene”. However scientist were forced to modify the statement "one gene makes one protein" in two ways. First, some proteins consist of substructures each of which is coded by a separate gene. Secondly, through alternative splicing () one gene can code for different proteins. 1.3 DNA The deoxyribonucleic acid (DNA) codes all information of life (with some viral exceptions where information is coded in RNA) and represents the human genome. It is a double helix where one helix is a sequence of nucleotides with a deoxyribose (see Fig. ??). The single strand DNA ends are called 5’ and 3’ ("five prime" and "three prime"), which refers to the sides of the sugar molecule with 5’ at the phosphates side and 3’ at the hydroxyl group. The DNA is written from 5’ to 3’ and upstream means towards the 5’ end and downstream towards the 3’ end. There exist 5 nucleotides (see Fig. ??): adenine (A), thymine (T), cytosine (C), guanine (G), and uracil (U). The first 4 are found in the DNA whereas uracil is used in RNA instead of thymine. They form two classes: the purines (A, G) and the pyrimidines (C, U, T). The nucleotides are often called nucleobases. In the double helix there exist hydrogen bonds between a purine and a pyrimidine where the pairing is A–T and C–G (see Fig. ?? and Fig. ??). These pairings are called base pairs. Therefore each of the two helices of the DNA is complementary to the other (i.e. the code is redundant). The DNA uses a 4-digit alphabet similar to computer science where a binary alphabet is used. The DNA is condensed in the nucleus through various processes and many proteins resulting in chromosomes (humans have 23). The DNA wraps around histones (special proteins) resulting1.3. DNA 7 Figure 1.5: The deoxyribonucleic acid (DNA) is depicted.8 Chapter 1. Biological Basics Figure 1.6: The 5 nucleotides. Figure 1.7: The hydrogen bonds between base pairs.1.3. DNA 9 Figure 1.8: The base pairs in the double helix. Figure 1.9: The DNA is depicted in detail.10 Chapter 1. Biological Basics Figure 1.10: The storage of the DNA in the nucleus. (1) DNA, (2) chromatin (DNA with his- tones), (3) chromatin strand, (4) chromatin (2 copies of the DNA linked at the centromere), (5) chromosome. in a structure called chromatin. Two strands of chromatin linked together at the centromere give a chromosome. See Fig. ?? and Fig. ??. However, the DNA of humans differs from person to person as single nucleotides differ which makes us individual. Our characteristics as eye or hair color, tall or not, ear or nose form, skills, etc is determined by small differences in our DNA. The DNA and also its small differences to other persons is inherited from both parents by 23 chromosomes. An exception is the mitochondrial DNA, which is inherited only from the mother. If a variation in the DNA at the same position occurs in at least 1% of the population then it is called a single nucleotide polymorphism (SNP – pronounced snip). SNPs occur all 100 to 300 base pairs. Currently many research groups try to relate preferences for special diseases to SNPs (schizophrenia or alcohol dependence). Note, the DNA double helix is righthanded, i.e. twists as a "right-hand screw" (see Fig. ?? for an error).1.3. DNA 11 Figure 1.11: The storage of the DNA in the nucleus as cartoon. Figure 1.12: The DNA is right-handed.12 Chapter 1. Biological Basics 1.4 RNA Like the DNA the ribonucleic acid (RNA) is a sequence of nucleotides. However in contrast to DNA, RNA nucleotides contain ribose rings instead of deoxyribose and uracil instead of thymine (see Fig. ??). RNA is transcribed from DNA through RNA polymerases (enzymes) and further processed by other proteins. Very different kinds of RNA exist: Messenger RNA (mRNA): first it is translated from the DNA (eukaryotic pre-mRNA), after maturation (eukaryote) it is transported to the protein production site, then it is transcribed to a protein by the ribosome; It is a “blueprint” or template in order to translate genes into proteins which occurs at a huge nano-machine called ribosome. Transfer RNA (tRNA): non-coding small RNA (74-93 nucleotides) needed by the ribosome to translate the mRNA into a protein (see Fig. ??); each tRNA has at the one end comple- mentary bases of a codon (three nucleotides which code for a certain amino acid) and on the other end an amino acid is attached; it is the basic tool to translate nucleotide triplets (the codons) into amino acids. Double-stranded RNA (dsRNA): two complementary strands, similar to the DNA (some- times found in viruses) Micro-RNA (miRNA): two approximately complementary single-stranded RNAs of 20-25 nucleotides transcribed from the DNA; they are not translated, but build a dsRNA shaped as hairpin loop which is called primary miRNA (pri-miRNA); miRNA regulates the expression of other genes as it is complementary to parts of mRNAs; RNA interference (RNAi): fragments of dsRNA interfere with the expression of genes which are at some locations similar to the dsRNA Small/short interfering RNA (siRNA): 20-25 nucleotide-long RNA which regulates expres- sion of genes; produced in RNAi pathway by the enzyme Dicer (cuts dsRNA into siRNAs). Non-coding RNA (ncRNA), small RNA (sRNA), non-messenger RNA (nmRNA), functional RNA (fRNA): RNA which is not translated Ribosomal RNA (rRNA): non-coding RNAs which form the ribosome together with various proteins Small nuclear RNA (snRNA): non-coding, within the nucleus (eukaryotic cells); used for RNA splicing Small nucleolar RNA (snoRNA): non-coding, small RNA molecules for modifications of rRNAs Guide RNA (gRNA): non-coding, only in few organism for RNA editing Efference RNA (eRNA): non-coding, intron sequences or from non-coding DNA; function is assumed to be regulation of translation1.4. RNA 13 Figure 1.13: The difference between RNA and DNA is depicted.14 Chapter 1. Biological Basics Figure 1.14: Detailed image of a tRNA. Signal recognition particle (SRP): non-coding, RNA-protein complex; attaches to the mRNA of proteins which leave the cell pRNA: non-coding, observed in phages as mechanical machines tmRNA: found in bacteria with tRNA- and mRNA-like regions 1.5 Transcription Transcription enzymatically copies parts of the DNA sequence by RNA polymerase to a com- plementary RNA. There are 3 types of RNA polymerase denoted by I, II, and III responsible for rRNA, mRNA, and tRNA, respectively. Transcription reads the DNA from the 3’ to 5’ direction, therefore the complementary RNA is produced in the 5’ to 3’ direction (see Fig. ??).1.5. Transcription 15 Figure 1.15: The transcription from DNA to RNA is depicted. Transcription consists of 3 phases: initiation, elongation and termination. We will focus on the eukaryotic transcription (the prokaryotic transcription is different, but easier) 1.5.1 Initiation The start is marked by a so-called promoter region, where specific proteins can bind to. The core promoter of a gene contains binding sites for the basal transcription complex and RNA polymerase II and is within 50 bases upstream of the transcription initiation site. It is normally marked through a TATA pattern to which a TATA binding protein (TBP) binds. Subsequently different proteins (transcription factors) attach to this TBP which is then recognized by the polymerase and the polymerase starts the transcription. The transcription factors together with polymerase II are the basal transcriptional complex (BTC). Some promoters are not associated with the TATA pattern. Some genes share promoter regions and are transcribed simultaneously. The TATA pattern is more conservative as TATAAA or TATATA which means it is observed more often than the others. For polymerase II the order of the TBP associated factors is as follows: TFIID (Transcription Factor for polymerase II D) binds at the TATA box TFIIA holds TFIID and DNA together and enforces the interactions between them TFIIB binds downstream of TFIID TFIIF and polymerase II come into the game; the-subunit of the polymerase is important for finding the promoter as the DNA is scanned, but will be removed later (see Fig. ??) TFIIE enters and makes polymerase II mobile TFIIH binds and identifies the correct template strand, initiates the separation of the two DNA strands through a helicase which obtains energy via ATP, phosphorylates one end of the polymerase II which acts as a starting signal, and even repairs damaged DNA16 Chapter 1. Biological Basics Figure 1.16: The interaction of RNA polymerase and promoter for transcription is shown. (1) The polymerase binds at the DNA and scans it until (2) the promoter is found. (3) polymerase/promoter complex is built. (4) Initiation of the transcription. (5) and (6) elongation with release of the polymerase-subunit.1.6. Introns, Exons, and Splicing 17 TFIIH and TFIIE strongly interact with one another as TFIIH requires TFIIE to unwind the promoter. Also the initiation is regulated by interfering proteins and inhibition of the chromatin structure. Proteins act as signals and interact with the promoter or the transcription complex and prevent transcription or delay it (see Fig. ??). The chromatin structure is able to stop the initiation of the transcription by hiding the promoter and can be altered by changing the histones. 1.5.2 Elongation After initiation the RNA is actually written. After the generation of about 8 nucleotides the - subunit is dissociated from polymerase. There are differnent kinds of elongation promoters like sequence-dependent arrest affected factors, chromatin structure oriented factors influencing the histone (phosphorylation, acetylation, methylation and ubiquination), or RNA polymerase II catalysis improving factors. The transcription can be stimulated e.g. through a CAAT pattern to which other transcription factors bind. Further transcription is regulated via upstream control elements (UCEs, 200 bases upstream of initiation). But also far away enhancer elements exist which can be thousands of bases upstream or downstream of the transcription initiation site. Combinations of all these control elements regulate transcription. 1.5.3 Termination Termination disassembles the polymerase complex and ends the RNA strand. It is a comparably simple process which can be done automatically (see Fig. ??). The automatic termination occurs because the RNA forms a 3D structure which is very stable (the stem-loop structure) through the G–C pairs (3 hydrogen bonds) and the weakly bounded A–U regions dissociate. 1.6 Introns, Exons, and Splicing Splicing modifies pre-mRNA, which is released after transcription. Non-coding sequences called introns (intragenic regions) are removed and coding sequences called exons are glued together. The exon sequence codes for a certain protein (see Fig. ??). A snRNA complex, the spliceosome, performs the splicing, but some RNA sequences can perform autonomous splicing. Fig. ?? shows the process of splicing, where nucleotide patterns result in stabilizing a 3D conformation needed for splicing. However pre-mRNA corresponding to a gene can be spliced in different ways (called alter- native splicing), therefore a gene can code for different proteins. This is a dense coding because proteins which share the same genetic subsequence (and, therefore, the same 3D substructure) can be coded by a single gene (see Fig. ??). Alternative splicing is controlled by various signaling molecules. Interestingly introns can convey old genetic code corresponding to proteins which are no longer needed.18 Chapter 1. Biological Basics Figure 1.17: Mechanism to regulate the initiation of transcription. Top (a): Repressor mRNA binds to operator immediately downstream the promoter and stops transcription. Bottom (b): Repressor mRNA is inactivate through a inducer and transcription can start.1.7. Amino Acids 23 Figure 1.22: A generic cartoon for an amino acid. “R” denotes the side chain which is different for different amino acids – all other atoms are identical for all amino acids except for proline. 1.7 Amino Acids An amino acid is a molecule with amino and carboxylic acid groups (see Fig. ??). There exist 20 standard amino acids (see Fig. ??). In the following properties of amino acids are given like water hating (hydrophobic) or water loving (hydrophilic) (see Tab. ?? and Tab. ??), electrically charged (acidic = negative, basic = positive) (see Tab. ??). The main properties are depicted in Fig. ??. Hydrophobic amino acids are in the inside of the protein because it is energetically favorable. Only charged or polar amino acids can build hydrogen bonds with water molecules (which are polar). If all molecules which cannot form these hydrogen bonds with water are put together then more molecules can form hydrogen bonds leading to an energy minimum. Think of fat on a water surface (soup) which also forms clusters. During folding of the protein the main force is the hydrophobic effect which also stabilizes the protein in its 3D structure. Other protein 3D-structure stabilizing forces are salt-bridges which can exist between a positively and negatively charged amino acid. Further disulfide bridges (Cys and Met) are important both for folding and 3D-structure stability. The remaining 3D-structure forming forces are mainly hydrogen bonds between two backbones or two side-chains as well as between backbone and side-chain. A sequence of amino acids, i.e. residues, folds to a 3D-structure and is called protein. The26 Chapter 1. Biological Basics SA Hyd Res Hyd side Gly 47 1.18 0.0 Ala 86 2.15 1.0 Val 135 3.38 2.2 Ile 155 3.88 2.7 Leu 164 4.10 2.9 Pro 124 3.10 1.9 Cys 48 1.20 0.0 Met 137 3.43 2.3 Phe 39+155 3.46 2.3 Trp 37+199 4.11 2.9 Tyr 38+116 2.81 1.6 His 43+86 2.45 1.3 Thr 90 2.25 1.1 Ser 56 1.40 0.2 Gln 66 1.65 0.5 Asn 42 1.05 -0.1 Glu 69 1.73 0.5 Asp 45 1.13 -0.1 Lys 122 3.05 1.9 Arg 89 2.23 1.1 Table 1.2: Hydrophobicity scales (P.A.Karplus, Protein Science 6(1997)1302-1307)). “SA”: Residue non-polar surface area A2 (All surfaces associated with main- and side-chain carbon atoms were included except for amide, carboxylate and guanidino carbons. For aromatic side chains, the aliphatic and aromatic surface areas are reported separately.); “Hyd Res”: Estimated hydrophobic effect for residue burial kcal/mol; “Hyd side”: Estimated hydrophobic effect for side chain burial kcal/mol (The values are obtained from the previous column by subtracting the value for Gly (1.18 kcal/mol) from each residue).1.8. Genetic Code 27 First Second Position Third (5’ end) (3’ end) U C A G UUU Phe UCU Ser UAU Tyr UGU Cys U U UUC Phe UCC Ser UAC Tyr UGC Cys C UUA Leu UCA Ser UAA Stop UGA Stop A UUG Leu UCG Ser UAG Stop UGG Trp G CUU Leu CCU Pro CAU His CGU Arg U C CUC Leu CCC Pro CAC His CGC Arg C CUA Leu CCA Pro CAA Gln CGA Arg A CUG Leu CCG Pro CAG Gln CGG Arg G AUU Ile ACU Thr AAU Asn AGU Ser U A AUC Ile ACC Thr AAC Asn AGC Ser C AUA Ile ACA Thr AAA Lys AGA Arg A AUG Met ACG Thr AAG Lys AGG Arg G GUU Val GCU Ala GAU Asp GGU Gly U G GUC Val GCC Ala GAC Asp GGC Gly C GUA Val GCA Ala GAA Glu GGA Gly A GUG Val GCG Ala GAG Glu GGG Gly G Table 1.3: The genetic code. AUG not only codes for methionine but serves also as a start codon. property of amino acids to form chains is essential for building proteins. The chains are formed through the peptide bonds. An amino acid residue results from peptide bonds of more amino acids where a water molecule is set free (see Fig. ??). The peptide bonds are formed during translation (). All proteins consist of these 20 amino acids. The specific 3D structure of the proteins and the position and interaction of the amino acids results in various chemical and mechanical properties of the proteins. All nano-machines are built from the amino acids and these nano-machines configure them-selves if the correct sequence of amino acids is provided. 1.8 Genetic Code The genetic code are instructions for producing proteins out of the DNA information. A protein is coded in the DNA through a gene which is a DNA subsequence with start and end makers. A gene is first transcribed into mRNA which is subsequently translated into an amino acid sequence which folds to the protein. The genetic code gives the rules for translating a nucleotide sequence into an amino acid sequence. These rules are quite simple because 3 nucleotides correspond to one amino acid, where the nucleotide triplet is called codon. The genetic code is given in Tab. ??. AUG and CUG serve as a start codon, however for prokaryotes the start codons are AUG, AUU and GUG.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.