This tutorial provides an overview of what DNA and proteins are, and the surrounding context from biology. We’ll also see how these biological molecules can be represented as a sequence of characters, so that we can apply various string algorithms and other computational techniques.
DNA (or Deoxyribose Nucleic Acid in full) stores all the required information for a living being to grow and function. An organism’s DNA can be thought of as the blueprint or the design for the particular organism.
A DNA is made up of a chain of nucleotides, and each nucleotide consists of a sugar (deoxyribose), a phosphate group, and one out of four nucleobases. The four types of nucleobases are Adenine, Cytosine, Guanine, and Thymine (often referred to by their first characters, A, C, G and T). These nucleotides are arranged in a double-stranded helical structure as shown below.
Model diagram of DNA’s double-stranded helical structure. The strands are composed of sugar and phosphate groups. The ladder rungs are made of nucleobases.
The helical strands (blue) are composed of deoxyribose sugar and phosphate groups, usually called the “backbone” of the DNA molecule. The ladder rungs are composed of nucleobases. Each rung has two nucleobases, which attach the strands together. The nitrogen bases are complementary to each other, i.e. Adenine can only pair with Thymine, and Guanine can only pair with Cytosine. This is shown in the above figure by the use of colors to represent each of the 4 possible nitrogen bases.
A short segment of the DNA can hence be represented by a sequence of characters as shown below.
The first row represents the first strand, the second row represents the second strand. Each column is one ladder rung. Since the two strands much obey the rule regarding complementary base pairs, only A and T or G and C may appear opposite each other.
Moreover, since one strand gives us complete information about the second strand, we can represent a DNA sequence by a single sequence of characters. For example,
The size of a DNA is measured in number of base pairs, and it varies from species to species, ranging from 10,000s to hundreds of billions. For example, a rice plant has about 420 million base pairs, a yeast has about 12.1 million base pairs, whereas humans have approximately 3.2 billion base pairs. In reality, these long strands of DNA exist in a highly compressed coiled form to fit inside the cell nucleus.
In a living being, cells are continuously being destroyed and created.Since each cell contains of copy of the DNA, a DNA needs to be copied for a new cell to be created. For this, the DNA’s two strands are first separated, and then each strand serves as a template for the construction of its complementary strand. Once the construction is complete, we have two copies of the DNA.
DNA replication and combination also needs to happen for the purposes of reproduction.
During DNA replication, errors may occur (called mutations). Different types of mutations are possible, ranging from single-nucleotide errors, to large sections of the DNA getting copied twice or getting deleted.
DNA serve as the blueprint for protein construction, which directly perform a wide variety of functions in an organism. A protein can be thought of as the building blocks of life. A proteins ability to perform a specific function is high dependent on its shape and structure.
A protein comprises of a chain of amino acids, and there are 20 such amino acids. The sequence of amino acids that make up a particular protein determine its structure, and how the protein will respond to a particular environment and to other molecules.
A chain of amino acids is often called a peptide, which folds itself into a 3-D structure based on the sequence of amino acids and the environment it is in, giving the protein its shape.
Each of the 20 amino acids that proteins are made of have a 3-letter as well as a 1-letter abbreviation, as shown in the following table.
Hence, we can represent a protein as a sequence of characters, as we did with the DNA. For example, the following sequence would represent a protein with 25 amino acids,
The protein comprise of the following sequence of amino acids - tryptophan, threonine, glutamine, glutamine, glutamine, proline, and so on.
Proteins are measured either in the mass (in kilo daltons, i.e. 1000 atomic mass unit) or by the length of amino acid sequence. The largest known proteins are titins, present in human muscle cells, which have a mass of up to 3000 KDa and length of up to 27,000 amino acids. However, not all proteins are so large. For example, the average yeast protein has an approximate mass of 53 KDa and is made up of around 460 amino acids.
In this section, we’ll see how protein is synthesized starting from the DNA. This is a really interesting, neat and precise process. It happens in two steps:
- Transcription: A small portion of the DNA is transcribed and an RNA (Ribonucleic Acid) is constructed. For the RNA to be made out of DNA, certain enzymes first open the spiral or helix of the DNA and copies one side of the DNA (RNA molecules have a single strand). RNA differs from DNA in composition so that for each thymine present in DNA, RNA has uracil (refer to image below). Moreover, they are much shorter because only small parts of the DNA are transcribed.
- Translation: Ribosomes, or the protein builder, reads the RNA 3 nitrogen bases at a time. For each set of three letters (referred to as codon), it produces a specific amino acid. This reading and converting of codons continue until the end of the RNA strand, producing the proteins amino acids one by one.
Diagram of protein synthesis: DNA to RNA (transcription) and RNA to Protein (translation)
Example: In the above example, we see a segment of DNA which is first converted to RNA. Note that the RNA is complementary in nitrogen base composition to the DNA strand. Now, the ribosome reads the RNA 3 letters at a time. The first 3 letters in RNA sequence are GUG (guanine-uracil-guanine) codes for amino acid V (valine). Similarly other amino acids are decoded and eventually a protein is formed as a series of amino acids, which in this example is VHLTPEEK (each letter here is the 1-letter abbreviation of the amino acid).
Mapping from Codon to Amino Acid.
The entire mapping from 3-letter codons to amino acids is included above. Note that multiple codons can map to the same amino acid. Also note the start and stop codons, which determine where translation should begin from and when to stop.
Diagram of cell showing where transcription and translation occurs
To give some more context, see the above image. The larger “bubble” is the biological cell, and the smaller bubble is the cell nucleus. The DNA is always in the cell nucleus, which is also where transcription occurs (synthesis of RNA). Which parts of the DNA are transcribed depends on several factors, including the type of biological cell (for example, nerve cell vs lung cell vs heart cell), and regulatory segments within the DNA close to the site that gets transcribed. Then, the RNA leaves the nucleus and enters the cytoplasm (remaining area within the cell excluding the nucleus). Ribosomes, which exist in cytoplasm, then start the process of creating a chain of amino acid to form proteins. The cytoplasm is the environment in which the protein is synthesized, and its chemical properties also play a role in determining how the protein folds (in addition to the sequence of amino acids that make the protein).
In this tutorial, we saw what the DNA and the protein are, how they can be represented as a sequence of characters, and that they are core parts of the biology of any organism. This provides us with enough background from biology to get into different problems that bioinformatics can be used to solve.