Halomicrobium mukohataei Genome Fall 2009

From GcatWiki
Jump to: navigation, search

This page will be used by Davidson College students in the Genomics Laboratory course.

Links to Multiple Databases

  • Manatee at JCVI]
  • KEGG
    We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database?

Papers of Interest

Proteins from extremophiles as stable tools for advanced biotechnological applications of high social interest.

Molecular ecology of extremely halophilic Archaea and Bacteria

Submitted Course Assignments

Genome_comparisons summarizes information found by the class about each of the nine species we are comparing.

Tutorials for Annotating Genomes

Media:Creation of Sequence Logos Using WebLogo.doc (Katie)

Determining whether genes called in JGI and RAST are identical (Karen)

The Ins and Outs of ClustalW2 (Sarah)

Mastering the Art of NCBI: It's a BLAST (Claudia)

Media:ClustalW_Tutorial.doc - (Olivia)

Media:KEGG_pathway_tutorial.doc - (Megan)

Tutorials for Whole Genome Analysis

Olivia - perl script to compare proteomes (links to Katie's and Megan's pages)
Katie - two web pages, one for downloading original perl scripts and one for sample small scale version (convert to fasta and compare proteomes)
link Proteome Compare
Claudia - How To Find and Format Genome Sequences
Megan - Determining Unique and Conserved Proteins: How to Use Katie's Webpage
Karen - how to deal with output from web pages
Sarah - CRISPR resources

Oral Reports on Individual System Research Projects

Claudia's Assignment
Degradation of Xenobiotics by Halomicrobium mukohataei (Megan Reilly)
Sarah's Assignment
Olivia's Assignment
Karen's Assignment
Katie's Assignment

Oral Reports for Whole Genome Projects

Claudia - Cysteine Metabolism
Megan: ABC Transporters - External link. Cyberducky was being problematic.
Sarah Media:Cas_ProteinsFinal.ppt
Olivia Media:Hoshing_CRISPRdirectRepeats.ppt
Katie CRISPR spacers and the capturing of viral DNA Media:CRISPR_spacers.ppt

Final Term Papers


Student Research Results

RNA Genes

16S rRNA: 3' TCCTCCA 5'
.....mRNA: 5' nGGAGGt 3'

Research into One Gene/Protein

Histone Deacetylase (Katie Richeson)

Uncharacterized Protein Family (UPF0153) (Katie Richeson)

CTP Synthase (Sarah Pyfrom)

Unknown Protein (Sarah Pyfrom)

Bacteriophytochrome (light-regulated signal transduction histidine kinase) (Claudia M. Carcelen)

Protein of unknown function (DUF861) (Claudia M. Carcelen)

Conserved hypothetical protein TIGR00162 (Megan Reilly)

Electron transfer flavoprotein, alpha subunit (Megan Reilly)

Cellulase (Karen Hasty)

Hypothetical Protein 644031642 (Karen Hasty)

Beta-galactosidase (Olivia Ho-Shing)

Hypothetical Protein 644029933 (Olivia Ho-Shing)

Whole Proteome Comparisons of Halomicrobium mukohataei vs 9 Halophile Proteomes by Megan, Karen and Claudia

Links to compared proteomes. These links allow you to view the proteins these two organisms share, and in a separate file, the proteins unique Halomicrobium mukohataei.

  1. Halomicrobium mukohataei vs. Haloarcula sinaiiensis: Shared proteins and Unique Proteins KH
  2. Halomicrobium mukohataei vs. Haloarcula valismortis: Shared proteins and Unique Proteins KH
  3. Halomicrobium mukohataei vs. Haloarcula californiae: Shared proteins and Unique Proteins MR
  4. Halomicrobium mukohataei vs. Haloferax dentrificans: Shared proteins and Unique Proteins CMC
  5. Halomicrobium mukohataei vs. Haloferax mediteranei: Shared proteins and Unique Proteins KH
  6. Halomicrobium mukohataei vs. Haloferax volcanii: Shared proteins and Unique Proteins KR
  7. Halomicrobium mukohataei vs. Haloferax sulfurifontis: Shared proteins and Unique Proteins KR
  8. Halomicrobium mukohataei vs. Haloferax mucosum: Shared proteins and Unique Proteins SP
  9. Halomicrobium mukohataei vs. Halorhabdus utahensis: Shared proteins and Unique Proteins SP

Final List of Shared Proteins for All 10 Species (Alphabetical)

Genome comparisons Genbank format files of genomes and proteomes:

  1. GenBank Format of Proteome of Haloarcula sinaiiensis ATCC 33800
  2. GenBank Format of Proteome of Haloarcula vallismortis ATCC 29715
  3. GenBank Format of Proteome of Haloarcula californiae ATCC 33799
  4. GenBank Format of Proteome of Haloferax denitrificans ATCC 35960
  5. GenBank Format of Proteome of Haloferax mediteranei ATCC 33500
  6. GenBank Format of Proteome of Haloferax volcanii ATCC 29605
  7. GenBank Format of Proteome of Haloferax sulfurifontis ATCC BAA-897
  8. GenBank Format of Proteome of Haloferax mucosum ATCC BAA-1512
  9. GenBank Format of Proteome of Halomicrobium mukohataei DSM 12286
  10. GenBank Format of Proteome of Halorhabdus utahensis

FASTA Version of only proteomes:

  1. Haloarcula sinaiiensis ATCC 33800
  2. Haloarcula vallismortis ATCC 29715
  3. Haloarcula californiae ATCC 33799
  4. Haloferax denitrificans ATCC 35960
  5. Haloferax mediteranei ATCC 33500
  6. Haloferax volcanii ATCC 29605
  7. Haloferax sulfurifontis ATCC BAA-897
  8. Haloferax mucosum ATCC BAA 1512
  9. Halomicrobium mukohataei DSM 12286
  10. Halorhabdus utahensis

Useful tool: NCBI Archaea Taxonomy List If you want to find all proteins in FASTA format for a limited number of halophiles and many different archaea, click the link above and choose an accession number, then from the drop down menu pick protein FASTA format. All of the proteins in FASTA format will be listed in alphabetical order.

CRISPR Project by Katie, Olivia and Sarah

CRISPR systems may represent a prokaryotic analog of eukaryotic RNA interference systems, which gives bacteria a simplified type of immune defense system (Wikipedia). It may be interesting and fruitful to look into doing a genomic comparison between our organism's CRISPR sequences and the sequences that makeup eukaryotic RNA interference systems.

Useful Databases and Tools:

JGI Halomicrobium mukohataei



CRISPR Database

CRISPR Finder- finds CRISPR sequences when you enter genomic data up to 67,000,000 bp



Brain Storming Ideas for Research Topics

Brain Storming by Sarah, Claudia, Megan

In general, we thought that specifying a few key pathways or processes might be the best way to break up the genome.

High salinity:

How, exactly, do halophiles manage to live in such high-salt conditions?
In order to investigate this question, I believe it would be highly beneficial to compare the genomes of other bacteria and yeasts, some of which grow in 24% brine, at very high salinity (Bender). Specifically, I believe we could delve into the specific genes that deal with the high salinity environment. charged amino acids on the surface and osmotic pressure maintain correct proton balance.
Perhaps proton pumps would be a good place to begin?
There are a number of Na+ pumps encoded in our organism's genome, several of which are in fact protein dependent. This provides further evidence for the background information Dr. Campbell found that not only is our organism able to survive in such an environment, it requires it. If these organisms require ions (from salts) to drive a number of symporters and antiporters, especially those that keep the insides of their cells from dehydrating. Specifically, items such as the Na+-dependent transporter might be very interesting to research, especially since it is a conserved protein.
Cl- pump.jpg
This article discusses the genome of Haloarcula marismortui, specifically mentioning the unusually high number of environmental response regulators. Whether we focus on light, salinity, or other environmental factors, I believe that investigating this excess of environmental response regulators would prove to be successful and provide interesting results.


Halophiles use a different kind of photosynthesis, using energy gathered from pigments in order to create ATP. Bacteriorhodopsin is one of these pigments; our species has 4 predicted proteins.

How does this pathway work?

Simplified archaeal photosynthetic pathway:

Archaeal chemiosmosis.png

Does our species use the same pathway for photosynthesis as other halophiles that have been studied?

Similar halophiles' photosynthetic pathways include a protein called halorhodopsin and well as bacteriorhodopsin. Our species, however, only has the latter. While bacteriorhodopsin is a light-driven proton pump, halorhodopsin is a light-driven anion pump specific for chloride ions. Both the wet lab and the JGI annotation agree that our species does not have this secondary pump. I think this would be a very interesting place to start since there has been a lot of research done on bacteriohordopsin and the pathways it is involved with. We could compare the similarities as well as how our species differs from the norm.

How do other metabolic pathways work? (i.e. methane, sulfur, nitrogen) Our species has some, but not all, genes that have been identified for methane, sulfur, and nitrogen metabolic pathways. Are these genes functional within each of these pathways? If so, are there other unidentified genes that exist within these pathways?

This article compares many metabolic pathways among three halophilic species.

Halophiles cannot fix carbon. How does this affect our species? How do they compensate?

JGI shows (through a KEGG pathway) that our species has most of the common genes involved in the reductive carboxylate cycle in photosynthetic bacteria. Can our halophile indeed fix carbon in the form of CO2?


Why does our organism require a high salinity to survive? What is the mechanism?
How do adaptations and/or pathways that deal with high salinity affect other processes within our organism?
Why does our organism appear to have only cation pumps and mechanisms, and not anions? How does this affect the organism's environment? What happens to those anions?
What similarities and differences are observed when comparing the genome of our organism to other high-salinity organisms such as halophiles and yeasts? Can we uncover what genes are vital to this adaptation, and where they may have come from? What are the differences, at the gene level, between halotolerant organisms and halophilic organisms?

Brain Storming by Katie, Olivia, Karen

Potential Research Directions:

1) Sugar metabolism - comparison between species; How do these species produce energy from light and sugar? Which source of energy is more important? How is chitin involved?

2) What gene products allow halophiles to survive in high salt content?

3) Heat shock genes - How does our species adapt to high heat conditions? Does this compare to how other halophiles adapt to salt or high heat conditions?

4) Compare genes with an RBS upstream of the gene with genes that are identically identified for all three annotation sites?

5) DNA repair mechanisms - What are they? How do they work? Do other halophiles have them?

6) Comparison between specific extreme genes - extreme heat/extreme salt - and bacteria and eukarya. Do these groups share any of these genes?

-combine this bullet with number 2.

7) Two 'other' RNA genes - Does our species have these genes?

8) Start codon issues - Do all proteins start with Met? Really?? Look at other annotation sites and see what they give. Wet lab research publications? tRNA i?

9) Light proton pump - what genes encode it? How does it work? Have all proteins necessary for this process been discovered?

10) Eating? Do archaea do this? Digestion? Little brine shrimpies? Yum.

-I think this bullet point may go better with the metabolism bullet (1)

11) Alien genes? Do we have any? What species are they from? Are they from BRINE SHRIMP??

Dr. C.

1) I was reading about how halophiles not only survive in high salt, but that they require it. WikiPedia

2) I also think it would be interesting to compare all 10 genomes annotated by RAST and see how they differ. Then we could see what known and unknown genes distinguish them from each other. Perhaps we could map out an evolutionary map by comparing who has which genes? We could compare a phylogenetic tree using rRNA sequences vs. conserved/deleted genes.
I collected all the 16S rRNA gene sequences I could find in NCBI. Most of these are NOT from the whole genome sequences listed online. Nevertheless, I was able to use CLUSTLw and Phylip to produce a dendrogram with horizontal distances listed. Sequences available here: rRNA Sequences in Word File
List of 8 species we could study:

  1. Haloarcula californiae
  2. Haloarcula sinaiiensis
  3. Haloarcula vallismortis
  4. Haloferax denitrificans
  5. Haloferax mediterranei
  6. Haloferax mucosum
  7. Haloferax sulfurifontis
  8. Haloferax volcanii

Halophile tree3.png

3) It might be interesting to map out the metabolic pathways to explain the experimentally determined sources of energy that support our species' growth and also try to determine why our species was not able to grow on some other energy sources.

4) We know our species has flagella. Are these flagella typical, or atypical?

Bacterial Flagella.png
Image of bacterial flagellum from link wikipedia.

Image from Encyclopedia of Life

AglH: ??
AglC: ??
AglA: ??
AglB: ??
FlaK: NO
FlaF: YES!!
FlaB1: YES
FlaB2: NO
FlaB3: NO
FlaA: NO

Flagella1 annotated.png
Flagella2 annotated.png

Glossary words (A - Z):


16S rRNA - ribosomal RNA found in the small subunit of prokaryotic ribosomes. rRNA functions in decoding mRNA and interacting with tRNAs in translation. Particularly 16S rRNA is a well-conserved gene found in all organisms (in prokaryotes and eukaryotic mitochondria) often used in comparative genomes when studying phylogeny (Lecture, Olivia)


accession number - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [1] (Will).

adsorption - the accumulation of molecules on the surface of a material. This can be part of a lab procedure to purify and isolate a specific portion of a cell or a protein (Wikipedia, Olivia)

alien genes - genes found in a genome that appear to have been inserted into an organism's genome from another species, more than likely through horizontal gene transfer ([1] Campbell, Claudia)

antisense (RNA or DNA)-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function (5 Pallavi).

Arabidopsis thaliana - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics (Wikipedia.org, Jay)

Archaea - one of the three evolutionary domains. A group of unicellular prokaryotes that were previously grouped with Bacteria, but have some genes and metabolic pathways more similar to eukaryotes, such as those involved in transcription and translation. Many Archaea are extremophiles, such as Halobacteria that thrive in high-salt environments (Lecture, Olivia)

Archaeal rhodopsins - Archaeal rhodopsins are light-sensitive and light-activated transmembrane proteins only found in archaeal plasma membranes. Bacteriorhodopsin (BR) and Halorhodopsin (HR) are both archaeal rhodopsins that are proton and chloride light drive pumps, respectively, indicating that the functionality of archaeal rhodopsins is diverse [2] (Katie)


BAC - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms (Wikipedia.org, Jay)

Bacteriorhodopsin- A transmembrane archaeal rhodopsin protein that uses light energy to move protons across membranes, creating an electrochemical gradient that is converted into chemical energy [3] (Katie).

Bacterioruberin - Bacterioruberin is a “carotenoid pigment” found in some halophiles giving them a red color and providing assumed protection from strong sunlight [4]. The structure also plays a stabilizing role in the archaeal rhodopsin proteins [5] (Katie).

bioinformatics - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [6] (Matt)

BLAST - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [7] (Mary)

Bligh-Dyer method- A lipid extraction method that uses chloroform-methanol as a solvent but also includes a re-extraction of the sample, just with chloroform, before evaporation of the solvent to capture more non-polar lipids. [8] The lipid membrane of archaea is extremely unique not only in composition (see Isoprenoid lipids) but also in the archaeal rhodopsins that are scattered among the plasma membrane [9]. In order to study the uniqueness of archaeal membranes one needs to observe the lipids outside of the membrane, which the Bligh-Dyer method accomplishes (Katie)

bioperl- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [10] (Wikipedia, Max Win)

bootstrap value - common reliability test of a phylogenetic tree, calculated as a percentage. In generating a phylogenetic tree, the sequences will be resampled, or rerun, multiple times. If a pair of sequences are consistently grouped together for 100 out of 100 resamplings, then the certainty that those sequences are correctly grouped would be very high, and the bootstrap value would be 100. If a pair of samples were grouped together only 50 out of 100 resamplings, the certainty that those sequences are correctly grouped would be lower; the bootstrap value would be 50. On phylogenetic trees, these values may be placed adjacent to the group to which they refer. (Lecture, Olivia)


carbon fixation - using carbon dioxide to create organic materials [11] (Samantha)

CCCP - carbonyl cyanide m-chlorophenyl hydrazone; a nitrile ionophore that inhibits oxidative phosphorylation and photophosphorylation. Ionophores are lipid-soluble molecules allowing them to transfer across membranes, creating pores that disrupt transmembrane ion gradients. (Sugiyama 1994 article, Olivia)

cell division control (Cdc) protein - for example, Cdc6 found in Halorhabdus utahensis; protein responsible for activating and maintaining mechanisms of cell division. Cell division control proteins are important in annotation because the presence of a Cdc gene is a good indicator for finding the origin of replication in a circular chromosome. (Bakke et al 2009, Olivia)

CDD (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [12] (Mary)

cDNA - DNA that is reverse-transcribed from mature mRNA. A cDNA library provides templates for genes that are expressed within an organism. [13]. (Pyfrom)

chaperonin - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [14] (Matt)

chemoorganotrophic - refers to organisms that obtain energy from oxidation/reduction reactions using organic electron donors (Link, Earthlife Claudia)

chemotaxis - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [15] (Nick)

chemotaxonomy - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [16] (Mary)

chimeric genome - A genome that consists of a mixture of genes from distinct species Baliga et al., 2004 (Karen)

ClustalW - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [17] (Will).

COG (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs (COG Pallavi)

comparative genomics - the study of relationships between genomes of different strains and species. Comparative genomics aims to define similarities and differences in structure and/or function of different proteins, RNAs and regulation between organisms (Wikipedia and Lecture, Olivia)

concatemer - long continuous DNA molecule that contains the same DNA sequence repeated in series [18](Samantha)

congenic - two strains of an organism that are nearly identical, varying only at a single locus (also called coisogenic) [19] (Megan)

consensus sequence - a nucleotide sequence that is common, though not necessarily identical, in different genes and in genes from different organisms that are associated with a particular function. [20] (Megan)

contigs (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [21], Max Win)

controlled vocabulary - a set of terms used to standardize the description of characteristics in organisms' genomes, as designated by the Gene Ontology (GO) project ([1] Campbell, Claudia)

coverage - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

CPAN (Comprehensive Perl Archive Network) - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [22](Will).

Cytogenetics-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes (7 Pallavi).


DCCD - dicyclohexylcarbodiimide; compound that acts as a proton ATPase inhibitor (Sugiyama 1994 article, Olivia)

de novo synthesis - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [23] (Matt)

dehydrogenase - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [24] (Peter)

deoxyribodipyrimidine photolyase - enzyme which breaks the errant covalent bonds that form pydrimdine dimers. UV light is a common cause of this particular anomaly and causes covalent bonds to form between adjacent pyrimidines. Many archaea and bacteria use deoxyribodipyrimidine photolyases in order to break these bonds and avoid errors during replication or transcription [25]. (Pyfrom)

diatom - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [26] (Mary)

domain (protein) - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. (Wikipedia article, Laura)

dot plot-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[27], Max Win)

draft genome- a genome that has been sequenced by computers and programs but has not yet been reviewed by humans in order to create a finished genome. Draft genomes usually contain gaps or mistakes due to the limited capacity of the programs used for sequencing (Lecture, Pyfrom).


EC number (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [28] (Mary)

Edman degradation-A method for sequencing amino acids in a peptide chain. It allows the ordered protein sequence to be determined by proceeding from the N-terminus of the chain and piecing together fragmented sequenced chains of a protein [29] (Katie).

E-value (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[30], Max Win)

epistasis - the interaction between two or more genes to control a single phenotype. Epistasis is not the same as dominance; dominance involves the interaction of two alleles for the same gene, whereas epistasis is the interaction of different genes. [31] (Megan)

expressed sequence tag (EST) – a short piece (200-500bp) of transcribed cDNA that can be used to determine the position of an expressed gene within the genome [32]. (Pyfrom)

extremophile - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [33] (Will).


FASTA format - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [34] (Nick)

family (protein) - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. (Wikipedia article and lecture, Laura)

finished genome - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay)

frustule - a hard, porous cell wall made up of silica that makes up the outermost layer of diatoms. These structures have complex and elaborate designs (Wikipedia Claudia)

fusion mRNA-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 Science Pallavi)


GAF Domain - A GAF domain is a small-molecule binding unit present in all domains of life. It is a light-responsive domain found in plant and cyanobacterial phytochromes (a pigment photoreceptor used to detect light). This domain plays an important role in an organism's ability to respond to its environment. (Baliga et. al., Molecular Interventions, Ecomii Claudia)

gap - a region of the genome for which no sequence is currently available. Two types of gaps exist: heterochromatic gaps consist largely of a highly repetitive sequence (and is therefore difficult to determine the exact non-overlapping sequence of), and euchromatic gaps are more likely to contain genes. [35] (Megan)

GC Content - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [36] (Matt)

GC-skew – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[37], Max Win)

gene amplification - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [38] [39] (Matt)

gene calling - Determining which parts of a sequenced genome represent genes. This process could also be called gene finding. The process is generally fully automated. Magnaporthe grisea Automated Gene Calling(Karen)

gene fusion-occurs when DNA segments of two different genes come together. Can result in hybrid proteins (9 Pallavi)

gene knockout - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [40] (Matt)

gene oncology- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[41], Max Win)

gene transfer - the incorporation of a DNA segment into an organism's cells, or DNA. This usually occurs through a vector such as a virus. This method is used in gene therapy. (Genomics.energy.gov Claudia)

genome annotation - the process of attaching biological meaning to sequence data. In other words, genome annotation involves determining where genes are located in a genome and discovering functions of these genes. Genome annotation: from sequence to biology (Karen)

glaucophyte - freshwater algae that have not been studied well [42](Samantha)


haemolysin or hemolysin - a chemical produced by a bacteria that causes lysis of red blood cells [43] (Nick)

halophile - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [44] (Matt)

haplotype-collection of alleles that travel together (Lecture, Pallavi)

haptophyte - phylum of algae [45](Samantha)

heterokont - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [46](Samantha)

Heterologous -literally meaning, “derived from a different organism,” heterologous refers to the fact that the gene/protein of interest was taken from a different cell type or species than the gene/protein recipient [47]. (Katie)

Hidden Markov Model - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. (Wikipedia and lecture, Laura)

hierarchical genome shotgun sequencing - a method for sequencing genomic DNA. Genomic DNA is cut into pieces of about 150 Mb and inserted into BAC vectors, transformed into E. coli where they are replicated and stored. The BAC inserts are isolated and mapped to determine the order of each cloned 150 Mb fragment. This is referred to as the Golden Tiling Path. Each BAC fragment in the Golden Path is fragmented randomly into smaller pieces and each piece is cloned into a plasmid and sequenced on both strands. These sequences are aligned so that identical sequences are overlapping. These contiguous pieces are then assembled into finished sequence once each strand has been sequenced about 4 times to produce 8X coverage of high quality data [48]. (Pyfrom)

HMM Logo - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. (How to read HMM Logos, on Pfam, Laura)

homeobox - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [49](Samantha)

homodimer - a protein made of paired identical polypeptides (Answers.com, Jay)

horizontal gene transfer-DNA transmission between species and incorporation of the DNA into the recipient's genome (horizontal gene transfer Pallavi)

Hox gene-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis (4 Pallavi)

hydrolase - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [50] (Nick)

Hydropathy analysis - This method determines the hydrophobic nature of an amino acid sequence. It uses a window moving through the sequence, summing the Gibbs free energy values for each amino acid and running these values through programs to determine hydrophobic segments. [51] In respect to halophiles, there is evidence to suggest that protein stability, in some cases, may be dependent upon high salt concentrations and since the hydrophobic nature of proteins increase stability, it is important to be able to measure stability in terms of hydrophathy [52] (Katie)

hypothetical protein - A hypothetical protein is a gene encoded by a genome that has a predicted function, but this function has not been experimentally tested or proved. The predicted function is determined by the protein's structural similarities to proteins of known function as well as the protein's sequence makeup. It has no analogs in the protein database. (Web Definitions Claudia)


ideogram - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

identities - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

immunopreciitation - the technique of precipitating a protein out of solution using an antibody that specifically binds to that particular protein. This process can be used to isolate and concentrate a particular protein from a sample containing many thousands of different proteins [53]. (Pyfrom)

indel - term used to describe insertions or delations within a genome. Since an insertion in one genome is a deletion in another, "indel" is a catch-all term coined to remove the relative subjectivity of determining a mutation as being either an insertion or deletion (Lecture, Pyfrom).

indole-a chemical compound that is produced from the break down of tryptophan (indole Pallavi)

inclusion body - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [54] (Nick)

intergenic distance - The distance (in base pairs) between genes wikipedia (Karen)

intron - a region of DNA in a gene that is not part of the final coding sequence for the protein. [55] (Peter)

IS elements - (insertion sequence element) sequences of DNA that can transpose to new positions in the genome. This can cause disruptions in other gene coding regions and major reorganizations of the genome Baliga et al., 2004 (Karen)

isoelectric point - the pH at which a molecule is neutral [56] (Nick)

Isoprenoid lipids -lipids made from five carbon isoprene units, also known as isoterpene units which is the organic compound CH2=C(CH3)CH=CH2. [57]. The side chains in phospholipids are built from isoprene instead of fatty acids in archaea, making them isoprenoid lipids [58]. (Katie)

isozymes - members of a gene family with very similar cellular roles (Campbell-Heyer Genomics textbook, Jay)


Junk DNA - sections of DNA that do not code for genes, or a label for stretches of DNA for which no function has been identified. Non-coding DNA is often referred to as "junk DNA." [59] (Megan)


KEGG (Kyoto Encyclopedia of Genes and Genomes) - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [60](Will).

kinase - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [61] (Peter)


lateral gene transfer - see "horizontal gene transfer" (Pallavi)

Liposome - microscopic fluid filled vesicle whose phospholipid walls are identical to that of the cell membrane and are often used as models for artificial cell membranes, which is useful in studying the uniqueness of archaeal membranes outside of the archaea organism, and drug delivery [1] (Katie).


Manatee - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [62](Will).

metabolism - chemical reactions organisms utilize in order to maintain life. Metabolism can be constructive such as anabolism in which energy is used to create cell components like protein, or it can be destructive such as catabolism where a substance such as sugar is systematically broken down in order to harvest energy for the organism. Wikipedia (Karen)

microsatellites-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families (3 Pallavi)

minisatellites-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) (2 Pallavi).

motif - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[63], Max Win)

mycoplasma - genus of bacteria that lack a cell wall [64] (Nick)


NCBI - (The National Center for Biotechnology Information) is a division of the National Library of Medicine (NLM) in the National Institutes of Health (NIH). This organization seeks to develop and make available information technologies for use in discovering and deciphering the fundamental molecular and genetic processes affecting health and disease. (NCBI Claudia)

NORFs (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[65], Max Win)

nucleomorph - reduced eukaryotic nuclei found in plastids [66](Samantha)


object-oriented programming - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

open reading frame (ORF)-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) ORF (Pallavi)

operon - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [67] (Nick)

opsin - In eukarya, this is a group of light sensitive G protein-coupled receptors often found in the retina. In prokaryotes, opsins are used to fix carbon by harvesting energy from light. Additionally, these receptors are independent of any chlorophyll pathway Wikipedia (Karen)

optical mapping-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome optical mapping (Pallavi)

ortholog - one within a group of DNA sequences each found in separate genomes that look very similar. Orthologs may have an evolutionary relationship, but the term itself does not imply the presence or absence of one. (Lecture, Olivia)

oxidoreductase - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [68] (Nick)


paralog- identical DNA sequences within a species (Lecture, Pallavi)

p-arm - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) (MedTerms Dictionary, Jay)

Perl - Developed by Larry Wall in 1987, Perl is a high-level programming language used frequently by biologists and bioinformaticists [69] (Will).

periplasmic space - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [70] (Peter)

Pfam - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. (Pfam Help, Laura)

plasmid - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [71](Peter)

plastid - major organelles in plants or algae [72](Samantha)

pleomorphism - the occurrence of two or more structural forms during a life cycle [73] (Mary)

phylogenetic tree - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [74] (Nick)

phylotypes – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[75], Max Win)

phytanyl lipids - Organically, a phytanyl is a branched-chain hydrocarbon containing 20 carbon atoms [76]. Phytanyl lipids are often found in the membrane of archaea and are thought to contribute to increased membrane stability at high salt concentrations [van de Vossenberg et al. Extremophiles (1999) 3:253-257]. (Katie)

phytochrome - a pigment that acts as a photoreceptor that triggers a response or signaling cascade in many plants and bacterial organisms as well as some animals. It is made up of a chromophore, or a compound that absorbs visible light, which is bound to a protein. Phytochrome is one of the most intensely colored pigments found in nature. This intense pigmentation allows the organism to sense even dim light. (Ecomii, Phytochrome Claudia)

positives - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [77] (Mary)

promoter - a region of DNA that facilitates transcription of a gene; promoters are typically located closely upstream of the gene they regulate [78] (Megan)

proteome - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [79](Samantha)

proton pump - an integral membrane protein capable of transporting protons across a membrane. Mitochondria utilize proton pumps in order to create a proton gradient used for producing ATP. Wikipedia (Karen)

PSORT - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. (PSORT, Laura)

pseudogenes-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

purine - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [80] (Peter)

p-value - probability associated with a statistical test of the difference between populations. Populations are considered significantly different if the associated p-value is small (typically 0.1 or smaller). Discovery Genomics, Proteomics and Bioinformatics[81], Pyfrom)

pyrimidine - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [82] (Peter)


q-arm - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) (MedTerms Dictionary, Jay)

query sequence - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. (BLAST on Wikipedia, Laura)


RAST - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([83], Max Win)

rDNA-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. (rDNA Pallavi)

replicon - a region of DNA or RNA that replicates from a single origin of replication [84] (Megan)

repressor - a protein that binds to a section of DNA in order to regulate one or more genes by decreasing the rate of transcription [85] (Megan)

residue (protein) - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. (Pfam Help, Laura)

retinal - vitamin A aldehyde; a chromophore (colour-producing molecule) that is bound to proteins called opsins. For example, Haloarcula and other halophilic archea have a light-driven proton pump such as bacteriorhodopsin. This pump contains a reddish-purple retinal that absorbs green visible light. (Wikipedia, Olivia)

retropseudogenes-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. (1 Pallavi)

retrotransposons - RNA transcribed back into DNA and added into the genome [86](Samantha)

ribonuclease - a nuclease that catalyzes the degradation of RNA into smaller components [87] (Mary)

ribosome binding site (RBS) - short purine-rich sequence found directly (4-8 bp) upstream of the start codon of a protein coding sequence to which ribosomes bind to begin translation. The RBS sequence tends to be species-specific, and the consensus sequence acts as a good indicator of the start site of a gene (Bakke et al 2009 and Lecture, Olivia)

ribozyme - an RNA molecule that acts as an enzyme to catalyze a reaction. Some ribozymes can catalyze self-splicing by folding in order to remove introns without the need for a protein. (Lecture, Olivia)

RNAi (RNA interference) - a process by which short pieces if RNA are used to degrade larger pieces of complementary RNA. It is found in all eukaryotes and is being considered as a possible approach for gene therapy where a reduced gene product would alleviate symptoms [88]. (Pyfrom)

RNaseP - a ribozyme that cleaves off a precursor section of RNA from a tRNA molecule. Previously, it was thought that this gene was necessary for life and therefore ubiquitous. However, species of archaea have been discovered that have adapted to life without this ribozyme. Wikipedia; Life without RNaseP (Karen)


Serovar-a subdivision of a species based on the characteristics of their cell surface antigens (serovar Pallavi)

sequence tag site (STS) - A sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known [89]. (Pyfrom)

scaffold - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected (MedTerms Dictionary, Jay)

Shadow enhancers - secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 Science Pallavi)

Shine-Dalgarno sequence - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and Wikipedia article, Laura)
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

SignalP - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. (SignalP Output explained, Laura)

signal peptide - a short peptide chain that directs the post-translational transport of a protein [90] (Matt)

Smith-Waterman alignment - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [91](Will).

SNP (Single Nucleotide Polymorphism) - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [92](Will).

subject sequence - In BLAST, the sequences retrieved from the database, which are compared for similarity to the query sequence, are considered subject sequences. As a general rule, subject sequences should be longer than the query sequence. BLAST searching (Karen)

symporter - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [93] (Peter)

synteny - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor (Answers.com, Jay)

synthetase - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [94] (Peter)


toxicogenomics - a subdiscipline of genomics that deals with gene and protein activity in order to determine how organisms respond to toxins in the environment. This has important implications for research concerning the effects of toxins on genetic material, and how that affects the organism in question (MedTerms, WebDefinitions Claudia).

transcriptome - the set of all mRNA molecules transcribed from a genome [95] (Megan)

transferase - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [96] (Matt)

transmembrane helix - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [97](Mary)

transposons / transposable elements - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [98](Samantha)

transposon mutagenesis - a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene (transposon mutagenesis Pallavi)

'trans-splicing '- fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA (8 Pallavi).

tRNA splicing endonuclease - an enzyme that cleaves intervening sequences of precursor tRNA. [99] (Peter)

type strain - an isolated sample of an organism that acts as the reference point for defining that species (Lecture, Olivia)



Vertical gene transfer-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries (6 Pallavi)


whole genome shotgun sequencing - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [100](Samantha)


xenobiotic - a substance that is found within an organism that is not normally produced or expected to be found within that organism [101] (Megan)

xenolog - homologs that are created by horizontal gene transfer between two different species [102] (Matt)