Blueberry Genome Project for Bio343

From GcatWiki
Revision as of 15:44, 14 February 2012 by Minuttle (talk | contribs) (M)
Jump to: navigation, search

This page will be used by Davidson College students in the Genomics Laboratory course.

Wiki Glossary


Spring 2012

Aaron_D

Mike_N

Shamita_P


SSR Guidelines: Ideally, I’d like my primer as close to the gene as possible. The further you get the more likely you are to have recombination between the marker and gene of interest. I also tend to prefer di and tri nucleotide repeats of lengths greater than 5 as these tend to be the most polymorphic among different lines. Total fragment length (Both primers plus sequence between them) is ideally above 100bp and less than 700bp. Smaller fragments are hard to score accurately and fragments longer than 700bps can’t be scored accurately on automated capillary sequencers due to the limits of the PCR reaction and the lane standards in fragment analysis kits.





Personal Lab Notebooks

Laura

Lexi

Dylan

Puneet

Leland

Jared

Lauren

William

Team Lab Notebooks

Leland & Will

Dylan & Jared

Lauren & Puneet

Lexi & Laura

Team Foci For Projects

Priority List of Topics


Small-scale Projects


Large-scale Projects
Tutorials, Past and Present

Spring 2011

  1. Laura = rRNA gene identification File:RRNAtutorial.docx
  2. Lexi = find gene structure of orthologs File:Genomics Tutorial.docx
  3. Puneet = tRNAs identification File:Finding tRNAs.docx; powerpoint File:Finding tRNA tutorial.pptx
  4. Leland = Parsing Blast Results from Your Favorite Database
  5. Jared = Potential Gene Across-Species Phylogenetic Analysis with Mr. Bayes
  6. Lauren = how to deal with multi-named genes
  7. Dylan = tBLASTn and Protein Sequence Analysis
  8. Will = File:How to Deal With 3 Partial Genome.docx


Fall 2009

  1. Media:Creation of Sequence Logos Using WebLogo.doc (Katie)
  2. Determining whether genes called in JGI and RAST are identical (Karen)
  3. The Ins and Outs of ClustalW2 (Sarah)
  4. Mastering the Art of NCBI: It's a BLAST (Claudia)
  5. Media:ClustalW_Tutorial.doc - (Olivia, Fall 2009)
  6. Media:KEGG_pathway_tutorial.doc - (Megan)
  7. Olivia - perl script to compare proteomes (links to Katie's and Megan's pages)
  8. Katie - two web pages, one for downloading original perl scripts and one for sample small scale version (convert to fasta and compare proteomes)
    link Proteome Compare
  9. Claudia - How To Find and Format Genome Sequences
  10. Megan - Determining Unique and Conserved Proteins: How to Use Katie's Webpage
  11. Karen - how to deal with output from web pages
  12. Sarah - CRISPR resources

Fall 2008

  1. Will DeLoache - BioPerl Installation
  2. Max Win - Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)
  3. Pallavi - Conserved Domains Database (CDD) Media:CDDtutorial.doc
  4. Mary - Protein Data Bank (PDB) Media:PDB Tutorial.doc
  5. Laura Voss - Pfam Database Pfam Tutorial
  6. Samantha Simpson - NCBI BLAST
  7. Peter Bakke - Media:ShineDalgarnoTutorial.doc
  8. Jay McNair - Origin of Replication Tutorial
  9. Nick Carney - Navigating the JGI Database Media:NavigatingJGItutorial.doc
  10. Matt Lotz - SEED Viewer - Media:SEEDTutorial.doc
  11. Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example: Media:Pallavitutorial.doc
  12. Matt: WikiPathways Media:WikiPathwaysTutorial2.doc
  13. Mary: ENZYME Media:ENZYME tutorial.doc
  14. Samantha: How To Determine EC Numbers
  15. Nick: Metacyc Media:MetaCyc tutorial.doc
  16. Max: KGML How to color EC numbers in KEGG maps and view it in KGML graph editor
  17. Jay: SEED Scenario Paths (a tool to determine completeness of pathways)
  18. Laura: Pathway Entrances and Exits
  19. Will: Running BLAST Locally
  20. Peter: Exploring Proteases: MEROPS Peptidase Database Tutorial - Media:MEROPStutorial_PB.doc



Links to Multiple Databases


Papers of Interest


Submitted Course Assignments


Glossary words (A - Z):

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

5' Cap - a methylated guanine nucleotide that is added to the 5' end of a mRNA molecule in eukaryotes. It is added by a 5' to 5' triphosphate linkage, and it gives the mRNA resistance to 5' exonucleases. [1] (Laura M.)

16S rRNA - ribosomal RNA found in the small subunit of prokaryotic ribosomes. rRNA functions in decoding mRNA and interacting with tRNAs in translation. Particularly 16S rRNA is a well-conserved gene found in all organisms (in prokaryotes and eukaryotic mitochondria) often used in comparative genomes when studying phylogeny (Lecture, Olivia)

454 Sequencing - 454 instruments are pyrosequencers that carry out many reactions at a time (parallel sequencing) in wells of a PicoTiter Plate. Beads coated with thousands of homogeneous DNA fragments are added to individual wells on the plate. The DNA fragments are amplified in an oil emulsion mixture with DNA polymerase and primers. dNTPs are sequentially added to the wells one at a time and washed. The process of continuous washing and the sequencial addition of dNTPs, DNA polymerase, luciferase, and ATP-sulfurylase explains the high reagent costs of sequencing. ATP-sulfurylase converts the PPi released from each dNTP addition to the complementary strand of the original ssDNA to ATP. ATP fuels luciferase in each well. The light produced is detected with a flourescence microscope. The current (2009) 454 FLX system has the ability to sequence 100 Mb DNA in 8 hours with an average read of 250 bp and raw accuracy of 99.5%. [2] [3] (Jared)

A

accession number - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [4] (Will).

acid invertase- an enzyme essential to sucrose metabolism, specifically in fruit, that hydrolyzes sucrose into fructose and glucose. Low levels of acid invertase have been shown to be associated with high levels of intracellular sucrose, and hence, to regulate storage and breakdown of sugar (sucrose) in fruit.[5] (Lauren)

acyltransferases - Enzymes that catalyze the transfer of an acyl group from a donor (such as acetyl CoA) to an acceptor. Activity of these enzymes adds a great deal of diversity athnocyanins, flavonoids, and phenolic compounds in Vaccinum Corymbosum[6] (Puneet)

adsorption - the accumulation of molecules on the surface of a material. This can be part of a lab procedure to purify and isolate a specific portion of a cell or a protein (Wikipedia, Olivia)

alien genes - genes found in a genome that appear to have been inserted into an organism's genome from another species, more than likely through horizontal gene transfer ([1] Campbell, Claudia)

alternative splicing - the process by which one gene can be translated into different protein isoforms. This is done by reconnecting the exons of the RNA produced in transcription in multiple ways during RNA splicing. ([7] Dylan)

allogeneic - variation in alleles among members of the same species. ([8] William G.)

anthocyanins - a member of the flavonoid family that changes color with pH, giving various fruits their coloration. The health benefits of anthocyanin are potentially great, with laboratory results suggesting positive effects against cancer, aging and neurological diseases, inflammation, diabetes, and bacterial infections. It is, however, poorly conserved during digestion and would have to be modified somehow for medicinal use. [9] [10] (Dylan)

antisense (RNA or DNA)-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function (5 Pallavi).

Apollo - Gene annotation software that allows you to visualize genes you have identified, your annotations for them, and where they lie within a genome Berkeley(Lexi).

Arabidopsis thaliana - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics (Wikipedia.org, Jay)

Archaea - one of the three evolutionary domains. A group of unicellular prokaryotes that were previously grouped with Bacteria, but have some genes and metabolic pathways more similar to eukaryotes, such as those involved in transcription and translation. Many Archaea are extremophiles, such as Halobacteria that thrive in high-salt environments (Lecture, Olivia)

Archaeal rhodopsins - Archaeal rhodopsins are light-sensitive and light-activated transmembrane proteins only found in archaeal plasma membranes. Bacteriorhodopsin (BR) and Halorhodopsin (HR) are both archaeal rhodopsins that are proton and chloride light drive pumps, respectively, indicating that the functionality of archaeal rhodopsins is diverse [11] (Katie)

assembly - the process of taking many short sequences of DNA, often from whole genome shotgun sequencing, and compiling overlapping regions to create a representation of the original chromosomes from which the DNA originated. ([12] Mike)

B

BAC - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms (Wikipedia.org, Jay)

Bacteriorhodopsin- A transmembrane archaeal rhodopsin protein that uses light energy to move protons across membranes, creating an electrochemical gradient that is converted into chemical energy [13] (Katie).

Bacterioruberin - Bacterioruberin is a “carotenoid pigment” found in some halophiles giving them a red color and providing assumed protection from strong sunlight [14]. The structure also plays a stabilizing role in the archaeal rhodopsin proteins [15] (Katie).

bioinformatics - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [16] (Matt)

BLAST - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [17] (Mary)

Blastula - is a hollow sphere of cells that transitions to the gastrula through a process of cell division known as clevage in the early stages of embryonic development. [18] (William G.)

BLASTx - a BLAST search (see BLAST) in which a translated nucleotide sequence is entered and compared to a protein database. [19] (Aaron)

Bligh-Dyer method- A lipid extraction method that uses chloroform-methanol as a solvent but also includes a re-extraction of the sample, just with chloroform, before evaporation of the solvent to capture more non-polar lipids. [20] The lipid membrane of archaea is extremely unique not only in composition (see Isoprenoid lipids) but also in the archaeal rhodopsins that are scattered among the plasma membrane [21]. In order to study the uniqueness of archaeal membranes one needs to observe the lipids outside of the membrane, which the Bligh-Dyer method accomplishes (Katie)

bioperl- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [22] (Wikipedia, Max Win)

bootstrap value - common reliability test of a phylogenetic tree, calculated as a percentage. In generating a phylogenetic tree, the sequences will be resampled, or rerun, multiple times. If a pair of sequences are consistently grouped together for 100 out of 100 resamplings, then the certainty that those sequences are correctly grouped would be very high, and the bootstrap value would be 100. If a pair of samples were grouped together only 50 out of 100 resamplings, the certainty that those sequences are correctly grouped would be lower; the bootstrap value would be 50. On phylogenetic trees, these values may be placed adjacent to the group to which they refer. (Lecture, Olivia)

C

CAGE - Cap Analysis Gene Expression. A technique for identifying the start sites for transcription and determining the amount of promoter usage in eukaryotic genomes. Small fragments (20-21 nucleotides) from the beginnings of mRNAs are extracted, reverse-transcribed to DNA, PCR amplified, and sequenced. These sequences (called "tags") are compared against a known genome to identify exact transcription start sites. ([23] Dylan)

carbon fixation - using carbon dioxide to create organic materials [24] (Samantha)

CCCP - carbonyl cyanide m-chlorophenyl hydrazone; a nitrile ionophore that inhibits oxidative phosphorylation and photophosphorylation. Ionophores are lipid-soluble molecules allowing them to transfer across membranes, creating pores that disrupt transmembrane ion gradients. (Sugiyama 1994 article, Olivia)

cell division control (Cdc) protein - for example, Cdc6 found in Halorhabdus utahensis; protein responsible for activating and maintaining mechanisms of cell division. Cell division control proteins are important in annotation because the presence of a Cdc gene is a good indicator for finding the origin of replication in a circular chromosome. (Bakke et al 2009, Olivia)

CDD (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [25] (Mary)

cDNA - DNA that is reverse-transcribed from mature mRNA. A cDNA library provides templates for genes that are expressed within an organism. [26]. (Pyfrom)

chaperonin - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [27] (Matt)

chemoorganotrophic - refers to organisms that obtain energy from oxidation/reduction reactions using organic electron donors (Link, Earthlife Claudia)

chemotaxis - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [28] (Nick)

chemotaxonomy - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [29] (Mary)

chilling requirement - the minimum time period a fruit bearing plant must spend in cold weather in order to blossum, often expressed in chill hours, which are calculated from duration spent at certain temperatures. ([30] Mike)

chimeric genome - A genome that consists of a mixture of genes from distinct species Baliga et al., 2004 (Karen)

Chloroplast chromosome - circular DNA found in the photosynthesizing organelle (chloroplast) of plants instead of the cell nucleus where most genetic material is located. This genome codes mostly for redox proteins involved in electron transport in photosynthesis. ([31] Dylan)

Circos Plot - A circular representation of the genome(s) for one or more species. It shows the extent of homology within the species and/or orthology between multiple species by connecting lines between regions of the chromosomes that share similar DNA sequences. In many cases, Circos Plots endow us with a "tangible" understanding of where gene duplication may have occurred, ie) If a region on chromosome 2 is paralagous with regions on chromosome 4 and 6, we will see a two lines, one connecting the region on 2 with the region on 4 and another connecting the region on 2 with the region on 6. (Berger et al, 2011, Shamita)

cladogram - A visual representation of relatedness among species that shows common ancestry via the formation of branch points on the tree. The species similarity is computationally determined, and based on the similarity of their DNA and/or RNA sequences. ([32], Shamita)

cloud computing - dividing data processes, and inputting parts of these processes into nodes to spread out heavy computational workloads among many computers or sections of computers running simultaneously. Cloud computing has become especially popular in the field of genomics. Assembly algorithms may take days to sort through terabytes of data for a genome with high coverage. One option for external cloud services is Amazon's Elastic Computing Cloud (EC2). A labratory could also build an internal cloud, linking all computers in the lab together. Ubuntu, an open source, linux-based operating system, now has cloud support. [33],[34] (Jared)

climacteric/non-climacteric fruit - Fruits that are susceptible to ripening by a releasing ethylene and increasing respiration rates. Some examples include bananas, apples, apricots, and peaches. Blueberries, strawberries, and grapes are all non-climacteric, and ripen under ethylene; however, genes for the ethylene receptor may still be active in non-climacteric fruit. ([35], Shamita)

ClustalW - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [36] (Will).

COG (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs (COG Pallavi)

comparative genomics - the study of relationships between genomes of different strains and species. Comparative genomics aims to define similarities and differences in structure and/or function of different proteins, RNAs and regulation between organisms (Wikipedia and Lecture, Olivia)

concatemer - long continuous DNA molecule that contains the same DNA sequence repeated in series [37](Samantha)

congenic - two strains of an organism that are nearly identical, varying only at a single locus (also called coisogenic) [38] (Megan)

consensus sequence - a nucleotide sequence that is common, though not necessarily identical, in different genes and in genes from different organisms that are associated with a particular function. [39] (Megan)

conserved genes - regions of similar or identical sequences within DNA or proteins across species. Sequence conservation generally implies that there is a conserved gene in that location. Highly conserved genes are oftentimes necessary for survival and, therefore, any mutations are eliminated through natural selection. ([40] Dylan)

contigs (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [41], Max Win)

controlled vocabulary - a set of terms used to standardize the description of characteristics in organisms' genomes, as designated by the Gene Ontology (GO) project ([1] Campbell, Claudia)

coverage - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

CPAN (Comprehensive Perl Archive Network) - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [42](Will).

Cytogenetics-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes (7 Pallavi).

D

digenic phenotype - phenotype caused by two genes, not one. ([43], Leland)

DCCD - dicyclohexylcarbodiimide; compound that acts as a proton ATPase inhibitor (Sugiyama 1994 article, Olivia)

de Bruijin graphs - graphic representations of groups of short letter strands (k-mers). Used in genomic assembly, the graphs consist of rectangles of short nucleotide sequences and their reverse complements. Sequences vertically protruding from these rectangles overlap and share these rectangle base sequences. Arcs connect nodes of linked overlapping sequences.

Zerbino and Birney (2008) developed Velvet, a set of algorithms designed to manipulate these graphs in order to assemble high coverage genomes consisting of short reads. [44] (Jared)
DeBruijin.gif

de novo synthesis - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [45] (Matt)

dehydrogenase - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [46] (Peter)

dendrogram - a tree diagram used to illustrate the arrangement of the clusters produced by hierarchial clustering based on the degree of similiarity of characteristics. Dendrograms are often used in computational biology to illustrate the grouping of genes or samples. [47](William G.)

deoxyribodipyrimidine photolyase - enzyme which breaks the errant covalent bonds that form pydrimdine dimers. UV light is a common cause of this particular anomaly and causes covalent bonds to form between adjacent pyrimidines. Many archaea and bacteria use deoxyribodipyrimidine photolyases in order to break these bonds and avoid errors during replication or transcription [48]. (Pyfrom)

diatom - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [49] (Mary)

DICER1 - a protein in the RNA induced silencing complex (RISC). DICER cleaves double stranded mRNAs, rendering them untranslatable. The protein belongs to the helicase family. Defects in the enzyme have been implicated in pleuropulmonary blastoma, a developmental cancer of the lungs. [50] (Jared)

dicotyledon - a group of flowering plants that has two leaves in the embryo of the seed. Most have net-veined leaves, and the vessels in the stem are arranged in a circle near the stem surface. [51] Blueberries are dicotyledon. [52] (Laura M.)

DNA (deoxyribonucleic acid) - The nucleic acid that forms the basis of the genetic material in most organisms. DNA is composed of the four nitrogenous bases Adenine, Cytosine, Guanine, and Thymine, covalently bonded to a backbone of deoxyribose-phosphate to form a DNA strand. Two complementary strands (where all Gs pair with Cs and As with Ts) form a double helical structure which is held together by hydrogen bonding between the complimentary bases. ( [53] [54] Mike)

domain (protein) - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. (Wikipedia article, Laura)

dirigent proteins - a protein that controls the stereochemistry of a compound synthesized by other enzymes. Ex: In lignin formation, dirigent proteins are suggested to "direct the coupling of two monolignol radicals, producing a dimer with a sinlge regio- and stereo- configuration." [55] (William G.)

dot plot-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[56], Max Win)

draft genome- a genome that has been sequenced by computers and programs but has not yet been reviewed by humans in order to create a finished genome. Draft genomes usually contain gaps or mistakes due to the limited capacity of the programs used for sequencing (Lecture, Pyfrom).

E

epigenetic regulation - changes in phenotypes that are caused by mechanisms other than DNA sequence. DNA methylation is an example of this. ([57], Leland)

EC number (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [58] (Mary)

Edman degradation-A method for sequencing amino acids in a peptide chain. It allows the ordered protein sequence to be determined by proceeding from the N-terminus of the chain and piecing together fragmented sequenced chains of a protein [59] (Katie).

E-value (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[60], Max Win)

ENZYME - an enzyme database with links to a variety of resources (KEGG, BRENDA, PubMed, etc.) specific to a query. Users can search based on enzyme commission (EC) number, enzyme family, cofactor, and more. [61] (Aaron)

epistasis - the interaction between two or more genes to control a single phenotype. Epistasis is not the same as dominance; dominance involves the interaction of two alleles for the same gene, whereas epistasis is the interaction of different genes. [62] (Megan)

Ericaceaea - The family of plants that blueberry belongs to. This family includes herbs, subshrubs, shrubs and trees, and grows best in acidic soils Flora of North America (Lexi).

ELSI - A research initiative funded by the US Department of Energy and National Institutes of Health to study the ethical, legal, and social issues (ELSI) brought about by the availability of genetic information. This program dealt with knowledge in both the Human Genome Project and other work of medicinal and health import. ([63] Dylan)

expressed sequence tag (EST) – a short piece (200-500bp) of transcribed cDNA that can be used to determine the position of an expressed gene within the genome [64]. (Pyfrom)

extremophile - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [65] (Will).

exon - portions of a nucleic acid sequence represented in mature RNA, as opposed to intons which are spliced out. ( [66] Mike)

F

FASTA format - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [67] (Nick)

family (protein) - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. (Wikipedia article and lecture, Laura)

finished genome - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay)

fold coverage - c= (L*N)/G, L= average read lengths, N= number of reads, G= genome size. A higher fold coverage allows for higher final accuracy statistically due to a larger sample size in calculating the mode nucleotide across point polymorphic sites (between reads) e.g. 12X coverage means 12X redundacy of bases, higher base accuracy and higher accuracy of assembly [68] (Jared)

Fragaria vesca - Strawberry, a fruit related to blueberry that had its genome sequenced in 2010. Strawberry has a relatively small genome (240 Mb), compared to the 487 Mb genome of the grape, demonstrating that there is great variability in the genomic structure of related species Strawberry Genome Paper Grape Genome Paper (Lexi).

frustule - a hard, porous cell wall made up of silica that makes up the outermost layer of diatoms. These structures have complex and elaborate designs (Wikipedia Claudia)

fusion mRNA-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 Science Pallavi)


Flavonoids - polyphenolic biochemical compounds that have been shown to have antioxidant effects. They are known to be found in fruits, vegetable, olive oil, cocoa and beverages such as tea and red wine. The most common flavonoids include anthocyanins, flavols, flavones, flavanones, flavan-3-ols, and isoflavones. [69] (Lauren)

G

GAF Domain - A GAF domain is a small-molecule binding unit present in all domains of life. It is a light-responsive domain found in plant and cyanobacterial phytochromes (a pigment photoreceptor used to detect light). This domain plays an important role in an organism's ability to respond to its environment. (Baliga et. al., Molecular Interventions, Ecomii Claudia)

gap - a region of the genome for which no sequence is currently available. Two types of gaps exist: heterochromatic gaps consist largely of a highly repetitive sequence (and is therefore difficult to determine the exact non-overlapping sequence of), and euchromatic gaps are more likely to contain genes. [70] (Megan)

gap penalty - The penalty applied due to gap(s) during sequence alignment, necessary to see similiarities between sequences that would otherwise be considered radically dissimiliar. Gaps arise during sequence comparison due to insertions or deletions. Gap penalties are usually subtracted from a cumulative score being determined by an optimization algorithm that attempts to maximize that score. A higher gap penalty will cause less favourable characters to be aligned, to avoid creating as many gaps. ( [71] Mike)

GC Content - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [72] (Matt)

GC-skew – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[73], Max Win)

gene amplification - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [74] [75] (Matt)

gene calling - Determining which parts of a sequenced genome represent genes. This process could also be called gene finding. The process is generally fully automated. Magnaporthe grisea Automated Gene Calling(Karen)

gene fusion-occurs when DNA segments of two different genes come together. Can result in hybrid proteins (9 Pallavi)

gene knockdown - similar to gene knockout, this technique involves the reduction of expression through use of complementary DNA or RNA that lasts only a short period of time before returning to normal. [76] (William G.)

gene knockout - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [77] (Matt)

Gene Network - A network shows the interactions among parts of a whole and can be applied to any level of biology, from the genetic to the ecosystem level. Within the study of genomics, networks are typically represented as gene regulatory networks, which show how genes, transcripts and proteins interact to regulate a particular pathway. Institute for Systems Biology(Lexi)

gene oncology- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[78], Max Win)

gene patent - In genetics, a patent applies to a particular gene sequence discovery and reserves rights to it and any process involved in obtaining or using the gene product for the individual or group responsible for the discovery. ([79] Dylan)

gene transfer - the incorporation of a DNA segment into an organism's cells, or DNA. This usually occurs through a vector such as a virus. This method is used in gene therapy. (Genomics.energy.gov Claudia)

Genome - The full set of an organism's hereditary information. The genome is encoded as either DNA or RNA and includes both genes and non-coding regions. Wikipedia article (Puneet)

genome annotation - the process of attaching biological meaning to sequence data. In other words, genome annotation involves determining where genes are located in a genome and discovering functions of these genes. Genome annotation: from sequence to biology (Karen)

glaucophyte - freshwater algae that have not been studied well [80](Samantha)

gynandromorph - organisms that contain both male and female cells and thereby express both male and female characteristics. [81] (William G.)

H

haemolysin or hemolysin - a chemical produced by a bacteria that causes lysis of red blood cells [82] (Nick)

halophile - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [83] (Matt)

haplogroup - branches on the ancestry tree of Homo sapiens that reflect early migrations. Geneticists differentiate these groups by examining variations in mtDNA (origins of mother) and the Y chromosome (origins of father) [84] (Jared)

haplotype-collection of alleles that travel together (Lecture, Pallavi)

haptophyte - phylum of algae [85](Samantha)

heterokont - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [86](Samantha)

Heterologous -literally meaning, “derived from a different organism,” heterologous refers to the fact that the gene/protein of interest was taken from a different cell type or species than the gene/protein recipient [87]. (Katie)

Hidden Markov Model - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. (Wikipedia and lecture, Laura)

hierarchical genome shotgun sequencing - a method for sequencing genomic DNA. Genomic DNA is cut into pieces of about 150 Mb and inserted into BAC vectors, transformed into E. coli where they are replicated and stored. The BAC inserts are isolated and mapped to determine the order of each cloned 150 Mb fragment. This is referred to as the Golden Tiling Path. Each BAC fragment in the Golden Path is fragmented randomly into smaller pieces and each piece is cloned into a plasmid and sequenced on both strands. These sequences are aligned so that identical sequences are overlapping. These contiguous pieces are then assembled into finished sequence once each strand has been sequenced about 4 times to produce 8X coverage of high quality data [88]. (Pyfrom)

High Throughput Biology (Sequencing, Genomics, etc) - Method of biology which utilizes new technologies to collect and analyze large volumes of data through biochemical manipulations of large numbers of samples 1 (Lexi)

HMM Logo - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. (How to read HMM Logos, on Pfam, Laura)

homeobox - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [89](Samantha)

homodimer - a protein made of paired identical polypeptides (Answers.com, Jay)

Homolog - Protein or gene that is derived from a common ancestor (Lecture; Wikipedia article) (Puneet)

horizontal gene transfer-DNA transmission between species and incorporation of the DNA into the recipient's genome (horizontal gene transfer Pallavi)

Hox gene-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis (4 Pallavi)

hydrolase - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [90] (Nick)

Hydropathy analysis - This method determines the hydrophobic nature of an amino acid sequence. It uses a window moving through the sequence, summing the Gibbs free energy values for each amino acid and running these values through programs to determine hydrophobic segments. [91] In respect to halophiles, there is evidence to suggest that protein stability, in some cases, may be dependent upon high salt concentrations and since the hydrophobic nature of proteins increase stability, it is important to be able to measure stability in terms of hydrophathy [92] (Katie)

hypothetical protein - A hypothetical protein is a gene encoded by a genome that has a predicted function, but this function has not been experimentally tested or proved. The predicted function is determined by the protein's structural similarities to proteins of known function as well as the protein's sequence makeup. It has no analogs in the protein database. (Web Definitions Claudia)

I

inducer - a molecule that amplifies gene expression. ([93], Leland)

ideogram - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

identities - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

Illumina sequencing - Illumina instruments amplify DNA fragments in situ on a flow cell. Fragment colonies are dispersed on the flow cell at a low concentration at first, allowing for non-overlapping fragment colonies. Clusters are promoted by isothermal bridging amplification. The amplification increases the density of these colonies. Florescently labeled nucleotides are cyclically washed over the flow cell. These nucleotides are conjugated with reversible terminators so that the four nucleotide bases can be simultaneously incorporated base by base across the flow cell. Laser induced excitation of the cell allows imaging of the excited flourophores. The use of a flow cell and reversible terminator allows the Illumina Genome Analyzer to produce 600 Mb of DNA per day with only 36 bp reads. The tradeoff between pyrosequencing methods and the flow cell method is increased throughput for shorter reads. The raw accuracy of the Illumina genome analyzer is over 98.5%. Increased coverage is necessary when using sequencers with high raw error rates. [94] [95] (Jared)

immunopreciitation - the technique of precipitating a protein out of solution using an antibody that specifically binds to that particular protein. This process can be used to isolate and concentrate a particular protein from a sample containing many thousands of different proteins [96]. (Pyfrom)

indel - term used to describe insertions or delations within a genome. Since an insertion in one genome is a deletion in another, "indel" is a catch-all term coined to remove the relative subjectivity of determining a mutation as being either an insertion or deletion (Lecture, Pyfrom).

indole-a chemical compound that is produced from the break down of tryptophan (indole Pallavi)

inclusion body - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [97] (Nick)

intergenic distance - The distance (in base pairs) between genes wikipedia (Karen)

intron - a region of DNA in a gene that is not part of the final coding sequence for the protein. [98] (Peter)

ion torrent - (Aaron)

IS elements - (insertion sequence element) sequences of DNA that can transpose to new positions in the genome. This can cause disruptions in other gene coding regions and major reorganizations of the genome Baliga et al., 2004 (Karen)

isoelectric point - the pH at which a molecule is neutral [99] (Nick)

Isoprenoid lipids -lipids made from five carbon isoprene units, also known as isoterpene units which is the organic compound CH2=C(CH3)CH=CH2. [100]. The side chains in phospholipids are built from isoprene instead of fatty acids in archaea, making them isoprenoid lipids [101]. (Katie)

isozymes - members of a gene family with very similar cellular roles (Campbell-Heyer Genomics textbook, Jay)

J

Junk DNA - sections of DNA that do not code for genes, or a label for stretches of DNA for which no function has been identified. Non-coding DNA is often referred to as "junk DNA." [102] (Megan)

K

KEGG (Kyoto Encyclopedia of Genes and Genomes) - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [103](Will).

kinase - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [104] (Peter)

Kozak consensus sequence - a sequence present in eukaryotic mRNA and that is upstream of the start codon, and plays a major role in the initial binding of mRNA to ribosomes that facilitate translation. [105] (Lauren)

Kyte Doolittle Hydropathy plot - a plot used to determine the hydrophobic character of an amino acid sequence. Peaks higher than 1.6 on the plot, suggest the sequence in question contains hydrophobic regions and is possibly localized within or around a membrane. Peaks less than 1.6, suggest the amino acid sequence does not have a membrane spanning domain. [106] Lauren

L

lateral gene transfer - see "horizontal gene transfer" (Pallavi)

lignin - a protein found in the cell wall of plants. It is important in the stiffness and strength of the plant stem. It also makes the cell wall waterproof, allowing transport of water and solutes through the vascular system. [107] (Laura M.)

linkage groups- Genes that are often inherited as a single unit are said to form a linkage group because the rate of recombination between them is so low. ([108], Shamita)

Liposome - microscopic fluid filled vesicle whose phospholipid walls are identical to that of the cell membrane and are often used as models for artificial cell membranes, which is useful in studying the uniqueness of archaeal membranes outside of the archaea organism, and drug delivery [1] (Katie).

M

Manatee - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [109](Will).

marker assisted selection - a process whereby a marker, in our case genetic, is used for indirect selection of a genetic determinant or determinants of a trait of interest. ( [110] Mike)

metabolism - chemical reactions organisms utilize in order to maintain life. Metabolism can be constructive such as anabolism in which energy is used to create cell components like protein, or it can be destructive such as catabolism where a substance such as sugar is systematically broken down in order to harvest energy for the organism. Wikipedia (Karen)

methylation - when DNA is methylated proteins (like transcription factors) can no longer bind to it. This is important to genomics because methylation is a way to activate or inactivate genes throughout the genome. A methylome is a complete description of the methylation status of a genome. (Discovering Genomics, Proteomics, & Bioinformatics pg 57, Leland)

metabolome - The complete set of small molecule metabolites (e.g. intermediates, products, etc.) found within an organism. The metabolome gives one an idea of the mechanisms underlying various metabolic pathways in an organism [111] (Puneet)

microsatellites-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families (3 Pallavi)

minisatellites-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) (2 Pallavi).

monocotyledon - a group of flowering plants that has one seed-leaf (cotyledon). In most, the leaf veins are parallel, and the vessels in the stem are scattered. [112] (Laura M.)

monosomy - only one copy of a chromosome is present instead of two (typically found in pairs, ex. humans). [113] (William G.)

mosaicism - the presence of two or more genetically different populations of cells that originated from the same zygote. Earliest examples involved the transplantation of a blastula stage embryo from one genetic background into another of a different genetic background. This allowed for expanding study of genes early in development. [114] (William G.)

motif - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[115], Max Win)

mycoplasma - genus of bacteria that lack a cell wall [116] (Nick)

Myb transcription factors - a family of proteins that regulate gene expression within the cell by binding directly to DNA. Absence of Myb factors has been shown to cause various types of cancer by inhibiting cell division. Myb proteins are identified by a number of imperfect tandem repeats known as the "Myb domain" which serve to identify where the protein binds to the DNA. Myb factors have been linked to various flavonoid pathways within plants. [117] (Dylan)

N

NCBI - (The National Center for Biotechnology Information) is a division of the National Library of Medicine (NLM) in the National Institutes of Health (NIH). This organization seeks to develop and make available information technologies for use in discovering and deciphering the fundamental molecular and genetic processes affecting health and disease. (NCBI Claudia)

Nhx - Family of antiporter proteins in plants responsible for regulating intercellular pH. One member of the family, Nhx1, is a Na+/H+ antiporter. 1 (Lexi)

NORFs (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[118], Max Win)

nucleolar organizer - the region of a chromosome around which the nucleolus forms after cell division. It contains tandem repeats of rRNA genes, which are transcribed, processed and formed into ribosomes (with the addition of ribosomal proteins) in the nucleolus. [119] [120] (Laura M.)

nucleomorph - reduced eukaryotic nuclei found in plastids [121](Samantha)

O

object-oriented programming - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

oligonucleotide - a short nucleic acid sequence (typically 50 or fewer bases) that is used as a DNA synthesis primer. They are formed from individual nucleotides to allow creation of any sequence necessary. Oligonucleotides are used in a number of procedures, including DNA microarrays, Southern blots, ASO analysis, fluorescent in situ hybridization (FISH), and the synthesis of artificial genes. ([122] Dylan)

ohnology - paralogous genes originating from a whole genome duplication. These genes are important to genomic analysis because they provide a series of genes that have all been diverging for the same amount of time since the duplication event. ([123] Dylan)

open reading frame (ORF)-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) ORF (Pallavi)

operon - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [124] (Nick)

opsin - In eukarya, this is a group of light sensitive G protein-coupled receptors often found in the retina. In prokaryotes, opsins are used to fix carbon by harvesting energy from light. Additionally, these receptors are independent of any chlorophyll pathway Wikipedia (Karen)

optical mapping-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome optical mapping (Pallavi)

origin of replication - the sequence in a genome where DNA replication( in Eukaryotes and Prokaryotes) or RNA replication (in RNA viruses) is initiated. In Eukaryotes there are multiple origins of replication that aid in speeding up the process of replication within the cell. [125], Lauren)

ortholog - one within a group of DNA sequences each found in separate genomes that look very similar. Orthologs may have an evolutionary relationship, but the term itself does not imply the presence or absence of one. (Lecture, Olivia)

oxidoreductase - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [126] (Nick)

P

polymerase chain reaction (PCR) - A technique used to amplify specific segments of DNA. The technique can be used to detect and amplify trace amounts of DNA into millions of copies. In a genomics setting, PCR has been adapted useful to quickly identify the species of an organism by using species specific primers. ([127] and Discovering Genomics, Proteomics, & Bioinformatics pg 146, Leland)

penetrance - refers to varying degrees of phenotypic expression of a gene. A gene with high penetrance always expresses the same phenotype. ([128], Leland)

paralog- identical DNA sequences within a species (Lecture, Pallavi)

p-arm - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) (MedTerms Dictionary, Jay)

pectin - a polysaccharide found in and between the cell walls of plants, which helps to keep cells rigid by regulating water flow between cells. It functions as a gelling agent in making fruit jellies and jams. [129] (Laura M.)

peptidyl transferase - an enzymatic part of the ribosome that catalyzes the peptide bonds between the amino acids during translation. Peptidyl transferase activity is done by rRNA in the large subunit (60S in eukaryotes) of the ribosome. [130] [131] (Laura M.)

Perl - Developed by Larry Wall in 1987, Perl is a high-level programming language used frequently by biologists and bioinformaticists [132] (Will).

periplasmic space - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [133] (Peter)

Pfam - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. (Pfam Help, Laura)

pharmacogenomics - how inherited genetic variations and the resulting genomic interactions alter the intended effects and side effects of drugs. Discovering Genomics, Proteomics, & Bioinformatics pg 333 (Jared)

phenylpropanoids - Plant-derived organic compounds derived from the amino acid phenylalanine. Phenylpropanoids are involved in a variety of essential functions such as plant defense, plant pollinator reactions, etc. [134] They potentially may be related to dietary health benefits seen in blueberries, as well. (Puneet)

phylogenetic tree - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [135] (Nick)

phylotypes – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[136], Max Win)

phytanyl lipids - Organically, a phytanyl is a branched-chain hydrocarbon containing 20 carbon atoms [137]. Phytanyl lipids are often found in the membrane of archaea and are thought to contribute to increased membrane stability at high salt concentrations [van de Vossenberg et al. Extremophiles (1999) 3:253-257]. (Katie)

phytochrome - a pigment that acts as a photoreceptor that triggers a response or signaling cascade in many plants and bacterial organisms as well as some animals. It is made up of a chromophore, or a compound that absorbs visible light, which is bound to a protein. Phytochrome is one of the most intensely colored pigments found in nature. This intense pigmentation allows the organism to sense even dim light. (Ecomii, Phytochrome Claudia)

plasmid - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [138](Peter)

plastid - major organelles in plants or algae [139](Samantha)

pleiotropy - a single gene that causes many different physical traits like multiple disease symptoms. [140] (William G.)

pleomorphism - the occurrence of two or more structural forms during a life cycle [141] (Mary)

polymorphism- A type of genetic variation that occurs at the same loci between individuals of the same species. The variation due to a polymorphism constitutes as different alleles of that gene. Ie) SNPs (single nucleotide polymorphisms), RFLPs (Restriction Fragment Length Polymorphism). ([142], Shamita)

Populus trichocarpa - Also known as the California poplar, Populus is a deciduous broadleaf tree species often used as a model organism in plant biology. Its genome was published in 2006. [143] (Puneet)

positives - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [144] (Mary)

promoter - a region of DNA that facilitates transcription of a gene; promoters are typically located closely upstream of the gene they regulate [145] (Megan)

proteome - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [146](Samantha)

proton pump - an integral membrane protein capable of transporting protons across a membrane. Mitochondria utilize proton pumps in order to create a proton gradient used for producing ATP. Wikipedia (Karen)

PSORT - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. (PSORT, Laura)

pseudogenes-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

purine - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [147] (Peter)

p-value - probability associated with a statistical test of the difference between populations. Populations are considered significantly different if the associated p-value is small (typically 0.1 or smaller). Discovery Genomics, Proteomics and Bioinformatics[148], Pyfrom)

pyrimidine - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [149] (Peter)

pyrosequencing - Pyro.jpg(image from [150]) (Jared)

Q

q-arm - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) (MedTerms Dictionary, Jay)

quantitative real time polymerase chain reaction (rt-PCR or qrt-PCR) - An experiment that serves to amplify and quantify the amount of a gene in a cell over time. There are many variations of the experiment, but cells are commonly placed in various external environments, and their expressed mRNAs are simultaneously collected. From the expressed mRNAs, cDNAs are produced. Those cDNAs are then used in the detection when added reagents produce another complementary DNA strand that binds to the cDNAs and fluoresces. The intensity of fluorescence is detected over the range of external conditions. This allows us to determine the extent to which genes are expressed in different environments. (USCM Webpage, Shamita)

quantitative trait loci (QTL) - the effect of multiple loci on a trait that can be quantified phenotypically, and that varies in degree depending on the loci involved (Campbell & Heyer, 2007, Shamita)

query sequence - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. (BLAST on Wikipedia, Laura)

R

RAST - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([151], Max Win)

rDNA-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. (rDNA Pallavi)

replicon - a region of DNA or RNA that replicates from a single origin of replication [152] (Megan)

repressor - a protein that binds to a section of DNA in order to regulate one or more genes by decreasing the rate of transcription [153] (Megan)

residue (protein) - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. (Pfam Help, Laura)

Resveratrol - part of the stilbene family,a polyphenol compound found in grapes, blueberries,and other food that has been shown to have cancer-preventive antioxidant, antimutagen activity and anti-inflammatory activity. [154](Lauren)

retinal - vitamin A aldehyde; a chromophore (colour-producing molecule) that is bound to proteins called opsins. For example, Haloarcula and other halophilic archea have a light-driven proton pump such as bacteriorhodopsin. This pump contains a reddish-purple retinal that absorbs green visible light. (Wikipedia, Olivia)

retropseudogenes-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. (1 Pallavi)

retrotransposons - RNA transcribed back into DNA and added into the genome [155](Samantha)

ribonuclease - a nuclease that catalyzes the degradation of RNA into smaller components [156] (Mary)

ribosome binding site (RBS) - short purine-rich sequence found directly (4-8 bp) upstream of the start codon of a protein coding sequence to which ribosomes bind to begin translation. The RBS sequence tends to be species-specific, and the consensus sequence acts as a good indicator of the start site of a gene (Bakke et al 2009 and Lecture, Olivia)

ribozyme - an RNA molecule that acts as an enzyme to catalyze a reaction. Some ribozymes can catalyze self-splicing by folding in order to remove introns without the need for a protein. (Lecture, Olivia)

RNA (Ribonucleic Acid) - A category of nucleic acids in which the component sugar is ribose and consisting of the four nucleotides Thymidine, Uracil, Guanine, and Adenine. The three types of RNA are messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA). RNAs are essential to all known forms of life. ( [157] [158] Mike)

RNAi (RNA interference) - a process by which short pieces if RNA are used to degrade larger pieces of complementary RNA. It is found in all eukaryotes and is being considered as a possible approach for gene therapy where a reduced gene product would alleviate symptoms [159]. (Pyfrom)

RNA polymerase I - an enzyme in eukaryotic organisms that transcribes pre-rRNA 45S, which is processed to form 28, 18, and 5.8 rRNA molecules. These forms of RNA account for over 50% of the RNA synthesized in a typical cell. [160] [161] (Laura M.)

RNaseP - a ribozyme that cleaves off a precursor section of RNA from a tRNA molecule. Previously, it was thought that this gene was necessary for life and therefore ubiquitous. However, species of archaea have been discovered that have adapted to life without this ribozyme. Wikipedia; Life without RNaseP (Karen)

S

Serovar-a subdivision of a species based on the characteristics of their cell surface antigens (serovar Pallavi)

sequence tag site (STS) - A sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known [162]. (Pyfrom)

scaffold - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected (MedTerms Dictionary, Jay)

Section - A taxonomic term analogous to subgenus. High bush blueberry belongs to the cyanococcus section of vaccinium (Personal Communication, Grant Proposal). (Lexi)

Shadow enhancers - secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 Science Pallavi)

Shine-Dalgarno sequence - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and Wikipedia article, Laura)
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

SignalP - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. (SignalP Output explained, Laura)

signal peptide - a short peptide chain that directs the post-translational transport of a protein [163] (Matt)

simple sequence repeat (SSR) - short, repetitive fragments of DNA that display a polymorphism in length, giving rise to allele variation in SSRs between individuals within a species. Also see microsatellite.(Soybean and Alfalfa Research Lab Shamita)

singleton - (Aaron)

small nuclear ribonucleic acid (snRNA) - small RNA molecules found in the nucleus of eukaryotic cells. They combine with specific proteins (called Sm proteins) to form ribonucleoprotein complexes (snRNPs), which function in removal of introns during RNA splicing. [164] (Laura M.)

Smith-Waterman alignment - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [165](Will).

SNP (Single Nucleotide Polymorphism) - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [166](Will).

SOAPdenovo - a package of algorithms developed by BGI for short-read de novo assembly of Homo sapien sized genomes. [167] (Jared)

Solanum lycopersicum - Commonly referred to as the tomato, Solanum lycopersicum is an effective model system for testing the functionality of various genes through transformation e.g. via agrobacteria (lecture) (Puneet)

SOLiD - (Aaron)

Stilbenes - polyphenolic compounds have been the focus of clinical research for cancer prevention. [4] One of the most commonly known stilbene, resveratrol, has been shown to have anticancer properties and the ability to suppress proliferation of cancer cells.[168] (Lauren)


subject sequence - In BLAST, the sequences retrieved from the database, which are compared for similarity to the query sequence, are considered subject sequences. As a general rule, subject sequences should be longer than the query sequence. BLAST searching (Karen)

subtracted cDNA library - The genetic library that results from a comparison of two different expression conditions (ie, two different tissues of an organism, two different species, or two different physical environments). The library is produced by gathering all expressed mRNAs from the two environments and constructing cDNAs from those mRNAs. Then, each set of cDNAs is mixed with the mRNAs from the opposite expression condition to observe whether formation of mRNA-cDNA complexes occurs. If some cDNAs from condition 1 fail to bind to the mRNAs from condition 2, it is assumed that those cDNAs are uniquely expressed in condition 1 only. The results unique cDNAs form a "subtracted" cDNA library. (PubMed: Subtracted cDNA Library, Shamita)

sucrose synthase - an enzyme essential to sucrose metabolism in fruits, that catalyzes the formation of the sugar sucrose from glucose and fructose. Loss or reduction of sucrose synthase has been shown to reduce both intracellular sugars and slow growth rates in fruits. [169] Lauren

symporter - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [170] (Peter)

syngenic - members of the same species that are genetically identical. [171] (William G.)

synteny - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor (Answers.com, Jay)

synthetase - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [172] (Peter)

Systems Biology - An emerging school of biology which utilizes high throughput data collection and analysis to study biological systems in a complex, integrated way that accounts for interactions within and among all levels of the system. The availability of full genome sequences has been crucial to the growth of this field. Institute for Systems Biology (Lexi)

T

tandem array - a series of copies of a gene back-to-back on a chromosome. These genes are transcribed at the same time and ensure that many copies of the gene product are made by the cell. Ribosomal RNA genes are often in tandem arrays. [173] (Laura M.)

tannin - a polyphenol molecule found in nuts, coffee, and fruits such as pomegranates, grapes, blueberries and cranberries that aids in the ripening of fruit and the aging process of wine. [174] (Lauren)

TATA box - a DNA sequence often found in promoters of archaea and eukaryotes. Useful in identifying possible promoter regions, and thereby genes after these regions. ([175], Leland)

tBLASTn - a BLAST search (see BLAST) in which a protein sequence is entered and compared to the translated nucleotide database. [176] (Aaron)

tBLASTx - a BLAST search (see BLAST) in which a translated nucleotide sequence is entered and compared to the translated nucleotide database [177] (Aaron)

transcription factors - a protein that binds to a specific sequence of DNA and regulates transcription (and thus expression). In genomics this concept is important because it means you can get more variation with less genes (different combinations can be on or off). ([178], Leland)

toxicogenomics - a subdiscipline of genomics that deals with gene and protein activity in order to determine how organisms respond to toxins in the environment. This has important implications for research concerning the effects of toxins on genetic material, and how that affects the organism in question (MedTerms, WebDefinitions Claudia).

transcriptome - the set of all mRNA molecules transcribed from a genome [179] (Megan)

transferase - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [180] (Matt)

transmembrane helix - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [181](Mary)

transposons / transposable elements - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [182](Samantha)

transposon mutagenesis - a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene (transposon mutagenesis Pallavi)

'trans-splicing '- fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA (8 Pallavi).

tRNADB-CE - The tRNA gene database curated by experts is composed of 927 complete and 1301 draft genomes of Bacteria and Archaea, 171 complete virus genomes, 121 complete chloroplast genomes, 12 complete eukaryote (Plant and Fungi) genomes as of 2011. Inputs in this database were generated using tRNAscan-SE, a computer program widely used for tRNA gene searches, in combination with ARAGORN and tRNAfinder. [183](Puneet)

tRNA scan-SE - Supported by the Lowe lab, tRNA scan-SE is an online tool used to identify tRNA genes in DNA sequences. tRNA scan-SE can identify 99-100% of tRNA genes in a DNA sequence giving less than one false positive per 15 gigabases. [184] (Puneet)

tRNA splicing endonuclease - an enzyme that cleaves intervening sequences of precursor tRNA. [185] (Peter)

Tribe - Taxonomic term that ranks between a subfamily and a genus Wikipedia (Lexi)

type strain - an isolated sample of an organism that acts as the reference point for defining that species (Lecture, Olivia)

U

V

Variable number tandem repeats (VNTRs)- See SSR and microsatellites (Soybean and Alfalfa Research Lab, Shamita)

Vertical gene transfer-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries (6 Pallavi)

Vitis vinifera - also known as grapes or grapevines and are dicotyledonous plants and close relative to the blueberry, both being in theplant family Vitaceae. Ranging from purple to red to black, grapevines are commonly used to make wine, and have been shown to exhibit antioxidant properties. [186], [187] (Lauren)

Vaccinium - A genus of shrubs in the family Ericaceae. Its fruits include the cranberry, blueberry, bilberry , lingonberry, and huckleberry; these fruits have health promoting properties most likely due to their athnocynanin, flavonoid, and polyproponoid content. Typically, they grow in acidic soil [Wikipedia article] (Puneet)

Vaccinium corymbosum - the Northern highbush blueberry plant, native to eastern North America. This genome was the basis of the Spring Genomics 2011 class. ([188], Leland)

Vaccinium macrocarpon - Cranberry, a fruit closely related to the blueberry belonging to the subgenus (or, section) Ocycoccos of Vaccinium (Lexi).

W

whole genome dupliction(WGD) - an evolutionary event characterized by the duplication of a species entire genome, that allows for gene innovation and genome diversity. Duplication events contribute to paralogs within species and orthologs between species that allow for the tracing of evolutionary relationships. [189] (Lauren)

whole genome shotgun sequencing - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [190](Samantha)

X

xenobiotic - a substance that is found within an organism that is not normally produced or expected to be found within that organism [191] (Megan)

xenolog - homologs that are created by horizontal gene transfer between two different species [192] (Matt)

Y

Yeast Artificial Chromosome (YAC) - an artificial chromosome used as a vector to clone or hold (as in a DNA library) DNA inserts from 150 kb to 1.5 Mb in size. (Discovering Geneomics, Proteomics, & Bioinformatics pg 50, Leland)

Z