GcatWiki - User contributions [en]

Davidson Missouri W/Davidson Protocols

2009-01-19T21:01:56Z

SaSimpson:

# [http://www.bio.davidson.edu/courses/molbio/labnotebook.html How to Keep a Lab Notebook]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/reagents.html Common molecular reagents]
# [http://parts.mit.edu/registry/index.php/Assembly:Standard_assembly Standard Assembly]
# [http://partsregistry.org/Help:BioBrick_Prefix_and_Suffix BioBrick Ends]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/ORIs.html '''Compatibility of Plasmids''']
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/anneal_oligos.html Building dsDNA with Oligos]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/pcr.html Setting up PCR mixtures]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/magnesium.html PCR and Mg2+ concentration]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/Clean_Concentrate.html Clean and Concentrate DNA (after PCR, before digestion)]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/pourgel.html Pouring an agarose gel]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/molwt.html Calculate MWs]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/digestion.html Digest DNA with restriction enzymes]
# [[Davidson Missouri W/Double Digest Guide| Double Digest Guide]]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/clean_short.html Ethanol Precipitate DNA (short protocol)]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/gels2002/1kbladder.pdf 1kb MW markers]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/SAP.html Shrimp Alkaline Phosphatase]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/Qiagen_gelpure.html Qiagen QIAquick Gel Purification]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/QIAQuick_recycle.html Qiagen QIAquick Column Regeneration Protocol]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/gelpure.html ElectroElute Gel Purification]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/ligation.html Ligation Protocol]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/Promegacompcells.pdf Heat Shock Transformation] OR [http://www.bio.davidson.edu/courses/Molbio/Protocols/transformation.html Short version of Heat Shock]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/Zippy_Transformation.html Zippy Transformation]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/ColonyPCR_Screening.html Colony PCR to Screen for Successful Ligations]
# [http://www.bio.davidson.edu/courses/Molbio/Protocols/miniprepPrmega.html Promega miniprep]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/Tranformation_list.html Choices for Transformation: Heat Shock vs. Zyppy]
#[http://www.bio.davidson.edu/courses/Molbio/Protocols/MiniPrep_list.html Choices for Mini-Preps: Promega vs. Zyppy]
# [[Davidson Missouri W/Primer_dimer| Making dsDNA Using Primer Dimers]]
# When inducing with IPTG, use '''3 µL of stock''' (0.2 µg/mL = 20% w/v) '''to every 1 mL''' of LB or other liquid.
# When inducing with Arabinose, use "2 µL of stock" (10% w/v L-Arabinose) "to every 1 mL" of LB or other liquid.

'''Web Tools We Use'''
#[http://gcat.davidson.edu/iGEM08/tools.html All Sites In One Place]
#[http://gcat.davidson.edu/iGEM08/gelwebsite/gelwebsite.html Optimize your Gel]
#[http://gcat.davidson.edu/iGEM07/genesplitter.html Gene Splitting Web Site]
#[http://gcat.davidson.edu/iGEM08/bbprimer.html PCR Primers w/ BioBricks]
# [http://www.promega.com/biomath/calc11.htm Promega Tm Calculator]
# [http://partsregistry.org/AHL List of auto-inducers and their catalog numbers]
#[[Davidson Missouri W/CUGI_Seuqencing| Sequencing at CUGI]]
#[http://gcat.davidson.edu/IGEM06/oligo.html Lance-olator Oligos for dsDNA assembly]

File:GenomicsLabFinalSS.pdf

2008-12-09T02:03:23Z

SaSimpson:

Halorhabdus utahensis Genome

2008-12-09T02:03:04Z

SaSimpson: /* My Favorite Term Paper */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/AnnotationSearcher.html Perform a text-based search of the Rast, JGI, and Manatee protein calls] 

== Data ==
*[[GC content of the contigs]] 
*[[Alternative Start Codons]] 
*[[Gene Length Histograms|Gene Length Comparison (all genes)]] 
*[[Venn_diagrams|Gene Prediction Overlap (Venn diagrams)]] 
*[[Shine Dalgarno Sequence Logo]] (RBS if on mRNA and SD if on 16S RNA) 
*[http://gcat.davidson.edu/Registry/kegg/geneSize.html Average Gene Length Comparison (shared and differing genes)]
*[http://gcat.davidson.edu/Registry/kegg/pathWay.html kegg pathways with colored ECs for our genome]

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
==My Favorite Term Paper==
Pallavi - [[Media:genomicsfinalpallavi.pdf]] 
Samantha - [[Media:GenomicsLabFinalSS.pdf]] 

== My favorite genes==
*Pallavi - Monooxygenase vs. Peroxiredoxin [[Media:peroxiredoxinormonooxygenase.ppt]]

*Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

*Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

*Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

*Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

*Will - [http://gcat.davidson.edu/GcatWiki/images/e/e7/Halomucin.ppt JGI gene 2500590430 (2847205..2854335)]

*Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

*Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765) [[Media:BioFavoriteGeneNrdR.ppt]]

*Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

*Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism, specifically glycolysis/gluconeogenesis[[Media:PallaviPathway.ppt]]

Jay - [[Media:Jay's_Favorite_Pathway.ppt]]

Will - [http://docs.google.com/Presentation?id=d977q88_2ghfq6tg8 RBS Consensus and Alternative Start Codons]

Max -[http://www.bio.davidson.edu/Courses/Bio343/SD_Max.pptx RBS/Shine-Dalgarno Part B]

Peter Bakke - [http://www.bio.davidson.edu/Courses/Bio343/origin_Bakke.ppt Origin of Replication]

Samantha - Purine Metabolism!!! [[Media:Purines.ppt]]

Laura - [http://www.bio.davidson.edu/Courses/Bio343/Laura_aa.ppt Amino Acid Biosynthesis]

Nick - Pentose Phosphate [http://www.bio.davidson.edu/Courses/Bio343/nick_Pathway.pptx Pentose Phosphate Pathway]

Matt - Chitin Metabolism [[Media: ChitinMetabolism.ppt]]

Mary - Citric acid cycle [[Media: Citric acid cycle.ppt]]

Malcolm - protein export [http://www.bio.davidson.edu/Courses/Bio343/Protein_Secretions.ppt Protein Secretion]

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

'''[[Phage Proteins]]''' by Malcolm 
Does our species have any phage pathogens?

'''Transposons''' by Malcolm 
As many as 21 different transposase genes or gene fragments.

'''Plasmids''' by Malcolm 
plasmid stability protein StbB, 
Protein affecting phage T7 exclusion by the F plasmid 

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/home/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example: [[Media:Pallavitutorial.doc]]

*Matt: WikiPathways [[Media:WikiPathwaysTutorial2.doc]]

*Mary: ENZYME [[Media:ENZYME tutorial.doc]]

*Samantha: [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial2.html How To Determine EC Numbers] 

*Nick: Metacyc [[Media:MetaCyc tutorial.doc]]

*Max: [http://www.bio.davidson.edu/courses/genomics/2008/Win/kgml.html KGML How to color EC numbers in KEGG maps and view it in KGML graph editor] 

*Jay: [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Pathways_Tutorial/SEED_Scenario_Paths.doc SEED Scenario Paths] (a tool to determine completeness of pathways)

*Laura: [http://www.bio.davidson.edu/Courses/Bio343/Pathway_Entrances_Exits.doc Pathway Entrances and Exits]

*Will: [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/LocalBlastTutorial/LocalBlast.html Running BLAST Locally]

*Peter: Exploring Proteases: MEROPS Peptidase Database Tutorial - [[Media:MEROPStutorial_PB.doc]]

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Halorhabdus utahensis Genome

2008-11-20T04:59:09Z

SaSimpson: /* My Favorite Pathways */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/AnnotationSearcher.html Perform a text-based search of the Rast, JGI, and Manatee protein calls] 

== Data ==
*[[Alternative Start Codons]] 
*[[Gene Length Histograms|Gene Length Comparison]] 
*[[Venn_diagrams]] 
*[[Shine Dalgarno Sequence Logo ]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
*Pallavi - Monooxygenase vs. Peroxiredoxin [[Media:peroxiredoxinormonooxygenase.ppt]]

*Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

*Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

*Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

*Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

*Will - [http://gcat.davidson.edu/GcatWiki/images/e/e7/Halomucin.ppt JGI gene 2500590430 (2847205..2854335)]

*Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

*Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765) [[Media:BioFavoriteGeneNrdR.ppt]]

*Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

*Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism, specifically glycolysis/gluconeogenesis

Jay - Phosphotransferases

Will - [[Ribosomal Binding Site Conservation]]

Max -energy

Samantha - Purine Metabolism!!! [[Media:Purines.ppt]]

Laura - Amino Acid Biosynthesis

Nick - Pentose Phosphate

Matt - Chitin Metabolism

Mary - Citric acid cycle

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

'''[[Phage Proteins]]''' by Malcolm
search for phage

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/home/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example: [[Media:Pallavitutorial.doc]]

*Matt: WikiPathways [[Media:WikiPathwaysTutorial.doc]]

*Mary: ENZYME [[Media:ENZYME tutorial.doc]]

*Samantha: [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial2.html How To Determine EC Numbers] 

*Nick: Metacyc [[Media:MetaCyc tutorial.doc]]

*Max: [http://www.bio.davidson.edu/courses/genomics/2008/Win/kgml.html KGML How to color EC numbers in KEGG maps and view it in KGML graph editor] 

*Jay: [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Pathways_Tutorial/SEED_Scenario_Paths.doc SEED Scenario Paths] (a tool to determine completeness of pathways)

*Laura: [http://www.bio.davidson.edu/Courses/Bio343/Pathway_Entrances_Exits.doc Pathway Entrances and Exits]

*Will: [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/LocalBlastTutorial/LocalBlast.html Running BLAST Locally]

*Peter: Exploring Proteases: MEROPS Peptidase Database Tutorial - [[Media:MEROPStutorial_PB.doc]]

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

File:Purines.ppt

2008-11-20T04:58:45Z

SaSimpson:

Halorhabdus utahensis Genome

2008-11-20T04:57:38Z

SaSimpson: /* My Favorite Pathways */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/AnnotationSearcher.html Perform a text-based search of the Rast, JGI, and Manatee protein calls] 

== Data ==
*[[Alternative Start Codons]] 
*[[Gene Length Histograms|Gene Length Comparison]] 
*[[Venn_diagrams]] 
*[[Shine Dalgarno Sequence Logo ]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
*Pallavi - Monooxygenase vs. Peroxiredoxin [[Media:peroxiredoxinormonooxygenase.ppt]]

*Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

*Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

*Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

*Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

*Will - [http://gcat.davidson.edu/GcatWiki/images/e/e7/Halomucin.ppt JGI gene 2500590430 (2847205..2854335)]

*Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

*Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765) [[Media:BioFavoriteGeneNrdR.ppt]]

*Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

*Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism, specifically glycolysis/gluconeogenesis

Jay - Phosphotransferases

Will - [[Ribosomal Binding Site Conservation]]

Max -energy

Samantha - Purine Metabolism!!![[Media:Purine.pptx]]

Laura - Amino Acid Biosynthesis

Nick - Pentose Phosphate

Matt - Chitin Metabolism

Mary - Citric acid cycle

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

'''[[Phage Proteins]]''' by Malcolm
search for phage

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/home/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example: [[Media:Pallavitutorial.doc]]

*Matt: WikiPathways [[Media:WikiPathwaysTutorial.doc]]

*Mary: ENZYME [[Media:ENZYME tutorial.doc]]

*Samantha: [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial2.html How To Determine EC Numbers] 

*Nick: Metacyc [[Media:MetaCyc tutorial.doc]]

*Max: [http://www.bio.davidson.edu/courses/genomics/2008/Win/kgml.html KGML How to color EC numbers in KEGG maps and view it in KGML graph editor] 

*Jay: [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Pathways_Tutorial/SEED_Scenario_Paths.doc SEED Scenario Paths] (a tool to determine completeness of pathways)

*Laura: [http://www.bio.davidson.edu/Courses/Bio343/Pathway_Entrances_Exits.doc Pathway Entrances and Exits]

*Will: [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/LocalBlastTutorial/LocalBlast.html Running BLAST Locally]

*Peter: Exploring Proteases: MEROPS Peptidase Database Tutorial - [[Media:MEROPStutorial_PB.doc]]

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Halorhabdus utahensis Genome

2008-11-09T17:11:53Z

SaSimpson: /* Pathway Tutorials */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/AnnotationSearcher.html Perform a text-based search of the Rast, JGI, and Manatee protein calls] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism

Jay - Membrane Transport

Will - Signal Transduction

Max -energy

Samantha - Purine Metabolism!!!

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 
== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example

*Matt: WikiPathways

*Mary: ENZYME

*Samantha: [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial2.html How To Determine EC Numbers] 

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Halorhabdus utahensis Genome

2008-11-06T16:10:55Z

SaSimpson: /* Pathway Tutorials */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism

Jay - Membrane Transport

Will - Signal Transduction

Max -energy

Samantha - Purine Metabolism!!!

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 
== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example

*Matt: WikiPathways

*Mary: ENZYME

*Samantha: How are EC numbers determined?

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Halorhabdus utahensis Genome

2008-11-06T16:01:33Z

SaSimpson: /* My Favorite Pathways */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism

Jay - Membrane Transport

Will - Signal Transduction

Max -energy

Samantha - Purine Metabolism!!!

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 
== Pathway Tutorials==
[http://www.pathguide.org/ Pathguide] - a possible source of tutorials and extensive information

[http://www.bigre.ulb.ac.be/Users/didier/pathfinding/ Shortest Path Tool]
<hr>
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example

*Matt: WikiPathways

*Mary: ENZYME

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Amino acids

2008-11-06T15:45:24Z

SaSimpson:

'''A = alanine''' 
The RAST KEGG viewer says there is no way to convert between pyruvate and alanine, however, the RAST spreadsheet calls enzyme 2.6.1.44
(an alanine-glyoxylate aminotransferase).
 

'''C = cysteine''' 
 

'''D = aspartic acid''' 
 

'''E = glutamic acid''' 
 

'''F = phenylalanine''' 
 

'''G = glycine''' 
Appears to be missing EC numbers 2.6.1.44 and 1.4.4.2. The RAST spreadsheet calls 2.6.1.44.
 

'''H = histidine''' 
Appears to be complete.
 

'''I = isoleucine''' 
Appears to be complete. 

'''K = lysine''' 
Appears to be complete. 

'''L = leucine''' 
Appears to be complete. 

'''M = methionine''' 
 

'''N = asparagine''' 
Our organism appears to have no enzyme that makes L-aasparagine from L-aspartate (asparaginases). On the RAST spreadsheet, however, RAST calls an "asparagine family protein" in bps from 1139281-1140141 (EC 6.3.5.4).
 

'''P = proline''' 
Appears to be complete.
 

'''Q = glutamine''' 
Appears to be fairly complete when compared with Haloarcula marismortui, with exceptions being a missing NAD synthase (EC 6.3.5.1), semialdehyde dehydrogenase (EC1.2.1.16), a 4-aminobutyrate aminotransferase (EC 2.6.1.19), and a carbamoyl-phosphate synthetase (EC 6.3.4.16). A glutamyl-tRNA(Gln) amidotransferase subunit (EC 6.3.5.7) appears to be missing, but the RAST spreadsheet calls several glutamyl-tRNA amidotransferase subunits, listing the EC as 6.3.5.-. Also, an arginine decarboxylase (EC 4.1.1.19) appears to be missing, but the RAST spreadsheet calls an arginine decarboxylase with this EC number.
 

'''R = arginine''' 
Appears to be complete.
 

'''S = serine''' 
Matches Haloarcula marismortui, except in RAST KEGG diagram, it appears to be missing enzyme 2.7.8.8, which is a CDP-diacylglycerol-serine O-phosphatidyltransferase. However, according to the RAST spreadsheet, there is a CDP-diacylglycerol-serine O-phosphatidyltransferase from bps 1440328-1439621.
 

'''T = threonine''' 
Appears to be complete.
 

'''V = valine''' 
Appears to be complete. 

'''W = tryptophan''' 
 

'''Y = tyrosine'''

Amino acids

2008-11-06T15:35:12Z

SaSimpson:

'''A = alanine''' 
The RAST KEGG viewer says there is no way to convert between pyruvate and alanine, however, the RAST spreadsheet calls enzyme 2.6.1.44
(an alanine-glyoxylate aminotransferase).
 

'''C = cysteine''' 
 

'''D = aspartic acid''' 
 

'''E = glutamic acid''' 
 

'''F = phenylalanine''' 
 

'''G = glycine''' 
Appears to be missing EC numbers 2.6.1.44 and 1.4.4.2. The RAST spreadsheet calls 2.6.1.44.
 

'''H = histidine''' 
Appears to be complete.
 

'''I = isoleucine''' 
Appears to be complete. 

'''K = lysine''' 
Appears to be complete for biosynthesis, very incomplete for degradation 

'''L = leucine''' 
Appears to be complete. 

'''M = methionine''' 
 

'''N = asparagine''' 
Our organism appears to have no enzyme that makes L-aasparagine from L-aspartate (asparaginases). On the RAST spreadsheet, however, RAST calls an "asparagine family protein" in bps from 1139281-1140141.
 

'''P = proline''' 
Appears to be complete.
 

'''Q = glutamine''' 
Appears to be fairly complete when compared with Haloarcula marismortui, with exceptions being a missing NAD synthase (EC 6.3.5.1), semialdehyde dehydrogenase (EC1.2.1.16), a 4-aminobutyrate aminotransferase (EC 2.6.1.19), and a carbamoyl-phosphate synthetase (EC 6.3.4.16). A glutamyl-tRNA(Gln) amidotransferase subunit (EC 6.3.5.7) appears to be missing, but the RAST spreadsheet calls several glutamyl-tRNA amidotransferase subunits, listing the EC as 6.3.5.-. Also, an arginine decarboxylase (EC 4.1.1.19) appears to be missing, but the RAST spreadsheet calls an arginine decarboxylase with this EC number.
 

'''R = arginine''' 
Appears to be complete.
 

'''S = serine''' 
Matches Haloarcula marismortui, except in RAST KEGG diagram, it appears to be missing enzyme 2.7.8.8, which is a CDP-diacylglycerol-serine O-phosphatidyltransferase. However, according to the RAST spreadsheet, there is a CDP-diacylglycerol-serine O-phosphatidyltransferase from bps 1440328-1439621.
 

'''T = threonine''' 
Appears to be complete.
 

'''V = valine''' 
Appears to be complete. 

'''W = tryptophan''' 
 

'''Y = tyrosine'''

Amino acids

2008-11-06T15:24:00Z

SaSimpson:

'''A = alanine''' 
The RAST KEGG viewer says there is no way to convert between pyruvate and alanine, however, the RAST spreadsheet calls enzyme 2.6.1.44
(an alanine-glyoxylate aminotransferase).
 

'''C = cysteine''' 
 

'''D = aspartic acid''' 
 

'''E = glutamic acid''' 
 

'''F = phenylalanine''' 
 

'''G = glycine''' 
Appears to be missing EC numbers 2.6.1.44 and 1.4.4.2. The RAST spreadsheet calls 2.6.1.44.
 

'''H = histidine''' 
Appears to be complete.
 

'''I = isoleucine''' 
 

'''K = lysine''' 
 

'''L = leucine''' 
 

'''M = methionine''' 
 

'''N = asparagine''' 
Our organism appears to have no enzyme that makes L-aasparagine from L-aspartate (asparaginases). On the RAST spreadsheet, however, RAST calls an "asparagine family protein" in bps from 1139281-1140141.
 

'''P = proline''' 
 

'''Q = glutamine''' 
Appears to be fairly complete when compared with Haloarcula marismortui, with exceptions being a missing NAD synthase (EC 6.3.5.1), and a carbamoyl-phosphate synthetase (EC 6.3.4.16). A glutamyl-tRNA(Gln) amidotransferase subunit (EC 6.3.5.7) appears to be missing, but the RAST spreadsheet calls several glutamyl-tRNA amidotransferase subunits, listing the EC as 6.3.5.-.
 

'''R = arginine''' 
 

'''S = serine''' 
Matches Haloarcula marismortui, except in RAST KEGG diagram, it appears to be missing enzyme 2.7.8.8, which is a CDP-diacylglycerol-serine O-phosphatidyltransferase. However, according to the RAST spreadsheet, there is a CDP-diacylglycerol-serine O-phosphatidyltransferase from bps 1440328-1439621.
 

'''T = threonine''' 
Appears to be complete.
 

'''V = valine''' 
 

'''W = tryptophan''' 
 

'''Y = tyrosine'''

Amino acids

2008-11-06T15:21:09Z

SaSimpson:

Amino acids

2008-11-06T03:54:51Z

SaSimpson:

Amino acids

2008-11-06T03:44:45Z

SaSimpson:

Amino acids

2008-11-06T03:38:27Z

SaSimpson:

'''A = alanine''' 
 
'''C = cysteine''' 
 
'''D = aspartic acid''' 
 
'''E = glutamic acid''' 
 
'''F = phenylalanine''' 
 
'''G = glycine''' 
 
'''H = histidine''' 
 
'''I = isoleucine''' 
 
'''K = lysine''' 
 
'''L = leucine''' 
 
'''M = methionine''' 
 
'''N = asparagine''' 
Our organism appears to have no enzyme that makes L-aasparagine from L-aspartate (asparaginases). On the RAST spreadsheet, however, RAST calls an "asparagine family protein" in bps from 1139281-1140141.
 
'''P = proline''' 
 
'''Q = glutamine''' 
Appears to be fairly complete when compared with Haloarcula marismortui, with one exception being a missing NAD synthase (EC 6.3.5.1).
 
'''R = arginine''' 
 
'''S = serine''' 
 
'''T = threonine''' 
 
'''V = valine''' 
 
'''W = tryptophan''' 
 
'''Y = tyrosine'''

Amino acids

2008-11-06T03:31:26Z

SaSimpson:

'''A = alanine''' 
 
'''C = cysteine''' 
 
'''D = aspartic acid''' 
 
'''E = glutamic acid''' 
 
'''F = phenylalanine''' 
 
'''G = glycine''' 
 
'''H = histidine''' 
 
'''I = isoleucine''' 
 
'''K = lysine''' 
 
'''L = leucine''' 
 
'''M = methionine''' 
 
'''N = asparagine''' 
 
'''P = proline''' 
 
'''Q = glutamine''' 
Appears to be fairly complete when compared with Haloarcula marismortui, with one exception being a missing NAD synthase (EC 6.3.5.1).
 
'''R = arginine''' 
 
'''S = serine''' 
 
'''T = threonine''' 
 
'''V = valine''' 
 
'''W = tryptophan''' 
 
'''Y = tyrosine'''

Amylases

2008-11-06T03:19:50Z

SaSimpson:

There are three types of amylases: alpha, beta, and gamma. All can be present in bacteria, although they each have preferred environments. Alpha amylase is a major digestive enzyme in animals and works optimally at a pH of 6.7-7.0. Beta amylase is present in fruit and microbes. Gamma amylase is best in environments with a pH of around 3.0.

In H. utahensis, I found both alpha and gamma amylases. Alpha amylase is from bps 1748453-1750477. A BLAST of this DNA segment with proteins in NCBI's database matches it to an alpha amylase from Haloquadratum walsbyi with an e-value of 7x10^-172. Gamma amylase is from bps 1751071-1755111. A BLAST of this DNA segment with NCBI's protein database matches it to a glucoamylase (another name for gamma amylase) from Haloarcula marismortui with an e-value of 0.0. Despite these findings, the RAST KEGG Pathway annotation did not identify any amylase genes. Here is an image of the KEGG pathway annotation. Enzymes identified are in boxes with a green background. Amylase genes are boxed in red. Alpha amylase is 3.2.1.1 and gamma amylase is 3.2.1.3.

[[Image:Amylase.jpg]]

KEGG pathway annotation also did not indicate RAST found any enzymes whose substrates or products were dextrin or alpha-D-glucose. However, I found a possible 4-alpha-glucanotransferase (EC 2.1.4.25) when BLASTing the sequence for this enzyme from Burkholderia vietnamiensis G4 with our genome with an e value of 8e-21. I also found this enzyme listed on the RAST spreadsheet from bps 1746766 to 1745279. I also found a potential oligo-1,6-glucosidase (EC 3.2.1.10) when BLASTing the sequence for this enzyme from Lactobacillus salivarius UCC118 with our genome with an e value of 7e-24. However, there were no oligo-1,6-glucosidases reported on the RAST spreadsheet.

Amylases

2008-11-06T03:14:16Z

SaSimpson:

There are three types of amylases: alpha, beta, and gamma. All can be present in bacteria, although they each have preferred environments. Alpha amylase is a major digestive enzyme in animals and works optimally at a pH of 6.7-7.0. Beta amylase is present in fruit and microbes. Gamma amylase is best in environments with a pH of around 3.0.

In H. utahensis, I found both alpha and gamma amylases. Alpha amylase is from bps 1748453-1750477. A BLAST of this DNA segment with proteins in NCBI's database matches it to an alpha amylase from Haloquadratum walsbyi with an e-value of 7x10^-172. Gamma amylase is from bps 1751071-1755111. A BLAST of this DNA segment with NCBI's protein database matches it to a glucoamylase (another name for gamma amylase) from Haloarcula marismortui with an e-value of 0.0. Despite these findings, the RAST KEGG Pathway annotation did not identify any amylase genes. Here is an image of the KEGG pathway annotation. Enzymes identified are in boxes with a green background. Amylase genes are boxed in red. Alpha amylase is 3.2.1.1 and gamma amylase is 3.2.1.3.

[[Image:Amylase.jpg]]

KEGG pathway annotation also did not find any enzymes whose substrates or products were dextrin or alpha-D-glucose. However, I found a possible 4-alpha-glucanotransferase (EC 2.1.4.25) when BLASTing the sequence for this enzyme from Burkholderia vietnamiensis G4 with our genome with an e value of 8e-21. I also found a potential oligo-1,6-glucosidase (EC 3.2.1.10) when BLASTing the sequence for this enzyme from Lactobacillus salivarius UCC118 with our genome with an e value of 7e-24.

Amylases

2008-11-06T03:08:39Z

SaSimpson:

There are three types of amylases: alpha, beta, and gamma. All can be present in bacteria, although they each have preferred environments. Alpha amylase is a major digestive enzyme in animals and works optimally at a pH of 6.7-7.0. Beta amylase is present in fruit and microbes. Gamma amylase is best in environments with a pH of around 3.0.

In H. utahensis, I found both alpha and gamma amylases. Alpha amylase is from bps 1748453-1750477. A BLAST of this DNA segment with proteins in NCBI's database matches it to an alpha amylase from Haloquadratum walsbyi with an e-value of 7x10^-172. Gamma amylase is from bps 1751071-1755111. A BLAST of this DNA segment with NCBI's protein database matches it to a glucoamylase (another name for gamma amylase) from Haloarcula marismortui with an e-value of 0.0. Despite these findings, the RAST KEGG Pathway annotation did not identify any amylase genes. Here is an image of the KEGG pathway annotation. Enzymes identified are in boxes with a green background. Amylase genes are boxed in red. Alpha amylase is 3.2.1.1 and gamma amylase is 3.2.1.3.

[[Image:Amylase.jpg]]

KEGG pathway annotation also did not find any enzymes whose substrates or products were dextrin or alpha-D-glucose. However, I found a possible 4-alpha-glucanotransferase (EC 2.1.4.25) when BLASTing the sequence for this enzyme from Burkholderia vietnamiensis G4 with our genome with an e value of 8e-21.

Amylases

2008-11-06T03:05:33Z

SaSimpson:

There are three types of amylases: alpha, beta, and gamma. All can be present in bacteria, although they each have preferred environments. Alpha amylase is a major digestive enzyme in animals and works optimally at a pH of 6.7-7.0. Beta amylase is present in fruit and microbes. Gamma amylase is best in environments with a pH of around 3.0.

In H. utahensis, I found both alpha and gamma amylases. Alpha amylase is from bps 1748453-1750477. A BLAST of this DNA segment with proteins in NCBI's database matches it to an alpha amylase from Haloquadratum walsbyi with an e-value of 7x10^-172. Gamma amylase is from bps 1751071-1755111. A BLAST of this DNA segment with NCBI's protein database matches it to a glucoamylase (another name for gamma amylase) from Haloarcula marismortui with an e-value of 0.0. Despite these findings, the RAST KEGG Pathway annotation did not identify any amylase genes. Here is an image of the KEGG pathway annotation. Enzymes identified are in boxes with a green background. Amylase genes are boxed in red. Alpha amylase is 3.2.1.1 and gamma amylase is 3.2.1.3.

[[Image:Amylase.jpg]]

KEGG pathway annotation also did not find any enzymes whose substrates or products were dextrin or alpha-D-glucose. However, I found a possible 4-alpa-glucanotransferase (EC 2.1.4.25).

Amylases

2008-11-04T16:18:56Z

SaSimpson:

Amylases

2008-11-04T16:17:11Z

SaSimpson:

File:Amylase.jpg

2008-11-04T16:14:37Z

SaSimpson:

Amylases

2008-11-04T16:14:06Z

SaSimpson:

Halorhabdus utahensis Genome

2008-11-04T16:05:02Z

SaSimpson: /* My Favorite Pathways */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 
*[http://gcat.davidson.edu/Wideloache/Webfiles/ecNumBlast.html Blast an EC number against the H. utahensis genome] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
=Our Favorites=
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism

Jay - Membrane Transport

Will - Signal Transduction

Max -energy

''Suggestions by Kjeld'' 
'''[[Cellulase]]''' by Pallavi 
I think it would be very interesting to look for genes involved in cellulose degradation: endocellulases, exocellolases (=cellobiohydrolases) and b-glucosidases.
Many cellulose degrades produce a range of each type. A cellolulyic system able to function at 4.6 M of NaCl is an interesting one. We either did not observed (or look for cellulose degradation). However, these systems are normally inducible and you need to test several substrates and inducers. It would be nice to have a compilation of putative “cellulase” genes.
There are several good recent reviews on cellulases (also mentioning E.C. numbers and enzyme families) that your students could consult.

'''[[Chitinase]]''' by Matt 
Apparently you detected a chitinase but according to our records it does not gorw on N-acetyl-glucosamine which is somewhat strange. It grows on glucose though.

'''[[Lipases]]''' by Mary 
Lipases (/esterases) would also be interesting to look for – some lipases have important industrial applications.

'''[[Amylases]]''' by Samantha 
We did not observed growth on starch. Did you find any “amylase-coding genes”?

'''[[Xylose (glucose) isomerase)]]''' by Nick 
An enzyme of great commercial value.

'''[[Amino acids]]''' lead by Laura and assisted by Max, Jay, Nick and Samantha 
According to our records AX-2 is able to grow in a “defined medium”. This is at variance with your “holes” for synthesis of amino acids. However, there could have been some “carry over” of amino acids when inoculating a culture grown in complex medium (e.g. containing yeast extract). However, we are normally aware of this problem and do repeated culturing to dilute out potential growth factors present in yeast extract.

'''[[Proteases]]''' by Peter 
We did not detect protease activity – albeit only checking a few substrates.

'''[[Protein Export]]''' by Malcolm 
We need to know how these proteins might reach outside the cell which is where the food would be and thus the digestive enzymes or importers need to reach the outside world or the cell membrane.

= Student-created tutorials: =
== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 
== Pathway Tutorials==
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example

<hr>

=Glossary words (A - Z):=
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

Halorhabdus utahensis Genome

2008-10-28T14:01:13Z

SaSimpson: /* Links to Multiple Databases */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://wishart.biology.ualberta.ca/basys/cache/135af8726ad6f61ec4c5f1e9c4d139ac/index.html BASYs] 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 
*[http://www.bio.davidson.edu/courses/genomics/2008/Win/ec/ Search EC number in RAST, JGI or Manatee] 

== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Origin_Tutorial/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046) [[Media:My favorite gene.ppt]]

Max - [http://app.sliderocket.com/app/FullPlayer.aspx?id=f2058b94-845f-4a11-94eb-142f251a7fea JGI gene 2500587636 (2-1849)]

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866) [[Media:Gene presentation.ppt]]

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/Fav_Gene/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

<hr>
== Pathway Tutorials==
*Pallavi: I will compare RAST and KEGG in pathway annotations and use Glycolysis/Gluconeogenesis as my example

<hr>

== My Favorite Pathways==
Pallavi - Carbohydrate Metabolism

Jay - Membrane Transport

Will - Signal Transduction

Nick - Drug Development

Max -energy

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

File:Earl.ppt

2008-10-09T13:47:46Z

SaSimpson:

Halorhabdus utahensis Genome

2008-10-09T13:47:22Z

SaSimpson: /* My favorite genes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
__NOTOC__
== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache - [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win - [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl.html Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 
# Pallavi - Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary - Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database [http://www.bio.davidson.edu/Courses/Bio343/Pfam_tutorial.doc Pfam Tutorial] 
# Samantha Simpson - [http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
== My favorite genes==
Pallavi - Monooxygenase vs. Peroxiredoxin

Mary - JGI gene 2500588521 (922976...924046)

Max - JGI gene 2500587636 (2-1849)

Samantha - JGI gene 2500575882 (80504-80878) [[Media:Earl.ppt]]

Nick - JGI gene 2300587691 (69942...72866)

Will - JGI gene 2500590430 (2847205..2854335)

Jay - JGI gene 2500588397 (806410..807321) [http://www.bio.davidson.edu/courses/genomics/2008/McNair/FavoriteGenePresentation.pptx Co/Zn/Cd PowerPoint]

Matt - Transcriptional Regulator nrdR (3109722..3110204 + 7274..7765)

Peter - tRNA intron endonuclease [[Media:TRNAtrpintronendonuclease.ppt]]

Laura - 16S Small ribosomal subunit, JGI gene 2500590728 (2397347..2398825)

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-10-07T14:40:24Z

SaSimpson: /* My favorite genes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].

== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache- [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win- [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl/ Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 

# Pallavi-Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary- Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database 
# Samantha Simpson - [[http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST]] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
== My favorite genes==
Pallavi-Monooxygenase vs. Peroxiredoxin

Mary- JGI gene 2500588521 (922976...924046)

Max - JGI gene 2500587636 (2-1849)

Samantha - JGI gene 2500575882 (80504-80878)

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-10-07T14:38:15Z

SaSimpson: /* My favorite genes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].

== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache- [http://www.bio.davidson.edu/courses/genomics/2008/DeLoache/BioPerlTutorial/BioPerl.htm BioPerl Installation] 
# Max Win- [http://www.bio.davidson.edu/courses/genomics/2008/Win/perl/ Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)] 

# Pallavi-Conserved Domains Database (CDD) [[Media:CDDtutorial.doc]] 
# Mary- Protein Data Bank (PDB) [[Media:PDB Tutorial.doc]] 
# Laura Voss - Pfam Database 
# Samantha Simpson - [[http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST]] 
# Peter Bakke - [[Media:ShineDalgarnoTutorial.doc]] 
# Jay McNair - [http://www.bio.davidson.edu/courses/genomics/2008/McNair/OriginTutorial.doc Origin of Replication Tutorial] 
# Nick Carney - Navigating the JGI Database [[Media:NavigatingJGItutorial.doc]] 
# Matt Lotz - SEED Viewer - [[Media:SEEDTutorial.doc]] 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>
== My favorite genes==
Pallavi-Monooxygenase vs. Peroxiredoxin

Mary- JGI gene 2500588521 (922976...924046)

Max - JGI gene 2500587636 (2-1849)

Samantha - JGI gene (80504-80878)

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Antisense (RNA or DNA)'''-a piece of DNA or RNA that binds to a complementary sequence of DNA or RNA. These segments of genetic material can be used to identify the existence of a disease gene and they can also be used to bind to specific DNA or mRNA sequences to inhibit their function ([http://biotech.fyicenter.com/glossary/Bioinformatics_Glossary.html 5] Pallavi).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

'''Cytogenetics'''-the study of normal and abnormal chromosomes. This involves studying the causes of chromosomal abnormalities and looking at the structure of chromosomes ([http://www.vivo.colostate.edu/hbooks/genetics/medgen/chromo/index.html 7] Pallavi).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

'''fusion mRNA'''-mRNA that results from the transcription of a gene after a chromosomal translocation event. This results in an mRNA sequence that comes from two different genes (Rowley and Blumenthal 2008 ''Science'' Pallavi)

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene fusion'''-occurs when DNA segments of two different genes come together. Can result in hybrid proteins ([http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-G/gene_fusion.html 9] Pallavi)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''''Hox'' gene'''-a gene that contains a homeobox region that is involved in morphogenesis along the cranio-caudal body axis ([http://www.uprightape.net/UA_Glossary.html 4] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will). 

'''microsatellites'''-stretches of repetitive, short DNA segments that can be used to track the inheritance of certain traits within families ([http://www.clanlindsay.com/genetic_dna_glossary.htm 3] Pallavi)

'''minisatellites'''-segments of DNA that can be used for individual identification (ex. DNA fingerprinting) or in determining relationships between people (ex. paternity cases) ([http://www.clanlindsay.com/genetic_dna_glossary.htm 2] Pallavi).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura) 

'''retropseudogenes'''-these are genes that have been reverse-transcribed from mRNA and the resulting DNA sequence is incorporated back into the genome. They are non-functional segments of DNA and can be distinguished from pseudogenes in that they do not have intron sequences. ([http://genome.cshlp.org/cgi/content/full/10/5/672 1] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''"Shadow enhancers"'''-secondary enhancers that are thought to be important for natural selection to occur in regulatory DNA segments. They evolve much faster than primary enhancers, which suggests that they are under fewer functional constraints (Wray and Babbit 2008 ''Science'' Pallavi)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''Trans-splicing'''-fragmented exon sequences fuse to form a mature species of mRNA. This process results in fusion mRNA ([http://www.representinggenes.org/Glossary.html 8] Pallavi).

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==
'''Vertical gene transfer'''-the transmission or absorption of genetic material that is associated with sexual reproduction and, thus, acknowledges species-specific boundaries ([http://www.gmo-compass.org/eng/glossary/#G 6] Pallavi)

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-10-04T17:08:24Z

SaSimpson: /* Tutorials for Annotating Genomes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].

== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache- BioPerl Installation 
# Max Win- Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions) 

# Pallavi-Conserved Domains Database (CDD) 
# Mary- Protein Data Bank 
# Laura Voss - Pfam Database 
# Samantha Simpson - [[http://www.bio.davidson.edu/courses/genomics/2008/Simpson/Tutorial.html NCBI BLAST]] 
# Peter Bakke - Finding species-specific Shine-Dalgarno sequence 
# Jay McNair - How to determine the origin of replication 
# Nick Carney - JGI Database 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-10-04T17:07:30Z

SaSimpson: /* Tutorials for Annotating Genomes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].

== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache- BioPerl Installation 
# Max Win- Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions) 

# Pallavi-Conserved Domains Database (CDD) 
# Mary- Protein Data Bank 
# Laura Voss - Pfam Database 
# Samantha Simpson - [[NCBI BLAST]] 
# Peter Bakke - Finding species-specific Shine-Dalgarno sequence 
# Jay McNair - How to determine the origin of replication 
# Nick Carney - JGI Database 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-10-04T17:07:04Z

SaSimpson: /* Tutorials for Annotating Genomes */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].

== Links to Multiple Databases ==
*[http://imgweb.jgi-psf.org/cgi-bin/img_edu_v260/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2500575004 JGI IMG EDU] public access *[[Media:JGIAnnotation.xls|JGI Annotation Excel Spreadsheet]]
*[http://www.tigr.org/tigr-scripts/prok_manatee/shared/login.cgi Manatee at JCVI] use the davidson number sent by email as username and password (database is nthu01 - this is case sensitive) *[[Media:ManateeAnnotation.xls|Manatee Annotation Excel Spreadsheet]]
*[http://rast.nmpdr.org/ SEED view via RAST] use the username and password combination sent to you by SEED *[[Media:RastAnnotation.xls|RAST Annotation Excel Spreadsheet]] *[http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18261238 RAST Publication in PubMed]
*[http://www.genome.jp/kegg/kaas/ KEGG] We can submit our genes to KEGG to have it mapped out, but SEED and Manatee may already do this. Do we want to ask them to upload it into their database? 
*[http://gcat.davidson.edu/Registry/compare/ Pairwise comparisons of All Three Annotations]

[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI_5contigs.txt JGI Full genome, 5 separate contigs & 3.1 Mbp, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.txt JGI gene DNA sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_genes.xls JGI gene annotations, Excel] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/JGI2500575004_proteins.txt JGI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_merged.txt CJVI Full genome, 5 contigs fused, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_ORFs.txt CJVI gene sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/h_utahensis_proteins.txt CJVI protein sequences, FASTA] 
[http://www.bio.davidson.edu/Courses/Bio343/sequences/GeneLengths.xls 3-way comparison, Excel] 
[[Venn_diagrams]] Venn diagram of 3-way comparison

 

== RNA Genes ==

*[[tRNA Genes Check List]] 
*[[rRNA operon]] 
*[[2 misc. RNA genes]] (short summary list) 
*[[Missing tRNA-trp gene found]] 

== Other Resources ==
*[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
*[[References]] 
*[[Gene Annotation Template]] 
*[[General Questions]] 
*[[Page for Annotated Genes]] 

== Tutorials for Annotating Genomes ==

# Will DeLoache- BioPerl Installation 
# Max Win- Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions) 

# Pallavi-Conserved Domains Database (CDD) 
# Mary- Protein Data Bank 
# Laura Voss - Pfam Database 
# Samantha Simpson - [[Link title]]NCBI Blast (protein, nucleotide, and blast2) 
# Peter Bakke - Finding species-specific Shine-Dalgarno sequence 
# Jay McNair - How to determine the origin of replication 
# Nick Carney - JGI Database 

== Research Questions ==
#How do the three systems compare for finding ORFs and RNA genes?
#Is there a pattern of missed genes for any of the 3 sites?
#Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
# Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
#Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
#How do the 3 sites compare for ease of use?
#What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
#How does each of the 3 sites compare for pathway detection and visualization?
#Do they find the origin of replication? Can we find it?

* How do the 3 systems compare when one gene is called hypothetical and the other calls it a functional protein? How can they vary and who is getting it closer to correct (however you define that, possibly by date of matched entry: Pallavi and Mary)
* Why did one system call a gene when the other two did not? (Matt and Lara)
* How do the 3 sites compare for ease of use? What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working? (Samantha and Nick)
* Where is the origin of replication and did the 3 systems attempt to identify this?
* Did the 3 systems utilize Shine-Dalgarno sequences to help them call start codons? Did they utilize our species's consensus Shine-Dalgarno? (Peter)
* We need to fill in the [[Venn diagrams]] for our 3-way comparison. Let's compare the size of ORFs and generate a [[Gene Length Histograms|graph comparing the distributions]] for all 3. (Max and Will - they also take requests).

<hr>

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''domain (protein)''' - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. ([http://en.wikipedia.org/wiki/Domain_(protein) Wikipedia article], Laura)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''family (protein)''' - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. ([http://en.wikipedia.org/wiki/Protein_family Wikipedia article] and lecture, Laura)

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''Hidden Markov Model''' - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. ([http://en.wikipedia.org/wiki/Hidden_Markov_Model Wikipedia] and lecture, Laura)

'''HMM Logo''' - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. ([http://www.sanger.ac.uk/Software/analysis/logomat-m/help.shtml How to read HMM Logos, on Pfam], Laura)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==
'''KEGG (Kyoto Encyclopedia of Genes and Genomes)''' - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [http://en.wikipedia.org/wiki/KEGG](Will).

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==
'''Manatee''' - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [http://manatee.sourceforge.net/](Will).

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''Pfam''' - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''PSORT''' - a prediction server that judges where a mature protein could be in the cell, based on its transmembrane domains, its predicted mature amino acid composition, and its signal sequences. ([http://psort.ims.u-tokyo.ac.jp/form.html PSORT], Laura)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay) 

'''query sequence''' - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. ([http://en.wikipedia.org/wiki/BLAST BLAST on Wikipedia], Laura)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''residue (protein)''' - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. ([http://pfam.sanger.ac.uk/help Pfam Help], Laura)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is ccGGAGGt.

'''SignalP''' - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. ([http://www.cbs.dtu.dk/services/SignalP-3.0/output.php SignalP Output explained], Laura)

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-09-23T13:52:45Z

SaSimpson: /* Other Resources */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
 

== RNA Genes ==

[[tRNA Genes Check List]] 
[[rRNA operon]] 
[[2 misc. RNA genes]] (short summary list) 
[[References]] 
[[Gene Annotation Template]] 
[[General Questions]] 
[[Page for Annotated Genes]] 

== Other Resources ==

[[Consensus Shine Dalgarno]] Excel File for ''H. utahensis'' 
Tutorials for annotating genomes 
# Will DeLoache- BioPerl Installation 
# Max Win- Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions) 
# Pallavi-Conserved Domains Database (CDD) 
# Mary- Protein Data Bank 
# Laura Voss - Pfam Database 
# Samantha Simpson - NCBI Blast (protein, nucleotide, and blast2) 

<hr>

== This is a list of glossary words (A - Z): ==
[[#A| A ]] [[#B| B ]] [[#C| C ]] [[#D| D ]] [[#E| E ]] [[#F| F ]] [[#G| G ]] [[#H| H ]] [[#I| I ]] [[#J| J ]] [[#K| K ]] [[#L| L ]] [[#M| M ]] [[#N| N ]] [[#O| O ]] [[#P| P ]] [[#Q| Q ]] [[#R| R ]] [[#S| S ]] [[#T| T ]] [[#U| U ]] [[#V| V ]] [[#W| W ]] [[#X| X ]] [[#Y| Y ]] [[#Z| Z ]]

== A ==
'''Accession Number''' - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)] (Will).

'''Arabidopsis thaliana''' - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics ([http://en.wikipedia.org/wiki/Thale_cress Wikipedia.org], Jay)

== B ==
'''BAC''' - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms ([http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome Wikipedia.org], Jay)

'''bioinformatics''' - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [http://en.wikipedia.org/wiki/Bioinformatics] (Matt)

'''BLAST''' - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [http://blast.ncbi.nlm.nih.gov/Blast.cgi] (Mary)

'''bioperl'''- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [http://en.wikipedia.org/wiki/BioPerl] (Wikipedia, Max Win)

== C ==
'''carbon fixation''' - using carbon dioxide to create organic materials [http://en.wikipedia.org/wiki/Carbon_fixation] (Samantha) 

'''CDD''' (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml] (Mary)

'''chaperonin''' - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [http://en.wikipedia.org/wiki/Chaperonins] (Matt)

'''chemotaxis''' - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [http://en.wikipedia.org/wiki/Chemotaxis] (Nick)

'''chemotaxonomy''' - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [http://en.wikipedia.org/wiki/Chemotaxonomy] (Mary)

'''ClustalW''' - A web-based or command line tool that performs multiple sequence alignment to determine evolutionary relationships between three or more sequences [http://en.wikipedia.org/wiki/Clustal] (Will).

'''COG''' (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs ([http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml COG] Pallavi) 

'''concatemer''' - long continuous DNA molecule that contains the same DNA sequence repeated in series [http://en.wikipedia.org/wiki/Concatemer](Samantha) 

'''contigs''' (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''coverage''' - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

'''CPAN (Comprehensive Perl Archive Network)''' - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [http://en.wikipedia.org/wiki/CPAN](Will).

== D ==
'''''de novo'' synthesis''' - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [http://en.wikipedia.org/wiki/De_novo_synthesis] (Matt)

'''dehydrogenase''' - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [http://en.wikipedia.org/wiki/Dehydrogenase] (Peter)

'''diatom''' - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [http://en.wikipedia.org/wiki/Diatom] (Mary)

'''dot plot'''-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

== E ==

'''EC number''' (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [http://en.wikipedia.org/wiki/EC_number] (Mary)

'''E-value''' (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''Extremophile''' - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [http://en.wikipedia.org/wiki/Extremophile] (Will).

== F ==

'''FASTA format''' - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acide sequence. [http://en.wikipedia.org/wiki/FASTA_format] (Nick) 

'''finished genome''' - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay) 

== G ==

'''GC Content''' - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [http://en.wikipedia.org/wiki/GC-content] (Matt) 

'''GC-skew''' – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''gene amplification''' - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [http://www.medterms.com/script/main/art.asp?articlekey=13537] [http://www.answers.com/topic/gene-amplification] (Matt)

'''gene knockout''' - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [http://en.wikipedia.org/wiki/Gene_knockout] (Matt)

'''gene oncology'''- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''glaucophyte''' - freshwater algae that have not been studied well [http://en.wikipedia.org/wiki/Glaucophyte](Samantha) 

== H ==

'''haemolysin or hemolysin''' - a chemical produced by a bacteria that causes lysis of red blood cells [http://en.wikipedia.org/wiki/Hemolysis_(microbiology)] (Nick)

'''halophile''' - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [http://en.wikipedia.org/wiki/Halophile] (Matt)

'''haplotype'''-collection of alleles that travel together (Lecture, Pallavi)

'''haptophyte''' - phylum of algae [http://en.wikipedia.org/wiki/Haptophyte](Samantha)

'''heterokont''' - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [http://en.wikipedia.org/wiki/Heterokont](Samantha)

'''homeobox''' - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [http://en.wikipedia.org/wiki/Homeobox](Samantha)

'''homodimer''' - a protein made of paired identical polypeptides ([http://www.answers.com/topic/homodimer Answers.com], Jay)

'''horizontal gene transfer'''-DNA transmission between species and incorporation of the DNA into the recipient's genome ([http://www.csrees.usda.gov/nea/biotech/res/biotechnology_res_glossary.html horizontal gene transfer] Pallavi)

'''hydrolase''' - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [http://en.wikipedia.org/wiki/Hydrolase] (Nick)

== I ==

'''ideogram''' - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

'''identities''' - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''indole'''-a chemical compound that is produced from the break down of tryptophan ([http://medical-dictionary.thefreedictionary.com/indole indole] Pallavi)

'''inclusion body''' - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [http://en.wikipedia.org/wiki/Inclusion_body] (Nick)

'''intron''' - a region of DNA in a gene that is not part of the final coding sequence for the protein. [http://en.wikipedia.org/wiki/Intron] (Peter)

'''isoelectric point''' - the pH at which a molecule is neutral [http://en.wikipedia.org/wiki/Isoelectric_point] (Nick)

'''isozymes''' - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

== J ==

== K ==

'''kinase''' - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [http://en.wikipedia.org/wiki/Kinase] (Peter)

== L ==

== M ==

'''motif''' - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''mycoplasma''' - genus of bacteria that lack a cell wall [http://en.wikipedia.org/wiki/Mycoplasma] (Nick)

== N ==

'''NORFs''' (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''nucleomorph''' - reduced eukaryotic nuclei found in plastids [http://en.wikipedia.org/wiki/Nucleomorph](Samantha)

== O ==
'''object-oriented programming''' - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

'''open reading frame (ORF)'''-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) [http://www.fao.org/DOCREP/003/X3910E/X3910E18.htm ORF] (Pallavi)

'''operon''' - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [http://en.wikipedia.org/wiki/Operon] (Nick)

'''optical mapping'''-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome [http://www.geocities.com/bioinformaticsweb/genomicglossary.html optical mapping] (Pallavi)

'''ortholog'''-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

'''oxidoreductase''' - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [http://en.wikipedia.org/wiki/Oxidoreductase] (Nick)

== P ==

'''paralog'''-identical DNA sequences within a species (Lecture, Pallavi)

'''p-arm''' - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) ([http://www.medterms.com/script/main/art.asp?articlekey=4715 MedTerms Dictionary], Jay)

'''Perl''' - Developed by Larry Wall in 1987, Perl is a [http://en.wikipedia.org/wiki/High-level_programming_language high-level programming language] used frequently by biologists and bioinformaticists [http://en.wikipedia.org/wiki/Perl] (Will).

'''periplasmic space''' - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [http://en.wikipedia.org/wiki/Periplasmic_space] (Peter)

'''plasmid''' - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [http://en.wikipedia.org/wiki/Plasmid](Peter)

'''plastid''' - major organelles in plants or algae [http://en.wikipedia.org/wiki/Plastid](Samantha)

'''pleomorphism''' - the occurrence of two or more structural forms during a life cycle [http://en.wikipedia.org/wiki/Pleomorphism] (Mary)

'''phylogenetic tree''' - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [http://en.wikipedia.org/wiki/Phylogenetic_tree] (Nick)

'''phylotypes''' – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[http://wps.aw.com/bc_campbell_genomics_2/43/11232/2875502.cw/index.html], Max Win)

'''positives''' - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

'''proteome''' - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [http://en.wikipedia.org/wiki/Proteome](Samantha)

'''psuedogenes'''-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

'''purine''' - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [http://en.wikipedia.org/wiki/Purine] (Peter)

'''pyrimidine''' - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [http://en.wikipedia.org/wiki/Pyrimidine] (Peter)

== Q ==

'''q-arm''' - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) ([http://www.medterms.com/script/main/art.asp?articlekey=5152 MedTerms Dictionary], Jay)

== R ==

'''RAST''' - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([http://rast.nmpdr.org/], Max Win)

'''rDNA'''-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. ([http://en.wikipedia.org/wiki/Ribosomal_DNA rDNA] Pallavi)

'''retrotransposons''' - RNA transcribed back into DNA and added into the genome [http://en.wikipedia.org/wiki/Retrotransposon](Samantha)

'''ribonuclease''' - a nuclease that catalyzes the degradation of RNA into smaller components [http://en.wikipedia.org/wiki/Ribonuclease] (Mary)

== S ==
'''Serovar'''-a subdivision of a species based on the characteristics of their cell surface antigens ([http://www.biology-online.org/dictionary/Serovar serovar] Pallavi)

'''scaffold''' - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected ([http://www.medterms.com/script/main/art.asp?articlekey=25223 MedTerms Dictionary], Jay)

'''Shine-Dalgarno sequence''' - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and [http://en.wikipedia.org/wiki/Shine-dalgarno Wikipedia article], Laura) 
Note: The Shine-Dalgarno consensus sequence for our genome is TAGGAGG.

'''signal peptide''' - a short peptide chain that directs the post-translational transport of a protein [http://en.wikipedia.org/wiki/Signal_peptide] (Matt)

'''Smith-Waterman alignment''' - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [http://en.wikipedia.org/wiki/Smith_waterman](Will).

'''SNP (Single Nucleotide Polymorphism)''' - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism](Will).

'''symporter''' - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [http://en.wikipedia.org/wiki/Symporter] (Peter)

'''synteny''' - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor ([http://www.answers.com/synteny Answers.com], Jay)

'''synthetase''' - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [http://books.google.com/books?id=bB8XnCykRmIC&pg=PA522&lpg=PA522&dq=%22synthetase+is+an+enzyme%22&source=web&ots=wkws4ksMsg&sig=zWLkDIk7T78hcf9S84nWs3u5Apw&hl=en&sa=X&oi=book_result&resnum=9&ct=result] (Peter)

== T ==
'''transferase''' - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [http://en.wikipedia.org/wiki/Transferase] (Matt)

'''transmembrane helix''' - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [http://en.wikipedia.org/wiki/Transmembrane_domain](Mary)

'''transposons / transposable elements''' - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [http://en.wikipedia.org/wiki/Transposon](Samantha)

'''Transposon Mutagenesis'''-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene ([http://cancerweb.ncl.ac.uk/cgi-bin/omd?transposon+mutagenesis transposon mutagenesis] Pallavi)

'''tRNA splicing endonuclease''' - an enzyme that cleaves intervening sequences of precursor tRNA. [http://cancerweb.ncl.ac.uk/cgi-bin/omd?splicing+endonuclease] (Peter) 

== U ==

== V ==

== W ==

'''whole genome shotgun sequencing''' - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [http://en.wikipedia.org/wiki/Whole_genome_shotgun](Samantha)
 

== X ==
'''xenolog''' - homologs that are created by horizontal gene transfer between two different species [http://en.wikipedia.org/wiki/Xenolog#Xenology] (Matt) 

== Y ==

== Z ==

 
<HR>
<HR>

== This is a list of the student-created tutorials: ==

Halorhabdus utahensis Genome

2008-09-18T14:50:17Z

SaSimpson: /* Other Resources */

Halorhabdus utahensis Genome

2008-09-18T02:25:26Z

SaSimpson: /* W */

Halorhabdus utahensis Genome

2008-09-18T02:24:52Z

SaSimpson: /* T */

Halorhabdus utahensis Genome

2008-09-18T02:24:13Z

SaSimpson: /* R */

Halorhabdus utahensis Genome

2008-09-18T02:23:37Z

SaSimpson: /* P */

Halorhabdus utahensis Genome

2008-09-18T02:22:43Z

SaSimpson: /* N */

Halorhabdus utahensis Genome

2008-09-18T02:22:05Z

SaSimpson: /* H */

Halorhabdus utahensis Genome

2008-09-18T02:19:31Z

SaSimpson: /* G */

Halorhabdus utahensis Genome

2008-09-18T02:16:57Z

SaSimpson: /* C */

TRNA Genes Check List

2008-08-31T16:52:10Z

SaSimpson:

As you find your anticodons, please change the 3 letter codon in the table to '''''bold and italics'''''. For example, if you find 5' GGG 3' is your anticodon, then it will bind to the codon CCC listed in the table and you will convert the '''''CCC''''' to bold and italics. To keep track of the 5' to 3' portion, use the genetic code and the amino acid automatically assigned to your tRNA gene to make sure you don't have it reversed.

<center>
<h1>Table of Standard Genetic Code</h1>
<hr>

<table border>
<tr>
<td></td>
<th colspan=1>T</th>
<th colspan=1>C</th>
<th colspan=1>A</th>
<th colspan=1>G</th>
</tr>

<tr>
<th rowspan=1>T</th>
<td>
TTT Phe (F) 
'''''TTC''''' Phe (F) 
'''''TTA''''' Leu (L) 
TTG Leu (L) 
</td>

<td>
TCT Ser (S) 
'''''TCC''''' Ser (S) 
TCA Ser (S) 
TCG Ser (S) 

</td>

<td>
TAT Tyr (Y) 
'''''TAC''''' Tyr (Y) 
TAA Stop 
TAG Stop 
</td>

<td>

TGT Cys (C) 
TGC Cys (C) 
TGA Stop 
TGG Trp (W) 
</td>
</tr>

<tr>
<th rowspan=1>C</th>

<td>
CTT Leu (L) 
CTC Leu (L) 
CTA Leu (L) 
'''''CTG''''' Leu (L) 
</td>

<td>
CCT Pro (P) 
CCC Pro (P) 
'''''CCA''''' Pro (P) 
'''''CCG''''' Pro (P) 
</td>

<td>
CAT His (H) 
CAC His (H) 
CAA Gln (Q) 
CAG Gln (Q) 
</td>

<td>
CGT Arg (R) 
CGC Arg (R) 
CGA Arg (R) 
CGG Arg (R) 
</td>
</tr>

<tr>
<th rowspan=1>A</th>

<td>
ATT Ile (I) 
ATC Ile (I) 
ATA Ile (I) 
'''''ATG''''' Met (M) 
</td>

<td>
ACT Thr (T) 
ACC Thr (T) 
ACA Thr (T) 
ACG Thr (T) 
</td>

<td>
AAT Asn (N) 
AAC Asn (N) 
'''''AAA''''' Lys (K) 
AAG Lys (K 
</td>

<td>
AGT Ser (S) 
AGC Ser (S) 
'''''AGA''''' Arg (R) 
AGG Arg (R) 
</td>
</tr>

<tr>
<th rowspan=1>G</th>

<td>
GTT Val (V) 
'''''GTC''''' Val (V) 
'''''GTA''''' Val (V) 
GTG Val (V) 
</td>

<td>
GCT Ala (A) 
GCC Ala (A) 
GCA Ala (A) 
GCG Ala (A) 

</td>

<td>
GAT Asp (D) 
GAC Asp (D) 
GAA Glu (E) 
GAG Glu (E) 
</td>

<td>
GGT Gly (G) 
GGC Gly (G) 
GGA Gly (G) 
'''''GGG''''' Gly (G) 
</td>
</tr>

</table>
</center>

Halorhabdus utahensis Genome

2008-08-27T03:47:38Z

SaSimpson: /* This is a list of glossary words (A - Z): */

This page will be used by Davidson College students in the [http://www.bio.davidson.edu/Courses/Bio343/LabMethods.html Genomics Laboratory course].
 
== This is a list of glossary words (A - Z): ==

carbon fixation - using carbon dioxide to create organic materials (Samantha)

concatemer - long continuous DNA molecule that contains the same DNA sequence repeated in series (Samantha)

glaucophyte - freshwater algae that have not been studied well (Samantha)

haptophyte - phylum of algae (Samantha)

heterokont - major line of eukaryotes consisting of about 10,500 known species, most of which are algae (Samantha)

homeobox - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades (Samantha)

nucleomorph - reduced eukaryotic nuclei found in plastids (Samantha)

plastid - major organelles in plants or algae (Samantha)

proteome - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions (Samantha)

retrotransposons - RNA transcribed back into DNA and added into the genome (Samantha)

transposons / transposable elements - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. (Samantha)

whole genome shotgun sequencing - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. (Samantha)

== This is a list of the student-created tutorials: ==

Needed From Davidson

2008-07-07T15:16:18Z

SaSimpson:

'''Parts Needed From Davidson'''

LuxR expression cassette (promoter+RBS+LuxR+TT) 
James Barron/Erin Feeney

LuxI gene (RBS+LuxI)

LacI double mutant expression cassette (promoter+RBS+LacI+TT) 
Pallavi Penumetcha

lasR expression cassette (promoter+RBS+LasR+TT) 
James Barron

low copy amp vector (I150042 in I51020) Samantha (hopefully)

medium copy amp vector (I50032 in I51020) Samantha

low copy kan vector (not built yet)

medium copy kan vector (not built yet)