Halorhabdus utahensis Genome

From GcatWiki
Revision as of 03:09, 25 September 2008 by Lavoss (talk | contribs) (S)
Jump to: navigation, search

This page will be used by Davidson College students in the Genomics Laboratory course.

Links to Multiple Databases


RNA Genes

Other Resources

Tutorials for Annotating Genomes

  1. Will DeLoache- BioPerl Installation
  2. Max Win- Introduction to Perl for non-programmers.(with step by step explanations,simple exercises and solutions)
  3. Pallavi-Conserved Domains Database (CDD)
  4. Mary- Protein Data Bank
  5. Laura Voss - Pfam Database
  6. Samantha Simpson - NCBI Blast (protein, nucleotide, and blast2)
  7. Peter Bakke - Finding species-specific Shine-Dalgarno sequence

Research Questions

  1. How do the three systems compare for finding ORFs and RNA genes?
  2. Is there a pattern of missed genes for any of the 3 sites?
  3. Do the three systems differ in their ability to find good start codons and Shine-Dalgarno sequences? [We need a standard set of genes for comparison. Only highly conserved or a range of genes?]
  4. Were Shine-Dalgarno sequences calculated for our species or default values used? If default, what sequence?
  5. Can we fill any holes in their automated annotation? Is there a mechanism for users to add in genes?
  6. How do the 3 sites compare for ease of use?
  7. What are the strengths and weakness of each system? What did they publish as their special features and how do we see these working?
  8. How does each of the 3 sites compare for pathway detection and visualization?
  9. Do they find the origin of replication? Can we find it?

This is a list of glossary words (A - Z):

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

Accession Number - a unique identifier given to DNA and protein sequences to allow for tracking of sequence information within a single database [1] (Will).

Arabidopsis thaliana - the scientific name for the thale cress plant; it was the first plant to have its genome sequenced, and is a model organism for understanding plant biology and genetics (Wikipedia.org, Jay)

B

BAC - bacterial articifical chromosome, a DNA construct used for transforming or cloning segments of DNA and often used to sequence the genetic code of organisms (Wikipedia.org, Jay)

bioinformatics - the multi-disciplinary approach of using biology, computer science and mathematics to solve or better understand biological problems [2] (Matt)

BLAST - (Basic Local Alignment Search Tool) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. [3] (Mary)

bioperl- a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications such as accessing sequence data from local and remote databases, transforming formats of database, manipulating individual sequences, searching for similar sequences, searching for genes and other structures on genomic DNA, or developing a machine readable sequence annotations. [4] (Wikipedia, Max Win)

C

carbon fixation - using carbon dioxide to create organic materials [5] (Samantha)

CDD (Conserved Domains Database)- a database used to identify the conserved domains present in a protein query sequence [6] (Mary)

chaperonin - a protein complex that assists some newly formed polypeptide chains by folding them into their final, functional, three-dimensional form [7] (Matt)

chemotaxis - the process in which cells will seek out or flee from a high concentration of certain chemicals and is found in both uni- and multicellular organisms. This process is used to avoid toxins or find food in unicelllular organisms or tasks such as reproduction in multicellular organisms [8] (Nick)

chemotaxonomy - the attempt to classify and identify organisms according to demonstrable differences and similarities in their biochemical compositions [9] (Mary)

ClustalW - A web-based or command line tool that performs multiple sequence alignments to determine evolutionary relationships between three or more sequences [10] (Will).

COG (Cluster of Orthologous Groups)- corresponds to a highly conserved domain and generally consists of either individual proteins or groups of paralogs (COG Pallavi)

concatemer - long continuous DNA molecule that contains the same DNA sequence repeated in series [11](Samantha)

contigs (contiguous DNA)- overlapping DNA segments that as a collection from a longer and gapless segment of DNA. (Discovery Genomics, Proteomics and Bioinformatics [12], Max Win)

coverage - refers to the number of times, on average, any piece of DNA in a sequenced genome has been individually sequenced (Lecture, Jay)

CPAN (Comprehensive Perl Archive Network) - an archive of over 12,200 modules of software written in Perl, as well as documentation for it. It contains a module called CPAN (or CPAN.pm) which is used as an installer for Perl modules such as BioPerl [13](Will).

D

de novo synthesis - the synthesis of complex molecules from simple molecules (e.g. sugars and nucleotides), rather than from recycled molecules; from the latin "of the new" [14] (Matt)

dehydrogenase - a type of enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor. [15] (Peter)

diatom - a major group of eukaryotic algae, and one of the most common types of phytoplankton. A characteristic feature of diatom cells is that they are encased within a unique cell wall made of silica called a frustule. These frustules show a wide diversity in form, but usually consist of two asymmetrical sides with a split between them. [16] (Mary)

domain (protein) - the structural and functional groups of a protein, which can exist independently of the protein itself. Domains typically perform a specific function, such as binding to promoters or substrates, and many proteins can have one or several domains in common. Evolutionarily-linked proteins are more likely to have domains in common. Domains are used to organize proteins into families. (Wikipedia article, Laura)

dot plot-graphical display comparing sequence conservation between two genomes with dots indicating strings of identical bases. (Discovery Genomics, Proteomics and Bioinformatics[17], Max Win)

E

EC number (Enzyme Commission Number)- a numerical classification scheme for enzymes, based on the chemical reactions they catalyze [18] (Mary)

E-value (Expect value)- When performing a BLAST search, you will obtain an E-value for each sequence that is retrieved. And E-value can be thought of as the probability that two sequences are similar to each other by chance. (Discovery Genomics, Proteomics and Bioinformatics[19], Max Win)

Extremophile - an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth [20] (Will).

F

FASTA format - a format used to convey either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented by single-letter codes. The sequence name and other descriptors often precede the amino acid sequence. [21] (Nick)

family (protein) - a group of evolutionarily-related proteins, often with one or several domains in common. Families are organized by domain overlap, structural/functional similarity, and sequence similarity. (Wikipedia article and lecture, Laura)

finished genome - a genome that has been sequenced at least partly by hand, resulting at least 99.99% sequence accuracy (Lecture, Jay)

G

GC Content - the percentage of bases within a certain sequence of DNA (e.g. a gene or a genome) that are either guanine or cytosine; a higher GC content is characteristic of a coding region of a gene; differences in GC content between a gene and a genome can be used as evidence for horizontal gene transfer [22] (Matt)

GC-skew – uneven distribution of guanine and cytosine bases between the two strands of DNA where GC base pairs occur. (Discovery Genomics, Proteomics and Bioinformatics[23], Max Win)

gene amplification - production of multiple copies of a gene in order to amplify the amount of protein that the gene encodes for [24] [25] (Matt)

gene knockout - a process in which a gene is deactivated within a test organism in order to better understand the function of the gene in that organism [26] (Matt)

gene oncology- a collaborative effort of investigators to unify and standardize terms associated with the role a gene or protein plays in an organism. (Discovery Genomics, Proteomics and Bioinformatics[27], Max Win)

glaucophyte - freshwater algae that have not been studied well [28](Samantha)

H

haemolysin or hemolysin - a chemical produced by a bacteria that causes lysis of red blood cells [29] (Nick)

halophile - an organism, most often of the Archaea domain, that lives in environments containing high concentrations of salt [30] (Matt)

haplotype-collection of alleles that travel together (Lecture, Pallavi)

haptophyte - phylum of algae [31](Samantha)

heterokont - major line of eukaryotes consisting of about 10,500 known species, most of which are algae [32](Samantha)

Hidden Markov Model - a statistical model used in protein recognition databases such as Pfam. A Hidden Markov Model keeps track of several variables and possible variations thereof, such as the possible amino acid sequences that make up a protein domain (since there can be some variance in an amino acid sequence) or the variations in the component sounds that make up a word, and uses those points to match a given sequence to the word, domain, or other complex sequence it most closely matches. An HMM in speech recognition software, for example, can identify that a certain set of sounds make up a certain word, even with the variations in pronunciation and accent that different people will give those sounds. (Wikipedia and lecture, Laura)

HMM Logo - a graphical representation of an HMM, detailing the possible amino acid sequences, the relative frequencies and probabilities of each amino acid in the sequence, the relative contribution each amino acid has to the overall protein family, and the charge or nature of the amino acids themselves. (How to read HMM Logos, on Pfam, Laura)

homeobox - DNA sequence within transcription factor genes that allow the cell to respond to patterns of development by having the transcription factors switch on gene cascades [33](Samantha)

homodimer - a protein made of paired identical polypeptides (Answers.com, Jay)

horizontal gene transfer-DNA transmission between species and incorporation of the DNA into the recipient's genome (horizontal gene transfer Pallavi)

hydrolase - an enzyme that catalyzes hydrolysis, the breakdown of water into oxygen and hydrogen atoms which often take part in subsequent reactions [34] (Nick)

I

ideogram - in genomics, usually describes a stylized representation of a chromosome with banding patterns (Campbell-Heyer Genomics textbook, Jay)

identities - in a BLAST output, the number and fraction of total residues which are identical in a given alignment [www.ncbi.nlm.nih.gov/blast/blast_help.shtml] (Mary)

indole-a chemical compound that is produced from the break down of tryptophan (indole Pallavi)

inclusion body - Inclusion bodies are collections of stainable substances, usually proteins, that are found either in the nucleus or the cytoplasm. It is thought that these bodies are often the result of viral proteins that misfolded [35] (Nick)

intron - a region of DNA in a gene that is not part of the final coding sequence for the protein. [36] (Peter)

isoelectric point - the pH at which a molecule is neutral [37] (Nick)

isozymes - members of a gene family with very similar cellular roles (Cambpell-Heyer Genomics textbook, Jay)

J

K

KEGG (Kyoto Encyclopedia of Genes and Genomes) - a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway database records networks of molecular interactions in the cells, and variants of them specific to particular organisms [38](Will).

kinase - a type of enzyme that transfers a phosphate group from a high-energy donor molecule to a target molecule in a process called phosphorylation. [39] (Peter)

L

M

Manatee - a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. This on-going, open source initiative was developed with two missions. One, to allow biologists the ability to functionally annotate their genomes using a powerful, stand-alone web application with a robustly designed relational annotation database. And secondly, to invite outside developers the opportunity to contribute their own ideas and requirements to enhance Manatee's ability to accomplish biological goals [40](Will).

motif - a sequence of amino acids or nucleotides that performs a particular role and is often conserved in other species or molecules. (Discovery Genomics, Proteomics and Bioinformatics[41], Max Win)

mycoplasma - genus of bacteria that lack a cell wall [42] (Nick)

N

NORFs (nonannotated open reading frame) - on open reading frame that was considered not to be a real gene when the genome was annotated.( Discovery Genomics, Proteomics and Bioinformatics[43], Max Win)

nucleomorph - reduced eukaryotic nuclei found in plastids [44](Samantha)

O

object-oriented programming - a programming paradigm in which collections of data, associated with operations on that data, are modularly defined and then built upon (CSC 121 Lecture, Will).

open reading frame (ORF)-a segment of DNA that can potentially encode for a protein and it begins with a start codon (usually ATG) ORF (Pallavi)

operon - a segment of DNA involving an operator, promoter, and one or more genes that operate as a single unit during transcription [45] (Nick)

optical mapping-DNA sequences of the organism in question are compared against a karyotype that specifically looks at restriction sites found within the DNA to correctly order the DNA sequences on a chromosome. This methodology gives very detailed haplotype information and allows for the detection of sequence variations across an entire genome optical mapping (Pallavi)

ortholog-different DNA sequences that look very similar, but have no evolutionary relationship (Lecture, Pallavi)

oxidoreductase - an enzyme that catalyzes redox reactions by transferring electrons from one molecule (the reductant) to another (the oxidant) [46] (Nick)

P

paralog-identical DNA sequences within a species (Lecture, Pallavi)

p-arm - the shorter arm of a chromosome's two arms separated by the centromere (compare to q-arm, the longer arm) (MedTerms Dictionary, Jay)

Perl - Developed by Larry Wall in 1987, Perl is a high-level programming language used frequently by biologists and bioinformaticists [47] (Will).

periplasmic space - the space between the inner cytoplasmic membrane and external outer membrane in bacteria or archaea. [48] (Peter)

Pfam - a database for protein domain families that matches amino acid sequences or nucleotide sequences to the related group of proteins to which they most likely belong. (Pfam Help, Laura)

plasmid - an extra-chromosomal DNA molecule that is capable of replicating independently of the chromosomal DNA. Commonly found in bacteria and archaea. [49](Peter)

plastid - major organelles in plants or algae [50](Samantha)

pleomorphism - the occurrence of two or more structural forms during a life cycle [51] (Mary)

phylogenetic tree - a diagram showing the evolutionary relationships between biological species that are thought to share a common ancestor [52] (Nick)

phylotypes – a term intended to resolve the challenge of “species” when classifying prokaryotes using DNA sequence comparisons. (Discovery Genomics, Proteomics and Bioinformatics[53], Max Win)

positives - in a BLAST output, the number and fraction of residues for which the alignment scores have positive rather than negative values [54] (Mary)

proteome - entire set of proteins expressed by a genome, cell, tissue, or organism. It may refer to expressed proteins under certain conditions [55](Samantha)

psuedogenes-A sequence of DNA that looks like a gene, but most likely contains many stop codons. It may have evolved away from a real gene or a paralog might have taken its place (Lecture, Pallavi)

purine - a category of nitrogenous base consisting of a pyrimidine ring fused to an imidazole ring. Notable purine bases are adenine and guanine. [56] (Peter)

pyrimidine - a category of nitrogenous base consisting of a heterocyclic aromatic ring containing two nitrogen atoms at positions 1 and 3 of the six-member ring. Notable pyrimidine bases are cytosine, thymine, and uracil. [57] (Peter)

Q

q-arm - the longer arm of a chromosome's two arms separated by the centromere (compare to p-arm, the shorter arm) (MedTerms Dictionary, Jay)

query sequence - the sequence (whether amino acid or nucleotide) entered into a database’s search function and checked against the database entries. (BLAST on Wikipedia, Laura)

R

RAST - (Rapid Annotation using Subsystem Technology)- a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. ([58], Max Win)

rDNA-These are DNA sequences that encode for ribosomal RNA. Note that rDNA can also stand for recombinant DNA. (rDNA Pallavi)

residue (protein) - the remaining portion of an amino acid after a water molecule has been removed and it has been incorporated into a protein. Functional residues, referred to in Pfam, are the residues that perform some specific identifiable function or are part of a domain, and can be conserved across evolutionarily-related proteins. (Pfam Help, Laura)

retrotransposons - RNA transcribed back into DNA and added into the genome [59](Samantha)

ribonuclease - a nuclease that catalyzes the degradation of RNA into smaller components [60] (Mary)

S

Serovar-a subdivision of a species based on the characteristics of their cell surface antigens (serovar Pallavi)

scaffold - a section of a sequenced genome composed of contigs that are in the right order but not necessarily connected (MedTerms Dictionary, Jay)

Shine-Dalgarno sequence - A ribosomal binding site on an mRNA, usually a sequence of six base pairs about six or seven base pairs upstream of the start codon. An anti-Shine-Dalgarno sequence exists on the rRNA in the small subunit of the ribosome; when the two sequences align, the mRNA is lined up and prepared for transcription. (Lecture and Wikipedia article, Laura)
Note: The Shine-Dalgarno consensus sequence for our genome is TAGGAGG.

SignalP - a prediction server that judges whether or not a query protein is a signal peptide. SignalP measures each amino acid against the amino acid sequences of probable signal peptide matches and predicts the cleavage site of the signal peptide. (SignalP Output explained, Laura)

signal peptide - a short peptide chain that directs the post-translational transport of a protein [61] (Matt)

Smith-Waterman alignment - A well-known algorithm for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure [62](Will).

SNP (Single Nucleotide Polymorphism) - a DNA sequence variation occurring when a single nucleotide in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual) [63](Will).

symporter - an integral membrane protein that is involved in movement of two or more different molecules or ions across a phospholipid membrane. [64] (Peter)

synteny - a neologism from the Greek for "on the same ribbon". Genes that are syntenic in one species are on the same chromosome; genes that are syntenic across species retain the same order on respective chromosomes as a result of descent from a common ancestor (Answers.com, Jay)

synthetase - a type of enzyme that creates a new covalent bond and requires direct input of energy from a high-energy phosphate. [65] (Peter)

T

transferase - an enzyme that catalyzes the transfer of a functional group from one molecule (the donor) to another (the acceptor) [66] (Matt)

transmembrane helix - a single transmembrane alpha helix of a transmembrane protein, usually about twenty amino acids in length. They are usually predicted by hydrophobicity. [67](Mary)

transposons / transposable elements - DNA sequences that can move around to different positions in a single cell's genome. Transposons can cause mutations and change the length of the genome. [68](Samantha)

Transposon Mutagenesis-a procedure in which a transposon is inserted into a gene, which inactivates the gene and can lead to the discovery of the phenotype associated with this gene (transposon mutagenesis Pallavi)

tRNA splicing endonuclease - an enzyme that cleaves intervening sequences of precursor tRNA. [69] (Peter)

U

V

W

whole genome shotgun sequencing - a method of sequencing where DNA is cut into small pieces and cloned into vectors, then both ends of every vector are sequenced in about 500 bps to form mate pairs. Mate pairs rarely overlap, but are used to reassemble the sequence using software. [70](Samantha)

X

xenolog - homologs that are created by horizontal gene transfer between two different species [71] (Matt)

Y

Z




This is a list of the student-created tutorials: