Explaining My Project

Shotgun Sequencing


Counting Kmers to Tell you about Genome

Kmer.png

  • Bad kmer rate = bad multiplicity kmers/total number of all kmers
  • Seq Error Rate = bad kmer Rate/kmer size
  • Genome Coverage = use gamma fit on the good multiplicity values of the best kmer (usually largest). The peak of this line gives genome coverage (see red line) (here about 47.11x)
  • Genome size = number of unique good multiplicity kmers/coverage
  • 1st peak (@ low multiplicity) = from seq errors
  • 2nd peak = multiple copies of the same location in the genome
    • If a k-mer occurs n times in the genome, we would expect to see it n times as often in the sequencing, so there should be additional peaks for k-mers that occur in repeats.