Difference between revisions of "Davidson Missouri W/Mathematical Modeling"

From GcatWiki
Jump to: navigation, search
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<center>[[Davidson/Missouri Western iGEM 2007| <span style="color:red">Home</span>]] | [[Background Information| <span style="color:red">Background Information</span>]] | [[Solving the HPP in vivo| <span style="color:red">Current Project: Solving the Hamiltonian Path Problem ''in vivo''</span>]] | [[Probability and Statistics| <span style="color:red">Probability and Statistics</span>]] | [[Gene splitting| <span style="color:red"> Gene Splitting </span>]] | [[Controlling Expression| <span style="color:red"> Controlling Expression </span>]] | [[Traveling Salesperson Problem| <span style="color:red">Traveling Salesperson Problem</span> ]] | [[Backwards promotion and read-through transcription| <span style="color:red">Backwards promotion and Read-Through Transcription</span>]] | [[Resources and Citations|<span style="color:red">Resources and Citations</span>]]</center>
+
<center>[[Davidson Missouri W| <span style="color:red">Home</span>]] | [[Davidson Missouri W/Background Information| <span style="color:red">Background Information</span>]] | [[Davidson Missouri W/Solving the HPP in vivo| <span style="color:red">Current Project: Solving the Hamiltonian Path Problem ''in vivo''</span>]] | [[Davidson Missouri W/Mathematical Modeling| <span style="color:black">Mathematical Modeling</span>]] | [[Davidson Missouri W/Gene splitting| <span style="color:red"> Gene Splitting </span>]] | [[Davidson Missouri W/Controlling Expression| <span style="color:red"> Controlling Expression </span>]] | [[Davidson Missouri W/Traveling Salesperson Problem| <span style="color:red">Traveling Salesperson Problem</span> ]] | [[Davidson Missouri W/Backwards promotion and read-through transcription| <span style="color:red">Backwards Promotion and Read-Through Transcription</span>]] | [[Davidson Missouri W/Resources and Citations|<span style="color:red">Resources and Citations</span>]]</center>
  
 
<hr>
 
<hr>
  
'''False Positives'''
+
=Markov Chain Model For Flipping=
  
The false positive programs will tell how many false positives there are in  Adelman’s graph without two edges (12) and the other is for the original graph with 14 edges. To develop the program we first made a list of all the different ways to make each number represented in the path and numbered them. We also, numbered each of the edges. we took the numbered edges and made vectors of the ways that each number can be represented at the top of the program.
+
'''Does starting orientation matter?'''
  
First we made a loop to make sure that MATLAB picks up at least one representation of each number. Then the perms function is used to arrange all the representations of each number except the promoter in all the different ways and call it matrix g.Then you want to find all the rows where the last number of each row of the matrix ends with the number that represents the terminator. Then using the ord function all the different ways of putting the two promoters in the front of each row is developed. Now the program changes the label of the numbers from there number representation number to there edge number where the label goes from 1 to 2 and transposes them. The way that the program lays out the numbers need to be transposed again so we called that matrix sl. Now an if statements were made for the total number of repeats for each row equal to the number of repeats next to each other. The counter is used to find the total number of repeats that are next to each other. For all the number of edges left after all of the repeats are taken out we replaced those with zeros. Now we had to make a vector of ones that will be from 1 to the length of sl. What we call usl is where we get rid of one of the repeated numbers that’s next to each other because one number shows the back half and front half of the gene. Now the new usl is relabeled and includes the numbers for the edges with the promoter at the beginning of each and the terminator at the end of each with the zeros replacing the edges that are after the terminator.
+
Question: Will the order in which we place the edges in our E.coli computers affect the probability of us detecting a Hamiltonian Path?
  
If there is a row in matrix g that does not end with a terminator then we use the command elseif and add a terminator to each of those rows. Since there are two terminators then the program goes through the process one time with the first terminator and another time with the other terminator making these rows longer than if they originally had a terminator at the end. After this whole process all the rows start with a promoter and end with a terminator and have zeros after the terminator is represented.
+
Our mathematical reasoning relies on the fact that the Hin/HixC DNA recombination system makes a large number of flips and for any one flip there is some element of chance that dictates which Hix sites are chosen to flip.  
  
At the end of the program we print out the total number of arrangements of false positives before the zeros after the terminator are replaced with the extra edges. Also, the rows that are not repeated are printed out. Now the program finds all the rows that have terminators starting in column 8 to 12. We found how many ways we can arrange the remaining edges depending on where the terminator is and take that number and multiply by the amount of times the terminator is in each spot and add them together for each column number. This number gives the number of false positives for a Hamiltonian path.
+
Answer: We developed a Markov Chain model that takes every signed permutation of edges as a state and the connections between these states are found and categorized by our Matlab programs by both the number of edges that need to be flipped to move between those states and categorized by which specific edge needed to be flipped. Further, our Matlab programs creates a transition matrix based on the weights a user enters to bias the probabilities of making different kinds of flips.  
  
MATLAB programs used to find the false positives:
+
Once this transition matrix is generated we then compute, for each possible starting state, the probability of that starting state transitioning to any of the solved states after x flips given the biasing of different types of flips. When our Markov Chain model is run for more than 2 edges we always observe a convergence of these probabilities no matter what bias we apply. This shows that starting orientation does not matter since the probability of being solved after a large number of flips is the same for each starting orientation.
 
 
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/fpalmostadelman.m 1. Adelman's False Positives with 12 nodes]
 
  
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/fpadelman.m 2. Adelman's False Positives with 14 nodes]
+
This begs the question: what is a large number of flips? Mostly we observe that this convergence occurs in the first 20 flips for three edges and only for very extreme probabilities does this convergence occur after 60 flips. However, as the number of edges increase we can expect this convergence to occur later. This does not appear to be a problem. Our flipping mechanism flips edges very quickly, so quickly that it is hard to measure. Additionally, the diameter of our transition diagrams increases linearly with the number of edges so there is no reason to believe that the number of flips to convergence would increase at more than a linear pace.
  
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/fpweighter.m 3. Counter Program for Adelman's False Positives]
 
<br>
 
= Projects Completed =
 
  
'''Markov Chains'''
+
[[Image:MarkovChaingraph4nod3edg.jpg]]
  
One of the first projects we worked on was to develop a mathematical model of graphs for the pancake problem. These graphs were used to find the percent of plasmids that solved problem based on the number of flips. We also developed graphs with biases to see if the length of the flips effect the probability of making the flip. In some of the graphs we saw bouncing because from a starting point you can only get to a solution by an odd number of flips or an even number of flips. If you get to a point on an even number of flips you can never get to an odd number. When the graph shows convergence, regardless whether the number of flips is even or odd you will have a chance to get to one of the solutions. The convergence occurs at .25 because there are 2/8 chances of getting to the solution after a high number of flips which reduces to ¼. We also did graphs using bias wieghts where it takes more flips for the graph to have the bouncing behavior. The biologist suspect that the length of the flips effects the probability of making the flips. Then you would compare mathematical graph to the biologist graph to see the bias.
 
  
 
MATLAB programs that we developed using Markov Chains
 
MATLAB programs that we developed using Markov Chains
Line 36: Line 30:
  
  
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/Poisson.doc '''Poisson''']
 
  
Using this statistical method we used to make a chart of  the probability of finding true positives based on the number of plasmids.
+
<br>
 +
= True Positives / False Positives =
  
'''P(missing all true positives)'''
+
[[Image:5_node-wiki.jpg]] 
  
We first looked at trying to find the probability of there not being a Hamiltonian Path(HP) and not seeing one.(if they don’t see anything what is the probability that nothing is there.) Knowing that the probability of seeing a HP given that there is not a HP is 0 and the probability of seeing a HP given that there is not one is .99.Since we did not know the probability of not seeing a HP nor do we not know the probability that there is not a HP we decided we need to change the wording of the question to what is the probability of missing all the true positives. We would be able to find this probability if we know the number of cells that can be seen. After talking to the biologist they said if there is a HP there is a 100% probability that they will see it. The answer to our original question of finding the probability for missing all the true positives is 0.
+
True positives are the Hamiltonian Path represented in the above graph with solid red arrows and below as the PCR Fragment Length where the edges after the Hamiltonian path can be arranged in any order.
  
'''How many orderings of positively oriented edges are 2 flips away from being solved?'''
+
[[Image:True_positive.jpg]]
  
We started off stating that n=nodes, e=the number of edges and we needed to arrange n-1 edges. First, we proved that the number of positively orientated edges could not be 1 flip away. Flips are represented by
+
A false positive is represented in the graph above with the solid red arrows and the dotted red arrow showing a "teleport" which causes this path to be a false positive. Teleporting occurs when an edge entering a node is followed by an edge leaving from a different node. Below you can see an extra edge that is highlighted, representing the teleport. In this example, the edge from 4 to 3 is followed by the edge from 4 to 5.
  
Example 1
+
[[Image:Fpospic.jpg]]
  
1 2∧ 3 4∧
+
         
 +
MATLAB programs used to find the false positives:
  
1 2 -3 -4 New row
 
  
Since everything is starts positive we need 2 chunks and they must be in numerical order [1,.,a].[a+1,…..,n-1]
+
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/fpVTgraph.m 1. False positives for the 8 edge / 5 node graph shown above]
 +
 
 +
true positives = 384
 +
false positives = 1,362
 +
 
 +
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/davidson85Iweighter.m 2. Counter Program for the 8 edge / 5 node graph]
 +
 
 +
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/adelman147.m 2. Adelman's false positives with 14 edges / 7 nodes]
 +
 
 +
true positives = 10,321,920
 +
false positives = 168,006,848
 +
 
 +
[http://gcat.davidson.edu/iGEM07/For_Missouri_Western/fpweighter.m 3. Counter Program for Adelman's False Positives]
 +
 
  
Example 2
+
Below are two graphs that show how quickly graphs can be complex so it makes the Hamiltonian Path harder to detect. Also, from these graphs you can tell that even though the graphs are very complex that there is still a 5% change of detecting a Hamiltonian Path or a true positive.
  
1 2  ∧ _ _ 3 4 ∧
+
[[Image:Positivenodedg.jpg]]                                                                                              [[Image:Truetotalpos.jpg]]
  
1 2 ∧-4 -3 ∧ _ _
+
=Possion Model For the Number of Plasmids=
  
1 2 3 4
 
  
Example 3
+
'''How many E.coli computers?'''
  
1 2 3 ∧ _ 4 ∧ _
+
Question: If a Hamiltonian Path exists in the graph, how many E.coli computers do we need to have working on the problem to be highly confident that at least k of our computers will find it? 
  
1 2 3 ∧ -4 ∧ _ _
+
Our mathematical reasoning relies on the fact that the Hin/HixC DNA recombination system makes a large number of flips and for any one flip there is some element of chance that dictates which Hix sites are chosen to flip.  Note: This has now been shown to be true but it needs to be stated that our math relies on it.
  
1 2 3 4 _ _
+
If this is the case then our E.coli computers will be equally likely to have flipped into each possible signed ordering of edges by the time we screen our results. For more details on this see our section answering the question does starting orientation matter?
  
Next we figured out a formula for how many different ways to have chunks.
+
We can use a cumulative Poisson distribution to answer this question.  
Number of edges after 1st chunk - number of edges in 2nd chunk
 
  
n-1
+
Answer: The Poisson distribution gives the probability of k occurrences when the expected number of occurrences is λ as (e^-λ)*(λ^k)/(k!). By taking 1 minus the sum of the Poisson probabilities from 0 to k-1 we get the cumulative probability of at least k occurrences.
  
∑ (e-i) – (n-1-i) = Number of unimportant edges
+
For example examine the Adleman graph. It has 14 edges and 7 nodes. This means that a Hamiltonian Path is represented in our plasmids by 6 edges in the proper order and orientation followed by 8 edges in any order and in any orientation.
  
i=0
+
Thus there are 8!*2^8 orderings that are true positives and 14!*2^14 possible orderings. The probability of any one plasmid holding a true positive is then 8!*2^8 / 14!*2^14. If we have m plasmids, λ= m*8!*2^8 / 14!*2^14.
  
n-1
+
Now if we want to know the probability of at least one true positive in m plasmids, we use k =1 and the expression becomes 1-(e^-(m*8!*2^8/14!*2^14)) as the summand collapses to the case where k=0 and (λ^0)/(0!) then equals 1.
  
∑ (e-n+1)
+
We can then solve for m depending on how confident we want to be. If we use
 +
m=1 billion plasmids then we expect to find the Hamiltonian Path when one exists 99.92% of the time.
  
i=0
+
1 billion E.coli computers fit conveniently into 1 mL of culture.
  
n(e-n+1)
 
  
The equation for the number of orderings of positively oriented edges 2 flips away from being solved is:
+
Using this statistical method we made a chart showing the probability of finding true positives based on the number of plasmids.
  
(e-(n-1) ! × (n(e-n+1))
+
[[Image:Plasmidsneeded.jpg]]

Latest revision as of 18:45, 9 August 2007

Home | Background Information | Current Project: Solving the Hamiltonian Path Problem in vivo | Mathematical Modeling | Gene Splitting | Controlling Expression | Traveling Salesperson Problem | Backwards Promotion and Read-Through Transcription | Resources and Citations

Markov Chain Model For Flipping

Does starting orientation matter?

Question: Will the order in which we place the edges in our E.coli computers affect the probability of us detecting a Hamiltonian Path?

Our mathematical reasoning relies on the fact that the Hin/HixC DNA recombination system makes a large number of flips and for any one flip there is some element of chance that dictates which Hix sites are chosen to flip.

Answer: We developed a Markov Chain model that takes every signed permutation of edges as a state and the connections between these states are found and categorized by our Matlab programs by both the number of edges that need to be flipped to move between those states and categorized by which specific edge needed to be flipped. Further, our Matlab programs creates a transition matrix based on the weights a user enters to bias the probabilities of making different kinds of flips.

Once this transition matrix is generated we then compute, for each possible starting state, the probability of that starting state transitioning to any of the solved states after x flips given the biasing of different types of flips. When our Markov Chain model is run for more than 2 edges we always observe a convergence of these probabilities no matter what bias we apply. This shows that starting orientation does not matter since the probability of being solved after a large number of flips is the same for each starting orientation.

This begs the question: what is a large number of flips? Mostly we observe that this convergence occurs in the first 20 flips for three edges and only for very extreme probabilities does this convergence occur after 60 flips. However, as the number of edges increase we can expect this convergence to occur later. This does not appear to be a problem. Our flipping mechanism flips edges very quickly, so quickly that it is hard to measure. Additionally, the diameter of our transition diagrams increases linearly with the number of edges so there is no reason to believe that the number of flips to convergence would increase at more than a linear pace.


MarkovChaingraph4nod3edg.jpg


MATLAB programs that we developed using Markov Chains

1.Flip Length

2.Pure Flip

3.Bias Weighter



True Positives / False Positives

5 node-wiki.jpg

True positives are the Hamiltonian Path represented in the above graph with solid red arrows and below as the PCR Fragment Length where the edges after the Hamiltonian path can be arranged in any order.

True positive.jpg

A false positive is represented in the graph above with the solid red arrows and the dotted red arrow showing a "teleport" which causes this path to be a false positive. Teleporting occurs when an edge entering a node is followed by an edge leaving from a different node. Below you can see an extra edge that is highlighted, representing the teleport. In this example, the edge from 4 to 3 is followed by the edge from 4 to 5.

Fpospic.jpg


MATLAB programs used to find the false positives:


1. False positives for the 8 edge / 5 node graph shown above

true positives = 384 false positives = 1,362

2. Counter Program for the 8 edge / 5 node graph

2. Adelman's false positives with 14 edges / 7 nodes

true positives = 10,321,920 false positives = 168,006,848

3. Counter Program for Adelman's False Positives


Below are two graphs that show how quickly graphs can be complex so it makes the Hamiltonian Path harder to detect. Also, from these graphs you can tell that even though the graphs are very complex that there is still a 5% change of detecting a Hamiltonian Path or a true positive.

Positivenodedg.jpg Truetotalpos.jpg

Possion Model For the Number of Plasmids

How many E.coli computers?

Question: If a Hamiltonian Path exists in the graph, how many E.coli computers do we need to have working on the problem to be highly confident that at least k of our computers will find it?

Our mathematical reasoning relies on the fact that the Hin/HixC DNA recombination system makes a large number of flips and for any one flip there is some element of chance that dictates which Hix sites are chosen to flip. Note: This has now been shown to be true but it needs to be stated that our math relies on it.

If this is the case then our E.coli computers will be equally likely to have flipped into each possible signed ordering of edges by the time we screen our results. For more details on this see our section answering the question does starting orientation matter?

We can use a cumulative Poisson distribution to answer this question.

Answer: The Poisson distribution gives the probability of k occurrences when the expected number of occurrences is λ as (e^-λ)*(λ^k)/(k!). By taking 1 minus the sum of the Poisson probabilities from 0 to k-1 we get the cumulative probability of at least k occurrences.

For example examine the Adleman graph. It has 14 edges and 7 nodes. This means that a Hamiltonian Path is represented in our plasmids by 6 edges in the proper order and orientation followed by 8 edges in any order and in any orientation.

Thus there are 8!*2^8 orderings that are true positives and 14!*2^14 possible orderings. The probability of any one plasmid holding a true positive is then 8!*2^8 / 14!*2^14. If we have m plasmids, λ= m*8!*2^8 / 14!*2^14.

Now if we want to know the probability of at least one true positive in m plasmids, we use k =1 and the expression becomes 1-(e^-(m*8!*2^8/14!*2^14)) as the summand collapses to the case where k=0 and (λ^0)/(0!) then equals 1.

We can then solve for m depending on how confident we want to be. If we use m=1 billion plasmids then we expect to find the Hamiltonian Path when one exists 99.92% of the time.

1 billion E.coli computers fit conveniently into 1 mL of culture.


Using this statistical method we made a chart showing the probability of finding true positives based on the number of plasmids.

Plasmidsneeded.jpg