Modeling Promoter Activity

From GcatWiki
Jump to: navigation, search

Modeling Promoter Activity

In order to use synthetic promoters to their fullest potential, we have to understand how they work. Sythetic promoters cannot help us model gene circuit activity unless models are developed for the activity of the promoter itself. Determining how exactly a promoter's strength correlates to its mutations is not easy, since for the most part it requires working with promoters on the level of individual sets of nucleiotides.

Jensen and Hammer (1997): Spacer Sequences

In this 1997 paper, Jensen and Hammer constructed a library of synthetic promoters based on the Lactococcus lactis prokaryotic promoter in order to better determine how gene sequence of promoters was tied to the promoter strength. Specifically, Jensen and Hammer were looking for a way to construct a constitutively active promoter – one that was always turned on, without needing an inducer – that could be safely used to tune gene expression in industrial-scale metabolic engineering projects, where inducers might be impractical or hazardous.

In order to tune the steady-state L. lactis promoter without using an inducer, Jensen and Hammer had to create a library of L. lactis mutant promoters, all with various levels of activity. To generate the library, they used the method described in Promoters and Reporters in Synthetic Biology: constructing oligonucleiotides that matched the genes common to all previous L. lactis promoters and mutants, then allowing the oligonucleiotides to be joined together by random spacer sequences.

After the promoter library was synthesized, promoters were cloned into both L. lactis and E. coli; each cell culture containing a different promoter was tested for the level of beta-galactosidase activity. The activity of each promoter (in Miller units, or beta-galactosidase concentration) is described in Figure 3.

Figure 3. Library of synthetic promoters for L. lactis. Promoter activities (Miller units) were assayed from the expression of a reporter gene (lacLM) encoding -galactosidase transcribed from the different synthetic promoter clones on the promoter cloning vector pAK80. The patterns of the data points indicate which promoter clones contain errors in either the 35 or the 10 consensus sequence or in the length of the spacer between these sequences. From Jensen and Hammer (1997). Permission Pending.

The mutant promoters expressed a wide range of activity, increasing in small increments. Note that not all of the clones were "perfect" - a few had mutations in the oligonucleotide sequences that were supposed to be preserved across the library. Those clones are indicated in the graph above. However, their data was not removed because it was within range of the data from the perfect clones - they caused no break in the general data trend. In addition, all clones were tested to ensure that they were truly constitutive.

When the promoters were cloned into E. coli, the same basic trend was observed. While the promoters did not demonstrate the same level of activity as they did in L. lactis, there was still a wide range of activity observed, with the activity level increasing in steady increments.

Jensen and Hammer constructed a library of synthetic promoters that could be constitutively expressed and covered a range of activity levels, but it was still not known for certain what caused a certain promoter to be active at a certain rate. Jensen and Hammer suggested in their Discussion that "it seems that the overall three-dimensional structure which arises from a particular nucleiotide sequence could be important".

Jensen, Alper, Fischer, and Stephanopoulis (2006): Statistical Modeling and Critical Mutation Sites

In this paper, Jensen et al. tried to determine exactly why some promoters in a promoter library were stronger than others, and which mutations might cause the change in strength. Jensen et al. propose to examine promoter libraries statistically rather than via assays; they will determine which mutations are associated with which phenotypes based on when they appear.

Imagine you are creating a mutant library of a protein that can fluoresce one of three colors: red, blue, or green. If a given point mutation – let’s call it A – has no effect on the color of the fluorescence, then (assuming the mutagenesis is truly random) that mutation should appear in every phenotype proportional to the amount of protein with that phenotype. It will not appear in one phenotype significantly more than the others unless there is significantly more protein with that phenotype. It follows, then, that if point mutation B appears much more often in, say, blue protein without there being much more blue protein than red or green protein, mutation B might have some effect on the protein’s phenotype. It is probably not the sole cause of the blue color, but it is associated with it.

To test their statistical analysis, Jensen et al generated different variants of a single promoter via error-prone PCR, fused the promoter into a plasmid with a GFP reporter gene, and then measured the amount of GFP via flow cytometry. The promoters were then sequenced, and any with insertions or deletions were removed until 69 promoters remained.

Now, assume that each mutant can be classified into one of an unknown number or phenotypic (descriptive) classes; let's call that number M. So there would be n(m) mutants in each class, with the summation of n(m) equalling all hypothetical mutants. Now, say you have a set of mutated promoters of size X, where X < N, all with one particular point mutation. If that mutation has no effect on the phenotype of the promoter, then the number of mutants in any given class with that point mutation would equal X/N - the total number of those mutants divided by the total number of promoters. In other words, they would be distributed evenly.

In multinomial statiestics, the probability that any one set X will take on another set of values y is:

Fd2 1.gif

Where the summation of y is equal to X. Given that summation, the probability that q or more of any specific mutant appearing in a particular class (P(i)) is:

Fd5 4.gif

The 69 promoters being examined were divided into two phenotypic classes based on their fluorescence: the top 50th percentile (brightest) and the bottom 50th percentile (dimmest). Because there are only two classes, the statistical analysis is simplified somewhat. The complete statistical analysis can be seen here in Figure 5:


Figure 5. Statistical distribution of mutations and their effects on mutant fluorescence. In panel A, the vertical axis shows the mutant number, where the mutants are sorted in descending order by their relative fluorescence. In general, the single-cell fluorescence distribution for each mutant strain was log normal distributed. The horizontal axis shows the mean of the log relative fluorescence for each mutant strain, where the error is the standard deviation of this distribution. Reading to the right from panel A into panel B reveals the point mutations present in each mutant. For each location in a mutant (where location is indicated on the horizontal axis) that was changed via the error-prone PCR, a black dot is indicated. With only two exceptions, all of these changes are base transitions rather than transversions, so the sequence of each of the 69 clones can be inferred from the wild-type sequence shown in panel D. (All of the mutations indicated in panel B are transitions with the exception of one A-C transversion at –125 bp in clone 53 and one T-G transversion at –8 in clone 68. These were treated as though they were transitions in our analysis.) Reading down from panel B into panel C shows how mutations at a particular location partition between the two classes of mutants: the top and bottom 50th percentiles. Sites that have no effect on the fluorescence phenotype should partition equally between the two classes, i.e., they should follow a binomial distribution with P = 0.5. Sites that deviate from this distribution are labeled with a dot and are colored either green or red, corresponding to the apparent effect of a mutation at the site. For these sites, P values are indicated, where this value is the probability of seeing a distribution at least as skewed to one side. Sites that were subsequently tested experimentally (see text) are indicated with an asterisk, where the color of the asterisk denotes the expected effect of a mutation at the site. We chose a range of sites to test experimentally from sites with high-confidence (low P value) positive effects to those with low-confidence (P value 0.5) negative effects (Table 1). These sites are also shown in panel D, which contains the wild-type nucleotide sequence of the promoter region that was subjected to mutation. From Jensen et al. (2006). Permission pending.

Statistical analysis revealed seven nucleiotide positions that were correlated with one of the two classes in a significant manner. These seven positions were then tested individually, to see if their phenotype when isolated matched their phenotype when the mutation was random (and accompanied by many other mutations).

When tested, six out of the seven mutants proved to have a similar phenotype in isolation to the phenotype they had in the random mutations, meaning that the statistical model used to predict the significant mutations was accurate and predicted correctly.

De Mey, Maertens, Lequeux, Soetart, and Vandamme (2007): Probability and Partial Least Squares Modeling