E. coli DNA Polymerase I

This page shows the sequence of the DNA polymerase I gene ("polA") from Escherichia coli strain K-12, substrain MG1655. Various sections of the gene and the corresponding protein sequence are indicated.

Segment nucleotides amino acids Description
DnaA binding site -132 to -124 (N/A) The replication initiation protein, DnaA, acts as a transcription factor for polA.
'CAAT box' and 'Pribnow box' -57 to -52, and -36 to -31 (N/A) These are parts of the RNA polymerase binding site.
→ transcription start site -19 (N/A) This is the beginning of the region that is copied into mRNA by RNA polymerase.
ribosome binding site -10 to -6 (N/A) This is the Shine-Delgarno sequence, which is important for ribosomes to bind to the mRNA and begin translation.
cleaved region 1 to 969 1 to 323 removed by proteolytic cleavage to leave the Klenow fragment.
5' to 3' exonuclease 19 to 756 7 to 252 This is the enzymatic domain responsible for removing DNA from the template ahead of the enzyme. This is important in vivo for removing the RNA primers from the lagging strand in DNA replication, and in vitro for nick translation.
Klenow fragment 970 to 2874 324 to 928 produced from the DNA Polymerase I holoenzyme by proteolytic cleavage to remove the 'cleaved' section from the carboxy terminus. This fragment contains both polymerase and proofreading activities, but lacks 5' to 3' exonuclease.
3' to 5' exonuclease 1033 to 1617 345 to 539 This is the "proofreading" domain. If the enzyme makes a mistake, this domain makes it possible to back up, remove the erroneous base, and try again. This leads to much higher fidelity of replication.
polymerase 1639 to 2778 547 to 926 This domain is responsible for the actual polymerization of the new DNA strand.

Sequence of the E. coli polA gene

  -300 ATCCTTAAGGAGAAAAATAATTCATATCTATCCACATTAGAAAAAATCCCATTATCTCAA   -241
                                                                          
  -240 TTATTAGGGATGGATTTATTTTTAACTGCATGAAAAACAAAGACAAACATCATGCTGTAA   -181
                                                                          
  -180 AAAGCATGATAATAAATTAAAAGCGATGTAAATAATTTATGCACAAAGTTATCCACATGA   -121
                                                                          
  -120 CGATTTGCGAGCGATCCAGAAGATCTACAAAAGATTTTCACGAAAAGCGGTGAAAAACTC    -61

   -60 ATGTTTTCATCCTGTCTGTGGCATCCTTTACCCATAATCTGATAAACAGGCACGGACATT     -1
                                                                          
     1 ATGGTTCAGATCCCCCAAAATCCACTTATCCTTGTAGATGGTTCATCTTATCTTTATCGC     60
     1 MetValGlnIleProGlnAsnProLeuIleLeuValAspGlySerSerTyrLeuTyrArg     20
                                                                          
    61 GCATATCACGCGTTTCCCCCGCTGACTAACAGCGCAGGCGAGCCGACCGGTGCGATGTAT    120
    21 AlaTyrHisAlaPheProProLeuThrAsnSerAlaGlyGluProThrGlyAlaMetTyr     40
                                                                          
   121 GGTGTCCTCAACATGCTGCGCAGTCTGATCATGCAATATAAACCGACGCATGCAGCGGTG    180
    41 GlyValLeuAsnMetLeuArgSerLeuIleMetGlnTyrLysProThrHisAlaAlaVal     60
                                                                          
   181 GTCTTTGACGCCAAGGGAAAAACCTTTCGTGATGAACTGTTTGAACATTACAAATCACAT    240
    61 ValPheAspAlaLysGlyLysThrPheArgAspGluLeuPheGluHisTyrLysSerHis     80
                                                                          
   241 CGCCCGCCAATGCCGGACGATCTGCGTGCACAAATCGAACCCTTGCACGCGATGGTTAAA    300
    81 ArgProProMetProAspAspLeuArgAlaGlnIleGluProLeuHisAlaMetValLys    100
                                                                          
   301 GCGATGGGACTGCCGCTGCTGGCGGTTTCTGGCGTAGAAGCGGACGACGTTATCGGTACT    360
   101 AlaMetGlyLeuProLeuLeuAlaValSerGlyValGluAlaAspAspValIleGlyThr    120
                                                                          
   361 CTGGCGCGCGAAGCCGAAAAAGCCGGGCGTCCGGTGCTGATCAGCACTGGCGATAAAGAT    420
   121 LeuAlaArgGluAlaGluLysAlaGlyArgProValLeuIleSerThrGlyAspLysAsp    140
                                                                          
   421 ATGGCGCAGCTGGTGACGCCAAATATTACGCTTATCAATACCATGACGAATACCATCCTC    480
   141 MetAlaGlnLeuValThrProAsnIleThrLeuIleAsnThrMetThrAsnThrIleLeu    160
                                                                          
   481 GGACCGGAAGAGGTGGTGAATAAGTACGGCGTGCCGCCAGAACTGATCATCGATTTCCTG    540
   161 GlyProGluGluValValAsnLysTyrGlyValProProGluLeuIleIleAspPheLeu    180
                                                                          
   541 GCGCTGATGGGTGACTCCTCTGATAACATTCCTGGCGTACCGGGCGTCGGTGAAAAAACC    600
   181 AlaLeuMetGlyAspSerSerAspAsnIleProGlyValProGlyValGlyGluLysThr    200
                                                                          
   601 GCGCAGGCATTGCTGCAAGGTCTTGGCGGACTGGATACGCTGTATGCCGAGCCAGAAAAA    660
   201 AlaGlnAlaLeuLeuGlnGlyLeuGlyGlyLeuAspThrLeuTyrAlaGluProGluLys    220
                                                                          
   661 ATTGCTGGGTTGAGCTTCCGTGGCGCGAAAACAATGGCAGCGAAGCTCGAGCAAAACAAA    720
   221 IleAlaGlyLeuSerPheArgGlyAlaLysThrMetAlaAlaLysLeuGluGlnAsnLys    240
                                                                          
   721 GAAGTTGCTTATCTCTCATACCAGCTGGCGACGATTAAAACCGACGTTGAACTGGAGCTG    780
   241 GluValAlaTyrLeuSerTyrGlnLeuAlaThrIleLysThrAspValGluLeuGluLeu    260
                                                                          
   781 ACCTGTGAACAACTGGAAGTGCAGCAACCGGCAGCGGAAGAGTTGTTGGGGCTGTTCAAA    840
   261 ThrCysGluGlnLeuGluValGlnGlnProAlaAlaGluGluLeuLeuGlyLeuPheLys    280
                                                                          
   841 AAGTATGAGTTCAAACGCTGGACTGCTGATGTCGAAGCGGGCAAATGGTTACAGGCCAAA    900
   281 LysTyrGluPheLysArgTrpThrAlaAspValGluAlaGlyLysTrpLeuGlnAlaLys    300
                                                                          
   901 GGGGCAAAACCAGCCGCGAAGCCACAGGAAACCAGTGTTGCAGACGAAGCACCAGAAGTG    960
   301 GlyAlaLysProAlaAlaLysProGlnGluThrSerValAlaAspGluAlaProGluVal    320
                                                                          
   961 ACGGCAACGGTGATTTCTTATGACAACTACGTCACCATCCTTGATGAAGAAACACTGAAA   1020
   321 ThrAlaThrValIleSerTyrAspAsnTyrValThrIleLeuAspGluGluThrLeuLys    340
                                                                          
  1021 GCGTGGATTGCGAAGCTGGAAAAAGCGCCGGTATTTGCATTTGATACCGAAACCGACAGC   1080
   341 AlaTrpIleAlaLysLeuGluLysAlaProValPheAlaPheAspThrGluThrAspSer    360
                                                                          
  1081 CTTGATAACATCTCTGCTAACCTGGTCGGGCTTTCTTTTGCTATCGAGCCAGGCGTAGCG   1140
   361 LeuAspAsnIleSerAlaAsnLeuValGlyLeuSerPheAlaIleGluProGlyValAla    380
                                                                          
  1141 GCATATATTCCGGTTGCTCATGATTATCTTGATGCGCCCGATCAAATCTCTCGCGAGCGT   1200
   381 AlaTyrIleProValAlaHisAspTyrLeuAspAlaProAspGlnIleSerArgGluArg    400
                                                                          
  1201 GCACTCGAGTTGCTAAAACCGCTGCTGGAAGATGAAAAGGCGCTGAAGGTCGGGCAAAAC   1260
   401 AlaLeuGluLeuLeuLysProLeuLeuGluAspGluLysAlaLeuLysValGlyGlnAsn    420
                                                                          
  1261 CTGAAATACGATCGCGGTATTCTGGCGAACTACGGCATTGAACTGCGTGGGATTGCGTTT   1320
   421 LeuLysTyrAspArgGlyIleLeuAlaAsnTyrGlyIleGluLeuArgGlyIleAlaPhe    440
                                                                          
  1321 GATACCATGCTGGAGTCCTACATTCTCAATAGCGTTGCCGGGCGTCACGATATGGACAGC   1380
   441 AspThrMetLeuGluSerTyrIleLeuAsnSerValAlaGlyArgHisAspMetAspSer    460
                                                                          
  1381 CTCGCGGAACGTTGGTTGAAGCACAAAACCATCACTTTTGAAGAGATTGCTGGTAAAGGC   1440
   461 LeuAlaGluArgTrpLeuLysHisLysThrIleThrPheGluGluIleAlaGlyLysGly    480
                                                                          
  1441 AAAAATCAACTGACCTTTAACCAGATTGCCCTCGAAGAAGCCGGACGTTACGCCGCCGAA   1500
   481 LysAsnGlnLeuThrPheAsnGlnIleAlaLeuGluGluAlaGlyArgTyrAlaAlaGlu    500
                                                                          
  1501 GATGCAGATGTCACCTTGCAGTTGCATCTGAAAATGTGGCCGGATCTGCAAAAACACAAA   1560
   501 AspAlaAspValThrLeuGlnLeuHisLeuLysMetTrpProAspLeuGlnLysHisLys    520
                                                                          
  1561 GGGCCGTTGAACGTCTTCGAGAATATCGAAATGCCGCTGGTGCCGGTGCTTTCACGCATT   1620
   521 GlyProLeuAsnValPheGluAsnIleGluMetProLeuValProValLeuSerArgIle    540
                                                                          
  1621 GAACGTAACGGTGTGAAGATCGATCCGAAAGTGCTGCACAATCATTCTGAAGAGCTCACC   1680
   541 GluArgAsnGlyValLysIleAspProLysValLeuHisAsnHisSerGluGluLeuThr    560
                                                                          
  1681 CTTCGTCTGGCTGAGCTGGAAAAGAAAGCGCATGAAATTGCAGGTGAGGAATTTAACCTT   1740
   561 LeuArgLeuAlaGluLeuGluLysLysAlaHisGluIleAlaGlyGluGluPheAsnLeu    580
                                                                          
  1741 TCTTCCACCAAGCAGTTACAAACCATTCTCTTTGAAAAACAGGGCATTAAACCGCTGAAG   1800
   581 SerSerThrLysGlnLeuGlnThrIleLeuPheGluLysGlnGlyIleLysProLeuLys    600
                                                                          
  1801 AAAACGCCGGGTGGCGCGCCGTCAACGTCGGAAGAGGTACTGGAAGAACTGGCGCTGGAC   1860
   601 LysThrProGlyGlyAlaProSerThrSerGluGluValLeuGluGluLeuAlaLeuAsp    620
                                                                          
  1861 TATCCGTTGCCAAAAGTGATTCTGGAGTATCGTGGTCTGGCGAAGCTGAAATCGACCTAC   1920
   621 TyrProLeuProLysValIleLeuGluTyrArgGlyLeuAlaLysLeuLysSerThrTyr    640
                                                                          
  1921 ACCGACAAGCTGCCGCTGATGATCAACCCGAAAACCGGGCGTGTGCATACCTCTTATCAC   1980
   641 ThrAspLysLeuProLeuMetIleAsnProLysThrGlyArgValHisThrSerTyrHis    660
                                                                          
  1981 CAGGCAGTAACTGCAACGGGACGTTTATCGTCAACCGATCCTAACCTGCAAAACATTCCG   2040
   661 GlnAlaValThrAlaThrGlyArgLeuSerSerThrAspProAsnLeuGlnAsnIlePro    680
                                                                          
  2041 GTGCGTAACGAAGAAGGTCGTCGTATCCGCCAGGCGTTTATTGCGCCAGAGGATTATGTG   2100
   681 ValArgAsnGluGluGlyArgArgIleArgGlnAlaPheIleAlaProGluAspTyrVal    700
                                                                          
  2101 ATTGTCTCAGCGGACTACTCGCAGATTGAACTGCGCATTATGGCGCATCTTTCGCGTGAC   2160
   701 IleValSerAlaAspTyrSerGlnIleGluLeuArgIleMetAlaHisLeuSerArgAsp    720
                                                                          
  2161 AAAGGCTTGCTGACCGCATTCGCGGAAGGAAAAGATATCCACCGGGCAACGGCGGCAGAA   2220
   721 LysGlyLeuLeuThrAlaPheAlaGluGlyLysAspIleHisArgAlaThrAlaAlaGlu    740
                                                                          
  2221 GTGTTTGGTTTGCCACTGGAAACCGTCACCAGCGAGCAACGCCGTAGCGCGAAAGCGATC   2280
   741 ValPheGlyLeuProLeuGluThrValThrSerGluGlnArgArgSerAlaLysAlaIle    760
                                                                          
  2281 AACTTTGGTCTGATTTATGGCATGAGTGCTTTCGGTCTGGCGCGGCAATTGAACATTCCA   2340
   761 AsnPheGlyLeuIleTyrGlyMetSerAlaPheGlyLeuAlaArgGlnLeuAsnIlePro    780
                                                                          
  2341 CGTAAAGAAGCGCAGAAGTACATGGACCTTTACTTCGAACGCTACCCTGGCGTGCTGGAG   2400
   781 ArgLysGluAlaGlnLysTyrMetAspLeuTyrPheGluArgTyrProGlyValLeuGlu    800
                                                                          
  2401 TATATGGAACGCACCCGTGCTCAGGCGAAAGAGCAGGGCTACGTTGAAACGCTGGACGGA   2460
   801 TyrMetGluArgThrArgAlaGlnAlaLysGluGlnGlyTyrValGluThrLeuAspGly    820
                                                                          
  2461 CGCCGTCTGTATCTGCCGGATATCAAATCCAGCAATGGTGCTCGTCGTGCAGCGGCTGAA   2520
   821 ArgArgLeuTyrLeuProAspIleLysSerSerAsnGlyAlaArgArgAlaAlaAlaGlu    840
                                                                          
  2521 CGTGCAGCCATTAACGCGCCAATGCAGGGAACCGCCGCCGACATTATCAAACGGGCGATG   2580
   841 ArgAlaAlaIleAsnAlaProMetGlnGlyThrAlaAlaAspIleIleLysArgAlaMet    860
                                                                          
  2581 ATTGCCGTTGATGCGTGGTTACAGGCTGAGCAACCGCGTGTACGTATGATCATGCAGGTA   2640
   861 IleAlaValAspAlaTrpLeuGlnAlaGluGlnProArgValArgMetIleMetGlnVal    880
                                                                          
  2641 CACGATGAACTGGTATTTGAAGTTCATAAAGATGATGTTGATGCCGTCGCGAAGCAGATT   2700
   881 HisAspGluLeuValPheGluValHisLysAspAspValAspAlaValAlaLysGlnIle    900
                                                                          
  2701 CATCAACTGATGGAAAACTGTACCCGTCTGGATGTGCCGTTGCTGGTGGAAGTGGGGAGT   2760
   901 HisGlnLeuMetGluAsnCysThrArgLeuAspValProLeuLeuValGluValGlySer    920
                                                                          
  2761 GGCGAAAACTGGGATCAGGCGCACTAAGATTCGCCTGAACATGCCTTTTTTCGTAAGTAA   2820
   921 GlyGluAsnTrpAspGlnAlaHisEnd                                     928

DNA sequence is shown in black on a white background, and the corresponding protein is shown using various text and background colors. Each is numbered according to where protein translation starts.

The area before the protein coding region contains the promoter, which affects how much mRNA will be transcribed from this gene by RNA polymerase under various circumstances. Several interesting parts of the promoter sequence are marked.

The binding site for RNA polymerase contains two segments that are relatively conserved among genes. These are called the CAAT box and the Pribnow box, or the -35 and -10 boxes, based on their normal locations relative to the transcription start site (remember that our numbering starts from the translation start site. You may notice that the -35 box is not quite at -35 bases in this gene, though.)

The polA promoter also contains a binding site (maybe two binding sites) for a protein called DnaA, an initiator protein for DNA replication in E. coli. The main function of DnaA is to help to unwind DNA at the oriC origin of replication during DNA replication. Here it also acts as a transcription factor, to increase the amount of DNA polymerase I produced when the cell is replicating.

The start site of transcription is shown with an arrow. This is where RNA polymerase begins to copy the gene into mRNA. A short region at the beginning of the mRNA does not code protein; this is called the 5' untranslated region (5' UTR). One of the most important features of the 5' UTR is the ribosome binding site, which controls where translation starts in the mRNA sequence. It is complementary to the UCCU core sequence of the 3'-end of 16S rRNA in the 30S ribosomal subunit.

The first amino acid is numbered 1. This amino acid is special, since it starts a new chain, rather than being added to an existing chain. In prokaryotes, the initial amino acid is a modified methionine (N-formyl methionine, or fMet). The tRNA for fMet usually matches an 'ATG' codon, just like normal methionine; this is the case here. In some cases, bacterial genes use different codons for the fMet (like 'GTG'). The first base of this first codon (here an 'A') is numbered 1 in the DNA sequence. Nucleotides before the coding region have negative numbers. Note that there is no zero in this counting scheme.

This protein actually has three distinct enzymatic activities. It is, of course, a DNA polymerase, that is, an enzyme that catalyzes polymerization of dNTPs into DNA. More specifically, this is a DNA-dependent DNA polymerase, since it constructs DNA from a DNA template. In addition, it has two different exonuclease activities. A nuclease is an enzyme that degrades nucleic acids. An exonuclease eats away from one end of a DNA strand (as opposed to an endonuclease, which attacks DNA strands in the middle). Since DNA has two distinct ends (a 5' end and a 3' end), it may not be surprising that some exonucleases eat DNA strands from one end, while others eat it from the other end. DNA polymerase I has both kinds of exonuclease activity, and each is carried out by a separate part of the protein.

A 5' to 3' exonuclease eats DNA strands starting at the 5' end. When DNA polymerase is performing primer extension along a DNA template, the 5' to 3' exonuclease serves to remove any DNA strands it might run into that are bound to the template ahead of the strand it is making.

A 3' to 5' exonuclease activity allows "proofreading"; it can chop off the end of a DNA strand that has just been made (that is, the 3' end). Now, if it always chopped off the bases that had just been added, we would have a lot of trouble ever making progress on our new strand. But we do make progress, because the 3' exonuclease activity is very slow compared to polymerization. It only really makes much difference when the polymerase accidentally incorporates an incorrect base. This causes a mismatch at the end of the strand being extended, and slows down the polymerase enough that the proofreading activity has a chance to chop off the 3' end. The net result is that errors get corrected. In fact, DNApolymerases without proofreading activity tend to make a lot more mistakes than those that have proofreading. This means they introduce mutations into the product strands at a much higher frequency.

Historically, this protein was purified and well characterized before its gene was cloned or sequenced. It was discovered that partial digestion with specific proteolytic enzymes cut it neatly into two parts. The larger of these parts retained the polymerase and proofreading activities, and proved very useful for a variety of biochemistry applications, including the earliest experiments in PCR. Since that time other enzymes, such as Taq polymerase, have taken over most of the applications of E. coli DNA polymerase I and the Klenow fragment. It is interesting to note that Taq polymerase, the DNA polymerase I from Thermus aquaticus has only the two activities in the Klenow fragment, and lacks the 5'-3' exonuclease.

Now that we have the complete sequence of this gene, as well as many similar genes from other organisms, we can identify "conserved domains" in the sequence. There are three such domains, each with an associated enzymatic activity. The enzyme can be described as a fusion of DNA polymerase I 5' to 3' polymerase, a 3' to 5' exonuclease, and a 5' to 3' exonuclease. The various regions of the enzyme are given in the entry for polA in the Entrez Gene database.

The termination codon is marked with the word 'End' in red on the amino acid sequence. 'End' is not really an amino acid.

Resources

References