Nayra M. Al-Thani1,Simeon S. Andrews2, Dietrich Büsselberg1*
1Weill Cornell Medicine in Qatar, Qatar Foundation-Education City, Doha, Qatar
Weill Cornell Medicine in Qatar
Qatar Foundation-Education City
POB 24144, Doha, Qatar
Tel no: +974 33480728
Article Type: Review Article
Manuscript ID: MMMB-1-105
Publisher: Boffin Access Limited
Journal Type: Open Access
Copyright: © 2018 Büsselberg D, et al.
Creative Commons Attribution 4.0
Al-Thani NM, Andrews SS, Büsselberg D. Open Reading Frame Filtering Methods for Identification of Genic Sequences. Methods Microbiol Mol Biol. 2018 Jan; 1(1): 105
Protein-Protein Interactions (PPIs) help understanding disease processes and their mechanisms. Ideally, researcher would like to understand the full network of PPIs that take place within a cell “interactome”, that is why Open Reading Frame (ORF) filtering method was utilized. ORF is a DNA fragment that lacks stop codon and has the potential to demonstrate a physiological interaction. ORFs are obtained by random shearing of DNA into small fragments. Fragments are filtered by insertion upstream of a selectable marker, to allow the survival of cells that only have an ORF. As vast majority of out of frame fragments encodes a premature stop codon. Here we present a method for filtering of genic ORFs ‘real gene’ which results in a physiological protein. Furthermore, researchers thought fragments libraries rather than full-length libraries are perhaps counterintuitive, as expected to result in a high rate of false negatives. However, researchers have found fragments libraries rather than full-length reduce number of false negative interactions. Lastly, great advantage to note is while using fragmented library it allows localization of interaction site, offering a robust path for drug targets, treatments towards cancer and the arising resistance to antibiotics.
Open Reading Frame; ORF filtering; Two-hybrid system; Gene fragment; Proteinprotein interaction
Most cellular processes are influenced or directly mediated by protein-protein interactions (PPIs). Studying PPIs is therefore essential for understanding normal and pathological physiology within a cell. Understanding PPIs helps us to understand disease processes such as cancer and their mechanisms. Ideally, we would like to understand and study the entire network of protein interactions, which is referred to as the “interactome”. An interactome defines the full network of PPIs that take place within a cell .
Diverse methods have been used to identify interactomes, including proteomics methods . However, the chemistry and technology of protein quantitation have substantial challenges. In contrast, DNA-sequencing technologies have long been robust, and particularly in the past decade “next-generation” sequencing has revolutionized our ability to quickly and accurately sequence vast amounts of genetic material. Various methods have used DNA readouts to study PPIs. Of these, the two-hybrid system is probably most frequently used . The original system used yeast proteins encoding a DNA-binding domain and an activation domain. One protein of interest (“X”) is N-terminally fused to the binding domain, while another protein is similarly attached to the activation domain (“Y”). The binding domain protein binds to a specific DNA sequence and the activation domain recruits transcription factors, which are necessary to initiate the transcription of a reporter gene. Only if the “X” and the “Y” protein interact, bringing the binding domain and the activation domain into proximity, will the reporter gene be transcribed (Figure 1).
The two-hybrid system has been used in the intervening decades, and it has two major variations, which are the “yeast” and the “bacterial” two-hybrid systems.
In 1997 Fromont-Racine and co-workers used yeast 2-hybrid system (Y2H) to identify protein interactions in yeast . Thereafter, the Y2H model was used to screen for PPIs in large scale for species such as Saccharomyces cerevisiae, Helicobacter pylori, Drosophila melanogaster, Caenorhabditis elegans, and Homo Sapiens [5-11]. Three years after the first screening using the Y2H system Joung and colleagues were the first utilizing a bacterial 2-hybrid (B2H) system with a large library (~ 108 in size) . The B2H system has two major advantages compared to the Y2H system, as it has a faster growth rate, and higher transformation efficiency .
Initial experiments with the two-hybrid systems generally employed full-length genes. The use of fragments rather than full-length libraries is perhaps counterintuitive, as we know that many protein structures and interactions are impossible with fragments, and might be expected to result in a high rate of false negatives. Surprisingly, researchers have found that gene fragment libraries reduce false negatives interactions rather than full-length gene; and thorough screening reduce false positives interaction [7,14]. When Boxem, M et al, used fragments rather than full-length genes, they found that they recovered more physiological interactions, and in their limited set no false positive interactions . Presumably, this is because fragments of the protein may avoid problems of folding or translocation found by fulllength proteins. Any false positive interactions can also be eliminated with more thorough library screening .
The use of fragments has the added benefit of permitting more rapid screening, as one does not have to first devise a library of all protein-coding genes with specific primers before testing pairs. Random fragmentation quickly allows cDNA to be converted into testable fragments. Finally, note that the use of fragments allows localization of interactions to specific regions of proteins. Rather than knowing only those two proteins interact, we can define their interacting regions as well. With overlapping fragments, we can even identify the minimal interacting region.
For a fragment to potentially demonstrate a physiological interaction, it must be an open reading frame (ORF) (Figure 2). An ORF is a DNA sequence without a stop codon and has the potential to encode proteins. If DNA is sheared into random fragments, followed by insertion into a vector, the majority of gene fragments (83%) do not represent any functional gene (termed as out of frame), therefore coding non-physiological proteins. Out of the 6 possible frames (Figure 3) only 1 fragment corresponds to the gene frame, which encodes the physiologically relevant protein. Therefore, these fragments need to be filtered in order to discard those that are nonphysiological. This process of removing non-ORF sequences is what is termed “ORF filtering.”
All methods currently in use for ORF filtering rely on the underrepresentation of stop codons in truly coding sequences. A truly random DNA sequence will, on average, encode a stop codon every 21 triplets. Indeed, with only 63 codons, there is a 95% probability of having at least one stop codon if the sequence is random, and with fragments encoding 100 amino acids (300 bp), there is a 99 % chance that random sequence will have a stop codon. By contrast, a coding sequence of DNA will of course avoid stop codons until the end of the sequence has been reached. If we take fragments of cDNA just 300 bp long, but in random frames, then 5/6 (83.3%) will be in the wrong frame. Yet 99% of those will have a stop codon; if we can selectively eliminate fragments with stop codons, the in-frame percentage of the library will go from 16.7% up to 96%. This has the disadvantage of selecting against in-frame sequences that include the physiological stop codon, and thus the C-termini of proteins are expected to be under represented.
In order to filter and express ORFs, a sheared DNA fragment is cloned upstream of a selectable marker. If it is an ORF, then the selectable marker will be transcribed and expressed in those cells. The vast majority of DNA fragments that are out of frame will encode a premature stop codon, and consequently the selectable marker is not translated, resulting in a difference in selectability between ORFcontaining cells and those without ORF fragments. For instance, if an antibiotic resistance gene is the selectable marker, then only the cells with ORF fragments will survive in the presence of antibiotic (Figure 4) . This method was adopted by Weinstock, G. M., and co-workers (1983), who inserted random fragments into a vector between the outer membrane protein (Omp) gene and beta-galactosidase (LacZ) gene (which can be used to screen via blue-white colony selection). The vector has LacZ(-), which corresponds to the nonfunctional gene . However, by inserting of random-fragments that realign both genes Omp and LacZ, the LacZ(+) becomes functional and is expressed on the colonies and, therefore, can be selected by bluewhite screening. Furthermore, the libraries with random-fragments were enriched from 54% to 100% ORFs by the selection of an antibiotic [17,18]. Therefore, such a selection improves the quality of the library . Moreover, this method is also capable of localizing the sites of interactions within the sequence [20,21].
There are several organisms that can be used for ORF filtering including: bacterial strains, viruses such as phages, yeast. All these methods are based on a series of experiments, starting by 1) isolating a gene, 2) shearing to ORF fragments, 3) amplifying by polymerase chain reaction (PCR), 4) ligation into a vector, 5) transfection into cell, 6) applying selective pressure to the library, and 7) sequencing the targeted DNA.
To identify ORF’s in bacteria a marker, such as AmpR, (marker for ampicillin resistance), is used to test the presence of ORFs. DNA fragments are inserted upstream of AmpR gene and downstream of its leader sequence. The leader sequence allows the export of the transcribed product of AmpR gene to the periplasm (Figure. 4), its site of action. Different antibiotics (e.g. chloramphenicol, kanamycin, spectinomycin, tetracycline) can be used as selectable markers . Furthermore, some methods use a more complex cloning by insertion of some sequences, such as Lox sequence, which is cleaved by the Cre recombinase [18,23]. This allows a recombination of the ORF and the formation of the fused DNA product with a tag gene. By flanking the ORFs with recombination elements, we can facilitate the isolation of ORFs for further studies and validation of ORF interactions [24,25].
The first time the frame concept was used to generate a MH3000 E.coli strain which had a -galactosidase (LacZ) gene out-of-frame . These researchers inserted ORFs downstream the OmpF gene and upstream of an out-of-frame LacZ gene. When the fragments were inserted, those which changed the frame to generate a functional LacZ gene could produce blue colonies in the presence of X-Gal. These blue colonies could be verified to contain functional proteins.
Davis & Benzer showed that ORF frequencies are dependent on the concentration of antibiotic concerning their selection, library size, or bacterial strain . They showed 8% of the clones are in frame before selection, while this fraction increased to 70% following the selection. Furthermore, selection frequency differed from ORFs library size. Both strains XL1-Blue and DH10B are capable of cloning larger fragment, however XL1-Blue resulted in higher transformation efficiency compared to DH10B. They concluded that for a smaller library size a higher concentration of the antibiotic did result in a better selection, while this was opposite for large library seizes. By chance the ORF fragment could be orientated in the wrong orientation. To overcome this issue Davis and co-workers used directional cloning using two different restriction enzymes to clone the ORFs into the expression vector . Moreover, they used PCR primers to modify kanamycin gene by having a stop codon in the second reading frame of ATGA. This allowed them to ensure the ORFs will not survive if the reading frame starts from the second nucleotide.
Four test genes were used to confirm the “theory of frame” by shifting two of the genes, adding one or two bases . The vector used had a Chloramphenicol resistance, and an ampicillin resistance gene. The four test genes were inserted upstream of AmpR gene and were transformed and plated in two different plates (Chlor and Amp). The colonies with frame-shifted genes do not survive in Ampicillin plates since it does not have a functional AmpR gene. Therefore, all the four test gene were able to grow in Chloramphenicol plates.
Filtering of genic ORFs for a ‘real gene’ – resulting in a physiological protein - researchers used a vector that has a chloramphenicol resistance to grow colonies on plates . Thereafter, they harvested cells and grew them in selective media supplemented with both chloramphenicol and with different concentrations of ampicillin (as a selective marker). This step was followed by sequencing to identify ORFs which obtained 96% corresponding to real genes. Statistical analysis showed that the activity of beta-lactamase rises with increasing concentrations of ampicillin. This proves that higher expression of beta-lactamase is essential for colonies to grow in high concentration of ampicillin.
The phage display method inserts the gene (which encodes the protein of interest) into a phage coat protein that is expressed on the surface of the phage. Thereafter the gene is expressed in bacteria (as a host); a process called transduction. The primary bacteriophages used for phage display are T7 and M13, both of which can use Escherichia coli as a host.
In one of the first reports of ORF filtering for phage display, researchers modified a vector by eliminating the original multiple cloning site (MCS) inserting a new site through inverse PCR . This step allowed them to design ORFs which are in-frame even when subcloned into derivatives vectors, which they constructed. After ligation of the fragments upstream of AmpR, the vector was transformed into an XL-1blue strain and ampicillin selection was applied. In order to display the ORF’s on the phage surface, the insert was cloned into a derivative vector and was transformed into a bacterial strain (ER2738) to grow under kanamycin selection. Phagemid were rescued by helper phage and sequencing analysis of those samples determined that 97% were ORFs.
Zacchi et al fused fragment upstream of the beta-lactamase to filter ORFs and flanked the insert by lox recombination sequence . Following ampicillin selection, they excised the beta lactamase gene from the vector. To facilitate the purification of those ORFs using Phage display, the constructed vector had fd phage tag “gene 3”. Results confirmed using high concentration of antibiotic eliminate out-of frame sequences, (when using 12 μM ampicillin they had 100% ORFs and 0.2% out of frame; but when using 25 μM ampicillin they had 85% ORF and none out of frame). From the 100% filtered library, 80% was detected using Dot blot for protein detection and the mapped ORFs represented 50% genic ORFs.
In 2010, Di Niro et al, applied the antibiotic screening technique to prepare ORFs for expression in phage display vectors in a more seamless way. Genes were fragmented into to 100-600bp, and fragments were cloned in the right orientation using restriction enzymes sites. The ORFs fragments were inserted upstream of AmpR gene, but with lox sequences flanking the AmpR sequence. Downstream from the AmpR and Lox sequence is a g3p gene, which encodes a phage coat protein used to display ORFs in phage surface. Once ligated, the vectors were transformed into bacteria and grown on ampicillin-containing plates for ORF selection. Positive clones were then transformed into a bacterial strain that has a constitutively active Cre-recombinase. This cleaves the lox sequence and recombines ORFs with g3p gene, eliminating the AmpR gene. In this way, the ORF only needs to be once cloned into the vector, while still permitting expression of the ORF/g3p fusion without AmpR. To validate interactions ORFs were displayed in phage for the enzyme transglutaminase 2 (TG 2). This method allowed the selection of 99% ORFs in which 85% corresponding the correct frame of the gene, and provided the local regions of interactions domain.
By contrast, Caberoy NB et al, used phage display itself to select ORFs . They used the T7 phage, inserting their ORF at the C-terminal end of the Capsid 10B protein. Crucially, however, they added a 3C protease cleavage site and then a biotinylation site further downstream. Consequently, only virus particles encoding an ORF will be biotinylated. The cDNA library was selected using streptavidin to isolate biotinylated ORFs, and these were cleaved from resin with the 3C protease. The recovered library was re-amplified in bacteria to generate an ORF-selected library suitable for use in selection experiments. They found 17 ORFs, of which 13 encode different protein were selected using phage display. Phage display of cDNA library fused with biotinylation tag in the C-terminus confirmed following selection that clones had 90% enriched ORFs inserts .
Gene sheared, gel purified fragments of 100-300bp. Preformed ampicillin selection to filter ORFs, following step vector transformed into strain that express constitutively active Cre gene to remove ampicillin gene from ORFs after selection . Phage display of ORFs by infecting the bacteria with helper phage M13K07 represented 94% ORFs library.
Fragmented gene into 200-800bp via sonication, those fragments was cloned into vector . Followed by transform into strain resistance to chloramphenicol and ampicillin as selective marker. To confirm that the target sequences obtained, the samples were sequenced and to determine the structure of the enzyme crystallization performed. For the purification and crystallization of proteins, a His-tag was attached in the N-terminus to have protein in the soluble form. Following this approach, they were able to identify two domains on the gene, covered 739 genes from chromosome 1 and 540 genes from chromosome 2 with a total of 1279 ORFs in their library.
Open reading frame percentage (ORFs%) corresponds to percentage of sequences that were isolated without having a stop codon. (ORFs genic%) is percentage of ORF sequences that were isolated without having a stop codon and when aligned to the reference gene it aligns to the correct gene frame. The (selective marker) is the marker that have been used in the ORF filtering vector to select for ORF sequence and filter out the one with stop codon based on antibiotic selective pressure for bacteria and yeast, Tag presence for phage. Fragments size used for sheared DNA library on base pair (bp), Application applied such as Phage display to and further tests applied to validate ORF (Validation methods). Lastly, the strain used for selection whether it has been done in bacteria, phage, or yeast and corresponding authors name. Authors highlighted yellow are for ORF selection using bacteria, while green using yeast, and blue using phage.
ORFs have also been tested in yeast, by transforming first into a bacterial strain for expression of the desired vector, which must be ampicillin resistance by plating into Amp plates . Following the selection of the desired vector, the plasmid was transformed into yeast to get ORFs through histidine induction medium to select ORFs that are tagged with histidine gene. The ORFs were tagged with histidine gene, to filter out the ORFs. As other researched used ampicillin as selective marker, here Holz and his colleague used histidine gene as selective marker for yeast. Through this experiment they were able to cover 60% ORFs.
Using bacteria, as a host to obtain ORFs is more efficient and reliable compared to phage and yeast. In order to get an ORF library using Phage display it requires enrichment via multiple rounds of purification and amplification . The only advantage of that method as it allows purification of larger fragments [13,30]. Although Caberoy and her colleagues were able to achieve 90% ORFs it does not correspond to real genic ORF%, since ORF is just a sequence that lacks stop codon. However, when some of ORF sequences aligned to the corresponding gene, it does not align to the correct frame of the gene. More importantly, ORF filtered library using bacteria provides faster growth rate and higher transformation efficiency . Yeast ORF filtered library is not yet extensively tested as bacteria, Holz C et al, applied ORF filtering on yeast were they achieved 60% ORFs, using insert of 200-20000bp [32,33]. However, given that yeast generally have much lower transformation efficiencies, they are unlikely to be an ideal host except under specific conditions, such as if a fragment is not believed to fold properly except in a eukaryotic environment.
From Table 1 most of the ORF selection using bacteria is done using DH5alphaF’ strain, since it has a high plasmid yield and high transformation efficiency (which is 1×109cfu/µg). Furthermore, (recA gene) responsible for heterologous recombination is mutant, ensuring high stability of the insert. Plus DH5alphaF’ lacks some endonucleases that will start digesting the plasmid during the isolation process .
Table 1:Summary of ORF filtering methods from literature
A good note from Table 1 is that ampicillin is the most common antibiotic used as selective marker for ORF filtering. One reason for this is that it requires the expression of the antibiotic resistance gene with its leader sequence. If the fragment is inserted between them, this ensures that the resistance expression is from the full inserted sequence of the ORF. While the other antibiotics, which lack a leader sequence (since they are not exported), will allow the expression of some out-of-frame fragments which may have alternate translation start sites. Thus, the leader sequence of ampicillin ensures that all of the expressed ORFs contain the full insert sequence.
A higher ORF percentage is achieved using smaller insert fragments. Using bacteria ORF filtered library achieved highest ORF percent when utilized insert size ranging 100-800bp (Table1). However, a fragment ranging from 100-500bp is better in the sense that a fragment with size of 300bp will have 99% chance of having a stop codon. The great advantage of using fragmented library is that it allows localization of interaction site [20,21]. The ORF filtering eliminates fragments with stop codon, providing good selection for more genic ORFs (Figure 5). When fragmented libraries were used with ORF filtering a recovery of more physiological interactions was achieved .
Figure 5:a) DNA sequence presenting of four genes (full-length) that are randomly fragmented and sheared to pool of DNA fragments. The pool is representing a whole mixture of all possible sequences fragments within the genome.
b) The pooled fragments are inserted into ORF expression vector to filter out only the fragments that does not have a stop codon (which allow the survival under antibiotic selection).
c) Finally when ORF filtered library is validated for the interaction and aligned to the reading frame using bioinformatics tools it allow the localization of the interaction site within the gene. As showing for gene.1 the interaction site is covered from B, F, and J. whereas for gene.2 is covered by fragment D, H, and K. gene.3 fragment B, C, G, and a partial part of K which could be presented as weak interaction during the test. The same for gene.4 when interaction is applied, the strongest interaction is going to be from F fragment, then J while A and B will be weaker.
ORF filtering is a great tool for providing functional sequence within the gene. This offers a robust path for discovery of drug targets, treatment of infections especially resistance to antibiotic and cancer.
The authors thank Dr. Joel Malek (Genomics core at WCM-Q) for sharing his extensive knowledge in the field of genomics research.
The authors declare no competing interest