The purpose of this small program is to show
how prior information can be used to optimize efficiency of initial
crystallization screening in High Throughput Protein Crystallography (HTPX).
Effective initial crystallization screening aims to identify with the highest
overall efficiency (least material, supplies, and resources and thus
cost) the proteins that are most likely to yield successful crystals and
structures. The purpose of efficient initial screening is not to find
conditions for each and any protein, but to focus resources (upscale, Se-Met,
etc) on those proteins which have the highest probability to yield structures
with least effort (a.k.a. 'the first cut, cherry picking, etc).
The isoelectric point (pI) is the pH at
which the charges of amino acid residues (C, D, E, H, K, R, Y) and the amino-
and carboxy-terminus of the peptide chain compensate to a zero net charge
resulting in minimum solubility of the protein in aqueous solution. Although
the relevance of decreased solubility for crystallization success is still
debated, there exists no simple correlation between pI and pH of
crystallization. However, following the proper distributions of crystallization pH or pH-pI
for a given pI increases the likelihood of crystallization, and thus pI can be employed as a predictor for
crystallization success. The data have been extracted from the 9000+ sequence records
of the PDB and the corresponding reported pH of crystallization.
Following caveats apply:
a) The pI calculation is not exact. Only after the structure is known, local
environment determining the actual pKa values of the residues could be
b) The distributions are not further discriminated by protein properties, and
represent probabilities for the average, 'garden variety' protein reported in the PDB. They may not be valid for special cases, such as membrane proteins,
complexes etc. The provide, however, the most efficient overall strategy for
c) It is mandatory to use the actually
crystallized construct sequence when calculating the pI. All affinity tags,
fusions, linkers, cleavage site residuals, etc must be included.
d) The delta-distributions used are
coarsely binned (9000 data points for a 2-d set
of distributions is not much) but show clearly how the shape and mean/mode of the
distributions differ. The binning width of 1 pH unit is a realistic estimate of
the error in the calculated pIs (and reported pH perhaps).
e) The distribution of the pH in the PDB is
biased by usage (no negatives) and the
distributions - judging from random experiments - are perhaps broader than
extracted from the PDB.
Please cite the published reference when you
use this program :
The bin data
extracted from the latest non-redundant PDB data set can be
downloaded from here.
Above the distribution graphs, you will see a set of tables. The following is
returned for pI of 8.0 for example:
Table for cutoff excluding bins with expected success rates below 1.0%
pH-pI bin : -8.0 -7.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0Expected % : 0.0 0.0 0.0 0.0 4.8 12.5 21.8 24.4 26.7 7.3 1.9 0.0 0.0 0.0
Population of 288 experiments in 7 bins :equal pop. : 0 0 0 0 41 41 41 41 41 41 41 0 0 0 287suggested : 0 0 0 0 13 36 63 70 77 21 5 0 0 0 287
Expected relative hit ratesequal pop. : 0.0 0.0 0.0 0.0 2.0 5.1 9.0 10.0 10.9 3.0 0.8 0.0 0.0 0.0 40.8suggested : 0.0 0.0 0.0 0.0 0.7 4.5 13.8 17.3 20.6 1.5 0.1 0.0 0.0 0.0 58.5
pH --- --- --- --- 4.0 5.0 6.0 7.0 8.0 9.0 10.0 --- --- ---
Experiments: --- --- --- --- 13 36 63 70 77 21 5 --- --- ---
Expected efficiency increase compared to pH screening with equally populated bins: 43%
The first set of blue lines indicates:
the pH-pI bin of the distribution (shown in the graph) as column headers
the prior (expected) distribution of successes based on the analysis from
the PDB crystallization data
the population of the screen with experiments, first with equal frequency,
then with the frequency suggested by evidence
the relative expected hit rates (scale is arbitrary) for equal frequency
and for suggested frequency.
The final red lines give:
the suggested pH range for screening
the number of experiments to set up
and finally, the estimated increase in efficiency based on the expected hit
The table repeats for for different screen widths (i.e., neglecting bins with
populations below a certain cutoff as
listed). Note how this effects the
efficiency increase - the gain is largest for wide (improbable) screen ranges.
In the above example, one sees that there is not much point in screening far above
the pI, but even up to 3 pH units below pI there is a good statistical chance
that the pH is conducive to crystallization. For pI 6.0, this distribution would
have similar centroid values, but a different shape. For more extreme
values, both the centroids and the distribution shape change substantially.
Using the suggested values and frequencies maximizes the chance for success with
a minimal number of experiments.
In an initial screening, for example, you might consider a more limited range
of pHs - at the risk of loosing a few percentage points of chance for success.
Comprehensiveness versus material demands need to be balanced for maximum efficiency
- a decision you need to make based on your situation.
NOTE: For consistency with
the pI calculation used to derive the statistics, use the calculator provided
below. Depending on which pI calculator you use, deviations of +/- 0.5 pH units
or more are not uncommon - see
Enter either pI, or the sequence of
your protein to be crystallized (see above):