Difference between revisions of "Merging"

From cctbx_xfel
Jump to: navigation, search
m (Output confusion.)
(Merging a set of integration files)
 
(One intermediate revision by one other user not shown)
Line 13: Line 13:
 
* The <code>SQLite</code> back end uses a simple SQLite database, which is written to a single file on the file system.  It is easier to use than the <code>MySQL</code> back end and more efficient than the <code>FS</code> backend.  Regrettably, the <code>SQLite</code> back end does not appear to work on the Lustre file system.
 
* The <code>SQLite</code> back end uses a simple SQLite database, which is written to a single file on the file system.  It is easier to use than the <code>MySQL</code> back end and more efficient than the <code>FS</code> backend.  Regrettably, the <code>SQLite</code> back end does not appear to work on the Lustre file system.
  
Compared to indexing and integration, merging is a relatively quick procedure.  However, particularly for large datasets, it may significantly strain computational resources.  Therefore, it is recommended to merge data on SLAC's interactive nodes.
+
In ''cctbx.xfel'' images are merged using the <code>cxi.merge</code> command.  <code>cxi.merge</code> accepts several options which can conveniently be assembled in a ''phil'' file.  In this tutorial only a subset of the available options are defined.
<pre>$ ssh psanacs.slac.stanford.edu
+
$ cd myrelease
+
</pre>
+
 
+
In ''cctbx.xfel'' images are merged using the <code>cxi.merge</code> command.  The results are logged to standard output, so it is advisable to redirect the stream to a file.
+
<pre>
+
$ cxi.merge Ls04-lysozyme-merge.phil > merge.out
+
</pre>
+
 
+
Here, <code>Ls04-lysozyme-merge.phil</code> is a ''phil''-file with the parameters to control the merging procedure.  In this tutorial only a subset of the available options are defined.
+
  
 
; backend
 
; backend
Line 29: Line 19:
 
; d_min
 
; d_min
 
: Limiting resolution for scaling and merging
 
: Limiting resolution for scaling and merging
 +
; pixel_size
 +
: set the pixel size for the detector used, this is a single value (0.11 for CSPad)
 
; data
 
; data
 
: Directory containing integrated data in pickle format.  Repeat to specify additional directories.
 
: Directory containing integrated data in pickle format.  Repeat to specify additional directories.
Line 36: Line 28:
 
: Correlation cutoff for rejecting individual frames
 
: Correlation cutoff for rejecting individual frames
 
; model
 
; model
: The scaling reference, PDB filename containing atomic coordinates and isomorphous <code>CRYST1</code> record
+
: The scaling reference, PDB filename containing atomic coordinates and isomorphous <code>CRYST1</code> record.  Note:  quick command to download a pdb file from the Internet is: <code> phenix.fetch_pdb [pdb_code]</code>
 
; nproc
 
; nproc
 
: Specifies the number of scaling processes ''cxi.merge'' may have running at any one time
 
: Specifies the number of scaling processes ''cxi.merge'' may have running at any one time
Line 47: Line 39:
 
; set_average_unit_cell
 
; set_average_unit_cell
 
: If <code>True</code> set the unit cell of the merged data to the average of the merged images, otherwise use the unit cell of the scaling reference
 
: If <code>True</code> set the unit cell of the merged data to the average of the merged images, otherwise use the unit cell of the scaling reference
 +
  
 
The merged data is written to an mtz-file named <code><i>output.prefix</i>.mtz</code>.  Standard output, redirected to <code>merge.out</code> above, mainly consists of statistics for each individual image as it is scaled.  The output concludes with a section labelled <code>FINISHED MERGING</code>, which first lists the number of accepted and rejected images,
 
The merged data is written to an mtz-file named <code><i>output.prefix</i>.mtz</code>.  Standard output, redirected to <code>merge.out</code> above, mainly consists of statistics for each individual image as it is scaled.  The output concludes with a section labelled <code>FINISHED MERGING</code>, which first lists the number of accepted and rejected images,
Line 79: Line 72:
  
 
== Additional merging statistics ==
 
== Additional merging statistics ==
 
 
''cxi.xmerge'' retrieves the scaled, unmerged intensities from the database back end, and calculates the <i>CC</i><sub>1/2</sub><ref>[http://dx.doi.org/10.1126/science.1218231 Karplus, P. A. & Diederichs, K. Linking Crystallographic Model and Data Quality. <i>Science</i> <b>336</b>, 1030–1033 (2012).]</ref> and <i>CC</i><sub>iso</sub> statistics.  <i>CC</i><sub>1/2</sub> is defined as Pearson's correlation coefficient between two sets, such that for each unique reflection the average intensities of two randomly chosen halves of its independent observations are assigned to different set.  <i>CC</i><sub>iso</sub> is the correlation coefficient between the merged data and the isomorphous scaling reference.  Both statistics are computed in each resolution bin, as well as for the full set of reflections.
 
''cxi.xmerge'' retrieves the scaled, unmerged intensities from the database back end, and calculates the <i>CC</i><sub>1/2</sub><ref>[http://dx.doi.org/10.1126/science.1218231 Karplus, P. A. & Diederichs, K. Linking Crystallographic Model and Data Quality. <i>Science</i> <b>336</b>, 1030–1033 (2012).]</ref> and <i>CC</i><sub>iso</sub> statistics.  <i>CC</i><sub>1/2</sub> is defined as Pearson's correlation coefficient between two sets, such that for each unique reflection the average intensities of two randomly chosen halves of its independent observations are assigned to different set.  <i>CC</i><sub>iso</sub> is the correlation coefficient between the merged data and the isomorphous scaling reference.  Both statistics are computed in each resolution bin, as well as for the full set of reflections.
  
Line 86: Line 78:
 
<pre>$ cxi.xmerge Ls04-lysozyme-xmerge.phil</pre>
 
<pre>$ cxi.xmerge Ls04-lysozyme-xmerge.phil</pre>
  
The options used in this tutorial not already described in [[#Merging a set of integration files|Merging a set of integration files]]
+
The options used in this tutorial not already described in [[#Merging a set of integration files|Merging a set of integration files]] are explained below.
 
+
; scaling.mtz_column_F
+
: Column name in the reference structure mtz-file with structure factors
+
 
; scaling.mtz_file
 
; scaling.mtz_file
: mtz-file with reference structure factors, must have data type <code>F</code>
+
: mtz-file containing experimentally determined reference structure factors, such as synchrotron measurements, to be used for CC<sub>iso</sub>.  Experimental measurements, when available, can be downloaded from the PDB using <code>phenix.fetch_pdb –-mtz [pdb_code]</code>.
 +
; scaling.mtz_column_F
 +
: In the experimental measurements mtz-file, the column name containing the reference data; can be either amplitudes or intensities.  The column name should be given '''in lower case'''. An MTZ-format file can be inspected to see which columns it contains with <code>phenix.mtz.dump [code].mtz</code>.  Quick tip:  if experimental measurements are unavailable, mock experimental structure factors can be generated with <br><code>phenix.fmodel <pdb file> high_resolution=<value> output.type=real</code> in which case one would use <code>scaling.mtz_column_F=fmodel</code>.
 
; scaling.log_cutoff
 
; scaling.log_cutoff
: Intensities less than <i>e</i><sup><code>scaling.log_cutoff</code></sup> will not be included in  the calculation
+
: Intensities less than <i>e</i><sup><code>scaling.log_cutoff</code></sup> will not be included in  the calculation of CC<sub>iso</sub> and R<sub>iso</sub> (however, these measurements will still be included in the final merged data).
 +
 
 +
The results are printed on standard output and conclude with a table including the <i>CC</i>
 +
 
 +
<pre>
 +
---------------------------------------------------------------------------------------
 +
                                      CC            CC            R    R  Scale Scale
 +
Bin  Resolution Range  Completeness  int    N    iso    N    int  iso  int  iso
 +
---------------------------------------------------------------------------------------
 +
  1 -1.0000 -  6.4633    [233/309] 45.1%    233 56.3%    142 57.5% 45.6% 0.749 3.084
 +
  2  6.4633 -  5.1299    [184/285] 22.5%    184 56.7%      62 57.2% 31.5% 0.862 2.841
 +
  3  5.1299 -  4.4814    [203/275] 43.7%    203 57.6%      85 55.3% 37.0% 0.830 2.977
 +
  4  4.4814 -  4.0716    [220/269] 39.0%    220 59.7%    101 51.4% 30.3% 0.805 3.296
 +
  5  4.0716 -  3.7798    [187/249] 77.5%    187 33.6%      63 54.3% 39.4% 0.775 4.243
 +
  6  3.7798 -  3.5569    [198/267] 35.4%    198 47.2%      56 56.8% 27.5% 0.903 3.587
 +
  7  3.5569 -  3.3787    [198/274] 32.0%    198 70.7%      40 59.9% 31.4% 0.814 3.846
 +
  8  3.3787 -  3.2317    [163/256] 30.6%    163 24.5%      28 63.9% 49.1% 0.734 4.278
 +
  9  3.2317 -  3.1072    [164/264] 40.6%    164 53.6%      7 80.4% 32.1% 0.622 5.543
 +
10  3.1072 -  3.0000    [137/261]  8.2%    137 93.9%      5 85.6% 33.4% 0.221 4.372
 +
 
 +
All                                34.7%    1887 47.7%    589 62.5% 39.5% 0.754 3.157
 +
---------------------------------------------------------------------------------------
 +
</pre>
 +
In this case the overall <i>CC</i><sub>1/2</sub> (<code>CC int</code>) is 34.7%, and the correlation coefficient to the isomorphous reference, <i>CC</i><sub>iso</sub> (<code>CC iso</code>) is 47.7%.
 +
Note that the completeness for the cxi.xmerge output appears lower than that for the cxi.merge output. This is because the cxi.xmerge final table shows the completeness for the data used in the calculation of correlation coefficients, and so the completeness in the previous table should be used for general statistics.
 +
 
 +
== Merging the tutorial data ==
 +
 
 +
Compared to indexing and integration, merging is a relatively quick procedure.  However, particularly for large datasets, it may significantly strain computational resources.  Therefore, it is recommended to merge data on SLAC's interactive nodes.
 +
<pre>
 +
$ ssh psanacs.slac.stanford.edu
 +
$ cd myrelease
 +
$ cp /reg/g/cctbx/tutorials/merging/Ls04-lysozyme-*merge.phil .
 +
</pre>
 +
Before running the merging programs, the ''phil''-files may need to be edited.  In particular, the location of the directories containing pickle files with the integrated intensities may need to be adjusted.  Since both <code>cxi.merge</code> and <code>cxi.xmerge</code> log to standard output, it is advisable to redirect the stream to a file.
 +
<pre>
 +
$ cxi.merge Ls04-lysozyme-merge.phil > merge.out
 +
$ cxi.xmerge Ls04-lysozyme-xmerge.phil > xmerge.out
 +
</pre>
 +
The merged reflections, suitable for further processing, are written <code>Ls04-lysozyme.mtz</code>. Note that although the data for this tutorial extends to 2.2 Å, the statistics become exceedingly poor around 3.0 Å.
  
 
== References ==
 
== References ==
 
<references/>
 
<references/>

Latest revision as of 22:44, 29 September 2014

The result of Indexing and integration is a set of Python pickle files, each of which essentially contains a table of Miller indices of the observed reflections, their integrated intensities, and estimated errors. In the general case, these files reflect the measurements from single shots, each exposing different crystals with a unique pulse of X-rays. Merging refers to the procedure applied to unite all these observations into a single data set. During merging, a distinct multiplicative factor, which accounts for the variance in pulse intensity and crystal size, is applied to the observations from a single shot to bring all the observations onto a common scale. After rejecting possible outliers, the intensities for individual reflections are summed, and their errors are propagated in quadrature. The result of merging is an mtz file suited for further processing, e.g. molecular replacement.


Merging a set of integration files

In cctbx.xfel the per-image scale factors are determined using a scaling reference. This scaling reference is expected to be a previously solved, isomorphous data set. The scale factor is determined by a least-squares fit of the observations to the reference intensities, after applying corrections for polarization<ref>Kahn, R, et al. Macromolecular Crystallography with Synchrotron Radiation: Photographic Data Collection and Polarization Correction. J Appl Cryst 15, 330–337 (1982).</ref>, and a significance filter, which limits the resolution of each diffraction pattern based on the signal-to-noise ratio. Lattices not conforming to the Bravais symmetry of the scaling reference are rejected. By default non-isomorphous lattices with cell lengths differing by more than 10% in length from the mean, or 2° in the angles are also rejected, as are lattices that correlate poorly with the scaling reference.

cxi.merge may take several passes over the integrated images. In order to speed up processing, cxi.merge will write the scaled data to a back end database during each pass. Currently three database back ends are implemented.

  • FS is the simplest back end. It stores the scaled intensities, their Miller indices, and information about the shots they were observed during in three flat files on the file system.
  • The MySQL back end stores data in a MySQL database. The database must be set up beforehand, and credentials to access it must be supplied in the parameters passed to cxi.merge.
  • The SQLite back end uses a simple SQLite database, which is written to a single file on the file system. It is easier to use than the MySQL back end and more efficient than the FS backend. Regrettably, the SQLite back end does not appear to work on the Lustre file system.

In cctbx.xfel images are merged using the cxi.merge command. cxi.merge accepts several options which can conveniently be assembled in a phil file. In this tutorial only a subset of the available options are defined.

backend
Back end database; FS for flat-file ASCII data storage, MySQL and SQLite for the respective proper database back ends.
d_min
Limiting resolution for scaling and merging
pixel_size
set the pixel size for the detector used, this is a single value (0.11 for CSPad)
data
Directory containing integrated data in pickle format. Repeat to specify additional directories.
merge_anomalous
True to merge anomalous contributors (i.e. Bijvoet mates), False to preserve them
min_corr
Correlation cutoff for rejecting individual frames
model
The scaling reference, PDB filename containing atomic coordinates and isomorphous CRYST1 record. Note: quick command to download a pdb file from the Internet is: phenix.fetch_pdb [pdb_code]
nproc
Specifies the number of scaling processes cxi.merge may have running at any one time
output.prefix
Prefix for all output file names
rawdata.sdfac_auto
True to apply SDFAC correction to each image, assuming negative intensities are normally distributed noise
rescale_with_average_cell
Rescale the images a second time, requiring images to conform to the average unit cell. If set to True, set_average_unit_cell must also be set to True.
set_average_unit_cell
If True set the unit cell of the merged data to the average of the merged images, otherwise use the unit cell of the scaling reference


The merged data is written to an mtz-file named output.prefix.mtz. Standard output, redirected to merge.out above, mainly consists of statistics for each individual image as it is scaled. The output concludes with a section labelled FINISHED MERGING, which first lists the number of accepted and rejected images,

 962 of 1185 integration files were accepted
 127 rejected due to wrong Bravais group
 1 rejected for unit cell outliers
 12 rejected for low signal
 83 rejected due to poor correlation

This is followed by a histograms of the unit cell distribution, and the merging table:

---------------------------------------------------------------------------------
                                     <asu   <obs
Bin  Resolution Range  Completeness redun> redun> n_meas   <I>    <I/sig(I)>
---------------------------------------------------------------------------------
  1 -1.0000 -  6.4633     [306/309]  67.61  68.28  20893    35621    27.044
  2  6.4633 -  5.1299     [285/285]  43.18  43.18  12305    19272    12.490
  3  5.1299 -  4.4814     [275/275]  46.77  46.77  12862    24391    14.261
  4  4.4814 -  4.0716     [269/269]  43.38  43.38  11670    29436    14.819
  5  4.0716 -  3.7798     [249/249]  35.50  35.50   8840    29811    12.716
  6  3.7798 -  3.5569     [267/267]  32.07  32.07   8562    24212    10.523
  7  3.5569 -  3.3787     [274/274]  20.99  20.99   5750    23626     8.171
  8  3.3787 -  3.2317     [256/256]  15.96  15.96   4085    24168     6.746
  9  3.2317 -  3.1072     [263/264]  12.16  12.20   3209    23936     5.859
 10  3.1072 -  3.0000     [263/263]   8.74   8.74   2298    20613     4.556

All                     [2707/2711]  33.37  33.42  90474    25594    11.978
----------------------------------------------------------------------------------

In this case, the overall completeness to 3.0 Å is 2,707 / 2,711, or approximately 100%, and each observed reflection is measured 33.42 times on average (<obs redun>). In total 90,474 observations, with an average scaled and integrated intensity of 25,594 analog-to-digital units (ADU), were merged and the mean I / σ(I) was determined to be 11.978.

Additional merging statistics

cxi.xmerge retrieves the scaled, unmerged intensities from the database back end, and calculates the CC1/2<ref>Karplus, P. A. & Diederichs, K. Linking Crystallographic Model and Data Quality. Science 336, 1030–1033 (2012).</ref> and CCiso statistics. CC1/2 is defined as Pearson's correlation coefficient between two sets, such that for each unique reflection the average intensities of two randomly chosen halves of its independent observations are assigned to different set. CCiso is the correlation coefficient between the merged data and the isomorphous scaling reference. Both statistics are computed in each resolution bin, as well as for the full set of reflections.

Once the database has been populated using cxi.merge, cxi.xmerge can be run, using the parameters defined in a phil-file, Ls-04-lysozyme-xmerge.phil.

$ cxi.xmerge Ls04-lysozyme-xmerge.phil

The options used in this tutorial not already described in Merging a set of integration files are explained below.

scaling.mtz_file
mtz-file containing experimentally determined reference structure factors, such as synchrotron measurements, to be used for CCiso. Experimental measurements, when available, can be downloaded from the PDB using phenix.fetch_pdb –-mtz [pdb_code].
scaling.mtz_column_F
In the experimental measurements mtz-file, the column name containing the reference data; can be either amplitudes or intensities. The column name should be given in lower case. An MTZ-format file can be inspected to see which columns it contains with phenix.mtz.dump [code].mtz. Quick tip: if experimental measurements are unavailable, mock experimental structure factors can be generated with
phenix.fmodel <pdb file> high_resolution=<value> output.type=real in which case one would use scaling.mtz_column_F=fmodel.
scaling.log_cutoff
Intensities less than escaling.log_cutoff will not be included in the calculation of CCiso and Riso (however, these measurements will still be included in the final merged data).

The results are printed on standard output and conclude with a table including the CC

---------------------------------------------------------------------------------------
                                      CC            CC            R     R   Scale Scale
Bin  Resolution Range  Completeness  int     N     iso     N     int   iso   int   iso
---------------------------------------------------------------------------------------
  1 -1.0000 -  6.4633     [233/309] 45.1%     233 56.3%     142 57.5% 45.6% 0.749 3.084
  2  6.4633 -  5.1299     [184/285] 22.5%     184 56.7%      62 57.2% 31.5% 0.862 2.841
  3  5.1299 -  4.4814     [203/275] 43.7%     203 57.6%      85 55.3% 37.0% 0.830 2.977
  4  4.4814 -  4.0716     [220/269] 39.0%     220 59.7%     101 51.4% 30.3% 0.805 3.296
  5  4.0716 -  3.7798     [187/249] 77.5%     187 33.6%      63 54.3% 39.4% 0.775 4.243
  6  3.7798 -  3.5569     [198/267] 35.4%     198 47.2%      56 56.8% 27.5% 0.903 3.587
  7  3.5569 -  3.3787     [198/274] 32.0%     198 70.7%      40 59.9% 31.4% 0.814 3.846
  8  3.3787 -  3.2317     [163/256] 30.6%     163 24.5%      28 63.9% 49.1% 0.734 4.278
  9  3.2317 -  3.1072     [164/264] 40.6%     164 53.6%       7 80.4% 32.1% 0.622 5.543
 10  3.1072 -  3.0000     [137/261]  8.2%     137 93.9%       5 85.6% 33.4% 0.221 4.372

All                                 34.7%    1887 47.7%     589 62.5% 39.5% 0.754 3.157
---------------------------------------------------------------------------------------

In this case the overall CC1/2 (CC int) is 34.7%, and the correlation coefficient to the isomorphous reference, CCiso (CC iso) is 47.7%. Note that the completeness for the cxi.xmerge output appears lower than that for the cxi.merge output. This is because the cxi.xmerge final table shows the completeness for the data used in the calculation of correlation coefficients, and so the completeness in the previous table should be used for general statistics.

Merging the tutorial data

Compared to indexing and integration, merging is a relatively quick procedure. However, particularly for large datasets, it may significantly strain computational resources. Therefore, it is recommended to merge data on SLAC's interactive nodes.

$ ssh psanacs.slac.stanford.edu
$ cd myrelease
$ cp /reg/g/cctbx/tutorials/merging/Ls04-lysozyme-*merge.phil .

Before running the merging programs, the phil-files may need to be edited. In particular, the location of the directories containing pickle files with the integrated intensities may need to be adjusted. Since both cxi.merge and cxi.xmerge log to standard output, it is advisable to redirect the stream to a file.

$ cxi.merge Ls04-lysozyme-merge.phil > merge.out
$ cxi.xmerge Ls04-lysozyme-xmerge.phil > xmerge.out

The merged reflections, suitable for further processing, are written Ls04-lysozyme.mtz. Note that although the data for this tutorial extends to 2.2 Å, the statistics become exceedingly poor around 3.0 Å.

References

<references/>