2017 cxi merge tutorial

From cctbx_xfel
Jump to navigation Jump to search

This is an updated, worked example of data merging using cxi.merge, for presentation at the Feb 17, 2017 Berkeley Lab Serial Crystallography Workshop. Previous documentation sets are here and here. Literature description is in the Hattne (2014), the PRIME paper, the Sauter (2014) and Sauter (2015) papers. Math derivations are further described in the source code release in file postrefinement_rs_model.pdf.

Initial characterization

In this example, we are given integrated still-shot data collected by Danny Axford at Diamond, for P6 myoglobin, PDB code 5M3S.

  • /net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted # cctbx-style integration pickles
  • /net/dials/raid1/aaron/zurich0038/jr_006_batches/sig_filter/split_reintegrated/extracted # same data, with per-image resolution cutoff during integration

Unix ls reveals 5031 *.pickle files in each directory.

Immediately there is a problem:

$ cxi.print_pickle /net/dials/raid1/aaron/zurich0038/jr_006_batches/sig_filter/split_reintegrated/extracted/*.pickle

...fails on image 0059 with a traceback; it looks like the file is corrupted.

So focus on the data without integration resolution cutoff:

$ cxi.print_pickle /net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted/*.pickle

Some conclusions with the aid of grep:

  • all integration pickles have space group P6 (good)
  • distance and beam center is fixed throughout the integrated dataset
  • Unit cells are variable but do seem to cluster around 91.4 91.4 45.9 90 90 120
phenix.fetch_pdb --mtz 5m3s

Merge command file:

#!/bin/csh -f

set effective_params = "d_min=DMIN \
data=/net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted/*.pickle \
output.n_bins=10 \
pixel_size=0.172 \
backend=FS \
nproc=1 \
model=5m3s.pdb \
merge_anomalous=True \
plot_single_index_histograms=False \
scaling.algorithm=mark0 \
raw_data.sdfac_auto=False \
scaling.mtz_file=5m3s.mtz \
scaling.show_plots=False \
scaling.log_cutoff=None \
scaling.mtz_column_F=i-obs \
scaling.report_ML=True \
set_average_unit_cell=True \
rescale_with_average_cell=False \
significance_filter.apply=True \
significance_filter.min_ct=30 \
significance_filter.sigma=0.2 \
include_negatives=NEG \
postrefinement.enable=True \
postrefinement.algorithm=rs \
output.prefix=TAG"
set tag = p6m
set dmin = 2.5
set neg = True
set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"`

cxi.merge ${eff}
exit
cxi.xmerge ${eff}
phenix.xtriage ${tag}_s0_mark0.mtz scaling.input.xray_data.obs_labels=imean

Initial trial nproc=1 just to see if it runs. Had to fix PDB reference. Can't use *.pickle on the data= line

Scale-up trial nproc=60, no postrefinement. set the MTZ flag = i_obs

 4493 of 5031 integration files were accepted
 0 rejected due to wrong Bravais group
 11 rejected for unit cell outliers
 22 rejected for low signal
 505 rejected due to up-front poor correlation under min_corr parameter
 0 rejected for file errors or no reindex matrix

Usage: 5m3s.mtz does not contain any observations labelled [fobs, imean, i-obs]. Please set scaling.mtz_column_F to one of [iobs].

 File "/net/viper/raid1/sauter/proj-e/modules/cctbx_project/xfel/cxi/util.py", line 13, in is_odd_numbered
   return int(os.path.basename(file_name).split(allowable)[0][-1])%2==1

ValueError: invalid literal for int() with base 10: 'd'

Something is wrong in the ability to determine even/odd numbered-ness. Added "_extracted.pickle" in the code; had to put it first.

Table of Scaling Results:

---------------------------------------------------------------------------------------------------------
                                      CC      N     CC     N     R     R     R   Scale  Scale    SpSig
Bin  Resolution Range  Completeness  int    int    iso   iso    int  split  iso   int    iso      Test
---------------------------------------------------------------------------------------------------------
  1 -1.0000 -  5.3861     [809/809] 80.0%     809 75.2%    805 61.0% 40.1% 52.9% 0.551 214.059 12489.8850
  2  5.3861 -  4.2749     [791/791] 54.9%     791 74.5%    791 53.0% 38.8% 49.7% 0.693 270.307 1785.4625
  3  4.2749 -  3.7345     [781/781] 65.8%     781 81.6%    781 46.5% 33.6% 40.7% 0.762 337.287 1149.4218
  4  3.7345 -  3.3930     [776/776] 63.9%     776 74.5%    776 49.3% 36.4% 48.6% 0.764 283.109  758.0388
  5  3.3930 -  3.1498     [765/765] 67.1%     765 81.9%    765 48.4% 35.6% 43.4% 0.795 338.091  533.7650
  6  3.1498 -  2.9641     [771/771] 58.6%     771 72.4%    771 49.3% 36.6% 50.7% 0.759 286.707  222.4718
  7  2.9641 -  2.8156     [765/765] 56.0%     765 72.3%    765 48.5% 35.3% 46.7% 0.765 320.954  154.5299
  8  2.8156 -  2.6930     [746/746] 63.0%     746 76.1%    746 46.4% 34.3% 42.6% 0.867 357.183  99.4430
  9  2.6930 -  2.5894     [790/790] 52.1%     790 69.4%    790 50.4% 37.4% 47.5% 0.814 314.326  113.1264
 10  2.5894 -  2.5000     [757/757] 54.9%     757 78.6%    757 52.4% 38.9% 44.4% 0.794 306.403  109.0768

All                     [7751/7751] 74.9%    7751 78.8%   7747 51.9% 36.9% 50.1% 0.680 266.538   1298.0
---------------------------------------------------------------------------------------------------------

Of course we know the data do not scale because this is a polar space group, and data must be sorted by Brehm/Diederichs method.

Breaking the indexing ambiguity

Take note of our detail instructions on Resolving an Indexing Ambiguity. Do this in three steps:

1) Generate a database of observations

step1.csh:

#!/bin/csh -f

set effective_params = "d_min=DMIN \
data=/net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted \
output.n_bins=10 \
pixel_size=0.172 \
backend=FS \
nproc=60 \
merge_anomalous=True \
plot_single_index_histograms=False \
scaling.algorithm=mark1 \
target_unit_cell=91.4,91.4,45.9,90,90,120 \
target_space_group=P6 \
raw_data.sdfac_auto=False \
include_negatives=NEG \
postrefinement.enable=False \
output.prefix=TAG"

set tag = p6m
set dmin = 2.5
set neg = False
set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"`

cxi.merge ${eff}

This yields 4988 of 5031 integration files accepted.

2) Sort the lattices

step2.csh:

#!/bin/csh -f

set effective_params = "d_min=DMIN \
pixel_size=0.172 \
target_unit_cell=91.4,91.4,45.9,90,90,120 \
target_space_group=P6 \
backend=FS \
nproc=60 \
merge_anomalous=True \
output.prefix=TAG"

set tag = p6m
set dmin = 3.5
set neg = False
set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"`

cxi.brehm_diederichs ${eff}

14 plots total. h,k,l=2503 h,-h-k,-1=2485 total 4988

3) Apply reindexing operators and merge

cxi.merge program output

----------------------------------------------------------------------------------------
                                            <asu   <obs
Bin  Resolution Range  Completeness   %    multi> multi> n_meas   <I>    <I/sig(I)>
----------------------------------------------------------------------------------------
  1   -1.000 - 5.386    [1490/1490] 100.00 102.21 102.21 152295   103994   103.244
  2    5.386 - 4.275    [1500/1500] 100.00  62.76  62.76  94141   128403    95.046
  3    4.275 - 3.735    [1499/1499] 100.00  53.90  53.90  80795   143552    92.607
  4    3.735 - 3.393    [1497/1497] 100.00  47.14  47.14  70571   112723    70.575
  5    3.393 - 3.150    [1477/1477] 100.00  43.96  43.96  64928    76925    51.011
  6    3.150 - 2.964    [1488/1488] 100.00  39.87  39.87  59330    57060    37.899
  7    2.964 - 2.816    [1483/1483] 100.00  38.17  38.17  56611    44079    32.085
  8    2.816 - 2.693    [1455/1455] 100.00  36.34  36.34  52874    37117    27.460
  9    2.693 - 2.589    [1530/1530] 100.00  34.49  34.49  52763    30496    24.443
 10    2.589 - 2.500    [1476/1476] 100.00  31.83  31.83  46974    27147    21.564

All                   [14895/14895] 100.00  49.10  49.10 731282    76275    55.681
----------------------------------------------------------------------------------------

cxi.xmerge program output

--------------------------------------
Bin  Resolution Range # images %accept
--------------------------------------
  1 -1.0000 -  5.3861     4712  100.00
  2  5.3861 -  4.2749     4663  98.96
  3  4.2749 -  3.7345     4646  98.60
  4  3.7345 -  3.3930     4614  97.92
  5  3.3930 -  3.1498     4578  97.16
  6  3.1498 -  2.9641     4552  96.60
  7  2.9641 -  2.8156     4521  95.95
  8  2.8156 -  2.6930     4499  95.48
  9  2.6930 -  2.5894     4477  95.01
 10  2.5894 -  2.5000     4416  93.72

All                       4721
--------------------------------------
--------------------------------------------------------------------------------------------------------
                                      CC      N     CC     N     R     R     R   Scale  Scale    SpSig
Bin  Resolution Range  Completeness  int    int    iso   iso    int  split  iso   int    iso      Test
--------------------------------------------------------------------------------------------------------
  1 -1.0000 -  5.3861   [1490/1490] 87.3%    1490 88.1%   1484 46.3% 32.9% 42.6% 0.772 300.328 8084.8580
  2  5.3861 -  4.2749   [1500/1500] 76.4%    1500 89.3%   1500 43.8% 30.6% 34.5% 0.761 425.498 1728.0907
  3  4.2749 -  3.7345   [1499/1499] 80.1%    1499 91.6%   1499 42.5% 26.7% 34.5% 0.684 430.028 1556.6316
  4  3.7345 -  3.3930   [1497/1497] 80.5%    1497 90.3%   1497 37.9% 27.2% 29.9% 0.846 481.795  600.5001
  5  3.3930 -  3.1498   [1477/1477] 84.2%    1477 90.0%   1477 37.2% 26.4% 31.4% 0.838 477.825  269.5784
  6  3.1498 -  2.9641   [1492/1492] 80.0%    1492 91.5%   1492 39.8% 28.6% 28.3% 0.866 511.386  165.9517
  7  2.9641 -  2.8156   [1483/1483] 76.7%    1483 90.0%   1483 39.3% 28.7% 30.1% 0.865 470.331  102.0659
  8  2.8156 -  2.6930   [1451/1451] 76.8%    1451 90.7%   1451 38.5% 28.2% 27.3% 0.883 492.758  88.6666
  9  2.6930 -  2.5894   [1532/1532] 76.6%    1532 89.4%   1532 40.1% 29.3% 30.5% 0.879 452.831  52.0092
 10  2.5894 -  2.5000   [1472/1472] 77.2%    1472 88.9%   1474 42.9% 31.4% 35.3% 0.801 393.866  52.6667

All                   [14893/14893] 84.7%   14893 88.6%  14889 41.6% 29.0% 39.8% 0.771 378.964   804.8
--------------------------------------------------------------------------------------------------------

Table of results

Tag Method Details Resolution
(Angstrom)
# files
accepted
CC1/2
(highest shell)
CCiso
(highest shell)
<|L|> test
(0.5 perfect)
nopost no postrefinement scale only 2.5 4962 (4828) 77.5% (66.2%) 84.0% (85.8%) 0.475
basic rs refine scale, B, rotx,roty 2.5 4942 (4650) 84.7% (77.2%) 88.6% (88.9%) 0.477
trial1 rs2
unit weighting
lorentzian lineshape
analytical derivatives
better convergence test
Flex database
2.5 4719 (4458) 88.2% (74.8%) 89.5% (89.1%) 0.480
trial2 rs2
unit weighting
gaussian lineshape
2.5 4721 (4416) 90.9% (69.6%) 90.9% (89.1%) 0.508
trial3 rs_hybrid
gentle weighting (|I|/sigma**2)
gaussian lineshape
rs2: LBFGS
LevMar to refine Rs
2.5 4059 (3783) 93.5% (37.3%) 95.4% (89.1%) 0.518
trial3 / cycle2 rs_hybrid
gentle weighting (|I|/sigma**2)
gaussian lineshape
recycle model
Use mtz from trial 3
as a scaling reference
2.5 3716 (3432) 92.8% (48.1%) 93.3% (85.7%) 0.522

Useful:

export BOOST_ADAPTBX_FPE_DEFAULT=1
nproc=1
postrefinement.show_trumpet_plot=True