2017 cxi merge tutorial
This is an updated, worked example of data merging using cxi.merge, for presentation at the Feb 17, 2017 Berkeley Lab Serial Crystallography Workshop. Previous documentation sets are here and here.
Initial characterization
In this example, we are given integrated still-shot data collected by Danny Axford at Diamond, for P6 myoglobin, PDB code 5M3S.
- /net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted # cctbx-style integration pickles
- /net/dials/raid1/aaron/zurich0038/jr_006_batches/sig_filter/split_reintegrated/extracted # same data, with per-image resolution cutoff during integration
Unix ls reveals 5031 *.pickle files in each directory.
Immediately there is a problem:
$ cxi.print_pickle /net/dials/raid1/aaron/zurich0038/jr_006_batches/sig_filter/split_reintegrated/extracted/*.pickle
...fails on image 0059 with a traceback; it looks like the file is corrupted.
So focus on the data without integration resolution cutoff:
$ cxi.print_pickle /net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted/*.pickle
Some conclusions with the aid of grep:
- all integration pickles have space group P6 (good)
- distance and beam center is fixed throughout the integrated dataset
- Unit cells are variable but do seem to cluster around 91.4 91.4 45.9 90 90 120
phenix.fetch_pdb --mtz 5m3s
Merge command file:
#!/bin/csh -f set effective_params = "d_min=DMIN \ data=/net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted/*.pickle \ output.n_bins=10 \ pixel_size=0.172 \ backend=FS \ nproc=1 \ model=5m3s.pdb \ merge_anomalous=True \ plot_single_index_histograms=False \ scaling.algorithm=mark0 \ raw_data.sdfac_auto=False \ scaling.mtz_file=5m3s.mtz \ scaling.show_plots=False \ scaling.log_cutoff=None \ scaling.mtz_column_F=i-obs \ scaling.report_ML=True \ set_average_unit_cell=True \ rescale_with_average_cell=False \ significance_filter.apply=True \ significance_filter.min_ct=30 \ significance_filter.sigma=0.2 \ include_negatives=NEG \ postrefinement.enable=True \ postrefinement.algorithm=rs \ output.prefix=TAG" set tag = p6m set dmin = 2.5 set neg = True set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"` cxi.merge ${eff} exit cxi.xmerge ${eff} phenix.xtriage ${tag}_s0_mark0.mtz scaling.input.xray_data.obs_labels=imean
Initial trial nproc=1 just to see if it runs. Had to fix PDB reference. Can't use *.pickle on the data= line
Scale-up trial nproc=60, no postrefinement. set the MTZ flag = jobs
4493 of 5031 integration files were accepted 0 rejected due to wrong Bravais group 11 rejected for unit cell outliers 22 rejected for low signal 505 rejected due to up-front poor correlation under min_corr parameter 0 rejected for file errors or no reindex matrix
Usage: 5m3s.mtz does not contain any observations labelled [fobs, imean, i-obs]. Please set scaling.mtz_column_F to one of [iobs].
File "/net/viper/raid1/sauter/proj-e/modules/cctbx_project/xfel/cxi/util.py", line 13, in is_odd_numbered return int(os.path.basename(file_name).split(allowable)[0][-1])%2==1
ValueError: invalid literal for int() with base 10: 'd'
Something is wrong in the ability to determine even/odd numbered-ness. Added "_extracted.pickle" in the code; had to put it first.
Table of Scaling Results:
--------------------------------------------------------------------------------------------------------- CC N CC N R R R Scale Scale SpSig Bin Resolution Range Completeness int int iso iso int split iso int iso Test --------------------------------------------------------------------------------------------------------- 1 -1.0000 - 5.3861 [809/809] 80.0% 809 75.2% 805 61.0% 40.1% 52.9% 0.551 214.059 12489.8850 2 5.3861 - 4.2749 [791/791] 54.9% 791 74.5% 791 53.0% 38.8% 49.7% 0.693 270.307 1785.4625 3 4.2749 - 3.7345 [781/781] 65.8% 781 81.6% 781 46.5% 33.6% 40.7% 0.762 337.287 1149.4218 4 3.7345 - 3.3930 [776/776] 63.9% 776 74.5% 776 49.3% 36.4% 48.6% 0.764 283.109 758.0388 5 3.3930 - 3.1498 [765/765] 67.1% 765 81.9% 765 48.4% 35.6% 43.4% 0.795 338.091 533.7650 6 3.1498 - 2.9641 [771/771] 58.6% 771 72.4% 771 49.3% 36.6% 50.7% 0.759 286.707 222.4718 7 2.9641 - 2.8156 [765/765] 56.0% 765 72.3% 765 48.5% 35.3% 46.7% 0.765 320.954 154.5299 8 2.8156 - 2.6930 [746/746] 63.0% 746 76.1% 746 46.4% 34.3% 42.6% 0.867 357.183 99.4430 9 2.6930 - 2.5894 [790/790] 52.1% 790 69.4% 790 50.4% 37.4% 47.5% 0.814 314.326 113.1264 10 2.5894 - 2.5000 [757/757] 54.9% 757 78.6% 757 52.4% 38.9% 44.4% 0.794 306.403 109.0768 All [7751/7751] 74.9% 7751 78.8% 7747 51.9% 36.9% 50.1% 0.680 266.538 1298.0 ---------------------------------------------------------------------------------------------------------
Of course we know the data do not scale because this is a polar space group, and data must be sorted by Brehm/Diederichs method.
Breaking the indexing ambiguity
Take note of our detail instructions on Resolving an Indexing Ambiguity. Do this in three steps:
1) Generate a database of observations
step1.csh:
#!/bin/csh -f set effective_params = "d_min=DMIN \ data=/net/dials/raid1/aaron/zurich0038/jr_006_batches/split_reintegrated/extracted \ output.n_bins=10 \ pixel_size=0.172 \ backend=FS \ nproc=60 \ merge_anomalous=True \ plot_single_index_histograms=False \ scaling.algorithm=mark1 \ target_unit_cell=91.4,91.4,45.9,90,90,120 \ target_space_group=P6 \ raw_data.sdfac_auto=False \ include_negatives=NEG \ postrefinement.enable=False \ output.prefix=TAG" set tag = p6m set dmin = 2.5 set neg = False set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"` cxi.merge ${eff}
This yields 4988 of 5031 integration files accepted.
2) Sort the lattices
step2.csh:
#!/bin/csh -f set effective_params = "d_min=DMIN \ pixel_size=0.172 \ target_unit_cell=91.4,91.4,45.9,90,90,120 \ target_space_group=P6 \ backend=FS \ nproc=60 \ merge_anomalous=True \ output.prefix=TAG" set tag = p6m set dmin = 3.5 set neg = False set eff = `echo $effective_params|sed -e "s,FS,Flex,g"|sed -e "s,DMIN,$dmin,g"|sed -e "s,NEG,$neg,g"|sed -e "s,TAG,$tag,g"` cxi.brehm_diederichs ${eff}
BOOST crash--floating point error
^Z; kill %%
Try using d_min 3.5 instead of 2.5--still crash
Try using fewer proc; use 30 instead of 60. (increases problem size by 2**2=4) --still crash
Try nproc=15
It looks like the crash is associated with the matplotlib plot as I only experience it when I mouse-over the plot.
setenv BOOST_ADAPTBX_FPE_DEFAULT 1
14 plots total. h,k,l=2503 h,-h-k,-1=2485 total 4988
3) Apply reindexing operators and merge
cxi.merge program output
---------------------------------------------------------------------------------------- <asu <obs Bin Resolution Range Completeness % multi> multi> n_meas <I> <I/sig(I)> ---------------------------------------------------------------------------------------- 1 -1.000 - 5.386 [1490/1490] 100.00 102.21 102.21 152295 103994 103.244 2 5.386 - 4.275 [1500/1500] 100.00 62.76 62.76 94141 128403 95.046 3 4.275 - 3.735 [1499/1499] 100.00 53.90 53.90 80795 143552 92.607 4 3.735 - 3.393 [1497/1497] 100.00 47.14 47.14 70571 112723 70.575 5 3.393 - 3.150 [1477/1477] 100.00 43.96 43.96 64928 76925 51.011 6 3.150 - 2.964 [1488/1488] 100.00 39.87 39.87 59330 57060 37.899 7 2.964 - 2.816 [1483/1483] 100.00 38.17 38.17 56611 44079 32.085 8 2.816 - 2.693 [1455/1455] 100.00 36.34 36.34 52874 37117 27.460 9 2.693 - 2.589 [1530/1530] 100.00 34.49 34.49 52763 30496 24.443 10 2.589 - 2.500 [1476/1476] 100.00 31.83 31.83 46974 27147 21.564 All [14895/14895] 100.00 49.10 49.10 731282 76275 55.681 ----------------------------------------------------------------------------------------
cxi.xmerge program output
-------------------------------------------------------------------------------------------------------- CC N CC N R R R Scale Scale SpSig Bin Resolution Range Completeness int int iso iso int split iso int iso Test -------------------------------------------------------------------------------------------------------- 1 -1.0000 - 5.3861 [1490/1490] 87.3% 1490 88.1% 1484 46.3% 32.9% 42.6% 0.772 300.328 8084.8580 2 5.3861 - 4.2749 [1500/1500] 76.4% 1500 89.3% 1500 43.8% 30.6% 34.5% 0.761 425.498 1728.0907 3 4.2749 - 3.7345 [1499/1499] 80.1% 1499 91.6% 1499 42.5% 26.7% 34.5% 0.684 430.028 1556.6316 4 3.7345 - 3.3930 [1497/1497] 80.5% 1497 90.3% 1497 37.9% 27.2% 29.9% 0.846 481.795 600.5001 5 3.3930 - 3.1498 [1477/1477] 84.2% 1477 90.0% 1477 37.2% 26.4% 31.4% 0.838 477.825 269.5784 6 3.1498 - 2.9641 [1492/1492] 80.0% 1492 91.5% 1492 39.8% 28.6% 28.3% 0.866 511.386 165.9517 7 2.9641 - 2.8156 [1483/1483] 76.7% 1483 90.0% 1483 39.3% 28.7% 30.1% 0.865 470.331 102.0659 8 2.8156 - 2.6930 [1451/1451] 76.8% 1451 90.7% 1451 38.5% 28.2% 27.3% 0.883 492.758 88.6666 9 2.6930 - 2.5894 [1532/1532] 76.6% 1532 89.4% 1532 40.1% 29.3% 30.5% 0.879 452.831 52.0092 10 2.5894 - 2.5000 [1472/1472] 77.2% 1472 88.9% 1474 42.9% 31.4% 35.3% 0.801 393.866 52.6667 All [14893/14893] 84.7% 14893 88.6% 14889 41.6% 29.0% 39.8% 0.771 378.964 804.8 --------------------------------------------------------------------------------------------------------
Table of results
Tag | Method | Details | Resolution (Angstrom) |
# files accepted |
CC1/2 (highest shell) |
CCiso (highest shell) |
<|L|> test (0.5 perfect) |
nopost | no postrefinement | scale only | 2.5 | 4962 (4828) | 77.5% (66.2%) | 84.0% (85.8%) | 0.423 |
basic | rs | refine scale, B, rotx,roty | 2.5 | 4942 (4650) | 84.7% (77.2%) | 88.6% (88.9%) | 0.455 |
trial1 | rs2 unit weighting lorentzian lineshape |
analytical derivatives better convergence test Flex database |
2.5 | 4719 (4458) | 88.2% (74.8%) | 89.5% (89.1%) | 0.459 |
trial2 | rs2 unit weighting gaussian lineshape |
2.5 | 4721 (4416) | 90.9% (69.6%) | 90.9% (89.1%) | 0.470 | |
trial3 | rs_hybrid gentle weighting (|I|/sigma**2) gaussian lineshape |
rs2: LBFGS LevMar to refine Rs |
2.5 | 4059 (3783) | 93.5% (37.3%) | 95.4% (89.1%) | 0.504 |
trial3 / cycle2 | rs_hybrid gentle weighting (|I|/sigma**2) gaussian lineshape recycle model |
Use mtz from the first cycle as a scaling reference |
2.5 | 3973 (3700) | 93.3% (55.5%) | 93.6% (87.0%) | 0.509 |