Processing Blog

From cctbx_xfel
Revision as of 10:27, 11 August 2015 by Aaron (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data Processing Blog

Reindexing operator needed to identify C-axis in an orthorhombic setting (6 May 2014 NKS)

The user brought us data from an orthorhombic space group. The synchrotron structure had been solved and published with this unit cell:

space group P 21 21 21 unit cell 107.5 75.0 51.3 90 90 90

However, the program cxi.merge fails to merge any XFEL data when the synchrotron-solved PDB file is used as the "model" for isomorphous scaling. Knowing that the axes of an orthorhombic group can be permuted, we immediately suspected a mismatch between the XFEL unit cell and the PDB unit cell. To inspect the XFEL integration results, we use the command:

cxi.print_pickle [integration_result_file.pickle]

All of the integration pickle files had this unit cell:

Laue group P 2 2 2 unit cell 50.1 73.6 103.8 90 90 90

Therefore we added the following item to the phil-parameters for cxi.merge:

data_reindex_op=l,k,h

This reindexing had the desired effect of aligning the new data with the reference model.

A tricky case of merging resolution limits

Merging some data today where the default per-frame resolution filter settings were cutting things a little too harshly, and discarding a lot more data than expected (based on manual inspection of a maximum projection pattern). So, how to change this? CXTBX.XFEL performs a per-frame resolution test at two stages: during integration, and during merging. Here, we are going to discuss editing the during merging part of things. There are two phil parameters you can play with for this: significance_filter.sigma (default 0.5), and significance_filter.max_ct (default 50). The sigma option will remove resolution bins such that all accepted bins have Mean(I/sigma) >= sigma, and the max_ct option will set the bin sizes so that the number of bins is maximized, subject to mean bin population <= max_ct.

A quick inspection of the terminal output from cxi.merge showed that in many cases, the Mean(I/sigma) was jumping up and down a lot, for example:

Step 4. Filter on global resolution and map to asu

 Bin  Resolution Range  Compl.         <I>     <I/sig(I)>
   1 22.4216 -  8.1030 [127/5229]       646.3       3.8
   2  8.1030 -  6.4828 [ 45/5157]       428.9       1.7
   3  6.4828 -  5.6785 [103/5257]        89.3       0.7
   4  5.6785 -  5.1663 [107/5182]        62.2       0.4
   5  5.1663 -  4.7999 [101/5201]       121.2       0.5
   6  4.7999 -  4.5193 [ 78/5189]        93.9       0.3
   7  4.5193 -  4.2946 [ 83/5191]        36.2       0.2
   8  4.2946 -  4.1089 [ 71/5156]       197.4       0.7
   9  4.1089 -  3.9516 [ 67/5199]        43.5       0.3
  10  3.9516 -  3.8159 [ 74/5170]        83.3       0.4
  11  3.8159 -  3.6971 [110/5194]        12.7       0.2
  12  3.6971 -  3.5919 [ 76/5227]        16.9       0.2

Using the defaults, this file would cut off at 5.7 A, but I would like it to go to 4.8, since it looks like that is a legitimate 0.5 in the I/sigI column. Changing the bin size should average this out. I therefore re-ran the merging with the following in my phil file:

significance_filter{ 
    sigma=0.5
    max_ct=100
}

This helped a bit, and the per-image cuttoffs were looking more sensible. I then tweaked the sigma down to 0.4, since it decreased gradually to around this point before going mental, and got even better results.

When doing this, how do you know when you are just adding junk to your file? If you observe lots of variation in your intensities per image per bin, then it's fairly safe to increase max_ct. This has the effect of smoothing things out, but may also underestimate your resolution if the bins become too large. Reducing sigma, on the other hand, can easily introduce junk data. The only real way to test for this is to see if your statistics from cxi.xmerge, in particular CC1/2, get better.

--Zeldin (talk) 23:19, 17 April 2014 (UTC)

A Traceback error in April

15 April 2014. A user writes:

I am seeing a Traceback error during integration with pyana:

/cctbx_project/myrelease/cxi_data/present_image_0000001

Beam center is not immediately clear; rigorously retesting 9 solutions
Beam x 96.6 y 97.1, initial score 9; refined rmsd: None
Beam x 96.8 y 97.5, initial score 8; refined rmsd: None
Beam x 96.6 y 97.0, initial score 7; refined rmsd: None
Beam x 96.6 y 97.2, initial score 7; refined rmsd: None
Beam x 96.7 y 97.4, initial score 6; refined rmsd: None
Beam x 96.6 y 97.3, initial score 6; refined rmsd: None
Beam x 96.9 y 97.0, initial score 6; refined rmsd: None
Beam x 96.9 y 97.1, initial score 6; refined rmsd: None
Process Process-2:

Traceback (most recent call last):
  File "/usr/local/crys_test/psdm/sw/external/python/2.7.2/x86_64-rhel6-gcc44-opt/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/crys_test/psdm/sw/external/python/2.7.2/x86_64-rhel6-gcc44-opt/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/crys/psdm/sw/releases/ana-current/arch/x86_64-rhel6-gcc44-opt/python/pyana/pyanamod.py", line 126, in _proc
    for req in proto.getRequest():
  File "/usr/local/crys/psdm/sw/releases/ana-current/arch/x86_64-rhel6-gcc44-opt/python/pyana/mp_proto.py", line 121, in getRequest
    opcode = self._conn.recv_bytes()
EOFError

Reply:

  • It's not clear what's causing the EOFError. However, we should at least check for memory leaks that would cause Pyana processing to draw to a standstill. The diagnostic is to use the "top" command in the Unix shell. Observe whether or not the subprocess memory use continually increases as more images are processed. Correct behavior is for the memory use to reach a stationary plateau. We recently fixed a memory leak bug (March 2014), but it is worth checking this to make sure.
  • You should generally not get a "rigorously retesting 9 solutions" message for XFEL indexing. Generally for XFELs we know the beam position accurately so there is no need to search for the direct beam. Your phil file should include beam_search_scope=0.5 to avoid a large search radius. However, your log indicates a narrow search, likely indicating that spot finder sees numerous spots that are unphysically close. Many things could be done to safeguard you from this:
    • Careful use of detector masks, as described in Preparatory steps on the tutorial wiki . This would avoid picking spots on compromised areas of the detector. Aaron is this correct?
    • Use distl.minimum_spot_area=1 instead of =0. This would avoid picking random local-maxima pixels as spots, but there is a tradeoff.
    • Set the phil file resolution cutoff (3 different places) to a value close to where you actually intend to cut off the merging. This avoids picking spots for indexing that aren't real spots.
  • I notice you are using integration.detector_gain=7.5. However, we've been having a lot of trouble with setting this to anything else but 1.0. Until we can sort out the code, I recommend using gain=1.0 during processing, and then in the cxi.merge phil use significance_filter.sigma=0.5. This combination should help retain as many reflections as possible.


Low resolution reflections missing

A user was indexing using cxi.index and finding many low resolution spots not being predicted. Here, cxi.index refuses to predict spots that are overlapped. Setting integration.background_factor=1 and integration.guard_width_sq=3 can force a better outcome.


Indexing images with fewer spots

A user was having a hard time with images that had clear diffraction patterns with between 20 to 40 spots but that didn't index with cxi.index. Here's how we resolved the problem.

Add the following to the indexing phil file:

1) target cell and known setting. These parameters provide constraints on the still-image indexing procedure algorithm that finds basis vectors without the benefit of additional images with rotational data. See the phil section. Example for lysozyme:

target_cell = 38 74 74 90 90 90
known_setting = 9

2) Add distl_minimum_number_spots_for_indexing=20 and model_refinement_minimum_N=10 to the phil file. These parameters set lower limits on the number of spots needed for indexing and crystal model refinement. For patterns with fewer spots, it disables two checks for enough spots to proceed.

Wrong Bravais lattice chosen during indexing

Some of the user's images produced integration results in the triclinic setting, but they had included a target cell in the hexagonal setting. Here are the lines at the end of the log file that show the symptom:

 $ cxi.index -d -n 16 target=<parameters.phil> <image.pickle>
 New Horizons Integration results:
 Solution  SpaceGroup Beam x   y  distance  Resolution Mosaicity RMS
     12           P3 156.63 156.52  374.79       4.00    0.5    0.246
 :)   1           P1 156.63 156.52  374.79      11.63    0.5    0.041

The :) shows that the software chose the triclinic setting for its final solution. The explanation is that LABELIT compares the ratio of RMS values from the highest to the lowest symmetry settings (RMS is the root mean squared difference between observed and predicted spot positions, in mm). The default is to allow a factor 3.5 ratio, no more. This parameter can be altered in the *.phil file or on the command line:

$ cxi.index -d -n 16 target=<parameters.phil> <image.pickle> mosflm_rmsd_tolerance=10

The parameter is called "mosflm" rmsd tolerance for legacy reasons: LABELIT uses MOSFLM as the integration engine; however cctbx.xfel has its own integration engine. Result:

 New Horizons Integration results:
 Solution  SpaceGroup Beam x   y  distance  Resolution Mosaicity RMS
 :)  12           P3 156.63 156.52  374.79       4.00    0.5    0.246
      1           P1 156.63 156.52  374.79      11.63    0.5    0.041

Now the higher symmetry setting is chosen as the final solution.



Processing stills

Here's a recent example of processing still from the MAR detector at XPP

General reference: Indexing individual stills

All the commands listed here have some help with the --help option.

cd <path to directory with .mccd files from xpp>
cxi.image2pickle -v -c *.mccd
mkdir pickles
mv -v *.pickle pickles

Converting the images to pickle format, using the -c (crop) option, sets them up a bit nicer for the indexer Next, I picked one of the high res images and looked at the spotfinding parameters for the program:

distl.image_viewer distl.minimum_signal_height=5 distl.minimum_spot_area=2 distl.minimum_spot_height=5 pickles/<one of the image files>.pickle

My initial guess (based on my experience with the MAR detector) showed a reasonable diffraction pattern picked out in red by the spotfinder.

I then took a phil file that had been working well for us from another experiment, and edited it using these spotfinding parameters. Seeing as I didn't know the unit cell, I removed the target cell. I created a results directory and called this first test trial 000, and indexed all the images.

mkdir -p results/000
cxi.index -d -s -n 12 -o results/000 target=stills_target.phil pickles | tee results/trial_000.out

This indexed around 202 images. To get an idea of the unit cell, I entered the trial 000 directory and ran cluster.unit_cell --log . Most of the images were in the following cluster:

cluster_26      166       192.5(1.3 ) 192.5(1.5 ) 291.1(2.9 ) 90.00 (0.00) 90.00 (0.00) 90.00 (0.00)
163 in P4, 3 in P222.

The numbers in parentheses are standard deviations for those cell dimensions. Now, indexing stills is greatly helped when the symmetry is known upfront, so I added a target_cell to the phil file, like so:

# Derived from average cell after scaling and merging.
target_cell = 192.5 192.5 291.1 90.00 90.00 90.00
known_setting = 9

For the space group, cxi.index uses a known_setting parameter. P4 is in the tetragonal setting. From this webpage:

Phil

that means I need to use a known setting value of 9.

I decided to use the LCLS computing cluster for the next indexing trial. I was logged into psana on an interactive node provided to me by the cluster, and others were logged in to the same node concurrently. The queuing system would ensure I got my own node for the job:

mkdir results/001
bsub -n 12 -q psanaq -o results/trial_001.out cxi.index -s -n 12 -d -o results/001 target=stills_target.phil pickles

This yielded 218 indexed images.

I did two rounds of merging/postrefinement with prime.postrefine. Post refinement trial 000, using indexing trial 001, in directory 001_000, I used the wrong number of amino acids per asymmetric unit, but it got me a sense of the completeness limits. Using that I remerged to an ambitious 3.3 angstroms, in directory 001_001. The files of interest are:

prime/001_001/log.txt prime/001_001/postref_cycle_2_merge.mtz

Command for merging:

bsub -n 12 -q psanaq -o results/trial_001.out cxi.index -s -n 12 -d -o results/001 target=prime.phil pickles

Documentation for prime is on the wiki.

Where to go from here? I would look at which files indexed and which didn’t. See if there are some that didn’t index that should have. Maybe that could improve the completeness. One could look into prime.iota, which does a grid search for spotfinder parameters. Also, cxi.index, if you don’t supply the -d flag, will pop up a few images showing the indexing results overlaid on the image (first in the triclinic setting and then in the best symmetry setting). Red pixels are spotfinder spots, blue are integrated and yellow are background. One could run cxi.index with a single file to see how it's doing (don’t forget to supply a target, like the command above).