Resolving an Indexing Ambiguity

From cctbx_xfel
Jump to navigation Jump to search

Here we describe the use of Brehm & Diederichs algorithm 2 to resolve the indexing ambiguity for XFEL data. This is applicable for all polar space groups (where the Bravais symmetry is higher than the space group symmetry) and also for cases with pseudo symmetry (e.g., a monoclinic cell with a near 90-degree beta angle).

Brief Description of the Workflow

It is assumed that the reader is familiar with the tutorial for Merging. The usual workflow of cxi.merge followed by cxi.xmerge will fail in the xmerge step if there are reindexing operators (not H,K,L) that relate the individual indexed lattices to one another. We resolve this by (1) running the cxi.merge step to generate a database with all the observations; (2) use Brehm-Diederichs algorithm 2 to identify the reindexing operators with cxi.brehm_diederichs; (3) re-run the cxi.merge + cxi.xmerge process with the additional list of reindexing operators as input.

Detailed Step-By-Step Instructions

Create a Database of Observations

$ vi myoglobin_step1.csh
#!/bin/csh -f
set trial=${1}

set runs = 127,130,132,134,135,140,141,142,144
set datastring = \
`python -c "print ' '.join(['data=/my_results/L785/r%04d/${trial}/integration'%i for i in [${runs}]])"`
set tag = myoglobin_${trial}

set effective_params = "d_min=2.0 \
output.n_bins=20 \
${datastring} \
scaling.algorithm=mark1 \
target_unit_cell=90.3,90.3,45.2,90,90,120 \
target_space_group=P6 \
nproc=16 \
merge_anomalous=True \
mysql.runtag=${tag} \
mysql.passwd=terp888 \
mysql.user=nick \
mysql.database=xfelnks \
scaling.mtz_file=fake_filename.mtz \
pixel_size = 0.079346 \
output.prefix=${tag}"

cxi.merge ${effective_params} # Note the xmerge script is NOT run here

$./myoglobin_step1.csh 009 # create the database from trial 009

Noteworthy parameters:

  • d_min is the high-resolution limit to be used for resolving the indexing ambiguity, not final merging. There is a trade-off: including too many data slows down the determination of lattice-lattice correlation coefficients, which scales as the number of common Miller indices. Including too few data makes the determination of correlation coefficients unreliable or undefined. Future: the program will print out <L>, the average number of observation pairs per correlation coefficient so the resolution limit can be sensibly adjusted.
  • scaling.algorithm is set to mark1 here (no scaling) to illustrate how data would be processed from an unknown structure. As a consequence (see the Advanced Merging page) we set the target_unit_cell and target_space_group but do not provide a PDB model. Also scaling.mtz_file is set to a dummy value.
  • nproc is the number of processors to be used for writing the database. Use as many single-host processors as are available up to a limit of about 16, beyond which there is no further benefit, at least on Linux.
  • merge_anomalous=True. Since we are trying to maximize the number of common Miller indices for each lattice pair we want to merge the Bijvoet pairs, even if we will look for dispersive differences in the final merging step.
  • backend. Use whatever backend is available on your system; choices are FS, MySQL (default), and SQLite.

Sort the Lattices

$ vi myoglobin_step2.csh
#!/bin/csh -f
set trial=${1}
set tag = myoglobin_${trial}

set effective_params = "d_min=2.0 \
target_unit_cell=90.3,90.3,45.2,90,90,120 \
target_space_group=P6 \
nproc=32 \
merge_anomalous=True \
mysql.runtag=${tag} \
mysql.passwd=terp888 \
mysql.user=nick \
mysql.database=xfelnks \
pixel_size = 0.079346 \
output.prefix=${tag}"

cxi.brehm_diederichs ${effective_params}

$./myoglobin_step2.csh 009 # sort the lattices from trial 009

Noteworthy parameters:

  • d_min is the high-resolution limit to be used for resolving the indexing ambiguity, not final merging. There is a trade-off: including too many data slows down the determination of lattice-lattice correlation coefficients, which scales as the number of common Miller indices. Including too few data makes the determination of correlation coefficients unreliable or undefined. Future: the program will print out <L>, the average number of observation pairs per correlation coefficient so the resolution limit can be sensibly adjusted.
  • nproc is the number of processors used on a single host. Must either be set to 1 or >=5; there is no implementation for 2-4. If there are a large number of images (>2000) it is advantageous to set this number as high as possible [we use 64 on our AMD Linux machine]. BUT IF THERE ARE FEWER THAN ~500 IMAGES, you must use nproc=1.
  • target_unit_cell, target_space_group and merge_anomalous are mandatory and should be set to the same values used in the myoglobin_step1.csh script above.
  • Likewise, mysql.runtag and output.prefix should use the same values.

Output files from this script:

  • ${tag}_intensities_presort.pickle contains a Python tuple with two Miller arrays, the first giving all intensites input into the calculation, and the second assigning each intensity to a lattice [crystal] number. For development, not actually used for anything.
  • ${tag}_lookup.pickle is a Python dictionary whose keys are the reindexing operators, and values are lists of lattice [crystal] numbers to be reindexed accordingly.
  • ${tag}_reverse_lookup.pickle is a Python dictionary whose keys are the original integrated data pickle file names, and values are the reindexing operator to be applied to each file. This output is actually used in the next step.

Apply Reindexing Operators and Merge

$ vi myoglobin_step3.csh
#!/bin/csh -f
set trial=${1}

set runs = 127,130,132,134,135,140,141,142,144
set datastring = \
`python -c "print ' '.join(['data=/my_results/L785/r%04d/${trial}/integration'%i for i in [${runs}]])"`
set tag = myoglobin_${trial}_step3

set effective_params = "d_min=1.8 \
d_max=6.0 \
output.n_bins=20 \
${datastring} \
merging.reverse_lookup=myoglobin_${trial}_reverse_lookup.pickle
scaling.algorithm=mark0 \
model=3u3e.pdb \
model_reindex_op=h,-h-k,-l \
nproc=16 \
merge_anomalous=False \
mysql.runtag=${tag} \
mysql.passwd=terp888 \
mysql.user=nick \
mysql.database=xfelnks \
scaling.mtz_file=3u3e.data.mtz \
scaling.mtz_column_F=fmodel \
pixel_size = 0.079346 \
output.prefix=${tag}"

cxi.merge ${effective_params} 
cxi.xmerge ${effective_params} 

$./myoglobin_step3.csh 009 # apply the reindexing operators and merge

Noteworthy parameters:

  • d_min is now set to the highest-resolution desired for the merging step.
  • merge_anomalous=False. If looking for dispersive differences, set this to False.
  • scaling.algorithm is set to mark0 here to illustrate scaling of a P6 myoglobin XFEL data set against an isomorphous structure already in the PDB.
  • model The PDB file used for scaling individual crystals in the cxi.merge step.
  • model_reindex_op A total kludge. Since there is an indexing ambiguity and we've sorted it out for the XFEL data, there is still the possibility that the reference model is indexed the "wrong" way. This parameter lets us reindex the reference to coincide with the XFEL data. Default value is "h,k,l", no reindexing.
  • scaling.mtz_file. For the statistics in cxi.xmerge, this is set to an mtz file with Fcalc's from phenix.fmodel 3u3e.pdb high_resolution=1.78 output.type=real.
  • scaling.mtz_column_F is set to whatever column name contains the reference F's in this case "fmodel".
  • d_max is set so that isomorphous correlation coefficients are not calculated on the very low resolution reflections, since the Fcalc's have no solvent contribution. If an observed reference data set were used for the mtz_file this precaution would be unnecessary.
  • nproc is the number of processors to be used for merging. Use as many single-host processors as are available up to a limit of about 16, beyond which there is no further benefit, at least on Linux.
  • merging.reverse_lookup Sets the cxi.merge script to reindex all the integrated data files according to the operators determined in step 2, and to ignore any integrated data for which no reindexing operator was determined. In the cxi.xmerge script, each reindexing operator in turn is considered in order to report isomorphous CC and R statistics against the reference data set. It is up to the user to look at the output to find the reindexing that gives the highest CCiso values.