Difference between revisions of "Cctbx.prime"

From cctbx_xfel
Jump to: navigation, search
(Running in manual mode)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
  
 
== Prime: '''p'''ost-'''r'''ef'''i'''nement and '''me'''rging ==
 
== Prime: '''p'''ost-'''r'''ef'''i'''nement and '''me'''rging ==
With the latest update, prime can be used to process data on multiple nodes (on queuing system). At the moment, only LSF (bsub) is supported. See documentation below for more information how to use the queuing system.
+
With the latest update, prime can be used to process data on multiple nodes (on a queuing system). At the moment, only LSF (bsub) is supported. See documentation below for more information on how to use the queuing system.
  
 
This major update replaces prime.postrefine with  
 
This major update replaces prime.postrefine with  
Line 8: Line 8:
 
prime.run  
 
prime.run  
 
</pre>
 
</pre>
 
For auto mode, you can still use prime.run with your parameter phil file like before. For manual mode, the available sub commands in prime are:
 
 
<pre>
 
prime.genref #generates a reference set from given integration results
 
prime.postrefine #refines all images
 
prime.merge #merges all refined results for an mtz file
 
</pre>
 
 
You can choose to run these commands independently (ideally in the above order) using the same phil file. See [https://commons.wikimedia.org/wiki/File:Prime_flowcharts.tif "PRIME flowchart"]. This will you a freedom to change something (e.g. set of parameters to refine, resolution cut-off, etc.) at different stages of the post-refinement and merging. See running prime in manual mode for more detail.
 
  
 
Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see  
 
Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see  
Line 23: Line 13:
 
[http://elifesciences.org/content/4/e05421 "DOI: http://dx.doi.org/10.7554/eLife.05421"]
 
[http://elifesciences.org/content/4/e05421 "DOI: http://dx.doi.org/10.7554/eLife.05421"]
  
== Prime is gui-ed (new!)==  
+
== Prime is gui-ed ==  
 
Thanks to Dr. Lyubimov, PRIME is also available as a Graphic User Interface program. Try it by running
 
Thanks to Dr. Lyubimov, PRIME is also available as a Graphic User Interface program. Try it by running
  
Line 124: Line 114:
 
Don't forget also to change your pixel size in millimeters. Check what your detector is and note down its pixel size.  
 
Don't forget also to change your pixel size in millimeters. Check what your detector is and note down its pixel size.  
  
'''Running post-refinement in automatic mode'''<br>
+
'''Running post-refinement'''<br>
 
Once you have the input .phil file, you can run ''prime'' by calling
 
Once you have the input .phil file, you can run ''prime'' by calling
 
<pre>
 
<pre>
Line 170: Line 160:
 
</pre>
 
</pre>
  
== More detail with input parameters ==
+
== Solving indexing ambiguity (New) ==  
Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.
+
* SOFTWARE UPDATE REQUIRED *
 +
With the latest version (Aug 31, 2016), you can solve the indexing ambiguity problem directly in prime. The Brehm & Diederichs algorithms ([http://dx.doi.org/10.1107/S1399004713025431 "doi:10.1107/S1399004713025431"]) have been implemented with bootstrap capability to handle large dataset.
 +
<br>
 +
'''Merohedral Twinning'''<br>
 +
For merohedral twinning (27 space groups e.g. P6), the indexing choices will be determined automatically in prime. Use this default setting in your .phil file,
  
'''Anomalous data:'''
 
 
<pre>
 
<pre>
target_anomalous_flag = True
+
indexing_ambiguity {
 +
  mode = Auto
 +
  index_basis_in = None
 +
  assigned_basis = None
 +
  n_sample_frames = 300
 +
  n_selected_frames = 100
 +
}
 +
</pre>
 +
 
 +
The n_sample_frames parameter indicates no. of images that will be used for the calculation of the scoring function. After that, only n_selected_images will be used in the B&D algorithms. This saves a lot of computing time since only the selected images will be used for the determination of the ambiguity. You can change these two parameters to fit with your experiments. The default values are 300 and 100 (give 300 - use 100).
 +
 
 +
'''Pseudo-Merohedral Twinning'''<br>
 +
For pseudo-merohedral twinning, due to different possibilities for the indexing choice, prime doesn't determine these choices automatically. If you suspect that you may have pseudo twinning (b and c are similar, beta angle is almost 90 degree but not quite), you have an option to force prime to determine the ambiguity according to your choices.
 +
<pre>
 +
indexing_ambiguity {
 +
  mode = Forced
 +
  index_basis_in = None
 +
  assigned_basis = -h,l,k
 +
  assigned_basis = -k, l, h
 +
  n_sample_frames = 300
 +
  n_selected_frames = 100
 +
}
 
</pre>
 
</pre>
In the last cycle, prime will output a reflection set with I+ and I-.
 
  
'''Indexing ambiguity'''
+
When you set indexing_ambiguity.mode to Forced, you can assign indexing choices according to your problem. In this example, two more choices (-h, l, k and -k, l, h) were assigned as the indexing choice.
  
For space groups with indexing ambiguity, use the solutions from cctbx.xfel (see Tutorial for resolving indexing ambiguity) to merge the data set.
+
'''Reusing the solution''' <br>
 +
At the end of the run, your solution pickle is saved to your_run_no/index_ambiguity/solution_pickle.pickle. If you don't want to spend time solving the ambiguity again in the next run, you can reuse this solution pickle by setting these parameters:
 
<pre>
 
<pre>
 
indexing_ambiguity {
 
indexing_ambiguity {
   flag_on = True
+
   mode = Auto
   index_basis_in = /path/to/solution/pickle_file.pickle
+
   index_basis_in = your_run_no/index_ambiguity/solution_pickle.pickle
 
}
 
}
 
</pre>
 
</pre>
 +
 +
This will bypass the indexing ambiguity module. Prime will use the solution file to perform normal post-refinement and merging.
 +
 +
'''Using an external reference set''' <br>
 +
To use another isomorphous dataset (e.g. from a synchrotron experiment) as a reference set to solve the ambiguity, you can specify an mtz file as part of these parameters:
 +
<pre>
 +
indexing_ambiguity {
 +
  mode = Auto
 +
  index_basis_in = path/to/your/mtz/file.mtz
 +
}
 +
</pre>
 +
Again, you can choose to do Auto or Forced (with a list of assigned_basis parameters) depending on your problem.
 +
 +
== More detail with input parameters ==
 +
Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.
 +
 +
'''Anomalous data:'''
 +
<pre>
 +
target_anomalous_flag = True
 +
</pre>
 +
In the last cycle, prime will output a reflection set with I+ and I-.
  
 
'''Number of micro- and macrocycles'''
 
'''Number of micro- and macrocycles'''
 
<pre>
 
<pre>
 
n_postref_cycle = 3
 
n_postref_cycle = 3
n_postref_sub_cycle = 3
+
n_postref_sub_cycle = 1
 
</pre>
 
</pre>
  
Line 207: Line 242:
 
</pre>
 
</pre>
  
== Running in manual mode ==
+
== Running on multiple nodes ==
With the same phil file, you can run prime manually. This gives you more freedom in terms of parameter settings at different stages (generating reference set, post-refining images, and merging) or at different cycle of post-refinement.
+
For LCLS users (or other users with LSF bsub), you can use psana (or your) queuing system to parallelize the entire process. For example, if you want to run your job on 100 nodes using psanq, you can specify:
 
+
'''Example A''': I want to generate a reference set then post-refine all the images on the '''scale factors only''' for '''three cycles''' then refine '''all parameters''' in the '''4th cycle'''. To do this, you can follow these steps:
+
 
+
To generate a reference set,
+
 
<pre>
 
<pre>
prime.genref prime.phil
+
queue {
</pre>
+
   mode = bsub
 
+
   qname = psanaq
To post-refine on scale factors only, modify your .phil file so that all parameters are turned ''off''.
+
   n_nodes = 100
<pre>
+
...
+
scale {
+
   d_min = 2.5
+
   d_max = 45
+
   sigma_min = 1.5
+
 
}
 
}
postref {
+
timeout_seconds = 300
  residual_threshold = 5
+
  residual_threshold_xy = 5
+
  scale {
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
  }
+
  crystal_orientation {
+
    flag_on = False
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
  }
+
  reflecting_range {
+
    flag_on = False
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
  }
+
  unit_cell {
+
    flag_on = False
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
    uc_tolerance = 5
+
  }
+
  allparams {
+
    flag_on = False
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
    uc_tolerance = 5
+
  }
+
}
+
...
+
n_postref_cycle = 3
+
...
+
</pre>
+
Then run,
+
<pre>
+
prime.postrefine prime.phil
+
</pre>
+
To refine all parameters one more cycle, update your .phil file again
+
<pre>
+
...
+
  allparams {
+
    flag_on = '''True'''
+
    d_min = 2.5
+
    d_max = 45
+
    sigma_min = 1.5
+
    partiality_min = 0.1
+
    uc_tolerance = 5
+
  }
+
}
+
...
+
n_postref_cycle = 1
+
...
+
</pre>
+
Then run,
+
<pre>
+
prime.postrefine prime.phil
+
</pre>
+
To obtain the final merged mtz, run
+
<pre>
+
prime.merge prime.phil
+
 
</pre>
 
</pre>
 +
Prime will divide all the images into 100 batches and submit them to different nodes. It will wait until all images in every batches are done before returning to the merging step (or the exit step in the manual mode). You can control timeout_seconds parameter to tell prime how long it should wait for all the image batches to finish. Usually, this timeout parameter is not used (all images should return before 300 seconds) but in case, you need to wait longer or shorter, you can modify this parameter.

Latest revision as of 19:25, 14 February 2017

Prime: post-refinement and merging

With the latest update, prime can be used to process data on multiple nodes (on a queuing system). At the moment, only LSF (bsub) is supported. See documentation below for more information on how to use the queuing system.

This major update replaces prime.postrefine with

prime.run 

Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see "Enabling X-ray Free Electron Laser Crystallography for Challenging Biological Systems from a Limited Number of Crystals" "DOI: http://dx.doi.org/10.7554/eLife.05421"

Prime is gui-ed

Thanks to Dr. Lyubimov, PRIME is also available as a Graphic User Interface program. Try it by running

prime

Click to see "PRIME main gui" and "Advanced options"

Getting started

Generating input phil file
Like most programs developed under cctbx framework, prime reads in input .phil file, which stores all the parameters needed to run post-refinement and merging steps. To generate the template .phil file, do the dry run by calling

$ prime.run

An example of the template .phil file:

data = None
run_no = None
title = None
scale {
  d_min = 0.1
  d_max = 99
  sigma_min = 1.5
}
...

You can save the content of the output to any file name - in this tutorial, let's save it to thermolysin.phil.

First look at your phil file
To run prime, set the required parameters to match with your experiments (you can leave other parameters with their default values - or just delete them from you .phil file). The most interesting parameters are shown below:

data = /path/to/your/integarion/result/pickle_files
run_no = 001
title = First trial for thermolysin
scale {
  d_min = 2.1
  d_max = 45
  sigma_min = 1.5
}
postref {
  scale {
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  crystal_orientation {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  reflecting_range {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  unit_cell {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
    uc_tolerance = 3
  }
  allparams {
    flag_on = False
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
    uc_tolerance = 3
  }
}
merge {
  d_min = 2.1
  d_max = 45
  sigma_min = 1.5
  partiality_min = 0.1
  uc_tolerance = 3
}
target_unit_cell = 93.99,93.99,130.87,90,90,120
target_space_group = P 61 2 2
pixel_size_mm = 0.102

You should pay attention to d_min and d_max for the refinement and merging parameters. If you use IOTA to integrate the images, IOTA will output .phil file for prime that has the optimal resolution range. If not, a few trial-and-error runs may be required to get the best resolution range for your dataset. Use merging statistics output by prime and check the values of CC1/2 and I/sigI to find out your optimal resolution range.

Cell parameters (target_unit_cell and target_space_group) are required to run prime. Target cell parameter is used to remove some outlier images by controlling uc_tolerance parameter (the default value of tolerate range is 3% different). Space group parameter is used in removing outliers and merging with the given symmetry.

Don't forget also to change your pixel size in millimeters. Check what your detector is and note down its pixel size.

Running post-refinement
Once you have the input .phil file, you can run prime by calling

prime.run thermolysin.phil

Prime will post-refine and merge for reflection sets using three (default value) macrocycles. At the end of the run, you can obtain merging statistics in the last cycle - all other cycle statistics are also available in log.txt.

An example of merging statistics:

Summary for 001/postref_cycle_1_merge.mtz
Bin Resolution Range     Completeness      <N_obs>  |Rsplit  CC1/2  N_ind |CCanom   N_ind| <I/sigI>   <I>
-------------------------------------------------------------------------------------------------------------
02    5.70 -    4.52 100.0   1055 /   1055   65.89   16.02   89.15   1055    0.00      0    20.17    2101.97
03    4.52 -    3.95 100.0   1032 /   1032   61.53   14.48   92.03   1032    0.00      0    20.39    2529.90
04    3.95 -    3.59 100.0   1016 /   1016   54.15   15.61   90.13   1016    0.00      0    16.69    1971.43
05    3.59 -    3.33 100.0   1004 /   1004   42.67   17.66   89.23   1004    0.00      0    14.21    1502.14
06    3.33 -    3.14 100.0   1013 /   1013   32.77   20.40   84.26   1013    0.00      0    11.76    1077.60
07    3.14 -    2.98 100.0    995 /    995   27.36   23.00   78.72    995    0.00      0    11.58     935.37
08    2.98 -    2.85 100.0   1006 /   1006   23.57   22.63   82.26   1006    0.00      0    10.56     722.62
09    2.85 -    2.74 100.0    986 /    986   16.64   28.51   72.90    985    0.00      0    10.01     591.56
10    2.74 -    2.65  99.9    989 /    990   12.41   31.35   72.95    987    0.00      0     9.91     515.07
11    2.65 -    2.56  99.7    979 /    982    9.35   37.14   65.31    970    0.00      0     9.31     438.96
12    2.56 -    2.49  98.0    979 /    999    6.06   45.98   45.37    930    0.00      0     9.45     390.05
13    2.49 -    2.42  95.1    931 /    979    4.46   50.68   34.20    834    0.00      0     8.93     334.80
14    2.42 -    2.37  91.7    896 /    977    3.35   55.66   37.15    729    0.00      0     9.27     320.17
15    2.37 -    2.31  83.9    829 /    988    2.61   56.92   43.21    600    0.00      0     9.60     296.67
16    2.31 -    2.26  72.4    702 /    969    1.97   65.81   26.89    386    0.00      0    10.29     284.39
17    2.26 -    2.22  59.1    582 /    985    1.75   64.72   31.28    275    0.00      0     9.87     284.06
18    2.22 -    2.18  52.9    513 /    970    1.51   71.27   16.86    188    0.00      0     8.93     215.31
19    2.18 -    2.14  35.7    349 /    978    1.32   62.26   68.25     90    0.00      0     8.22     199.09
20    2.14 -    2.10  23.1    227 /    981    1.20   92.14   -9.20     42    0.00      0     8.59     224.44
-------------------------------------------------------------------------------------------------------------
        TOTAL         85.9  17224 /  20046   27.11   21.11   92.07  15305    0.00      0    12.87     999.53
-------------------------------------------------------------------------------------------------------------

Summary of refinement and merging
 No. good frames:                  1809
 No. bad cc frames:                 153
 No. bad G frames) :                  0
 No. bad unit cell frames:            5
 No. bad gamma_e frames:              0
 No. bad SE:                          0
 No. observations:               466997

Solving indexing ambiguity (New)

  • SOFTWARE UPDATE REQUIRED *

With the latest version (Aug 31, 2016), you can solve the indexing ambiguity problem directly in prime. The Brehm & Diederichs algorithms ("doi:10.1107/S1399004713025431") have been implemented with bootstrap capability to handle large dataset.
Merohedral Twinning
For merohedral twinning (27 space groups e.g. P6), the indexing choices will be determined automatically in prime. Use this default setting in your .phil file,

indexing_ambiguity {
  mode = Auto
  index_basis_in = None
  assigned_basis = None
  n_sample_frames = 300
  n_selected_frames = 100
}

The n_sample_frames parameter indicates no. of images that will be used for the calculation of the scoring function. After that, only n_selected_images will be used in the B&D algorithms. This saves a lot of computing time since only the selected images will be used for the determination of the ambiguity. You can change these two parameters to fit with your experiments. The default values are 300 and 100 (give 300 - use 100).

Pseudo-Merohedral Twinning
For pseudo-merohedral twinning, due to different possibilities for the indexing choice, prime doesn't determine these choices automatically. If you suspect that you may have pseudo twinning (b and c are similar, beta angle is almost 90 degree but not quite), you have an option to force prime to determine the ambiguity according to your choices.

indexing_ambiguity {
  mode = Forced
  index_basis_in = None
  assigned_basis = -h,l,k
  assigned_basis = -k, l, h
  n_sample_frames = 300
  n_selected_frames = 100
}

When you set indexing_ambiguity.mode to Forced, you can assign indexing choices according to your problem. In this example, two more choices (-h, l, k and -k, l, h) were assigned as the indexing choice.

Reusing the solution
At the end of the run, your solution pickle is saved to your_run_no/index_ambiguity/solution_pickle.pickle. If you don't want to spend time solving the ambiguity again in the next run, you can reuse this solution pickle by setting these parameters:

indexing_ambiguity {
  mode = Auto
  index_basis_in = your_run_no/index_ambiguity/solution_pickle.pickle
}

This will bypass the indexing ambiguity module. Prime will use the solution file to perform normal post-refinement and merging.

Using an external reference set
To use another isomorphous dataset (e.g. from a synchrotron experiment) as a reference set to solve the ambiguity, you can specify an mtz file as part of these parameters:

indexing_ambiguity {
  mode = Auto
  index_basis_in = path/to/your/mtz/file.mtz
}

Again, you can choose to do Auto or Forced (with a list of assigned_basis parameters) depending on your problem.

More detail with input parameters

Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.

Anomalous data:

target_anomalous_flag = True

In the last cycle, prime will output a reflection set with I+ and I-.

Number of micro- and macrocycles

n_postref_cycle = 3
n_postref_sub_cycle = 1

Number of bins for merging statistics

n_bins = 20

Help with input parameters

Most input parameters are self-explained. However, you can run -h switch to view help information for each parameter.

prime.run -h

Running on multiple nodes

For LCLS users (or other users with LSF bsub), you can use psana (or your) queuing system to parallelize the entire process. For example, if you want to run your job on 100 nodes using psanq, you can specify:

queue {
  mode = bsub
  qname = psanaq
  n_nodes = 100
}
timeout_seconds = 300

Prime will divide all the images into 100 batches and submit them to different nodes. It will wait until all images in every batches are done before returning to the merging step (or the exit step in the manual mode). You can control timeout_seconds parameter to tell prime how long it should wait for all the image batches to finish. Usually, this timeout parameter is not used (all images should return before 300 seconds) but in case, you need to wait longer or shorter, you can modify this parameter.