Cctbx.prime: Difference between revisions

From cctbx_xfel
Jump to navigation Jump to search
No edit summary
 
(69 intermediate revisions by 2 users not shown)
Line 1: Line 1:


== Prime: '''p'''ost-'''r'''ef'''i'''nement and '''me'''rging ==
== Prime: '''p'''ost-'''r'''ef'''i'''nement and '''me'''rging ==
With the latest update, prime can be used to process data on multiple nodes (on a queuing system). At the moment, only LSF (bsub) is supported. See documentation below for more information on how to use the queuing system.
This major update replaces prime.postrefine with
<pre>
prime.run
</pre>


Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see  
Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see  
Enabling X-ray free electron laser crystallography for challenging biological systems from a limited number of crystals
"Enabling X-ray Free Electron Laser Crystallography for Challenging Biological Systems from a Limited Number of Crystals"
[http://elifesciences.org/content/early/2015/03/17/eLife.05421, "DOI: http://dx.doi.org/10.7554/eLife.05421"]
[http://elifesciences.org/content/4/e05421 "DOI: http://dx.doi.org/10.7554/eLife.05421"]


== Prime is gui-ed ==
Thanks to Dr. Lyubimov, PRIME is also available as a Graphic User Interface program. Try it by running


== Step 1: Generating input file ==
<pre>
Like most programs developed under ''cctbx'' framework, prime reads in input .phil file, which stores all the parameters needed to run post-refinement and merging steps. To generate the template .phil file, do the dry run by calling
prime
</pre>
 
Click to see [https://commons.wikimedia.org/wiki/File:PRIME_main.png "PRIME main gui"] and [https://commons.wikimedia.org/wiki/File:PRIME_advanced_options.png "Advanced options"]
 
== Getting started ==
'''Generating input phil file'''<br>
Like most programs developed under ''cctbx'' framework, ''prime'' reads in input .phil file, which stores all the parameters needed to run post-refinement and merging steps. To generate the template .phil file, do the dry run by calling


<pre>
<pre>
$ prime.postrefine
$ prime.run
</pre>
</pre>


Line 30: Line 46:
You can save the content of the output to any file name - in this tutorial, let's save it to thermolysin.phil.
You can save the content of the output to any file name - in this tutorial, let's save it to thermolysin.phil.


== Step 2: Update input parameters ==
'''First look at your phil file'''<br>
For the first trial, set the required parameters to match with your experiments (you can leave other parameters with their default values - or just delete them from you .phil file):
To run prime, set the required parameters to match with your experiments (you can leave other parameters with their default values - or just delete them from you .phil file). The most interesting parameters are shown below:


<pre>
<pre>
Line 73: Line 89:
   allparams {
   allparams {
     flag_on = False
     flag_on = False
     d_min = 0.1
     d_min = 2.1
     d_max = 99
     d_max = 45
     sigma_min = 1.5
     sigma_min = 1.5
     partiality_min = 0.1
     partiality_min = 0.1
Line 92: Line 108:
</pre>
</pre>


You should pay attention to d_min and d_max for the refinement and merging parameters. If you use IOTA to integrate the images, IOTA will output .phil file for prime that has the optimal resolution range. If not, a few trial-and-error runs may be required to get the best resolution range for your dataset. Use merging statistics output by prime and check the values of CC1/2 and I/sigI to find out your optimal resolution range.
Cell parameters (target_unit_cell and target_space_group) are required to run prime. Target cell parameter is used to remove some outlier images by controlling uc_tolerance parameter (the default value of tolerate range is 3% different). Space group parameter is used in removing outliers and merging with the given symmetry.
Don't forget also to change your pixel size in millimeters. Check what your detector is and note down its pixel size.
'''Running post-refinement'''<br>
Once you have the input .phil file, you can run ''prime'' by calling
<pre>
prime.run thermolysin.phil
</pre>
''Prime'' will post-refine and merge for reflection sets using three (default value) macrocycles. At the end of the run, you can obtain merging statistics in the last cycle - all other cycle statistics are also available in log.txt.
An example of merging statistics:
<pre>
Summary for 001/postref_cycle_1_merge.mtz
Bin Resolution Range    Completeness      <N_obs>  |Rsplit  CC1/2  N_ind |CCanom  N_ind| <I/sigI>  <I>
-------------------------------------------------------------------------------------------------------------
02    5.70 -    4.52 100.0  1055 /  1055  65.89  16.02  89.15  1055    0.00      0    20.17    2101.97
03    4.52 -    3.95 100.0  1032 /  1032  61.53  14.48  92.03  1032    0.00      0    20.39    2529.90
04    3.95 -    3.59 100.0  1016 /  1016  54.15  15.61  90.13  1016    0.00      0    16.69    1971.43
05    3.59 -    3.33 100.0  1004 /  1004  42.67  17.66  89.23  1004    0.00      0    14.21    1502.14
06    3.33 -    3.14 100.0  1013 /  1013  32.77  20.40  84.26  1013    0.00      0    11.76    1077.60
07    3.14 -    2.98 100.0    995 /    995  27.36  23.00  78.72    995    0.00      0    11.58    935.37
08    2.98 -    2.85 100.0  1006 /  1006  23.57  22.63  82.26  1006    0.00      0    10.56    722.62
09    2.85 -    2.74 100.0    986 /    986  16.64  28.51  72.90    985    0.00      0    10.01    591.56
10    2.74 -    2.65  99.9    989 /    990  12.41  31.35  72.95    987    0.00      0    9.91    515.07
11    2.65 -    2.56  99.7    979 /    982    9.35  37.14  65.31    970    0.00      0    9.31    438.96
12    2.56 -    2.49  98.0    979 /    999    6.06  45.98  45.37    930    0.00      0    9.45    390.05
13    2.49 -    2.42  95.1    931 /    979    4.46  50.68  34.20    834    0.00      0    8.93    334.80
14    2.42 -    2.37  91.7    896 /    977    3.35  55.66  37.15    729    0.00      0    9.27    320.17
15    2.37 -    2.31  83.9    829 /    988    2.61  56.92  43.21    600    0.00      0    9.60    296.67
16    2.31 -    2.26  72.4    702 /    969    1.97  65.81  26.89    386    0.00      0    10.29    284.39
17    2.26 -    2.22  59.1    582 /    985    1.75  64.72  31.28    275    0.00      0    9.87    284.06
18    2.22 -    2.18  52.9    513 /    970    1.51  71.27  16.86    188    0.00      0    8.93    215.31
19    2.18 -    2.14  35.7    349 /    978    1.32  62.26  68.25    90    0.00      0    8.22    199.09
20    2.14 -    2.10  23.1    227 /    981    1.20  92.14  -9.20    42    0.00      0    8.59    224.44
-------------------------------------------------------------------------------------------------------------
        TOTAL        85.9  17224 /  20046  27.11  21.11  92.07  15305    0.00      0    12.87    999.53
-------------------------------------------------------------------------------------------------------------
Summary of refinement and merging
No. good frames:                  1809
No. bad cc frames:                153
No. bad G frames) :                  0
No. bad unit cell frames:            5
No. bad gamma_e frames:              0
No. bad SE:                          0
No. observations:              466997


== Step 3: Post-refine and merge ==
</pre>
Once you have the input .phil file, you can run prime by calling
 
== Solving indexing ambiguity (New) ==  
* SOFTWARE UPDATE REQUIRED *
With the latest version (Aug 31, 2016), you can solve the indexing ambiguity problem directly in prime. The Brehm & Diederichs algorithms ([http://dx.doi.org/10.1107/S1399004713025431 "doi:10.1107/S1399004713025431"]) have been implemented with bootstrap capability to handle large dataset.
<br>
'''Merohedral Twinning'''<br>
For merohedral twinning (27 space groups e.g. P6), the indexing choices will be determined automatically in prime. Use this default setting in your .phil file,
 
<pre>
indexing_ambiguity {
  mode = Auto
  index_basis_in = None
  assigned_basis = None
  n_sample_frames = 300
  n_selected_frames = 100
}
</pre>
 
The n_sample_frames parameter indicates no. of images that will be used for the calculation of the scoring function. After that, only n_selected_images will be used in the B&D algorithms. This saves a lot of computing time since only the selected images will be used for the determination of the ambiguity. You can change these two parameters to fit with your experiments. The default values are 300 and 100 (give 300 - use 100).
 
'''Pseudo-Merohedral Twinning'''<br>
For pseudo-merohedral twinning, due to different possibilities for the indexing choice, prime doesn't determine these choices automatically. If you suspect that you may have pseudo twinning (b and c are similar, beta angle is almost 90 degree but not quite), you have an option to force prime to determine the ambiguity according to your choices.
<pre>
<pre>
prime.postrefine thermolysin.phil
indexing_ambiguity {
  mode = Forced
  index_basis_in = None
  assigned_basis = -h,l,k
  assigned_basis = -k, l, h
  n_sample_frames = 300
  n_selected_frames = 100
}
</pre>
</pre>
Prime will post-refine and merge for reflection sets using three (default value) macrocycles. At the end of the run, you can obtain merging statistics in the last cycle - all other cycle statistics are also available in log.txt.


When you set indexing_ambiguity.mode to Forced, you can assign indexing choices according to your problem. In this example, two more choices (-h, l, k and -k, l, h) were assigned as the indexing choice.


== Advance settings ==
'''Reusing the solution''' <br>
At the end of the run, your solution pickle is saved to your_run_no/index_ambiguity/solution_pickle.pickle. If you don't want to spend time solving the ambiguity again in the next run, you can reuse this solution pickle by setting these parameters:
<pre>
indexing_ambiguity {
  mode = Auto
  index_basis_in = your_run_no/index_ambiguity/solution_pickle.pickle
}
</pre>
 
This will bypass the indexing ambiguity module. Prime will use the solution file to perform normal post-refinement and merging.
 
'''Using an external reference set''' <br>
To use another isomorphous dataset (e.g. from a synchrotron experiment) as a reference set to solve the ambiguity, you can specify an mtz file as part of these parameters:
<pre>
indexing_ambiguity {
  mode = Auto
  index_basis_in = path/to/your/mtz/file.mtz
}
</pre>
Again, you can choose to do Auto or Forced (with a list of assigned_basis parameters) depending on your problem.
 
== More detail with input parameters ==
Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.
Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.
'''Anomalous data:'''
'''Anomalous data:'''
<br>
<pre>
<pre>
target_anomalous_flag = True
target_anomalous_flag = True
</pre>
</pre>
In the last cycle, prime will output a reflection set with I+ and I-.
In the last cycle, prime will output a reflection set with I+ and I-.
'''Number of micro- and macrocycles'''
<pre>
n_postref_cycle = 3
n_postref_sub_cycle = 1
</pre>
'''Number of bins for merging statistics'''
<pre>
n_bins = 20
</pre>
== Help with input parameters ==
Most input parameters are self-explained. However, you can run -h switch to view help information for each parameter.
<pre>
prime.run -h
</pre>
== Running on multiple nodes ==
For LCLS users (or other users with LSF bsub), you can use psana (or your) queuing system to parallelize the entire process. For example, if you want to run your job on 100 nodes using psanq, you can specify:
<pre>
queue {
  mode = bsub
  qname = psanaq
  n_nodes = 100
}
timeout_seconds = 300
</pre>
Prime will divide all the images into 100 batches and submit them to different nodes. It will wait until all images in every batches are done before returning to the merging step (or the exit step in the manual mode). You can control timeout_seconds parameter to tell prime how long it should wait for all the image batches to finish. Usually, this timeout parameter is not used (all images should return before 300 seconds) but in case, you need to wait longer or shorter, you can modify this parameter.

Latest revision as of 19:25, 14 February 2017

Prime: post-refinement and merging

With the latest update, prime can be used to process data on multiple nodes (on a queuing system). At the moment, only LSF (bsub) is supported. See documentation below for more information on how to use the queuing system.

This major update replaces prime.postrefine with

prime.run 

Step-by-step guidelines to post-refine and merge XFEL diffraction images. For more detail and citation, see "Enabling X-ray Free Electron Laser Crystallography for Challenging Biological Systems from a Limited Number of Crystals" "DOI: http://dx.doi.org/10.7554/eLife.05421"

Prime is gui-ed

Thanks to Dr. Lyubimov, PRIME is also available as a Graphic User Interface program. Try it by running

prime

Click to see "PRIME main gui" and "Advanced options"

Getting started

Generating input phil file
Like most programs developed under cctbx framework, prime reads in input .phil file, which stores all the parameters needed to run post-refinement and merging steps. To generate the template .phil file, do the dry run by calling

$ prime.run

An example of the template .phil file:

data = None
run_no = None
title = None
scale {
  d_min = 0.1
  d_max = 99
  sigma_min = 1.5
}
...

You can save the content of the output to any file name - in this tutorial, let's save it to thermolysin.phil.

First look at your phil file
To run prime, set the required parameters to match with your experiments (you can leave other parameters with their default values - or just delete them from you .phil file). The most interesting parameters are shown below:

data = /path/to/your/integarion/result/pickle_files
run_no = 001
title = First trial for thermolysin
scale {
  d_min = 2.1
  d_max = 45
  sigma_min = 1.5
}
postref {
  scale {
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  crystal_orientation {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  reflecting_range {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
  }
  unit_cell {
    flag_on = True
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
    uc_tolerance = 3
  }
  allparams {
    flag_on = False
    d_min = 2.1
    d_max = 45
    sigma_min = 1.5
    partiality_min = 0.1
    uc_tolerance = 3
  }
}
merge {
  d_min = 2.1
  d_max = 45
  sigma_min = 1.5
  partiality_min = 0.1
  uc_tolerance = 3
}
target_unit_cell = 93.99,93.99,130.87,90,90,120
target_space_group = P 61 2 2
pixel_size_mm = 0.102

You should pay attention to d_min and d_max for the refinement and merging parameters. If you use IOTA to integrate the images, IOTA will output .phil file for prime that has the optimal resolution range. If not, a few trial-and-error runs may be required to get the best resolution range for your dataset. Use merging statistics output by prime and check the values of CC1/2 and I/sigI to find out your optimal resolution range.

Cell parameters (target_unit_cell and target_space_group) are required to run prime. Target cell parameter is used to remove some outlier images by controlling uc_tolerance parameter (the default value of tolerate range is 3% different). Space group parameter is used in removing outliers and merging with the given symmetry.

Don't forget also to change your pixel size in millimeters. Check what your detector is and note down its pixel size.

Running post-refinement
Once you have the input .phil file, you can run prime by calling

prime.run thermolysin.phil

Prime will post-refine and merge for reflection sets using three (default value) macrocycles. At the end of the run, you can obtain merging statistics in the last cycle - all other cycle statistics are also available in log.txt.

An example of merging statistics:

Summary for 001/postref_cycle_1_merge.mtz
Bin Resolution Range     Completeness      <N_obs>  |Rsplit  CC1/2  N_ind |CCanom   N_ind| <I/sigI>   <I>
-------------------------------------------------------------------------------------------------------------
02    5.70 -    4.52 100.0   1055 /   1055   65.89   16.02   89.15   1055    0.00      0    20.17    2101.97
03    4.52 -    3.95 100.0   1032 /   1032   61.53   14.48   92.03   1032    0.00      0    20.39    2529.90
04    3.95 -    3.59 100.0   1016 /   1016   54.15   15.61   90.13   1016    0.00      0    16.69    1971.43
05    3.59 -    3.33 100.0   1004 /   1004   42.67   17.66   89.23   1004    0.00      0    14.21    1502.14
06    3.33 -    3.14 100.0   1013 /   1013   32.77   20.40   84.26   1013    0.00      0    11.76    1077.60
07    3.14 -    2.98 100.0    995 /    995   27.36   23.00   78.72    995    0.00      0    11.58     935.37
08    2.98 -    2.85 100.0   1006 /   1006   23.57   22.63   82.26   1006    0.00      0    10.56     722.62
09    2.85 -    2.74 100.0    986 /    986   16.64   28.51   72.90    985    0.00      0    10.01     591.56
10    2.74 -    2.65  99.9    989 /    990   12.41   31.35   72.95    987    0.00      0     9.91     515.07
11    2.65 -    2.56  99.7    979 /    982    9.35   37.14   65.31    970    0.00      0     9.31     438.96
12    2.56 -    2.49  98.0    979 /    999    6.06   45.98   45.37    930    0.00      0     9.45     390.05
13    2.49 -    2.42  95.1    931 /    979    4.46   50.68   34.20    834    0.00      0     8.93     334.80
14    2.42 -    2.37  91.7    896 /    977    3.35   55.66   37.15    729    0.00      0     9.27     320.17
15    2.37 -    2.31  83.9    829 /    988    2.61   56.92   43.21    600    0.00      0     9.60     296.67
16    2.31 -    2.26  72.4    702 /    969    1.97   65.81   26.89    386    0.00      0    10.29     284.39
17    2.26 -    2.22  59.1    582 /    985    1.75   64.72   31.28    275    0.00      0     9.87     284.06
18    2.22 -    2.18  52.9    513 /    970    1.51   71.27   16.86    188    0.00      0     8.93     215.31
19    2.18 -    2.14  35.7    349 /    978    1.32   62.26   68.25     90    0.00      0     8.22     199.09
20    2.14 -    2.10  23.1    227 /    981    1.20   92.14   -9.20     42    0.00      0     8.59     224.44
-------------------------------------------------------------------------------------------------------------
        TOTAL         85.9  17224 /  20046   27.11   21.11   92.07  15305    0.00      0    12.87     999.53
-------------------------------------------------------------------------------------------------------------

Summary of refinement and merging
 No. good frames:                  1809
 No. bad cc frames:                 153
 No. bad G frames) :                  0
 No. bad unit cell frames:            5
 No. bad gamma_e frames:              0
 No. bad SE:                          0
 No. observations:               466997

Solving indexing ambiguity (New)

  • SOFTWARE UPDATE REQUIRED *

With the latest version (Aug 31, 2016), you can solve the indexing ambiguity problem directly in prime. The Brehm & Diederichs algorithms ("doi:10.1107/S1399004713025431") have been implemented with bootstrap capability to handle large dataset.
Merohedral Twinning
For merohedral twinning (27 space groups e.g. P6), the indexing choices will be determined automatically in prime. Use this default setting in your .phil file,

indexing_ambiguity {
  mode = Auto
  index_basis_in = None
  assigned_basis = None
  n_sample_frames = 300
  n_selected_frames = 100
}

The n_sample_frames parameter indicates no. of images that will be used for the calculation of the scoring function. After that, only n_selected_images will be used in the B&D algorithms. This saves a lot of computing time since only the selected images will be used for the determination of the ambiguity. You can change these two parameters to fit with your experiments. The default values are 300 and 100 (give 300 - use 100).

Pseudo-Merohedral Twinning
For pseudo-merohedral twinning, due to different possibilities for the indexing choice, prime doesn't determine these choices automatically. If you suspect that you may have pseudo twinning (b and c are similar, beta angle is almost 90 degree but not quite), you have an option to force prime to determine the ambiguity according to your choices.

indexing_ambiguity {
  mode = Forced
  index_basis_in = None
  assigned_basis = -h,l,k
  assigned_basis = -k, l, h
  n_sample_frames = 300
  n_selected_frames = 100
}

When you set indexing_ambiguity.mode to Forced, you can assign indexing choices according to your problem. In this example, two more choices (-h, l, k and -k, l, h) were assigned as the indexing choice.

Reusing the solution
At the end of the run, your solution pickle is saved to your_run_no/index_ambiguity/solution_pickle.pickle. If you don't want to spend time solving the ambiguity again in the next run, you can reuse this solution pickle by setting these parameters:

indexing_ambiguity {
  mode = Auto
  index_basis_in = your_run_no/index_ambiguity/solution_pickle.pickle
}

This will bypass the indexing ambiguity module. Prime will use the solution file to perform normal post-refinement and merging.

Using an external reference set
To use another isomorphous dataset (e.g. from a synchrotron experiment) as a reference set to solve the ambiguity, you can specify an mtz file as part of these parameters:

indexing_ambiguity {
  mode = Auto
  index_basis_in = path/to/your/mtz/file.mtz
}

Again, you can choose to do Auto or Forced (with a list of assigned_basis parameters) depending on your problem.

More detail with input parameters

Now that you have your first trial merged data set, you can explore different parameter settings to merge or to obtain the Bijvoet pairs (I+/I-) for your anomalous data set.

Anomalous data:

target_anomalous_flag = True

In the last cycle, prime will output a reflection set with I+ and I-.

Number of micro- and macrocycles

n_postref_cycle = 3
n_postref_sub_cycle = 1

Number of bins for merging statistics

n_bins = 20

Help with input parameters

Most input parameters are self-explained. However, you can run -h switch to view help information for each parameter.

prime.run -h

Running on multiple nodes

For LCLS users (or other users with LSF bsub), you can use psana (or your) queuing system to parallelize the entire process. For example, if you want to run your job on 100 nodes using psanq, you can specify:

queue {
  mode = bsub
  qname = psanaq
  n_nodes = 100
}
timeout_seconds = 300

Prime will divide all the images into 100 batches and submit them to different nodes. It will wait until all images in every batches are done before returning to the merging step (or the exit step in the manual mode). You can control timeout_seconds parameter to tell prime how long it should wait for all the image batches to finish. Usually, this timeout parameter is not used (all images should return before 300 seconds) but in case, you need to wait longer or shorter, you can modify this parameter.