To log in from a Bash prompt, run
ssh [email protected]
Today, this welcomed me when I logged in:
________________________________________
/ Cabernet currently consists of 27 nodes \
| with 808 cores total. Type sinfo for |
\ more options. /
----------------------------------------- . .
______________________________________________ . .
/\ ___ _ \ . .
/ : / (_) | | \ . .
| : | __, | | _ ,_ _ _ _ _|_ \________ ___
| : | / | |/ \_|/ / | / |/ | |/ | | |
| : \___/\_/|_/\_/ |__/ |_/ | |_/|__/|_/ ________|___|
| : .genomecenter.ucdavis.edu /
\ ; /
\/_____________________________________________/
First, write a script that runs your Rosetta protocol. In this case, we want to relax an input crystal structure into Rosetta's energy function with the relax
app. We want 1000 output structures to be produced in parallel.
Put in a file called sub.sh
:
#!/bin/bash
#
#SBATCH --job-name=relax
#SBATCH --output=log.txt
#SBATCH --array=1-1000
module load rosetta
relax.linuxgccrelease @flags -suffix $SLURM_ARRAY_TASK_ID
The -suffix $SLURM_ARRAY_TASK_ID
is to tell Rosetta which SLURM process it's running in so it can name output files appropriately.
To run relax on an apo protein structure, the flags
file contains the following Rosetta flags:
-s <input PDB>
-constrain_relax_to_start_coords 1
-renumber_pdb 1
The renumber_pdb
flag will renumber all the residues in the structure starting with 1.
Change to the same directory as your sub.sh
and input files and submit your job with:
sbatch sub.sh
In the case when you want to run the same protocol on multiple input structures, you can take an embarrassingly parallel approach by running all of the jobs concurrently rather than consecutively. This more complex situation adds a list of inputs in a file called list
, and uses the SLURM_ARRAY_TASK_ID
environment variable to select which input file should use used as input from the list
. When you submit a batch run, each job gets an integer ID (0, 1, 2 ... n). We add a short Bash command to a print particular line of the list
, which contains the Rosetta flags for a particular run. The list, in this case specifying three different input PDB structures, looks like
-s 3JYO.pdb
-s 3ORF.pdb
-s 1VLJ.pdb
For a parallel run, make a sub.sh
like this:
#!/bin/bash
#
#SBATCH --job-name=parallel_relax
#SBATCH --output=log.txt
#SBATCH --array=1-3
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
relax.linuxgccrelease @flags $S
The code head -${SLURM_ARRAY_TASK_ID} list | tail -1
returns the n
th line in the file list
, where n
is equal to SLURM_ARRAY_TASK_ID
.
Submit the run on Cabernet with
sbatch sub.sh
This option also works well when you need to permutate variables in your XML. For example, if we want make a series of point mutations to the same protein, we can add in a special %% %%
variable into our RosettaScript XML protocol that will get replaced at runtime.
Note the special %% %%
variables in the MutateResidue
mover declaration.
<ROSETTASCRIPTS>
<SCOREFXNS>
<myscore weights=talaris2013_cst.wts/>
</SCOREFXNS>
<TASKOPERATIONS>
<DetectProteinLigandInterface name=repack_sphere design=0 cut1=8.0 cut2=10.0 cut3=12.0 cut4=14.0 catres_interface=1 />
</TASKOPERATIONS>
<FILTERS>
<EnzScore name=allcst score_type=cstE scorefxn=myscore whole_pose=1 energy_cutoff=100 />
</FILTERS>
<MOVERS>
<MutateResidue name=mutate target=%%target%% new_res=%%new_res%% />
<AddOrRemoveMatchCsts name=cstadd cst_instruction=add_new accept_blocks_missing_header=1 fail_on_constraints_missing=0 />
<PredesignPerturbMover name=predock />
<EnzRepackMinimize name=cst_opt cst_opt=1 minimize_rb=1 minimize_sc=1 minimize_bb=0 min_in_stages=0 minimize_lig=1/>
<EnzRepackMinimize name=repack_wbb design=0 repack_only=1 scorefxn_minimize=myscore scorefxn_repack=myscore minimize_rb=1 minimize_sc=1 minimize_bb=1 minimize_lig=1 min_in_stages=0 backrub=0 task_operations=repack_sphere />
<ParsedProtocol name=iterate>
<Add mover=predock/>
<Add mover=cst_opt/>
<Add mover=repack_wbb/>
</ParsedProtocol>
<GenericMonteCarlo name=monte_repack mover_name=iterate filter_name=allcst />
</MOVERS>
<APPLY_TO_POSE>
</APPLY_TO_POSE>
<PROTOCOLS>
<Add mover=cstadd />
<Add mover=mutate />
<Add mover=monte_repack />
</PROTOCOLS>
</ROSETTASCRIPTS>
Now, we can generate a list of the mutations we want. list
:
-suffix "_325GLU" -parser:script_vars target=325 new_res=GLU
-suffix "_220GLU" -parser:script_vars target=220 new_res=GLU
-suffix "_298GLU" -parser:script_vars target=298 new_res=GLU
-suffix "_294LEU" -parser:script_vars target=294 new_res=LEU
-suffix "_407TYR" -parser:script_vars target=407 new_res=TYR
-suffix "_315ARG" -parser:script_vars target=315 new_res=ARG
-suffix "_164ASP" -parser:script_vars target=164 new_res=ASP
-suffix "_166GLU" -parser:script_vars target=166 new_res=GLU
-suffix "_415ASN" -parser:script_vars target=415 new_res=ASN
-suffix "_227TRP" -parser:script_vars target=227 new_res=TRP
Note that I've added suffixes to the output structures so they are written out with unique names containing the mutation.
The sub.sh
is the same except for which binary we're calling:
#!/bin/bash
#
#SBATCH --job-name=mutants
#SBATCH --output=log.txt
#SBATCH --array=1-10
S=$( head -${SLURM_ARRAY_TASK_ID} list | tail -1 )
module load rosetta
rosetta_scripts.linuxgccrelease @flags $S
and I've used these flags:
# options
-s bglb.pdb
-out:path:all out
-parser::protocol protocol.xml
-extra_res_fa pNPG.params
-enzdes::cstfile pNPG.enzdes.cst
# packing
-packing::ex1
-packing::ex2
-packing::ex1aro:level 6
-packing::ex2aro
-packing::extrachi_cutoff 1
-packing::use_input_sc
-packing::flip_HNQ
-packing::no_optH false
-packing::optH_MCA false
# enzdes-specific
-score::weights talaris2013_cst
-jd2::enzdes_out
# memory
-run::preserve_header
-run:version
-nblist_autoupdate
-linmem_ig 10
-chemical:exclude_patches LowerDNA UpperDNA Cterm_amidation VirtualBB ShoveBB VirtualDNAPhosphate VirtualNTerm CTermConnect sc_orbitals pro_hydroxylated_case1 pro_hydroxylated_case2 ser_phosphorylated thr_phosphorylated tyr_phosphorylated tyr_sulfated lys_dimethylated lys_monomethylated lys_trimethylated lys_acetylated glu_carboxylated cys_acetylated tyr_diiodinated N_acetylated C_methylamidated MethylatedProteinCterm
Run with
sbatch sub.sh
Stop a running job and kill all associated processes with
scancel <jobid>
If you submit a job and you want to watch the output, do
sbatch sub.sh --output=log.txt
tail -f log.txt
tail
will follow the progress of the log file. Quit with ^C
.
rsync
ing tons of PDBs and log files back and forth to the cluster sucks. Running your data analysis with Juptyter notebooks on the cluster rocks. The recommended style is:
On Cabernet:
screen -d -m jupyter-notebook --no-browser --port 8889
then disconnect your session with exit
. Note: we share the ports on Cabernet, so if 8889 is taken, try a number in the range 8000 to 9000.
On your machine:
ssh -N -f -L localhost:8888:localhost:8889 <user name>@cabernet.genomecenter.ucdavis.edu
and open localhost:8888
in your browser.
For those familiar with Sun Grid Engine (the scheduler used on Epiphany), the SLURM commands are very similar to SGE. However, the way they're configured is different.
SGE | SLURM |
---|---|
qsub sub.sh | srun sub.sh |
qsub -t 1-100 sub.sh | sbatch --array=1-100 sub.sh |
qstat (your jobs) | squeue -u (your jobs) |
qstat -u * (all jobs) | squeue (all jobs) |
qlogin | salloc |