BioShell cookbook

This cookbook provides a bunch of handy one-liners that simplify daily tasks in structural bioinformatics.

bash-only recipes

Combine a bunch of .pdb files into a single multimodel-pdb:

k=0;
for i in *.pdb; do
    k=$(($k+1));
    echo "MODEL     "$k;
    cat $i;
    echo "ENDMDL";
done > all-pdb
mv all-pdb all.pdb

1. seqc recipes

1.1 Create FASTA from PDB (prints FASTA on a screen):

seqc -in:pdb=2gb1.pdb -out:fasta

1.2 Create FASTA from PDB, including secondary structure:

seqc -in:pdb=2gb1.pdb -out:fasta -in::pdb::header -out:fasta:secondary

Secondary structure annotation is extracted from the PDB file header (-in::pdb::header option is necessary to parse it)

1.3 Create SS2 file from PDB:

seqc -in:pdb=2gb1.pdb -out:ss2 -in::pdb::header

As above, the secondary structure is extracted from the PDB file header; all the probability values (last three columns in a SS2 file) are set either to \(1.0\) or \(0.0\)

1.4 Count secondary structure elements in a bunch of PDB files, create a nice table:

for i in 2gb1.pdb 2fdo.pdb 1rrx.pdb
do
  ss=`seqc -in:pdb=$i -out:ss2 -in::pdb::header -of -out::sequence::width=0 \
     | tail -1 | fold -w1 | uniq | sort | uniq -c | tr '\n' ' '`
  echo $i $ss
done 2>/dev/null

As in recipe 1.2, but this time a combination of a few bash commands is used to parse the ouput and count the number of secondary structure elements: coil (C), strands (E) and helices (H). Example output looks as below:

2gb1.pdb 6 C 4 E 1 H
2fdo.pdb 7 C 6 E 3 H
1rrx.pdb 16 C 11 E 5 H

1.5 Write FASTA file with only one line per sequence (un-wrap sequences)

seqc -in:fasta=in.fasta -out:sequence:width=0 -out:fasta

1.6 Convert ASN.1 sequence profile (psiblast output) into a text format

seqc -in:profile:asn1=d1or4A_.asn1 -out:profile:txt

1.7 As in recipe 1.5 (i.e. .asn1 -> .txt), but this time reorder profile columns

seqc -in:profile:asn1=d1or4A_.asn1 -out:profile:txt  \
    -out:profile:columns=GAPVILMCHWFYKRQDNQST

1.8 Sort sequences from the longest to the shortest

seqc -in:fasta=in.fasta -seqc:sort -out:fasta

This recipe can obviously be combined with the one above (every FASTA sequence in a single line)

1.9 Basic sequence filtering

seqc -in:fasta=in.fasta -seqc:sort -select::sequence::protein -out:fasta \
    -select::sequence::long_at_least=30

Print only amino acid sequences (due to -select::sequence::protein filter) that are at least 30 residues long

1.10 Basic sequence filtering: keep nucleotide sequences

seqc -in:fasta=in.fasta -seqc:sort -select::sequence::nucleic -out:fasta \
    -select::sequence::long_at_least=30

Print only nucleic acid sequences (due to -select::sequence::nucleic filter) that are at least 30 residues long

2. strc recipes

2.1 Write only chain A of the given input PDB file

strc -in:pdb=5edw.pdb -select::chains=A -out:pdb=5edwA.pdb

2.2 Write only aminoacids of chain A (ligands, water etc will be removed)

strc -in:pdb=5edw.pdb -select::chains=A -out:pdb=5edwA.pdb -select::aa

2.3 Write only selected fragment of a given protein (residues from 1 to 83 of chain A)

strc -in:pdb=1PQX.pdb -select::substructure=A:1-83 -op=out.pdb

3. str_calc recipes

3.1 Find all pairwise all-atom crmsd distances between all the models in a given PDB

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz

3.2 Read in only CA atoms; find all pairwise crmsd distances between all the models in a given PDB

str_calc -select::ca -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models \
        -in:pdb:native=2KMK.pdb.gz

3.3 Generate theoretical NOE restraints on for a protein backbone

str_calc -in::pdb=2kwi.pdb -in:pdb:with_hydrogens \
  -calc::distmap::describe -calc::distmap::allatom

This command lists all distances between any two backbone atoms; -in:pdb:with_hydrogens option forces BioShell to read hydrogen atoms, which is false by default, -calc::distmap::describe turns on longer atom descriptions. The output may look as below:

A GLN    9  N     10  A GLY    8  N      1    3.602
A GLN    9  N     10  A GLY    8  CA     2    2.418
A GLN    9  N     10  A GLY    8  C      3    1.326
A GLN    9  N     10  A GLY    8  O      4    2.245
A GLN    9  N     10  A GLY    8  HA2    8    2.506
A GLN    9  N     10  A GLY    8  HA3    9    2.959
A GLN    9  CA    11  A GLY    8  N      1    4.834
A GLN    9  CA    11  A GLY    8  CA     2    3.788
A GLN    9  CA    11  A GLY    8  C      3    2.425
A GLN    9  CA    11  A GLY    8  O      4    2.756
str_calc -in::pdb=2kwi.pdb -in:pdb:with_hydrogens -calc::distmap::describe \
    -calc::distmap::allatom  | awk '{if(($11<2.5) && ($3-$8>4)) print $0}'

This output is the filtered with awk. The ouput lines must satisfy the criteria: distance below 2.5 Angstroms, sequence separation at least 4 residues.

3.3 Find all-atom crmsd distances between all models in a single PDB and the reference native structure

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz

3.4 As in the above example, but after superimposing alpha-carbons, calculate crmsd on all the atoms:

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz \
        -calc::crmsd::matching_atoms=A:*:_CA_ -calc::crmsd::rotated_atoms=A:*:*

Check peptide docking results: superimpose two structures using alpha carbons of chain A (i.e. the receptor) and calculate crmsd of CA atoms of chain B (i.e. the ligand)

str_calc -in:pdb=model-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=native.pdb \
        -calc::crmsd::matching_atoms=A:*:_CA_ -calc::crmsd::rotated_atoms=B:*:_CA_

3.5 Check peptide docking results: superimpose two structures using alpha carbons of chain A (i.e. the receptor) and calculate crmsd of CA atoms of chain B (i.e. the ligand)

str_calc -in:pdb=models-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=native.pdb \
        -calc::crmsd::matching_atoms=A:*:_CA_ -calc::crmsd::rotated_atoms=B:*:_CA_

Note, that this recipe loads all models from the models-1.pdb file. For instance, if that file contains 10 structures, one can expect the following output:

#  name   crmsd  len  crmsd  len
models-1-1.pdb  0.000   96  0.000    4
models-1-2.pdb  0.262   96 22.598    4
models-1-3.pdb  0.274   96 16.670    4
models-1-4.pdb  0.260   96 16.123    4
models-1-5.pdb  0.292   96 24.524    4
models-1-6.pdb  0.320   96 27.575    4
models-1-7.pdb  0.351   96 24.200    4
models-1-8.pdb  0.385   96 24.613    4
models-1-9.pdb  0.297   96 22.778    4
models-1-10.pdb  0.325   96 25.136    4

The first column identifies a model structure (name-of-input-file + dash + model number), the second and third provide crmsd on the atoms used for superposition (CA atoms of chains A inthis case) and the number of these atoms (here 96), respectively. Finaly the last two columns provude crmds and atom count for the rotated atom set. The results come for tetrapeptide docking experiment, hence only 4 CA atoms were rotated.

4. clust recipes

4.1 Calculate hierarchical clustering of 140 elements; distances are stored in tm_dist file.

clust -i=tm_dist -n=140 -clustering:out:distance=0.4

Prints clusters for critical distance 0.4. By default single link clustering strategy is used

5. hist recipes

5.1 Calculate a histogram from the 14th column of a given input file:

hist -in:file=default.fsc -in:column=14 -hist:x_max=10 -hist:x_min=0

The command reads a score file produced by Rosetta and makes a histogram of crmsd, assuming it’s in the 14th column