BioShell cookbook

This cookbook provides a bunch of handy one-liners that simplify daily tasks in structural bioinformatics.

seqc recipes

Create FASTA from PDB (prints FASTA on a screen):

seqc -in:pdb=2gb1.pdb -out:fasta

Create FASTA from PDB, including secondary structure:

seqc -in:pdb=2gb1.pdb -out:fasta -in::pdb::header -out:fasta:secondary

Secondary structure annotation is extracted from the PDB file header (-in::pdb::header option is necessary to parse it)

Create FASTA from PDB, including secondary structure:

seqc -in:pdb=2gb1.pdb -out:ss2 -in::pdb::header

As above, the secondary structure is extracted from the PDB file header; all the probability values (last three columns in a SS2 file) are set either to \(1.0\) or \(0.0\)

Write FASTA file with only one line per sequence (un-wrap sequences)

seqc -in:fasta=in.fasta -out:sequence:width=0 -out:fasta

Sort sequences from the longest to the shortest

seqc -in:fasta=in.fasta -seqc:sort -out:fasta

This recipe can obviously be combined with the one above (every FASTA sequence in a single line)

Basic sequence filtering

seqc -in:fasta=in.fasta -seqc:sort -select::sequence::protein -out:fasta \
        -select::sequence::long_at_least=30

Print only amino acid sequences (due to -select::sequence::protein filter) that are at least 30 residues long

Basic sequence filtering: keep nucleotide sequences

seqc -in:fasta=in.fasta -seqc:sort -select::sequence::nucleic -out:fasta \
        -select::sequence::long_at_least=30

Print only nucleic acid sequences (due to -select::sequence::nucleic filter) that are at least 30 residues long

strc recipes

Write only chain A of the given input PDB file

strc -in:pdb=5edw.pdb -select::chains=A -out:pdb=5edwA.pdb

Write only aminoacids of chain A (ligands, water etc will be removed)

strc -in:pdb=5edw.pdb -select::chains=A -out:pdb=5edwA.pdb -select::aa

Write only selected fragment of a given protein (residues from 1 to 83 of chain A)

strc -in:pdb=1PQX.pdb -select::substructure=A:1-83 -op=out.pdb

str_calc recipes

Find all pairwise all-atom crmsd distances between all the models in a given PDB

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz

Read in only CA atoms; find all pairwise crmsd distances between all the models in a given PDB

str_calc -select::ca -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models \
        -in:pdb:native=2KMK.pdb.gz

Find all-atom crmsd distances between all models in a single PDB and the reference native structure

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz

As in the above example, but after superimposing alpha-carbons, calculate crmsd on all the atoms:

str_calc -in:pdb=2kmk-1.pdb -calc::crmsd -in:pdb::all_models -in:pdb:native=2KMK.pdb.gz \
        -calc::crmsd::matching_atoms=A:*:_CA_ -calc::crmsd::rotated_atoms=A:*:*

clust recipes

Calculate hierarchical clustering of 140 elements; distances are stored in tm_dist file.

clust -i=tm_dist -n=140 -clustering:out:distance=0.4

Prints clusters for critical distance 0.4. By default single link clustering strategy is used