clust tutorial : clustering sequences and structures

Clustering procedure allows one to divide arbitrary number of objects into groups accordint to their mutual (dis)similarity. This method is widely used in bioinformatics and molecular modeling to deal with data sets that are too large to be inspected manually. Here we give two examples of Hierarchical Agglomerative Clustering with BioShell package:

  1. to cluster a pool of protein sequences
  2. to cluster results of protein-peptide docking

The BioShell procedure for clustering divides the task into three steps:

  1. calculate a matrix of distances between elements subjected to a clustering analysis.

    As a result, a flat text file should be produced. The three columns of that file must provide i-th element ID, j-th element ID and the respective distance value

  2. run the actuall clustering procedure.

    Although the procedure can be stopped at a particular cutoff distance, we advise to conduct the calculations i.e. until all the objects are merged into a single cluster. Clustering tree will be stored in an output file

  3. analyse the clustering tree to retrieve clusters at a desired cutoff level

Below we show how to perform these three steps for two different clustering applications

Example 1. Clustering protein sequences by their mutual sequence identity

Step 1: Calculating the distances

Clustering procedure should merge close sequences (i.e. small mutual distance) into a single cluster, while dissimilar sequences should be placed in different clusters. Unfortunately, sequence identity value (seq_id) cannot be used here because its largest value (1.0) denotes identical sequences. Here we propose to use 1.0 - seq_id as a distance function.

First we use ap_PairwiseSequenceIdentityProtocol program to evaluate all pairwise distances:

ap_PairwiseSequenceIdentityProtocol inp.fasta 8 0.4  > seq_id.out 2>LOG

where inp.fasta is the input file (FASTA format), 8 is the number of cores (threads run in parallel) and 0.4 is the smallest seq_id value to be written to a file.

Then the seq_id values are converted into distances with awk command line tool:

awk '{print $1,$2,1.0-$3}' seq_id.out > distances.out

Step 2: Clustering the data

Then we run the clust tool:

clust -in::file=distances.out \
    -n=46621 \
    -complete \
    -clustering:missing_distance=1.1 \
    -clustering:out:tree=tree-complete >clust_out 2>clust_log

The -n option is necessary to provide the number of objects subjected to clustering (not the number of distance values!). -clustering:missing_distance Provides the default distance value for the cases it’s undefined. The clustering tree will be stored in a file specified by -clustering:out:tree option

Step 3: Analysis

clust  -in::file=distances.out \
    -n=46621 \
    -clustering:in:tree=tree-complete \
    -clustering:out:clusters \
    -clustering:out:distance=0.4 \

Example 2. Clustering results of protein-peptide docking

The input data set contains 12500 conformations of a protein receptor (1jd4) with a short peptide bound to its surface. The conformations were calculated with FlexPepDocking program from Rosetta modelling suite.

Step 1: Calculating the distances

Step 2: Clustering the data

We run clust program as above, just should remember to put the correct imput file name and to change the number of data elements (i.e. protein conformations)

clust -in::file=1jd4-pep-crsmd \
  -n=12500 \
  -complete \
  -clustering:missing_distance=15.1 \
  -clustering:out:tree=tree-complete >clust_out 2>clust_log

Step 3: Analysis

clust -in::file=all_vs_all_crmsd_15 \
  -n=12500 -clustering:out:clusters \
  -clustering:out:distance=2.5 \
  -clustering:out:min_size=10 \