ap_SequenceWeightingProtocol

ap_SequenceWeightingProtocol reads a set of protein sequences and computes a real weight for each of those sequences.

If the FASTA file is the input, every pair of sequences will be aligned and sequence identity values will be evaluated based on these alignments. If .aln is the input (i.e. ClustalO MSA file format), it is assumed the sequences are already aligned and sequence identity values will be computed based on the MSA.

Sequence identity values will be transformed into real weights. These weights may be further used e.g. in sequence profile construction

USAGE:
ap_SequenceWeightingProtocol input.fasta
ap_SequenceWeightingProtocol input.aln

Keywords:

Categories:

  • core/protocols/SequenceWeightingProtocol

Output files:

Program source:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

#include <utils/exit.hh>
#include <core/data/basic/Array2D.hh>
#include <core/data/sequence/Sequence.hh>
#include <core/data/io/fasta_io.hh>
#include <core/data/io/clustalw_io.hh>
#include <core/protocols/SequenceWeightingProtocol.hh>
#include <utils/io_utils.hh>

std::string program_info = R"(

ap_SequenceWeightingProtocol reads a set of protein sequences and computes a real weight for each of those sequences.

If the FASTA file is the input, every pair of sequences will be aligned and sequence identity values will be evaluated
based on these alignments. If .aln is the input (i.e. ClustalO MSA file format), it is assumed the sequences are already
aligned and sequence identity values will be computed based on the MSA.

Sequence identity values will be transformed into real weights. These weights may be further used e.g.
in sequence profile construction

USAGE:
    ap_SequenceWeightingProtocol input.fasta
    ap_SequenceWeightingProtocol input.aln

)";

/** @brief Shows how to use SequenceWeightingProtocol class
 *
 * CATEGORIES: core/protocols/SequenceWeightingProtocol
 * KEYWORDS:   FASTA input; sequence alignment; sequence identity; sequence weighting
 * GROUP: Sequence calculations;
 */
int main(const int argc, const char* argv[]) {

  if(argc < 2) utils::exit_OK_with_message(program_info); // --- complain about missing program parameter

  using namespace core::data::sequence;
  using namespace core::protocols;

  bool if_align = true;
  std::vector<Sequence_SP> input_sequences;
  auto root_extn = utils::root_extension(argv[1]);
  if ((root_extn.second == "aln") || (root_extn.second == "clustalw")) {
    core::data::io::read_clustalw_file(argv[1], input_sequences);
    if_align = false;
  } else
    core::data::io::read_fasta_file(argv[1], input_sequences);

  core::protocols::SequenceWeightingProtocol protocol;
  protocol.seq_identity_cutoff(0.25).n_threads(1);
  protocol.if_align_sequences(if_align).add_input_sequences(input_sequences);
  auto start = std::chrono::high_resolution_clock::now(); // --- timer starts!
  protocol.run();
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> time_span = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
  std::cerr << input_sequences.size() * (input_sequences.size() - 1) / 2.0
            << " sequence similarities calculated within " << time_span.count() << " [s]\n";

  protocol.print_weights(std::cout);
}
../_images/file_icon.png