Hacking KvarQ

This chapter is a good starting point if you intend to change anything in the KvarQ source code, especially in the modules kvarq.genes and kvarq.analyse. If you’re merely interested in writing a new testuite, you might also find some valuable information here in addition to the example testsuites shipped with KvarQ. Make sure to have read the overview first.

About flanks

In general, KvarQ searches for a given sequence within the genome and checks for mutations within that “region of interest”. Because .fastq reads are only accepted if the overlap between the target sequence and the read has a specified minimum overlap (minreadlength), the coverage rapidly falls on both extremities of the target sequence. For this reason, a region of interest is flanked with supplementary bases on either side. Mutations in these flanks are disregarded, as their sole purpose is to avoid to have the border effect with low coverage within the region of interest. TemplateFromGenome read their bases from a Genome and can easily add an arbitrary flank on either side of the region of interest.

About positions

  • All positions within a Genome are relative to the index one. I.e. the first codon of the genome (if it were coding) would be the bases 1-3 which correspond to the first three bytes in the file (at file positions 0-2). Positions increase in the reading direction of the plus strand.
  • The TemplateFromGenome‘s start and stop attributes refer to positions in the genome.
  • Sequences are simply strings of bases and therefore, indexing within sequences starts at the first character of this sequence string. This sequence string may contain flanks on both sides. The (optional) attribute start refers to the first base after the flank (i.e. the first base of the sequence of interest).
  • In a Coverage, positions simply refer to its Sequences on the plus strand. A coverage has no pos attribute and the start, stop attributes correspond to the (plus strand) sequence’s left and right.
  • The Hit structure

About the complementary DNA strand

In general, everything refers to the + strand of the genome. Just before scanning, the analyser creates a complementary copy of every Sequence in Analyser.scan(). When assembling the hits in Coverage.apply_hit(), hits on the complementary strand are mapped on the positive strand again using the methods Sequence.plus_idx() and Sequence.plus_base(). Finally, the TemplateFromGenome and the Gene attached to a Test know whether the + or the complementary - strand is coding and accordingly converts the base mutations in (non-) synonymous amino acid changes.

Sequence of tests

The Analyser is initialized with a set of testsuites that define each a given number of Test. Just before scanning, the analyser creates a list (actually a OrderedDict) that contains every test of every testsuite. The base sequences the engine searches in the .fastq version is ordered in the same sequence (with the complementary base sequences added at the end of the list). Later on, the Coverage are ordered in the same sequence and everything that gets saved to the .json file (in encode()) uses the same sequence again. To be able to reconstruct the same order when loading a .json file, the decode() tries to identify every test by its name (as returned by its __str__() method) and reconstructs the same sequence of tests. The Analyser[...] method retunrs a Coverage and accepts integers as well as strings and Test).

About clonal variants (“heterozygous calls”)

When the DNA is not extracted with great care from single colonies, this can easily result in mixed DNA from different clones. Although KvarQ is not designed to interpret polyclonal variants, the kvarq.analyse.Coverage lists detailed information about the frequency of every mutation and new testsuites can use this information to interpret the results of mixed colony sequencing.

Currently, the explorer displays clonal variants (defined as base calls with the most dominant base below 90%) with ~ sign, and the MTBC/phylo as well as the MTBC/resistance testsuites display a remark, when sequencing date seems to stem from a mixed population sequencing.