Conrad Bailey

Determining DNA Health: Aho And Corasick

This problem comes from HackerRank.

You can see my full solution in the GitHub repo.

Problem Summary

You're given a list of non-unique patterns (strings), each of which has a positive integer value assigned to it.

Next you're given a list of texts in which to find those patterns. However only a certain, contiguous range of patterns is valid for any given text.

The goal for every text is to find all the insances of valid matches and sum up their associated values.

Finally report the minimum and maximum sums observed

Variable Names

  • text_len_sum is the sum of the lengths of all the texts.
  • pattern_len_sum is the sum of the lengths of all the patterns.
  • num_patterns is the number of patterns given.
  • num_texts is the number of texts one must process.
  • num_matches the total number of matches found in all texts.

Naive Approach

For every character in every text check whether its the beginning of any valid pattern. This is clearly O(text_len_sum*pattern_len_sum) which is terrible, but easy to code. That solved 7 cases and timed out on the other 26.

Initial Approach: Knuth-Morris-Pratt (KMP)

This approach requires building a KMP automaton for every pattern. Then for every character in every text, advance the state of every valid automaton and report accepted states.

I chose to experiment with this approach because I was already familiar with the KMP algorithm, and the code was fresh in my head. I expected the performance to be poor however: O(text_len_sum*num_patterns), still quadratic.

The overhead of this approach made it even worse than the naive approach, only solving 2 of 33 test cases without timing out.

Second Approach: Composed KMP -- Aho-Corasick

I recalled composing a lot of finite state machines in my Theory of Computing class; I didn't actually think that highly theoretical knowledge would be very useful. Yet here I was considering composing all the KMP automata for every pattern into a giant FSM that could match against every pattern in parallel, i.e. O(text_len_sum).

Before I went about rigging that up though, I decided it was time to do some research. That's when I discovered the Aho-Corasick Algorithm, which is basically the same thing. It preprocesses all the patterns into a FSM via a trie, and then that FSM can be used on any number of texts.

If you're looking into Aho-Corasick for the first time, try out these Stanford lecture notes.

Implementing Aho-Corasick: Node Basics

The problem guarantees that all texts and patterns are composed exclusively of lowercase, English letters. Therefore a vector<Node *> children of size 26 is sufficient for each node in the trie to hold a reference to every possible child. if children[i] == nullptr then that node does not have a child for the ith letter of the alphabet.

The basic Node data members are as follows

char val; // The character value belonging to this node.
bool accept = false; // Whether it represents an accept state
                     // (the last character of a pattern).
Node *suffix = nullptr; // See Aho-Corasick
Node *output = nullptr; // See Aho-Corasick
vector<Node *> children;

Implementing Aho-Corasick: Returning Associated Values

Yet a little more book-keeping is necessary to record the values associated with the accept states (read patterns). The simple fix is to give each accepting Node a data member to store it's value in; then when a match is found at that state, add that value to the running sum.

There's a problem though: If a pattern is outside the range of valid patterns for a text, then we don't want to recognize that pattern's accept state in our sum. This means we must create a new Aho-Corasick automaton for every range of patterns necessary, potentially for every text. Every Aho-Corasick FSM construction is O(pattern_len_sum), so this approach is <O(pattern_len_sum * num_texts), O(text_len_sum)>, quadratic.

...unless we give accept states the ability to hold a range of integer values. Every Node now has two new data members (the problem says the pattern-associated-values are "health values", hence the name)

// Values associated with patterns that match at this node
vector<unsigned long long> healths;

// index[j] holds the index at which healths[j] could be found
// in the orginal patterns list. Naturally this is in sorted
// order.
vector<unsigned long long> index;

This allowed me, on a per match basis, to find the range of valid values for that match (O(log(num_patterns)) thanks to the natural sorting) without reconstructing the Aho-Corasick FSM. See below

// Given the text being processed only has valid matches in
// the range [start, finish]
for (auto out = state; out != nullptr; out = out->output)
  if (out->accept)
    for (auto i = lower_bound(out->index.begin(),
                              out->index.end(), start);
         i != out->index.end() && *i <= finish;  ++i) {
       sum += out->healths[distance(out->index.begin(), i)];

Final Complexity

I should've/could've saved a lot of space with pointers to `long long` types instead of copying them all the time.

  • Building the automaton is O(pattern_len_sum)
  • Processing the texts is O(text_len_sum + num_matches) but every letter could be an accept state, requiring a O(log(num_patterns)) range lookup.
  • Therefore the final complexity is <O(pattern_len_sum), O((text_len_sum + num_matches)log(num_patterns))>
  • This version did complete all test cases correctly and under the time limit.