Implementing Aho-Corasick in the Streaming Model

A Note About this Blog

Now that I have started my Master’s dissertation which is on a related topic, this blog is being re-purposed for documenting my progress through until May.

My final year project is on the topic of pattern matching with fingerprints again, but is now looking at the problem of dictionary matching. Dictionary matching is a generalisation of exact pattern matching, where multiple patterns are matched against a single text.

This is defined formally as follows: We have a text T of length n. We have a set of k patterns \mathcal{P} = \{P_1,...,P_k\} of corresponding lengths m_1,...,m_k. The total size of the dictionary is denoted M = \sum^{k}_{i = 1}m_i and the length of the longest pattern is m = max_i\{m_i\}_{1 \leq i \leq k}.

Aho-Corasick

The standard solution to this problem is the Aho-Corasick algorithm, first published in 1975. The Unix command fgrep, which searches files for multiple fixed patterns, was originally based on this algorithm.

Aho-Corasick is based on the Knuth-Morris-Pratt algorithm for single patterns. Instead of having a single pattern to match against, it uses a trie composed of multiple patterns. Every vertex in the trie has a failure edge, which links the vertex to the previous vertex with the longest prefix that matches its suffix. When a mismatch occurs between the text and the trie, we travel up the failure edges until we either hit the root or a match is found. This takes worst case O(m) time per character.

Aho and Corasick’s paper then provides a method of reducing this to O(1) time per character by preprocessing. For each vertex, we work out the next verted it should move to for a given character based on the child and failure edges, and store these in a table. This removes the need to traverse many failure edges when we find a mismatch.

Implementing Aho-Corasick in the streaming model is easily done. Each character of the text is only read once and they are read in order, so the only change we need to implement is to feed the algorithm one character at a time.

The asymptotic complexity of the algorithm is O(M) space and O(1) time per character. Note that the time complexity relies on the alphabet being of a fixed size; a method of fixing this is presented below.

Implementing the Trie

There were two methods attempted to implement the trie.

The first was based on an object-oriented approach, with each vertex being represented by a data structure, and pointers leading to its children and the failure item. This method was quickly abandoned due to difficulty with memory allocation.

The second was based on an idea similar to an adjacency matrix. Each state was represented by an integer in order of when they were added to the trie. The next-move function was represented by two 2-D arrays: the first of integers to represent the states to move to for a given node, and the second of characters to represent the keys for each next-move. For space efficiency, if the next move for a state and character is not in the array, then the machine moves to the root state.

The second technique was chosen due to simpler memory allocation.

Removing the Fixed Alphabet Constraint

As already stated, these run times assume that the alphabet is of a fixed size. This is not guaranteed in the streaming model, meaning that the run-time for looking up a next move is currently |\Sigma|, where \Sigma is the alphabet of the text and pattern.

This can be sped up by hashing. Once the next-move function is computed, the C Minimum Perfect Hashing library (CMPH) is used to map each of the next-move keys to the index of a smaller array in constant time. The vertex to move to next is stored in this index, which gives us a constant running time for looking up the next move.

Note that the preprocessing time will still depend on the size of the alphabet, as hashing still is not used at this point.

Next Step

The Aho-Corasick algorithm was implemented as a baseline for testing the next algorithm. This algorithm, in development with the Algorithms Team at the University of Bristol, utilises the Rabin-Karp fingerprinting method of before and techniques similar to Porat and Porat’s approach to achieve dictionary matching in O(k\log m) space.

Analysis and Results

This work was completed between the 8th and 12th of September, but has only been posted now due to other commitments.

Both the implementations for Parameterised Matching and Exact Matching with Fingerprints have been analysed. The performance of both implementations along with a reference version of Knuth-Morris-Pratt were tested using random extracts from 50MB of English Text from the Gutenberg Project, which was in turn compiled by the Pizza&Chilli Corpus. This text is available here.

A program was written to select random portions of a given size from the text to act as the pattern, and another was used to invoke the pattern matching algorithms. Both of these programs were written in C. A Python script was used with NumPy and SciPy to compile the results into graphical form. And finally, a poster was produced using LaTeX.

The end results are available on GitHub here, and the poster in particular can be directly accessed here.

A couple of questions are raised on the poster itself, but I would like to mention a few more which were cut due to space:

  • How does exact matching scale with text size and precision? One of the unstated assumptions in Breslauer and Galil’s paper is that an integer takes up a constant amount of space. But this is only true if we limit the size of our modulus p. If we want arbitrary precision for any text size, then we cannot limit the size of our integers to a constant amount of space. This was one of my justifications for using The GMP Bignum Library. But this means that as p increases, so will our space requirement and the time taken for each fingerprint operation, since all of them rely on GMP.
  • How precise do we need to be? During all of my development and testing of exact matching, I had an accuracy level of p \in \Theta(n^2), which is the lowest accuracy setting mentioned in the paper. Despite this, I never saw a false positive reported. This is not to say that false positives can’t happen, but are they so unlikely that we can be more lenient in practice? This might also help with the above question of size/time complexity.
  • Are there better fingerprinting algorithms? Division and modulo operations are widely known as being inefficient, yet Karp-Rabin fingerprints strongly rely on modulo multiplication. Would we be better off finding a different fingerprinting technique that doesn’t use these operations?

This will be the last of this blog while I focus on University, but I am hoping to return to this in the new year, when my Thesis will focus largely on using these fingerprinting techniques for related problems.

Implementing Exact Matching with Fingerprints

Over the past two weeks my work has been focused around implementing exact pattern matching in constant time per character of text with O(\log m) space. An algorithm for this has been devised by Breslauer and Galil in their paper Real-Time Streaming String-Matching.

This post will provide a brief overview of the algorithm presented as well as some interesting points during my implementation.

Karp-Rabin Fingerprints

The main trick in achieving \log m space is to use a compression technique first used in Karp and Rabin’s paper Efficient Randomized Pattern-Matching Algorithms, where given a random prime p and a random number r \in [p], the fingerprint of a string S of length k is \sum_{i = 0}^{k - 1} r^iS[i]\, \text{modulo}\, p. Because this is a compressed format of the string, there is a chance of collisions. This can be reduced by increasing the size of p. More precisely, if p \in \Theta(n^{2+\alpha}) for some \alpha \geq 0, then the probability of a collision is smaller than {1 \over n^{1 + \alpha}}.

This is simple enough in theory, but is in fact rather restricting in practice. If we want our text to be in the scale of GB in size (i.e. n = 2^{30}), then on a 64-bit system our accuracy level \alpha \leq 1. If we want a larger degree of accuracy, then our calculation of p will overflow.

To avoid this, I implemented a library for Rabin-Karp fingerprinting using GNU Multiple Precision Arithmetic (GMP). GMP provides tools and functions for arbitrarily large integer arithmetic, and is an industry standard for fast computation of large numbers.

The other advantage of GMP is that it provides a collection of fast number theory functions, including a probabilistic function for finding prime numbers. This allows us to not worry about the computation of our prime number p.

At first glance, it may seem that we require a large number of modular exponentiations for these fingerprints. But this is not true; the only place where a modular exponentiation is needed is in calculating p, and the rest can be performed with modular multiplications. Similarly, any concatenation of two fingerprints, or getting the prefix/suffix from two fingerprints can be done with additions, subtractions and modular multiplications, thus avoiding exponentiations for these problems too.

Because of the number of modular multiplications that are being performed in this process, an interesting area to look could be to see if this process could be sped up by Montgomery reduction. Another interesting point could be the overhead of GMP itself, and whether or not the benefit in accuracy for large amounts of data is worth the more complex arithmetic. This has been ignored for now, and for the sake of simplicity the GMP operations that have been used have been assumed to run in constant time.

Breaking the Pattern into Stages

An O(\log m) time per character algorithm is achieved by matching the pattern in \log m stages, where each stage checks the last 2^i characters of the text match the first 2^i characters of the pattern, for 0 \leq i < m. When a new character comes in, we iterate through every stage and look for any viable occurances (VOs) that we can check for that stage. If there is one and it matches that stage, then we promote that VO to the next stage, otherwise we discard it.

There are two ways we can compare the text to the pattern. In both methods, we keep a running fingerprint of the whole text.

The first is to store for each stage i the fingerprint of P[0:2^i]. At stage 0, if we find a match at location j, then we store the fingerprint of T[0:j] – in other words, the fingerprint of the whole text up to but not including the j-th character. For the i-th stage afterwards, we can use this fingerprint and the fingerprint of the whole text to produce the fingerprint for the last 2^i characters of the text, which we can then compare with the pattern. If it passes, then we promote the VO to the next stage with the original fingerprint from stage 0.

The second is for each stage i to store the fingerprint of P[2^{i-1}:2^i] and P[0] for stage 0. At stage 0, if there is a match then we promote it with the fingerprint of the whole text up to and including the latest character. For each subsequent stage i, we find the fingerprint we need by using the fingerprint promoted from the previous stage and the fingerprint of the whole text to produce the fingerprint of the last 2^{i-1} characters, which we can then compare with the pattern for that stage. If it passes, then we promote the VO to the next stage with the fingerprint of the whole text at that point.

After implementation it became clear that both techniques were very similar – a diff on the source code revealed six lines of changes. For the rest of this work I have gone with the second option due to the fact that it required fewer operations.

At this point the amount of space required was still O(m) as I was currently storing all viable occurrences. This was compressed to O(\log m) space by taking advantage of the pattern’s periodicity. If there are more then two VOs being stored at the same time for a given stage then the pattern must be periodic for that stage, in which case we can simply store the fingerprints for the first and last VOs and the period. Adding and discarding VOs can still be done in constant time, and the rest of the algorithm is unaffected as we only ever read the first VO for a given stage.

Amortised Constant Time per Character

The next stage was to make the program run in amortised constant time. This is achieved using Knuth-Morris-Pratt (KMP) to match the first \log m characters of the pattern if these characters are not periodic. If these characters are periodic then we simply extend the amount of the pattern done by KMP until either it stops being periodic or the pattern ends.

This extended version can still be compressed to O(\log m) space as the period is guaranteed to be at most length {\log m \over 2}. As long as the pattern is periodic, we only need to store the first plen characters of the pattern and the first 2 * plen elements of the failure table – where plen is the length of the period – and the rest can be worked out in constant time. If the period does break, then we only need one extra character to store the character of the pattern that breaks the period, and one integer to store the failure value for that character.

If the whole pattern is periodic, then we can simply perform KMP on the whole pattern which runs in amortised constant time per character and O(\log m) space due to our compression trick. Otherwise, the pattern for KMP not being periodic guarantees that any VOs are at least \log m apart. Since we only have \log m stages, this allows us to only process one VO of one stage per round, thus bringing each round down to amortised constant time.

One issue with this comes from requiring fingerprints of the text at previous points. The paper fixes this by using a circular buffer to store the last \log m fingerprints of the text. Instead of a circular buffer, I have simply used a standard array, using the index of the row I’m currently verifying to dictate which index of the array to write to. This provides the disadvantage of requiring a modular operation for reading, but this is also true for many circular buffer implementations which allow for rotation.

But another problem is that there can be a delay between reading in the next character of the text and a match being reported. This delay will be of at most \log m. There are two methods for solving this:

The first, suggested by the paper, is to check for VOs on the final stage for every round. Since this is still at most two stages being checked per round, this is still amortised constant time.

The second, proposed by Ben Sach and Markus Jalsenius in conversation, is to use this fingerprint technique to match the first m - \log m characters of the pattern, and use KMP to match the rest. In case a match is returned instantly, results are stored in an array of size \log m, again requiring modular operations to read from and write to. This can also be done in amortised constant time with O(\log m) memory due to the amount of characters processed by this second KMP.

Both methods have been implemented. It will be seen which is better once tested on larger data sets.

Deamortising KMP

The only hold back for this algorithm now is KMP, which runs in amortised constant time per character, but in the worst case can take O(m) time. The paper removes this worst-case running time by using Galil’s adaptation in String Matching in Real Time·

Instead, I use a method first devised by Simon in String Matching Algorithms and Automata and explained by Jalsenius in Pattern Matching in Multiple Streams (slide 23). Instead of using a failure table, this method figures out for each index in the pattern how far to shift the pattern for each character in the alphabet. To keep space usage low, we only store characters which don’t match the pattern at a particular index and result in us shifting the pattern to some index other than -1; the sum of these is at most the length of the pattern for KMP.

To store these characters succinctly, I create CHD hash functions using C Minimum Perfect Hashing, a library for generating static minimum perfect hash functions of strings to integers. I then store two arrays of keys (characters in the alphabet) and values (the pattern index to shift to). Each index in the pattern matched via KMP has one of these structures for it, thus resulting in O(\log m) structures. Most of these structures will have empty arrays and therefore take up constant space, and the ones that do have entries in them will have total size of at most m.

One question that arises is can we compress these structures like we could with the pattern and failure table by taking advantage of the pattern being periodic? The answer is yes; like the failure table we only need to store the first 2 * plen tables, and the rest can be worked out from that in constant time.

The disadvantage comes in the fact that preprocessing is slower. The failure table is still required to work out these displacements, but can be discarded after this step. The current preprocessing time is O(m\left|\Sigma\right|) for KMP. The advantage is that matching for KMP (and thus the whole algorithm) is now constant time per character.

The current implementation is available here. Future goals include testing performance and moving from this to a parameterised matching variant proposed by Jalsenius, Porat and Sach here and here.

Implementing Parameterised Matching

Parameterised matching is a variant of pattern matching where the text T and pattern P contain symbols from the dictionary \Sigma \cup \Pi, where \Sigma and \Pi represent dictionaries of non-variables and variables respectively. Informally, a pattern p-matches a text if all the non-variables exactly match and a bipartite mapping can be made such that all the variables match.

As an example, if the dictionaries are \Sigma=\{a,b\}, \Pi=\{c,d\}, then ‘aaacd’ p-matches ‘aaadc’, as the map function is simply \{(c,d),(d,c)\}, but ‘aaacd’ does not p-match ‘aaacc’, as both ‘c’ and ‘d’ would map to ‘c’. Likewise, ‘aaacd’ does not p-match ‘aabcd’ as the non-variables don’t exactly match, and ‘aaacd’ does not p-match ‘aaccd’ as this mixes variables and non-variables.

Amir, Farach and Muthukrishnan investigated this in their paper Alphabet Dependence in Parameterised Matching. In it, they proposed an algorithm that runs in amortised O(n\log\pi) time, where \pi = min(m, \left|\Pi\right|). Their solution was to break the problem into two sub problems:

  1. An exact pattern matching problem between all the non-variables, but where the variables don’t necessarily m-match,
  2. And a mapped-matching problem, where all of the variables m-match but the non-variables don’t necessarily match.

They then argue that the intersection of these two results are the points which p-match. Their paper also provides an O(n\log\pi) solution to the mapped-matching problem based off of the Knuth-Morris-Pratt algorithm for exact matching, which provides the upper bound for this algorithm.

The rest of this post is a discussion of interesting points during my experience implementing this algorithm in C.

Construction of the Predecessor Table

A predecessor table A for pattern P is an array where A_i is the index of the previous occurrence of the letter p_i, or i if there is no previous occurrence of p_i.

This can be performed for the whole pattern in O(m\log\pi) time by storing the most recent occurrence of each letter in a balanced search tree. For this, I used an implementation of Red-Black trees from Literate Programs.

A common occurrence with search structures in C is to return a void* type. This means you can return whatever result you found if you found one, regardless of type, or NULL if no result was found. This was a problem, however, as the integer representation of NULL is the number zero – remember that in C, a NULL pointer is a pointer that points to address space zero. Because there could be cases where the previous occurrence of a character was at index zero of the pattern, this made it difficult to distinguish between this case and the case where this is the first occurrence of the character in the pattern.

I resolved this problem in the end by modifying the code for the Red-Black tree. Instead of simply returning NULL, a default value is passed in as a parameter to be used instead. The change can be viewed here. This meant that I could do a search for p_i and specify the call to return i if no results are found.

Implementing the Compare Function

The m-matching algorithm works by implementing the KMP algorithm for exact matching, but replacing any equals operation with a Compare function written for this problem. The Compare function is specified on page 4 of the paper linked above, and essentially checks that there is the same distance between each character and its previous occurrence.

Compare(p_i, t_j) was simple enough to implement naively and then improve to the stated running time via storing the last occurrence of each text character in a Red-Black tree again. But there were issues with Compare(p_i, p_j), where it would fail for any cases other than the first occurrence of a given letter. This could also be seen in the mathematics, in the second case of the function:

i - A[i] \geq j - 1 \Rightarrow 0 \geq j - (i - A[i]) - 1 \Rightarrow 0 \geq j - i + A[i] - 1 \Rightarrow 1 \geq j - i + A[i]

Because arrays in this paper are one-indexed, this meant that this second case would only succeed if p_j = p_1. This did not seem correct.

I resolved this by re-writing the Compare function as follows, for i \leq j:

Compare(p_i, p_j)

        if i - A[i] = j - A[j] return equals

        if A[i] = i and j - A[j] > i return equals

return not equals

end

The first case in this function checks that either the difference between each character is the same or this is the first occurrence of each character we have seen. The second case is where this is the first occurrence of p_i and we have seen p_j before, but far enough back that it occurs before the start of the pattern.

The function Compare(p_i, t_j) was also re-written in this style, using the result of the Red-Black tree as a substitute for A[j]. Both of these functions use the same running time as the original program of O(\log\pi) for Compare(p_i, t_j) and O(1) for Compare(p_i, p_j).

Streaming Adjustments

After p-matching was successfully implemented with fixed text and pattern, the only task left was to adjust it in order to make it streamable. This was trivial; because KMP only iterates through each character of the text once, a streaming version could be made through storing the internal state of the algorithm in a structure and turning the main loop into its own function.

Note that this only guarantees O(\log\pi) amortised streaming time per character. In the worst case scenario, the streaming time is O(m\log\pi) per character. There is a method of deamortising this algorithm by Clifford and Sach, which will be a future goal of this project.

The current implementation is available here. Note that current tests are only for functional verification. Testing performance – in particular with different search structures – is one of the next goals with this implementation.

Analysis of Range Minimum Query Implementations

Range Minimum Queries (RMQ) are a common dependency for pattern matching algorithms. The problem is, given a list A of n numbers, and two integers i, j such that 0 <= i <= j < n, find that smallest number in the range A[i:j].

During my first week, one of the tasks I did was to look at a few implementations of RMQ algorithms and compare their performance. The two implementations I analysed were based off of the Segment Tree and Sparse Table approaches, and originally implemented by Shrey Banga. I adapted his testing code to accept queries as input and write the results to a file, as well as to display performance statistics.

The performance is measured as the time taken to perform all the queries using the system clock, and the total memory & stack memory as gathered from /proc/self/statm.

I wrote some simple Python scripts to generate and verify the code, a third script using NumPy and MatPlotLib to analyse the performance, and some small bash scripts to tie things together.

Both data structures were ran on the same input. The input consisted of a randomly generated list and 10000 random queries on each list. The size of each list ranged from 5000 to 100000 items, with a step size of 5000 items. For each size, 10 different lists and queries were run and the results averaged out. Note that because these tests were randomly generated, this cannot be used to analyse worst-case data.

This was run on a HP laptop with an AMD A6 processor and 7.2GB of RAM.

The final results – as well as profiles for other sizes – can be seen in my fork of the repository here. Graphical analysis of the performance is presented below.

dataspace

time

The first thing I noticed was the way space and data scaled over problem size, remaining relatively consistent but then jumping every once in a while. These points aren’t a coincidence either; because both of these solutions divide the problem into bins of sizes which are powers of two, each time the problem size was greater than a power of two the data structures have to add another level.

The second interesting point is that there is a large amount of noise with the time results. This is due to only being able to achieve millisecond levels of accuracy with the number of queries and size of the list. Larger problem sizes and more queries are possible, but would require more time for verification.

Despite the noise, a general trend can be seen in that Segment Trees take more time, yet Sparse Tables take more space. This matches the theoretical research on worst case analysis.

It is worth noting that these aren’t the theoretically best-performing algorithms for this problem. A variant of Sparse Tables using low-resolution RMQs takes O(nloglogn) space and constant query time, and another using Lowest Common Ancestor (LCA) queries takes O(n) space and constant query time. But despite this theoretical performance, the only implementation I could find of either was a JavaScript implementation of RMQ-via-LCA here. This could be due to how difficult it is to implement, or it could be because of how large the problem has to be before we see a benefit in performance. Either way, this could be an interesting area to develop another time.

Introduction

Under the advice of my Supervisor, I have set up this blog in order to keep track of my progress throughout this Summer project. So what better way to start than explaining what this project exactly is.

Pattern matching is a very famous and well-developed problem in Computer Science and Mathematics: ‘Does this piece of text occur in this piece of text, and if so where?’[1] There are already many solutions to the basic problem, such as Suffix Trees, which have a total space and running time linear to the size of the text and the pattern. So what is there to improve?

Quite a bit, actually. Algorithms like the one mentioned above require the whole of the text and pattern to be stored in memory. How practical is this? Well, the Pizza & Chili Corpus test their text indexing on up to two gigabytes of English text. UniRef, the reference protein sequence cluster from UniProt is 16GB in size, and the even larger UniParc is 21GB when compressed. Even these are relatively small data sets for protein matching.

To alleviate this issue, research is currently being conducted into techniques based on data streams. Unlike traditional pattern matching where the whole text is available from the start, in streaming algorithms we only receive a portion of the text in a buffer. More text is added over time, and for each character that enters the buffer, one character at the other end is popped off. For a broader look at streaming algorithms, see Data Streams: Algorithms and Applications by S. Muthukrishnan.

There has already been significant development in the theoretical side of this area, with several algorithms being developed to work well under these conditions. But many of these algorithms have never been practically implemented, and some even require underlying building blocks that have never been written in physical code. This is where my work comes in.

The aims of this project are:

  1. To investigate current tools and projects, including primitives and dependent problems.
  2. To implement and analyse several algorithms to see where the bottlenecks actually lie.
  3. To improve these implementations and find compromises between what works well in theory and what works well in practice.

This project is being conducted over eight weeks at the University of Bristol, under the supervision of Dr. Raphaël Clifford.

[1] Text is used as just one example, of course. Other examples could be images, binary digits or even more complex data structures, as long as there is some means of equating them.