Over the past two weeks my work has been focused around implementing exact pattern matching in constant time per character of text with space. An algorithm for this has been devised by Breslauer and Galil in their paper Real-Time Streaming String-Matching.
This post will provide a brief overview of the algorithm presented as well as some interesting points during my implementation.
The main trick in achieving space is to use a compression technique first used in Karp and Rabin’s paper Efficient Randomized Pattern-Matching Algorithms, where given a random prime and a random number , the fingerprint of a string of length is . Because this is a compressed format of the string, there is a chance of collisions. This can be reduced by increasing the size of . More precisely, if for some , then the probability of a collision is smaller than .
This is simple enough in theory, but is in fact rather restricting in practice. If we want our text to be in the scale of GB in size (i.e. ), then on a 64-bit system our accuracy level . If we want a larger degree of accuracy, then our calculation of will overflow.
To avoid this, I implemented a library for Rabin-Karp fingerprinting using GNU Multiple Precision Arithmetic (GMP). GMP provides tools and functions for arbitrarily large integer arithmetic, and is an industry standard for fast computation of large numbers.
The other advantage of GMP is that it provides a collection of fast number theory functions, including a probabilistic function for finding prime numbers. This allows us to not worry about the computation of our prime number .
At first glance, it may seem that we require a large number of modular exponentiations for these fingerprints. But this is not true; the only place where a modular exponentiation is needed is in calculating , and the rest can be performed with modular multiplications. Similarly, any concatenation of two fingerprints, or getting the prefix/suffix from two fingerprints can be done with additions, subtractions and modular multiplications, thus avoiding exponentiations for these problems too.
Because of the number of modular multiplications that are being performed in this process, an interesting area to look could be to see if this process could be sped up by Montgomery reduction. Another interesting point could be the overhead of GMP itself, and whether or not the benefit in accuracy for large amounts of data is worth the more complex arithmetic. This has been ignored for now, and for the sake of simplicity the GMP operations that have been used have been assumed to run in constant time.
Breaking the Pattern into Stages
An time per character algorithm is achieved by matching the pattern in stages, where each stage checks the last characters of the text match the first characters of the pattern, for . When a new character comes in, we iterate through every stage and look for any viable occurances (VOs) that we can check for that stage. If there is one and it matches that stage, then we promote that VO to the next stage, otherwise we discard it.
There are two ways we can compare the text to the pattern. In both methods, we keep a running fingerprint of the whole text.
The first is to store for each stage the fingerprint of . At stage 0, if we find a match at location , then we store the fingerprint of – in other words, the fingerprint of the whole text up to but not including the -th character. For the -th stage afterwards, we can use this fingerprint and the fingerprint of the whole text to produce the fingerprint for the last characters of the text, which we can then compare with the pattern. If it passes, then we promote the VO to the next stage with the original fingerprint from stage 0.
The second is for each stage to store the fingerprint of and for stage 0. At stage 0, if there is a match then we promote it with the fingerprint of the whole text up to and including the latest character. For each subsequent stage , we find the fingerprint we need by using the fingerprint promoted from the previous stage and the fingerprint of the whole text to produce the fingerprint of the last characters, which we can then compare with the pattern for that stage. If it passes, then we promote the VO to the next stage with the fingerprint of the whole text at that point.
After implementation it became clear that both techniques were very similar – a diff on the source code revealed six lines of changes. For the rest of this work I have gone with the second option due to the fact that it required fewer operations.
At this point the amount of space required was still as I was currently storing all viable occurrences. This was compressed to space by taking advantage of the pattern’s periodicity. If there are more then two VOs being stored at the same time for a given stage then the pattern must be periodic for that stage, in which case we can simply store the fingerprints for the first and last VOs and the period. Adding and discarding VOs can still be done in constant time, and the rest of the algorithm is unaffected as we only ever read the first VO for a given stage.
Amortised Constant Time per Character
The next stage was to make the program run in amortised constant time. This is achieved using Knuth-Morris-Pratt (KMP) to match the first characters of the pattern if these characters are not periodic. If these characters are periodic then we simply extend the amount of the pattern done by KMP until either it stops being periodic or the pattern ends.
This extended version can still be compressed to space as the period is guaranteed to be at most length . As long as the pattern is periodic, we only need to store the first characters of the pattern and the first elements of the failure table – where is the length of the period – and the rest can be worked out in constant time. If the period does break, then we only need one extra character to store the character of the pattern that breaks the period, and one integer to store the failure value for that character.
If the whole pattern is periodic, then we can simply perform KMP on the whole pattern which runs in amortised constant time per character and space due to our compression trick. Otherwise, the pattern for KMP not being periodic guarantees that any VOs are at least apart. Since we only have stages, this allows us to only process one VO of one stage per round, thus bringing each round down to amortised constant time.
One issue with this comes from requiring fingerprints of the text at previous points. The paper fixes this by using a circular buffer to store the last fingerprints of the text. Instead of a circular buffer, I have simply used a standard array, using the index of the row I’m currently verifying to dictate which index of the array to write to. This provides the disadvantage of requiring a modular operation for reading, but this is also true for many circular buffer implementations which allow for rotation.
But another problem is that there can be a delay between reading in the next character of the text and a match being reported. This delay will be of at most . There are two methods for solving this:
The first, suggested by the paper, is to check for VOs on the final stage for every round. Since this is still at most two stages being checked per round, this is still amortised constant time.
The second, proposed by Ben Sach and Markus Jalsenius in conversation, is to use this fingerprint technique to match the first characters of the pattern, and use KMP to match the rest. In case a match is returned instantly, results are stored in an array of size , again requiring modular operations to read from and write to. This can also be done in amortised constant time with memory due to the amount of characters processed by this second KMP.
Both methods have been implemented. It will be seen which is better once tested on larger data sets.
The only hold back for this algorithm now is KMP, which runs in amortised constant time per character, but in the worst case can take time. The paper removes this worst-case running time by using Galil’s adaptation in String Matching in Real Time·
Instead, I use a method first devised by Simon in String Matching Algorithms and Automata and explained by Jalsenius in Pattern Matching in Multiple Streams (slide 23). Instead of using a failure table, this method figures out for each index in the pattern how far to shift the pattern for each character in the alphabet. To keep space usage low, we only store characters which don’t match the pattern at a particular index and result in us shifting the pattern to some index other than -1; the sum of these is at most the length of the pattern for KMP.
To store these characters succinctly, I create CHD hash functions using C Minimum Perfect Hashing, a library for generating static minimum perfect hash functions of strings to integers. I then store two arrays of keys (characters in the alphabet) and values (the pattern index to shift to). Each index in the pattern matched via KMP has one of these structures for it, thus resulting in structures. Most of these structures will have empty arrays and therefore take up constant space, and the ones that do have entries in them will have total size of at most .
One question that arises is can we compress these structures like we could with the pattern and failure table by taking advantage of the pattern being periodic? The answer is yes; like the failure table we only need to store the first tables, and the rest can be worked out from that in constant time.
The disadvantage comes in the fact that preprocessing is slower. The failure table is still required to work out these displacements, but can be discarded after this step. The current preprocessing time is for KMP. The advantage is that matching for KMP (and thus the whole algorithm) is now constant time per character.
The current implementation is available here. Future goals include testing performance and moving from this to a parameterised matching variant proposed by Jalsenius, Porat and Sach here and here.