## A Note About this Blog

Now that I have started my Master’s dissertation which is on a related topic, this blog is being re-purposed for documenting my progress through until May.

My final year project is on the topic of pattern matching with fingerprints again, but is now looking at the problem of dictionary matching. Dictionary matching is a generalisation of exact pattern matching, where multiple patterns are matched against a single text.

This is defined formally as follows: We have a text of length . We have a set of patterns of corresponding lengths . The total size of the dictionary is denoted and the length of the longest pattern is .

## Aho-Corasick

The standard solution to this problem is the Aho-Corasick algorithm, first published in 1975. The Unix command fgrep, which searches files for multiple fixed patterns, was originally based on this algorithm.

Aho-Corasick is based on the Knuth-Morris-Pratt algorithm for single patterns. Instead of having a single pattern to match against, it uses a trie composed of multiple patterns. Every vertex in the trie has a failure edge, which links the vertex to the previous vertex with the longest prefix that matches its suffix. When a mismatch occurs between the text and the trie, we travel up the failure edges until we either hit the root or a match is found. This takes worst case time per character.

Aho and Corasick’s paper then provides a method of reducing this to time per character by preprocessing. For each vertex, we work out the next verted it should move to for a given character based on the child and failure edges, and store these in a table. This removes the need to traverse many failure edges when we find a mismatch.

Implementing Aho-Corasick in the streaming model is easily done. Each character of the text is only read once and they are read in order, so the only change we need to implement is to feed the algorithm one character at a time.

The asymptotic complexity of the algorithm is space and time per character. Note that the time complexity relies on the alphabet being of a fixed size; a method of fixing this is presented below.

## Implementing the Trie

There were two methods attempted to implement the trie.

The first was based on an object-oriented approach, with each vertex being represented by a data structure, and pointers leading to its children and the failure item. This method was quickly abandoned due to difficulty with memory allocation.

The second was based on an idea similar to an adjacency matrix. Each state was represented by an integer in order of when they were added to the trie. The next-move function was represented by two 2-D arrays: the first of integers to represent the states to move to for a given node, and the second of characters to represent the keys for each next-move. For space efficiency, if the next move for a state and character is not in the array, then the machine moves to the root state.

The second technique was chosen due to simpler memory allocation.

## Removing the Fixed Alphabet Constraint

As already stated, these run times assume that the alphabet is of a fixed size. This is not guaranteed in the streaming model, meaning that the run-time for looking up a next move is currently , where is the alphabet of the text and pattern.

This can be sped up by hashing. Once the next-move function is computed, the C Minimum Perfect Hashing library (CMPH) is used to map each of the next-move keys to the index of a smaller array in constant time. The vertex to move to next is stored in this index, which gives us a constant running time for looking up the next move.

Note that the preprocessing time will still depend on the size of the alphabet, as hashing still is not used at this point.

## Next Step

The Aho-Corasick algorithm was implemented as a baseline for testing the next algorithm. This algorithm, in development with the Algorithms Team at the University of Bristol, utilises the Rabin-Karp fingerprinting method of before and techniques similar to Porat and Porat’s approach to achieve dictionary matching in space.