Under the advice of my Supervisor, I have set up this blog in order to keep track of my progress throughout this Summer project. So what better way to start than explaining what this project exactly is.
Pattern matching is a very famous and well-developed problem in Computer Science and Mathematics: ‘Does this piece of text occur in this piece of text, and if so where?’ There are already many solutions to the basic problem, such as Suffix Trees, which have a total space and running time linear to the size of the text and the pattern. So what is there to improve?
Quite a bit, actually. Algorithms like the one mentioned above require the whole of the text and pattern to be stored in memory. How practical is this? Well, the Pizza & Chili Corpus test their text indexing on up to two gigabytes of English text. UniRef, the reference protein sequence cluster from UniProt is 16GB in size, and the even larger UniParc is 21GB when compressed. Even these are relatively small data sets for protein matching.
To alleviate this issue, research is currently being conducted into techniques based on data streams. Unlike traditional pattern matching where the whole text is available from the start, in streaming algorithms we only receive a portion of the text in a buffer. More text is added over time, and for each character that enters the buffer, one character at the other end is popped off. For a broader look at streaming algorithms, see Data Streams: Algorithms and Applications by S. Muthukrishnan.
There has already been significant development in the theoretical side of this area, with several algorithms being developed to work well under these conditions. But many of these algorithms have never been practically implemented, and some even require underlying building blocks that have never been written in physical code. This is where my work comes in.
The aims of this project are:
- To investigate current tools and projects, including primitives and dependent problems.
- To implement and analyse several algorithms to see where the bottlenecks actually lie.
- To improve these implementations and find compromises between what works well in theory and what works well in practice.
This project is being conducted over eight weeks at the University of Bristol, under the supervision of Dr. Raphaël Clifford.
 Text is used as just one example, of course. Other examples could be images, binary digits or even more complex data structures, as long as there is some means of equating them.