In this post, we will describe our quest to create Alfred–Margaret, the fastest Haskell implementation of the Aho–Corasick string searching algorithm, which powers string search in Channable.
Channable is a feed processing tool where users can define rules to optimize their product feeds. Oftentimes these rules rely on substring searching, for example to categorize products based on keywords that occur in the title, or to exclude products that use inappropriate words in their description. Substring searching is also a subproblem of string replacement, which is another common operation for us. Because substring searching makes up such a big chunk of our workload, we need it to be fast in order to keep our tool fast and responsive.
Typically, a rule will have anywhere from one to a few hundred needles (strings to search for), which will be used to search through possibly millions of products. Aho–Corasick is a good string matching algorithm in this situation: it constructs a data structure once, which can then be used to search through many haystacks (strings to search through). In theory this algorithm is efficient: the time complexity is linear in the length of the haystack plus the number of matches, and independent of the number of needles. Although algorithmic complexity definitely matters, in practice a fast string matching implementation is even more important.
In the sections to follow, we will touch upon the algorithm briefly, and then explain how we got from high-level Haskell to tight machine code that outperforms the fastest Rust implementation we could find.
Before embarking on our own implementation, we decided to check what libraries are out there, to see what kind of performance is possible, and to evaluate if we could use an off-the-shelf implementation directly. Our feed processing system is written in Haskell, so we were open to binding a library with a C ABI through the foreign function interface. And if we would have to write our own, having a fast implementation to take inspiration from would also be useful.
We constructed a benchmark to evaluate string matching implementations. For a realistic comparison, we took a sample of the 242 slowest rules that our users had created, and dumped their needles and haystacks. This amounted to 3.5 gigabytes of UTF-16 data, with the number of needles per sample ranging from 1 to 5136, and a median needle length of 22 bytes. Because we use the
Text type for strings in Haskell, which is an array of UTF-16 code units in native endianness, the benchmarks would take UTF-16 data as input. The goal was to count the number of matches as quickly as possible. The measured time includes the time to construct the automaton.
We considered the following contenders:
- A naive Python script that loops over all needles and calls
str.findin a loop to find all matches. This is not Aho–Corasick, but it provides a baseline for timing, and it provides a reference implementation which proved very valuable for checking correctness. We use Python 3.6.7.
- A Haskell implementation by yuttie. This is the library we used before we started our optimization journey. We used version 0.2.0.5 of this library on GHC 8.2.2.
- A Java implementation by hankcs. We used this library in the precursor to our current feed processing system, which was written in Scala. We use version 1.2.0 of this library on OpenJDK 10.0.2.
- A C implementation by mischasan. This library looked promising, but we could not get its test suite to build, so we moved on.
- A Rust implementation by BurntSushi, which is used in Ripgrep. We used version 0.6.8 of the library on Rust 1.29.2.
- Finally, to establish a lower bound, we benchmarked a Rust program that just calls
to_vec()on the haystack, the equivalent of a
Our benchmarks established the following times for working through all 3.5 gigabytes of test data once:
The times shown are the means over 25 runs, the error bars indicate standard deviation.
A few things to note here:
The library we use currently is far from optimal!
The Python implementation decodes the UTF-16 input at load time. The runtime representation need not be UTF-16, so it has less bytes to traverse than the other contenders.
The Rust library only supports byte slices. If we would use this library, transcoding UTF-16 to UTF-8 on every FFI call would be expensive, and for string replacement we need match positions in the UTF-16 text. Therefore we match on the UTF-16 bytes directly. This means that the program is effectively doing twice the work: it treats every code unit as two characters.
The naive Python implementation is surprisingly fast, given that Python is an interpreted language, and it implements a suboptimal algorithm. There are two explanations for this. Firstly, the program spends most of its time in the heavily-optimized
str.findfunction, which is implemented in C. Interpreter overhead is small. Secondly, most samples have few needles, so looping over all needles is not so bad. Half of the samples have only 1 or 2 needles.
Clearly there was plenty of room for speeding up our string matching. We considered binding the Rust library, but due to the way that Haskell’s foreign function interface interacts with the garbage collector, passing a string to a foreign function requires copying it. That copy alone puts any foreign library at a significant disadvantage, so we decided to see how close we could get in Haskell alone.
The Aho–Corasick Automaton
Before we can explain our other optimizations, we need to explain a bit of how Aho–Corasick works. At the heart of the algorithm is an automaton. The construction of this automaton is beyond the scope of this post. All we need to know for now is that the automaton consists of states and transitions between them. Transitions between states correspond to input characters. Additionally, every state apart from the initial state has a failure transition to follow when there is no transition for the current input character. Every state can have zero or more needles associated with it. When we enter a state, we report all of the associated needles as matched. There can be multiple needles per state when needles overlap.
To traverse the haystack, we take one or more steps in the automaton for every input character. If at the current state, there is a transition for the current input, then we take it, and move on to the next input character. If there is no transition, we first follow the failure transition and try again from there, until we end up in the initial state. If there is no transition from the initial state, we move on to the next input character.
To give an example, suppose we we wanted to match tshirt, shirts, and shorts. We would use the following automaton:
State 0 is the initial state. The solid arrows indicate transitions associated with an input character, whereas the dashed arrows indicate a failure transition. For the sake of readability, failure transitions back to the initial state are not drawn.
When matching on the text short tshirts, we would step through the following states:
Input: s h o r t _ t s h i r t s State: 0 7 8 13 14 15 0 1 2 3 4 5 6 12 Match: ^ ^
For the first few input characters, we follow the input arrows. In state 15, when there is no transition for space, we first fall back to state 1, which has no transition for space either, and then we fall back to state 0. (Recall that the dotted line to state 0 is not drawn in the diagram.) In state 6, we report having matched tshirt and in state 12 we report having matched shirts. Now that we have seen how to step the automaton conceptually, we can start to consider implementation details.
As is often the case in performance sensitive code, the key to our fast implementation was to use the right representation of the automaton in memory. To step the automaton, we need to keep track of three things for each state:
- A transition table: a mapping from input character to next state.
- The state where the failure transition leads.
- A (possibly empty) set of needles to report as matched in this state.
The state itself is an integer, and in building the automaton we can assign consecutive unsigned integers to states. In the worst case, the number of states is the sum of the length of the needles, so we opted for a 32-bit integer.
For the transition table, a hash map is an obvious data structure, and it is also what yuttie’s library uses. However, unlike array-backed mutable hash tables in Rust or C, Haskell’s
HashMap is implemented as a tree, a pointer-heavy data structure that is less cache-friendly than a flat array. The tree admits efficient inserts with structural sharing (after the insert, both the old and the new hash map are available), but we do not need any of that. We only care about the performance of lookups. Building the automaton is a one-time task, whereas we need to traverse the automaton over and over.
We could build our own hash tables on top of unboxed vectors, a safe Haskell abstraction with the same memory representation as a Rust slice or C array. But do we need a hash table at all? It turns out that most transition tables are small. In the example in the previous section, 12 of the 17 states have a single transition (not counting the failure transition), three states have only the failure transition, and the remaining two have two non-failure transitions. So we tried an unordered array with linear search instead. We packed the 16-bit input code unit and the 32-bit go-to state together in a 64-bit integer, and now the transition table was a cache-friendly array of 64-bit integers.
For every state, we now have this transition table, implemented as an array, and one failure transition. If we are going to do a linear scan over the array anyway, we can store the failure transition at the end. This has three advantages:
- The failure go-to state does not need to be stored in a separate table, which is good for memory locality.
- It acts as a terminator for the linear scan, so we do not need to know the size of the table. For now
Vectorstill tracks a length for bounds checks, but later we will pack multiple tables in one vector.
- The lookup with default operation can be implemented in one go: when the linear scan terminates, we either found a regular transition, or the failure transition.
HashMap with these custom transition tables got us a 66% reduction in running time. Memory layout trumps algorithmic complexity for our workload.
Loops, Recursion and Goto
Although we had made great progress, our Haskell program still took twice as long as the Python version. Fiddling with strictness annotations and unboxed types helped to shave off a few percent of the running time, but we needed another order of magnitude improvement. We knew how fast string matching could be, and our memory layout was tight. Why was our program still so slow? We needed to dig into GHC’s compilation pipeline to learn what was going on. If you are not interested in Haskell optimization, you can skip this part and jump straight to the results.
Stepping the automaton is fairly simple, it can be written as a triply nested loop. To stick with the earlier example, suppose we had the automaton set up like so, in pseudo-Python to illustrate the idea without too much noise:
haystack = 'short tshirts' failure_marker = 0x10000 transitions = 0: [('t', 1), ('s', 7), (failure_marker, 0)], 1: [('s', 2), (failure_marker, 0)], 2: [('h', 3), (failure_marker, 7)], ... 8: [('i', 9), ('o', 13), (failure_marker, 0)], ... needles_ending_at_state = 0: , 1: , ... 6: ['tshirt'], ...
Then we could step the automaton with a triply nested loop: one over the input characters, one to follow transitions, and one for the linear scan through the transition table.
It would not be very hard to do this in a low-level language such as Rust, in a way that compiles down to efficient machine code, but in Haskell we have a problem: Haskell does not have loops.
The way to loop in Haskell, is to make a tail recursive function. GHC compiles these down to loops, and we can verify that with
-ddump-simpl. This flag tells GHC to dump optimized Core, a basic, explicitly typed language which is an intermediate target in GHC’s compilation pipeline.
For example, the following functions implement transition table lookup:
Note that infix
! is the vector indexing operator.
! as prefix is unrelated; it is a strictness annotation. The dumped Core for
lookup, after renaming and omitting a few things for readability, looks like this:
A few things stand out:
#is all over the place. An
Int#is an unboxed signed integer, with a target-dependent size, like
Word#is the unsigned variant. This is in contrast to
Wordin the original code, which are boxed: pointers to integers or thunks.
+#compiles to an ordinary add instruction,
and#to a bitwise and. An integer literal with
#suffix is an unboxed signed integer constant; a
##suffix makes it unsigned.
- The inliner inlined
transitionCodeUnitand the other helper functions.
Vectoris a base pointer, offset, and length (to support slicing), and GHC deconstructed its fields into function arguments to avoid indirection.
- The tail call has been replaced with a jump.
- The function does construct either a
Righton the heap as return value.
The middle loop for automaton stepping follows transitions. If a transition exists for the input character, the loop terminates after one step, but if the transition was a failure transition, we try again from the failure state. This translates to
After inspecting the Core for this function, the culprit became clear:
lookup was not inlined. This meant that for every transition lookup,
followEdge would do a function call,
lookup would allocate an
Either return value on the heap, and then
followEdge would immediately match on it. We could get
lookup to be inlined in
followEdge, but then
followEdge would not be inlined into the calling
step function, the function that implements the outer loop. At this point we were stuck. We could avoid a function call for one of the loops, but not both. We could not manually inline these functions by copying the bodies either, because they were recursive. We knew exactly what kind of machine code we wanted, but not how to tell the compiler to generate it.
From State Machines to Mutual Recursion
My colleague Radek offered a suggestion: what if — rather than implementing the triple loop as three functions — we would write one function with three loop variables? The rough idea is this:
Every constructor corresponds to one of the three loops, and all of the local variables have been moved into fields.
ConsumeInput carries the matches found so far, the current state, and the index of the current input code unit. When looking up a transition, we also need to track the index into the transition table. The
go function would now first match on the current step, perform a part of the loop, and then do a tail call. This way we can do three loops in one tail recursive function!
The new approach got rid of function calls in the compiled program. The tail call in the
go function compiles to a jump. This helped, but now we were allocating a
Step value on the heap at every iteration. Could we somehow avoid the allocation too?
The next insight was that we do not need the
Step type at all. Why construct it right before the tail call, only to match on it immediately? We could instead specialize the
go function for every step:
We ended up with a set of local mutually recursive functions. None of these is self-recursive, but they do all make a tail call to one of the other
go functions. Optimization often comes at the cost of clarity, but in this case, what remains is still very readable Haskell. The code reads a lot like labels and jumps: every function is a label, and a call is a jump to that label. In fact, labels and jumps are exactly what the code compiles down to.
After discovering what we needed to write to get GHC to generate good code, we could optimize further. With strictness annotations we eliminated almost all allocations from the
find function, save from building up the list of matches. Later, by modifying
find to act as a fold over all matches, we eliminated those allocations too. After a struggle, the power of a high-level language paid off.
As a fold,
find looks like this:
In our benchmark program we fold with an increment function. The
Next type also allows us to early out, for example when testing whether any of the needles occurs in the haystack:
countMatches, the generated code is specialized with a strict integer accumulator, without any allocations.
containsAny is similarly efficent, without allocations, returning
True at the first match. We added a few further optimizations, such as a sparse lookup table for the initial state, and packing all transition tables into one vector for better cache utilization, but this post is already quite long, so let’s move on.
The Glasgow Haskell Compiler uses its own code generation backend by default. By passing
-fllvm, GHC will use the newer LLVM backend, which benefits from the low-level optimizations that LLVM is good at. Build times suffered, but in return we got about a 20% reduction in running time. Enabling LLVM was by far the most worthwhile optimization in our entire journey, in terms of speedup per effort invested.
With our fully optimized implementation, we are now faster than any other implementation we could find on UTF-16 text:
Keep in mind that the Rust version matches UTF-16 text bytewise, whereas our Haskell version acts on 16-bit code units. For completeness, we also show how the Rust library would perform if our inputs would have been UTF-8, roughly halving the data volume. We did not benchmark our own implementation on UTF-8 strings, as it is specialized to 16-bit code units for Haskell’s
Text type. Compared to what we started with, we reduced string matching time by 94%.
We set out to speed up string searching, a common operation for Channable. We investigated several Aho–Corasick implementations, and concluded that our best bet was to build a fast Haskell implementation, which we are open-sourcing today. Though Haskell is not the best tool for doing low-level optimization, it is definitely possible. In the end our implementation came out faster than any of the others we had tried before.
Finding the right way to instruct GHC to generate fast code remains a challenge. Although the end result is code that is clear and fast, the journey to get there was quite the adventure. Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double. After some investigation we got the times back down, but this involved a subtle combination of code changes and inlining pragmas. Still, our journey was definitely worth the trouble, for in the end we can service users with even faster previews and a more responsive tool.
: The library offers a, sparse, a dense, and a full automaton representation. The full representation contained a bug that caused it to miss matches, and of the sparse and dense representation, dense was faster for our use case, so we benchmarked the dense automaton. ↩