Copyright	(c) 2010 Daniel Fischer
License	BSD3
Maintainer	Daniel Fischer <daniel.is.fischer@googlemail.com>
Stability	Provisional
Portability	non-portable (BangPatterns)
Safe Haskell	None
Language	Haskell98

Data.ByteString.Lazy.Search.KarpRabin

Contents

Overview
- Caution
Function

Description

Simultaneous search for multiple patterns in a lazy ByteString using the Karp-Rabin algorithm.

A description of the algorithm for a single pattern can be found at http://www-igm.univ-mlv.fr/~lecroq/string/node5.html#SECTION0050.

Synopsis

indicesOfAny :: [ByteString] -> ByteString -> [(Int64, [Int])]

Overview

The Karp-Rabin algorithm works by calculating a hash of the pattern and comparing that hash with the hash of a slice of the target string with the same length as the pattern. If the hashes are equal, the slice of the target is compared to the pattern character by character (since the hash function generally isn't injective).

For a single pattern, this tends to be more efficient than the naïve algorithm, but it cannot compete with algorithms like Knuth-Morris-Pratt or Boyer-Moore.

However, the algorithm can be generalised to search for multiple patterns simultaneously. If the shortest pattern has length k, hash the prefix of length k of all patterns and compare the hash of the target's slices of length k to them. If there's a match, check whether the slice is part of an occurrence of the corresponding pattern.

With a hash-function that

allows to compute the hash of one slice in constant time from the hash of the previous slice, the new and the dropped character, and
produces few spurious matches,

searching for occurrences of any of n patterns has a best-case complexity of O(targetLength * lookup n). The worst-case complexity is O(targetLength * lookup n * sum patternLengths), the average is not much worse than the best case.

The functions in this module store the hashes of the patterns in an IntMap, so the lookup is O(log n). Re-hashing is done in constant time and spurious matches of the hashes should be sufficiently rare. The maximal length of the prefixes to be hashed is 32.

Caution

Unfortunately, the constant factors are high, so these functions are slow. Unless the number of patterns to search for is high (larger than 50 at least), repeated search for single patterns using Boyer-Moore or DFA and manual merging of the indices is faster. Much faster for less than 40 or so patterns.

indicesOfAny has the advantage over multiple single-pattern searches that it doesn't hold on to large parts of the string (which is likely to happen for multiple searches), however, so in contrast to the strict version, it may be useful for relatively few patterns already.

Nevertheless, this module seems more of an interesting curiosity than anything else.

Function

indicesOfAny #

Arguments

:: [ByteString]	List of non-empty patterns
-> ByteString	String to search
-> [(Int64, [Int])]	List of matches

indicesOfAny finds all occurrences of any of several non-empty strict patterns in a lazy target string. If no non-empty patterns are given, the result is an empty list. Otherwise the result list contains the pairs of all indices where any of the (non-empty) patterns start and the list of all patterns starting at that index, the patterns being represented by their (zero-based) position in the pattern list. Empty patterns are filtered out before processing begins.