Musser-Nishanov generic sequence-matching algorithm(s) by jeremy-murphy · Pull Request #25 · boostorg/algorithm

jeremy-murphy · 2016-09-11T16:02:57Z

Introduction

In 1997, David R. Musser and Gor V. Nishanov wrote a hitherto unpublished paper and accompanying code, A Fast Generic Sequence Matching Algorithm.

It struck me as odd that this algorithm had not been widely adopted, as Musser's introsort had been.
So I contacted the authors and with their permission am attempting to bring it to a wider audience.

I provide the abstract from their paper here for completeness:

A string matching—and more generally, sequence matching—algorithm is presented
that has a linear worst-case computing time bound, a low worst-case bound on the
number of comparisons (2n), and sublinear average-case behavior that is better than
that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its
efficiency advantages in a wide variety of sequence matching problems of practical
interest, including traditional string matching; large-alphabet problems (as in Unicode
strings); and small-alphabet, long-pattern problems (as in DNA searches). Since it is
expressed as a generic algorithm for searching in sequences over an arbitrary type T , it
is well suited for use in generic software libraries such as the C ++ Standard Template
Library. The algorithm was obtained by adding to the Knuth-Morris-Pratt algorithm
one of the pattern-shifting techniques from the Boyer-Moore algorithm, with provision
for use of hashing in this technique. In situations in which a hash function or random
access to the sequences is not available, the algorithm falls back to an optimized version
of the Knuth-Morris-Pratt algorithm.

In the benchmarks that I have run, it is faster than Boyer-Moore, but equivalent to Boyer-Moore-Horspool, on 8-bit text.

On '2-bit' text (DNA sequence) however, the custom search traits boosts the performance to 5-10x faster than the nearest rival. I have demonstrated this in a new search test, no. 5.

What is also interesting is that a simpler, fallback algorithm is provided for corpus iterators that are not random access.

Design

I have not changed the fundamental algorithms in a material way, they should be recognizable from the original code apart from some simplifications.

What I have done personally is to restructure the interface to match the existing Boost.Algorithm searchers. The Musser-Nishanov search strategy is complicated by having two algorithms, accelerated linear (AL) and hashed accelerated linear (HAL), that are mostly but not entirely independent. If the corpus iterator is not random-access or the (static) suffix size is zero, then AL is selected at compile time. Otherwise, a HAL search object is created, but this does not guarantee that HAL will be used. If the pattern size (known at run time) is smaller than the suffix size, then the HAL search object must fall back to AL.

The user-facing API is a two-part musser_nishanov class that does the initial static choice between AL and HAL. Since HAL uses the same data structures as AL plus one more it is easy to implement HAL in terms of AL. This justified making AL a distinct class that is inherited publicly by the non-random-access musser_nishanov class, and inherited privately by the HAL class so that it could be used as the fallback algorithm. The HAL class currently is the musser_nishanov class for random access iterators, as opposed to being a distinct class.

The following is a crude UML diagram of the class hierarchy, showing the private inheritance in brown and public inheritance in blue, following Doxygen style.

Since the HAL search object makes its final decision about algorithm at run time, I had to represent this choice somehow. The simplest, though not necessarily most efficient, way is with bind() and function<>. It also chooses a null searcher if the pattern is empty. This method of storing which
algorithm to call might be causing a slight performance penalty, but it requires more detailed tests.

Selected Benchmarks

I whittled the benchmark output down for the purpose of displaying here. These values should be taken with a grain of 5-10% salt: Musser-Nishanov is not consistently faster than Boyer-Moore-Horspool for example.

8-bit random characters

This is an edited search_test2 output:

Corpus  is 2756252 entries long
---- Middle -----
Pattern is 105 entries long
                       std::search 0.4848 seconds     100%        484781
                boyer_moore_search 0.1171 seconds   24.16%        117117
       boyer_moore_horspool_search 0.1076 seconds    22.2%        107602
            musser_nishanov_search 0.1033 seconds   21.31%        103296
------ End ------
Pattern is 43 entries long
                       std::search 0.6206 seconds     100%        620587
                boyer_moore_search 0.2075 seconds   33.44%        207515
       boyer_moore_horspool_search 0.1742 seconds   28.07%        174218
            musser_nishanov_search 0.1686 seconds   27.17%        168633
--- Not found ---
Pattern is 91 entries long
                       std::search  1.028 seconds     100%       1028228
                boyer_moore_search 0.1971 seconds   19.17%        197079
       boyer_moore_horspool_search 0.1825 seconds   17.75%        182502
            musser_nishanov_search 0.1702 seconds   16.56%        170235

DNA sequences

And this is an edited search_test5 output, which is '2-bit' (DNA) sequence data, for which Musser-Nishanov has a few customized search traits, but I included just one here. The number --- here --- is the pattern size, search is timed for finding all matches (which may be zero -- this benchmark needs some work).
Just to be clear, there is no musser_nishanov_dna class, that is just shorthand for specialization.

Corpus  is 997532 entries long
--- 10 ---
matches: 2
                       std::search  2.239 seconds     100%       2239480
                boyer_moore object 0.6368 seconds   28.43%        636794
       boyer_moore_horspool object  0.664 seconds   29.65%        663994
            musser_nishanov object 0.7033 seconds    31.4%        703286
        musser_nishanov_dna object 0.3932 seconds   17.56%        393151
--- 20 ---
matches: 0
                       std::search   1.99 seconds     100%       1989855
                boyer_moore object 0.9167 seconds   46.07%        916750
       boyer_moore_horspool object  1.006 seconds   50.54%       1005750
            musser_nishanov object 0.9499 seconds   47.74%        949912
        musser_nishanov_dna object 0.1749 seconds   8.791%        174933
--- 40 ---
matches: 0
                       std::search  1.881 seconds     100%       1880573
                boyer_moore object 0.7201 seconds   38.29%        720095
       boyer_moore_horspool object 0.8477 seconds   45.08%        847738
            musser_nishanov object 0.8151 seconds   43.34%        815128
        musser_nishanov_dna object 0.09294 seconds  4.942%         92943
--- 150 ---
matches: 11
                       std::search  1.983 seconds     100%       1982931
                boyer_moore object 0.6318 seconds   31.86%        631754
       boyer_moore_horspool object 0.6806 seconds   34.32%        680592
            musser_nishanov object 0.6594 seconds   33.25%        659393
        musser_nishanov_dna object 0.04976 seconds  2.509%         49759

Caveats (or, what is unfinished)

I did not want to let perfect be the enemy of good, but more importantly I wanted to get a conversation about this code underway as soon as possible, so I have opened this PR well before everything is complete.

Documentation.
'wide char' benchmark.
Non-random-access corpus benchmark.
Worst-case benchmarks.
The benchmark suite for search test 5 needs to be either trimmed or better utilized.

Distinct function to read the corpus because it is a multi-line file. The pattern files are multi-line too, but we're treating them as one per line.

It's very slow; the default search trait is much better.

zamazan4ik · 2018-02-08T21:21:19Z

@mclow can you check this PR please?

jeremy-murphy · 2018-02-12T04:00:42Z

@zamazan4ik it's OK, I'll call for Marshall's attention when it's finished. The list of caveats/unfinished things is real; I'll tick them off or remove them when they're done or remove them completely if I change my mind about them. No need to keep drawing his attention until then, but I'm always ready to engage in a conversation about what is currently here.

mclow · 2020-12-15T01:33:28Z

Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.

jeremy-murphy · 2020-12-15T01:36:37Z

Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.

Some of the commits might be that old, I only just pushed for the first time in ages.

# Conflicts: # test/search_test1.cpp

This fixes a bug with using the DNA type_traits in C++11, which maybe worked in C++14.

Also use std::begin/end in place of boost::begin/end.

jeremy-murphy added 30 commits September 2, 2016 23:15

Add DNA corpus and complete DNA test pattern set.

80af2da

Reorganize patterns into files by size.

5df3227

DNA search test.

1de242a

Remove some comments.

36e8870

Add Musser-Nishanov search algorithm.

2a67f57

Add HAL search and dna[234] variations to search_test_5.

ba11299

Simplify test running slightly with a typedef.

ab26b58

Remove redundant pattern_size variable.

c6dc97d

Make compute_skip more debug friendly.

fbde8e4

Deal with empty patterns correctly.

753d55c

Fix index type and comment and what still needs doing.

7be81d7

Return unsigned value of the same size as char type from hash function.

b3dad73

Add musser-nishanov-HAL to search test 1.

2d4eaa3

Remove this->, it seems a bit strange.

68b10ce

Add musser-nishanov-HAL to search test 2.

016ed5b

Include Boost assertion headers.

3fb9d7d

Simplify test to just find all matches of pattern in corpus.

7230137

Distinct function to read the corpus because it is a multi-line file. The pattern files are multi-line too, but we're treating them as one per line.

Move HAL and AL into detail directory.

4914d9e

Most of skeleton of musser_nishanov search class.

479b858

Bind and assign the right search algorithm to search member function.

0f47bc6

Return something from AL/HAL.

12265cd

Add HAL initialization on first use; fill in operator()s.

11a2d7a

Static assert that corpus and pattern iterator value types are same.

77506ad

Use base_of instead of same in light of C++17 contiguous iterator.

55348dd

compute_next and compute_skip.

58e4af7

Remove template argument from constructor.

46d8b1b

Split searcher class on corpus iterator category.

c74c313

Add AL stub.

44065bf

Add AL and tweak to the Boost interface; rename next to next_.

624e77b

Test for empty pattern in AL and move j variable inside loop.

cd4caf6

jeremy-murphy added 3 commits January 28, 2018 15:50

A short unsigned search trait class.

333c52c

Make some updates to the documentation.

e17a191

[Musser-Nishanov] Remove the short unsigned search trait specialization.

688d571

It's very slow; the default search trait is much better.

jeremy-murphy added 3 commits November 29, 2020 09:43

Merge branch 'develop' into musser-nishanov-search

6260b4d

Substantial refactor and update to C++14; use Boost.Variant2.

27ff8f3

Merge branch 'develop' into musser-nishanov-search

423513f

jeremy-murphy and others added 20 commits December 15, 2020 12:48

Add error reporting to test4.

c988c9e

Merge branch 'develop' into musser-nishanov-search

5d0dcec

# Conflicts: # test/search_test1.cpp

Minor non-functional improvements: debugging and performance

279299f

hashed_accelerated_linear: Simplify interface

1de025a

Simplify, whitespace, etc

35d0f48

Replace k with corpus_first

e205291

Test for single-char search and empty pattern in empty haystack

a9e91c2

Update copyright years

d73c8e1

Replace MPL with MP11 and C++11 type_traits

3b06d2c

This fixes a bug with using the DNA type_traits in C++11, which maybe worked in C++14.

More auto

872e83a

Less foo

1e8658b

Move hashable predicate to detail namespace

34485f5

Remove redundant <vector>

dd44402

Use std::is_same

f101f67

Replace C++14 generic lambda with hand-written class

0dbc5db

Change all/remaining make_pair(a, b) to {a, b}

ba9228b

Remove superfluous boost:: and boost::algorithm:: ns qualification

780c36a

Also use std::begin/end in place of boost::begin/end.

std::find is fine

5602930

Prefer pattern_length and next_.size() over distance(next_)

bace96b

Remove unnecessary recalculation of the result's last iterator

e5977de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Musser-Nishanov generic sequence-matching algorithm(s)#25

Musser-Nishanov generic sequence-matching algorithm(s)#25
jeremy-murphy wants to merge 110 commits intoboostorg:developfrom
jeremy-murphy:musser-nishanov-search

jeremy-murphy commented Sep 11, 2016 •

edited

Loading

Uh oh!

zamazan4ik commented Feb 8, 2018

Uh oh!

jeremy-murphy commented Feb 12, 2018

Uh oh!

mclow commented Dec 15, 2020

Uh oh!

jeremy-murphy commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeremy-murphy commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Design

Selected Benchmarks

8-bit random characters

DNA sequences

Caveats (or, what is unfinished)

Uh oh!

zamazan4ik commented Feb 8, 2018

Uh oh!

jeremy-murphy commented Feb 12, 2018

Uh oh!

mclow commented Dec 15, 2020

Uh oh!

jeremy-murphy commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeremy-murphy commented Sep 11, 2016 •

edited

Loading