Musser-Nishanov generic sequence-matching algorithm(s)#25
Open
jeremy-murphy wants to merge 110 commits intoboostorg:developfrom
Open
Musser-Nishanov generic sequence-matching algorithm(s)#25jeremy-murphy wants to merge 110 commits intoboostorg:developfrom
jeremy-murphy wants to merge 110 commits intoboostorg:developfrom
Conversation
Distinct function to read the corpus because it is a multi-line file. The pattern files are multi-line too, but we're treating them as one per line.
It's very slow; the default search trait is much better.
Contributor
|
@mclow can you check this PR please? |
Contributor
Author
|
@zamazan4ik it's OK, I'll call for Marshall's attention when it's finished. The list of caveats/unfinished things is real; I'll tick them off or remove them when they're done or remove them completely if I change my mind about them. No need to keep drawing his attention until then, but I'm always ready to engage in a conversation about what is currently here. |
Collaborator
|
Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago. |
Contributor
Author
Some of the commits might be that old, I only just pushed for the first time in ages. |
# Conflicts: # test/search_test1.cpp
This fixes a bug with using the DNA type_traits in C++11, which maybe worked in C++14.
Also use std::begin/end in place of boost::begin/end.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduction
In 1997, David R. Musser and Gor V. Nishanov wrote a hitherto unpublished paper and accompanying code, A Fast Generic Sequence Matching Algorithm.
It struck me as odd that this algorithm had not been widely adopted, as Musser's introsort had been.
So I contacted the authors and with their permission am attempting to bring it to a wider audience.
I provide the abstract from their paper here for completeness:
In the benchmarks that I have run, it is faster than Boyer-Moore, but equivalent to Boyer-Moore-Horspool, on 8-bit text.
On '2-bit' text (DNA sequence) however, the custom search traits boosts the performance to 5-10x faster than the nearest rival. I have demonstrated this in a new search test, no. 5.
What is also interesting is that a simpler, fallback algorithm is provided for corpus iterators that are not random access.
Design
I have not changed the fundamental algorithms in a material way, they should be recognizable from the original code apart from some simplifications.
What I have done personally is to restructure the interface to match the existing Boost.Algorithm searchers. The Musser-Nishanov search strategy is complicated by having two algorithms, accelerated linear (AL) and hashed accelerated linear (HAL), that are mostly but not entirely independent. If the corpus iterator is not random-access or the (static) suffix size is zero, then AL is selected at compile time. Otherwise, a HAL search object is created, but this does not guarantee that HAL will be used. If the pattern size (known at run time) is smaller than the suffix size, then the HAL search object must fall back to AL.
The user-facing API is a two-part
musser_nishanovclass that does the initial static choice between AL and HAL. Since HAL uses the same data structures as AL plus one more it is easy to implement HAL in terms of AL. This justified making AL a distinct class that is inherited publicly by the non-random-accessmusser_nishanovclass, and inherited privately by the HAL class so that it could be used as the fallback algorithm. The HAL class currently is themusser_nishanovclass for random access iterators, as opposed to being a distinct class.The following is a crude UML diagram of the class hierarchy, showing the private inheritance in brown and public inheritance in blue, following Doxygen style.

Since the HAL search object makes its final decision about algorithm at run time, I had to represent this choice somehow. The simplest, though not necessarily most efficient, way is with
bind()andfunction<>. It also chooses a null searcher if the pattern is empty. This method of storing whichalgorithm to call might be causing a slight performance penalty, but it requires more detailed tests.
Selected Benchmarks
I whittled the benchmark output down for the purpose of displaying here. These values should be taken with a grain of 5-10% salt: Musser-Nishanov is not consistently faster than Boyer-Moore-Horspool for example.
8-bit random characters
This is an edited
search_test2output:DNA sequences
And this is an edited
search_test5output, which is '2-bit' (DNA) sequence data, for which Musser-Nishanov has a few customized search traits, but I included just one here. The number--- here ---is the pattern size, search is timed for finding all matches (which may be zero -- this benchmark needs some work).Just to be clear, there is no
musser_nishanov_dnaclass, that is just shorthand for specialization.Caveats (or, what is unfinished)
I did not want to let perfect be the enemy of good, but more importantly I wanted to get a conversation about this code underway as soon as possible, so I have opened this PR well before everything is complete.