/* Notes by Jerry Z.X. 2005/05/30 */
/*
** we use this for pattern matching algorithm performance studies
** any problem please contact jerryzx@gmail.com
** by Xin Zhou @Security Lab, RIIT, Tsinghua University, Beijing ,China
** http://security.riit.tsinghua.edu.cn/
*/
/*
**
** $Id: mwm.c,v 1.3 2003/12/17 21:25:14 jh8 Exp $
**
** mwm.c - A Modified Wu-Manber Style Multi-Pattern Matcher
**
** Copyright (C) 2002 Sourcefire,Inc
** Marc Norton
**
**
** This program is free software; you can redistribute it and/or modify
** it under the terms of the GNU General Public License as published by
** the Free Software Foundation; either version 2 of the License, or
** (at your option) any later version.
**
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
** GNU General Public License for more details.
**
** You should have received a copy of the GNU General Public License
** along with this program; if not, write to the Free Software
** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
**
**
**
**
** A Fast Multi-Pattern Search Engine
**
** This algorithm is based on the Wu & Manber '94 paper. This algorithm is not an
** exact implementation in that it uses a standard one or 2 byte Bad Character shift table,
** whereas Wu & Manber used a 2 byte Bad Characeter shift table. This implementation
** also uses a fixed 2 byte prefix hash table, Wu & Manber used a variable size
** 2 byte hash. Hash groups are defined as in Wu & Manber. The Pattern groups are searched
** using a reverse string compare as in Wu & Manber.
**
** A pattern match tracking mechanism has been added to reduce the effect of DOS attacks,
** and generally improve the worst case performanmce scenario. This feature is
** enabled/disabled via BITOP_TEST and requires the bitop.h file.
**
**
** Initial Version - Marc Norton - April 2002
**
**
Algorithm:
This is a simplified implementation of Wu & Manber's '92 and '94 papers.
1) A multi-pattern Boyer-Moore style bad character or word shift is done until
a possible pattern is detected.
2) Hashing is used on the 2 character prefix of the current text to index into a
group of patterns with the same prefix. Patterns must be sorted. Wu & Manber used
the last 2 characters of the psuedo multi-pattern minimum length. Either works fine.
3) A reverse string compare is performed against each pattern in the
group to see if any of the patterns match the current text prefix.
Algorithm Steps
Preprocess:
a) Sort The Patterns, forming a sorted list of patterns.
b) Build a hash table of the patterns as follows: For each pattern find it's
hash, in the hash index vector save this patterns index, now count how
many patterns following this one have the same hash as it, save this value
with the 1st pattern that has this hash.
c) Build a Boyer-Moore style bad Character shift table, using the min shift from
all patterns for the shift value for each characters in the shift table. For the
purposes of the Bad Character Shift we assume all patterns are the same length as the
shortest pattern. This works quite well in practice. Also build a Bad Word
shift table.
Search:
a) Perform Boyer-Moore Bad Character or Bad Word Shift loop to find a possible
pattern match.
b) Check if a hashed pattern group entry exists for this possible patterns prefix,
If no pattern hash exists for this suffix shift go to d)
c) If a hash exists, test all of the patterns in the pattern group to see
which ones if any match the text suffix. Use a reverse string comparison
to benefit from the increased likelihood of misses at the ends of the string.
This tidbit is based on Boyer-Moore searching. Uses bit flags to eliminate
previously hit matches from contention. This provides a better worst case
scenario for this setwise pattern matching.
d) Use a one character shift and go back to a)
Pattern Matcher Selection Heuristics:
We can use A Boyer-Moore for small sets of patterns. We can use one of 3 matchers
for larger sets. When the minimum pattern size within the pattern group is one
character we use the version without a Boyer-Moore bad character or bad word
shift. When we have groups with a minimum pattern size of 2 or more characters
we use the version with the bad character shift. When we have groups with a minimum
pattern size of 2 or more characters and the number of patterns small-medium we can use
the version with the bad word shift. More testing is needed to help determine the optimal
switching points between these algorithms.
Case vs NoCase:
We convert all patterns and search texts to one case, find a match, and if case is
important we retest just the pattern characters against the text in exact mode.
Algorithm Strengths:
Performance is limited by the minimum pattern length of the group. The bigger
the minumum, the more the bad character shift helps, and the faster the search.
The minimum pattern length also determines whether a tree based search is faster or slower.
Small groups of patterns 10-100 with a large minimum pattern size (3+ chars) get the best
performanc. Oh, if it were all that simple.
Improvements or Variations:
where to start ...much more to come...
Notes:
Observations:
This algorithm is CPU cache sensitive, aren't they all.
Ancestory:
The structure of the API interface is based on the interface used by the samples
from Dan Gusfields book.
Some References:
Boyer-Moore - The original and still one of the best.
Horspool - Boyer-Moore-Horspool.
Sunday - Quicksearch and the Tuned Boyer-Moore algorithm.
Wu and Manber - A Fast Multi-Pattern Matcher '94 -- agrep, fgrep, sgrep
Aho & Corasick- State machine approach, very slick as well.
Dan Gusfield - Algorithms on strings, trees, and sequences.
Steven Graham -
Crochemere -
NOTES:
4-2002 - Marc Norton
Initial implementation
5/23/2002 - Marc Norton -
Added Boyer-Moore-Horspool locally, so it's inlined.
We use a simple <5 pattern count to decide to use the
BMH method over the MWM method. This is not always right
but close enough for now. This may get replaced with the standard
Boyer-Moore in mString - the standard Boyer Moore may protect
better against some types of DOS'ing attempts.
11/02 - Fixed bug for multiple duplicate one byte patterns.
10/03 - Changed ID to a void * for better 64 bit compatability.
- Added BITOP_TEST to make testing rotuines easier.
- Added Windows __inline command back into bitops.
- Added Standalone Windows support for UNIT64 for standalone testing
- modified mwm.c, mwm.h, bitop.h
10/28/03 - Fixed bug in mwmPrepHashedPatternGroups(), it was setting resetting psIID
as it walked through the patterns counting the number of patterns in each group.
This caused an almost random occurrence of an already tested rule's bitop
flag to cause another rule to never get tested.
12/08/03 - Removed the usage of NumArray1, it was unneccessary, reverted back to NumArray
mwmPrepHashedPatternGroups:
When counting one byte patterns in 'ningroup' added a check for psLen==1
Broke out com