@MinDoc(copyright="Copyright 2010 A. Weinert", author="Albrecht Weinert", version="V.63", lastModified="4.08.2021", usage="extend for clever substring search", purpose="common type for clever searches like Rabin Karp (RK), KMP etc.") public abstract class CleverSSS extends Object implements CharSequence, Comparable<CleverSSS>
CleverSSS
object holdsRK
, Knuth Morris Pratt (KMP
) and others is their
very successful striving for O(n). The standard ("naive"
loop-in-loop) algorithms, also used in String
and consorts, are
of type O(n*m).RK
and KMP
) hide their
constructors behind factory respectively make() methods. Inheritors do
neither have to handle the cases, where an
"optimistic
" search is
sufficient nor do they have to handle the the trivial cases ofignored
optionally.java.lang.String
:
String
regards an empty subsequence as occurring everywhere
or being every other String
's prefix. But the implementing classes
factories must recognise those special cases and indirectly call this
classes factories or constructors accordingly.CleverSSS
objects made or returned by the factories are immutable.
The rationale behind making objects of this type immutable is their
multiple and threadsafe re-usability. Inheritors must be immutable too
and should preferably be final.CharSequence
a CleverSSS
object, besides providing
the search algorithms, is at the same time (as of type
CharSequence
) the sub sequence to search for. In case of
ignoreCase
being true this sequence
is all
lower case. In case of ignoreWS
it does not contain any blanks at
all and may hence be significantly shorter than the CharSequence
supplied to the factory method as the original search sequence.indexOf()
type are where heavy duty text processing tools — like
SVNkeys
or FuR
—
do spend most of their processing time [sic!]. For example, beautifying
all SVN keys of all (some 200) Frame4J's source files by
SVNkeys
once took over 15 seconds. Shifting to
RK
and then to KMP
took this down to 600ms on the same
machine. CleverSSS
's inheritors is usually well worth the virtually nil
extra effort.RK
Modifier and Type | Class and Description |
---|---|
static class |
CleverSSS.Simple
Simple substring search, clever only for the short and optimistic
cases.
|
Modifier and Type | Field and Description |
---|---|
String |
algName
The algorithm's (short) name.
|
boolean |
ignoreCase
The comparison / search is not case sensitive.
|
boolean |
ignoreWS
The comparison / search has to ignore all white space.
|
int |
len
The sub sequence's length.
|
static int |
mskMod
The modulo mask used in the (RK sliding) hash.
|
boolean |
optimisticOK
The very simple optimistic search is OK.
|
static int |
primF
The prime factor used in the (RK sliding) hash.
|
protected char[] |
subC
The internal character array.
|
Modifier | Constructor and Description |
---|---|
protected |
CleverSSS(String algName,
boolean ignoreCase,
boolean ignoreWS)
Constructor for subclass / factory use, empty search sequence.
|
protected |
CleverSSS(String algName,
char[] subC,
boolean ignoreCase,
boolean ignoreWS,
boolean optimisticOK)
Constructor for subclass / factory use, no checks.
|
protected |
CleverSSS(String algName,
char subOne,
boolean ignoreCase,
boolean ignoreWS)
Constructor for subclass / factory use, single character.
|
Modifier and Type | Method and Description |
---|---|
int |
allWhere(long[] therFnd,
CharSequence sequ,
int sI,
int mxLen,
boolean overlap)
All (respectively up to limit) findings of this sub sequence.
|
static StringBuilder |
asPair(StringBuilder bastel,
long val)
Formate a long as int pair.
|
static StringBuilder |
asPairs(StringBuilder bastel,
long[] vals)
Formate a long array as int pairs .
|
char |
charAt(int index)
The subsequence's value at
index . |
StringBuffer |
commonState()
The (common) state as StringBuffer.
|
int |
compareTo(CleverSSS other)
Compare with other CleverSSS object for order.
|
int |
endIndexOf(CharSequence sequ,
int sI)
An "endIndexOf" using this object's settings.
|
int |
endIndexOf(CharSequence sequ,
int sI,
int mxLen)
An "endIndexOf" using this object's settings.
|
boolean |
equals(Object other)
Compare with other CleverSSS object for same state.
|
int |
hashCode()
The sub sequence's (RK) hash value.
|
protected abstract long |
implAlgWhere(CharSequence sequ,
int lk,
int sI,
int mxSi)
The algorithmic implementation of whereImpl().
|
protected int |
implAlgWhere(long[] therFnd,
CharSequence sequ,
int lk,
int sI,
int mxSi,
boolean overlap)
The algorithmic implementation of allWhere().
|
int |
indexOf(CharSequence sequ,
int sI)
An "indexOf" using this object's final settings.
|
int |
indexOf(CharSequence sequ,
int sI,
int mxLen)
An "indexOf" using this object's final settings.
|
int |
lastIndexOf(CharSequence sequ,
int sI)
A "lastIndexOf" using this object's settings.
|
int |
lastIndexOf(CharSequence sequ,
int sI,
int minSI)
A "lastIndexOf" using this object's settings.
|
long |
lastWhereImpl(CharSequence sequ,
int sI,
int minSI)
A "lastIndexOf" (lastWhere) using this object's
settings.
|
int |
length()
The sub sequence's length.
|
static CleverSSS |
make(boolean ignoreCase,
boolean ignoreWS)
Make a CleverSSS object, to search for the empty substring.
|
static CleverSSS |
make(char subOne,
boolean ignoreCase,
boolean ignoreWS)
Make a CleverSSS object, to search for a single character.
|
static CleverSSS |
makeSimple(CharSequence sub,
boolean ignoreCase,
boolean ignoreWS)
Make a "stupid" or simple reference implementation.
|
static long |
pair2long(int sI,
int eI)
An int pair as long value.
|
String |
state()
The state as String.
|
CharSequence |
subSequence(int start,
int end)
A "sub sub" sequence (made as String) .
|
static char[] |
toCharA(CharSequence sub,
boolean ignoreCase,
boolean ignoreWS)
Prepare a character array from a sequence.
|
String |
toString()
The content (sub sequence) as String.
|
int[] |
where(CharSequence sequ,
int sI)
An "indexOf" (where) using this object's settings.
|
int[] |
where(CharSequence sequ,
int sI,
int mxLen)
An "indexOf" (where) using this object's settings.
|
long |
whereImpl(CharSequence sequ,
int sI,
int mxLen)
An "indexOf" (where) using this object's settings.
|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
chars, codePoints
public static final int primF
(RK)
implementations / inheritors of this class.RK
, but that has to be explored then.347
hashCode()
,
mskMod
,
Constant Field Valuespublic static final int mskMod
primF
and mskMod
here is
transporting Rabin Karp
's implementation details to its (this)
superclass. The rationale behind this design decision is to have the same
hash
algorithm in all CleverSSS
types. And that is by nature clearly determined by
Rabin Karp
.1073741823
= 0x3FFFFFFF = 0x40000000-1hashCode()
,
primF
,
Constant Field Valuespublic final int len
length()
protected final char[] subC
public final boolean ignoreCase
hashCode()
hashing and else.public final boolean ignoreWS
internal array
. They are not counted for the
subsequence's length
nor considered for
hashing
.String.trim()
.public final boolean optimisticOK
case
or
white space
, as well as, of course, for empty or
length one sequences.optimisticOK
true all searches are handled in this class
without calling the inheritor's
implementation
that
can restrict its cleverness to the "hard" requests.public final String algName
protected CleverSSS(String algName, char[] subC, boolean ignoreCase, boolean ignoreWS, boolean optimisticOK)
optimisticOK
being forced to true regardless of the respective
parameter if subC
is null, empty or of length one.algName
- the algorithms short name; null or empty defaults to
"CleverSSS (ext.)"subC
- the sequence to search foroptimisticOK
- true for trivial (length <= 1) cases as well as
those where an optimistic O(n) left to right search is feasibleignoreCase
,
ignoreWS
protected CleverSSS(String algName, char subOne, boolean ignoreCase, boolean ignoreWS)
ignoreWS
is true and subOne
is a space (<=' ') this
is considered as the empty sequence case.subOne
- single character search sequenceignoreCase
- true: ignore caseignoreWS
- true: ignore white spacesignoreCase
,
ignoreWS
protected CleverSSS(String algName, boolean ignoreCase, boolean ignoreWS)
ignoring
white space if so requested.ignoreCase
- true: ignore caseignoreWS
- true: ignore white spacesignoreCase
,
ignoreWS
public final int hashCode()
CleverSSS
objects, like RK
and KMP
, may very well
be used as (hash) keys. In the case of ignoreCase
true the hash
value respectively the hashing function is not case sensitive. This would
be a feature not available with String
s as keys.Rabin Karp
algorithm but irrelevant here KMP
uses RK
's hash function
for consistency within the type CleverSSS
.public final boolean equals(Object other)
other
object is of type
CleverSSS
and has (completely) the same state.compareTo(other)
.public final int length()
length
in interface CharSequence
public final String toString()
CleverSSS
's type CharSequence
and the naive use in String concatenating expressions.toString
in interface CharSequence
toString
in class Object
public final char charAt(int index) throws NullPointerException, ArrayIndexOutOfBoundsException
index
.CharSequence
.CharSequence
.charAt
in interface CharSequence
NullPointerException
- if this sequence is emptyArrayIndexOutOfBoundsException
- if index < 0
or >= length
public final CharSequence subSequence(int start, int end)
subSequence
in interface CharSequence
public final int compareTo(CleverSSS other)
other
CleverSSS and a negative value if the
other
CleverSSS is regarded as greater.longer
(sub) sequence or
if the length
s are equal the (sub) sequence being
greater by simple character value compare.equals()
: this.equals(other)
true implies that this.compareTo(other)
returns 0. But not the other way round!this.compareTo(other)
== 0 implies equal
subsequences to search for both objects. So, the consistency might be
adequate for most uses as Comparable
.ignoreCase
and
ignoreWS
, that might be somewhat surprising, especially when set
differently on both CleverSSS
objects to compare.compareTo
in interface Comparable<CleverSSS>
NullPointerException
- if other is nullpublic String state()
CleverSSS
object. The purpose is debugging or fabrication of
often used searches. The result is usually multi line.public final StringBuffer commonState()
state()
public static CleverSSS make(boolean ignoreCase, boolean ignoreWS)
public static CleverSSS make(char subOne, boolean ignoreCase, boolean ignoreWS)
ignoreCase
- true: ignore caseignoreWS
- true: ignore white spacespublic static CleverSSS makeSimple(CharSequence sub, boolean ignoreCase, boolean ignoreWS)
simple
"
CleverSSS
object that is not so clever in the light of Rabin,
Knuth & alii by implementing the real searches (naively) by O(n*m)
loop in loop (as does String
.String
's
indexOf methods — non regarding the extra ability to optionally
ignore case
and / or
white space
.sub
- the sequence to search forignoreCase
- true: ignore caseignoreWS
- true: ignore white spacesignoreCase
,
ignoreWS
public static char[] toCharA(CharSequence sub, boolean ignoreCase, boolean ignoreWS)
sub
- the input sequenceignoreCase
- true means convert to lower caseignoreWS
- true means do not transfer white spacespublic static final StringBuilder asPair(StringBuilder bastel, long val)
public static long pair2long(int sI, int eI)
sI
- first int (start index) goes to the lower 32 bitseI
- first int (end index) goes to the upper 32 bitspublic static final StringBuilder asPairs(StringBuilder bastel, long[] vals)
asPair(StringBuilder, long)
. The formatting stops at the end
of the array or at the first element == -1L (exclusive).public final int indexOf(CharSequence sequ, int sI, int mxLen)
CleverSSS
's (sub)
sequence
in the CharSequence
seq
supplied. The search is done once starting at position sI
and
going to the end of seq
respectively mxLen -1
.case
or ignoring
white space
(or both or none) according to this
CleverSSS
object's final settings.(int)whereImpl(sequ, sI, mxLen)
whereImpl(CharSequence, int, int)
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for this
object's (sub) sequence (<0 is regarded as 0)mxLen
- if > 0 and < seq
's length it is taken as
"shortened" length of sequ
public final int indexOf(CharSequence sequ, int sI)
indexOf(seq, sI, 0)
and(int)whereImpl(sequ, sI, 0)
whereImpl(CharSequence, int, int)
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)public final int endIndexOf(CharSequence sequ, int sI, int mxLen)
indexOf(seq, sI, mxLen
except
for returning the match's end index + 1 instead of the first index. This
value can only be unequal to first index plus subsequence's
length
in the case of ignoring
white space.(int)(whereImpl(sequ, sI, mxLen) >>> 32)
whereImpl(CharSequence, int, int)
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)mxLen
- if > 0 and < seq
's length it is taken as
"shortened" length of sequ
public final int endIndexOf(CharSequence sequ, int sI)
indexOf(seq, sI
except for
returning the match's end index + 1 instead of the first index. This
value can only be unequal to first index plus subsequence's
length
in the case of ignoring
white space.enIndexOf(seq, sI, 0)
and (int)(whereImpl(sequ, sI, 0) >>> 32)
whereImpl(CharSequence, int, int)
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)seq
or -1 if no matchpublic final int[] where(CharSequence sequ, int sI, int mxLen)
final long ret = whereImpl(sequ, sI, mxLen);
return new int[] {(int) ret, (int)(ret >>> 32)};
*
this being also the implementation based on the inheritors implementation
of whereImpl(CharSequence, int, int)
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)mxLen
- if > 0 and < seq
's length it is taken as
shortened length of sequ
public final int[] where(CharSequence sequ, int sI)
sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)public final int allWhere(long[] therFnd, CharSequence sequ, int sI, int mxLen, boolean overlap)
therFnd
)
findings of this substring in seq
. The found spots are returned
as modifications of the array therFnd
supplied; see
whereImpl()
for the
interpretation of the long value.therFnd
supplied is longer as the (returned) number of findings
the next element ([# of findings]) will be set to -1L. This ensures a
comfortable break out solution in (enhanced for) loops over that
array.implAlgWhere(long[], CharSequence, int, int, int, boolean)
that
might cleverly be overridden in extending classes, if applicable.therFnd
- where to put all findings; this array will be
modified up to [return value] and at least up to [0] (if that
element exists)sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)mxLen
- if > 0 and < seq
's length it is taken as
shortened length of sequ
overlap
- if true the findings may overlap, meaning that
"otto" is twice found in "ottotto" ( 0 and 3)
instead of once (from beginning to end)implAlgWhere(long[], CharSequence, int, int, int, boolean)
,
whereImpl(CharSequence, int, int)
protected int implAlgWhere(long[] therFnd, CharSequence sequ, int lk, int sI, int mxSi, boolean overlap)
allWhere()
after handling all trivial, exceptional, illegal and optimistic
cases.implAlgWhere(CharSequence, int, int, int)
in an adequately
organised loop. That's what can be done for most algorithms. But some
(like Knuth Morris Pratt
) could do those multiple searches
much better.public final long whereImpl(CharSequence sequ, int sI, int mxLen)
CleverSSS
objects subsequence within the sequence
seq
supplied as parameter. This method handles all trivial
and optimistic
cases by itself
delegating
only
the hard requests to the inheritor's clever
algorthm
.RK (Rabin Karp)
or
KMP (Knuth &alii)
do implement the famous inventor's
clever substring search algorithm by implementing the abstract method
implAlgWhere(CharSequence, int, int, int)
.CleverSSS.Simple
on the other hand provides the
(String
standard) O(n*m) simple loop in loop implementation as
a reference.indexOf(CharSequence, int, int)
,
indexOf(CharSequence, int)
,
endIndexOf(CharSequence, int, int)
,
where(CharSequence, int, int)
or
where(CharSequence, int)
.whereImpl(CharSequence, int, int)
instead of only a
where(..)
is avoiding the
throwaway object (returning one long primitive value instead of an
two element int array).sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 is regarded as 0)mxLen
- if > 0 and < seq
's length it is taken as
shortened length of sequ
implAlgWhere(CharSequence, int, int, int)
,
optimisticOK
protected abstract long implAlgWhere(CharSequence sequ, int lk, int sI, int mxSi)
whereImpl(CharSequence, int, int)
does all parameter
checks and handles all exceptional / illegal cases by returning -1. It
also handles the optimistic
cases.sequ
- the sequence to search inlk
- its correct or shortened) lengthsI
- the starting index for the searchmxSi
- the maximum possible value of the match indexpublic long lastWhereImpl(CharSequence sequ, int sI, int minSI)
TextHelper.lastIndexOf(CharSequence, CharSequence, int)
with the
difference of returning both the index of the first matching and of the
first non matching character (behind) in seq
in case of
a hit.ignoreWS
false these both numbers just differ by the
length
of the sub sequence searched for.ignoring
search the
difference mentioned can be greater than length
as well as shorter than the subsequence used to make this
CleverSSS
object.whereImpl(CharSequence, int, int)
.simple
"
search objects, but may disappoint if used with an object of one of the
(clever) inheritors like RK
or KMP
.sequ
- the sequence in which sub is searched forsI
- the index in sequ
to start the search for sub
(<0 or too large is regarded as from the end)minSI
- a stopper down to where seq will be searched;
<= 0 is taken as 0; > (adjusted) sI will return -1public final int lastIndexOf(CharSequence sequ, int sI)
(int)lastWhereImpl(sequ, sI, 0)
.public final int lastIndexOf(CharSequence sequ, int sI, int minSI)
(int)lastWhereImpl(sequ, sI, minSI)
.