有关字符串查找截取相关方法探究
程序员文章站
2022-07-14 16:37:43
...
主要分析
Stringtokenizer > string.subString > splitter.on(guava) 三种字符串截取类
1.首先介绍 String.subString () 方法 :不支持正则;
public String substring(int beginIndex, int endIndex) {
int length = length();
checkBoundsBeginEnd(beginIndex, endIndex, length);
int subLen = endIndex - beginIndex;
if (beginIndex == 0 && endIndex == length) {
return this;
}
return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
: StringUTF16.newString(value, beginIndex, subLen);
}
它首先new一个string 对象,对于这个string对象如何赋值,如何截取。下一个函数
public static String newString(byte[] val, int index, int len) {
return new String(Arrays.copyOfRange(val, index, index + len),
LATIN1);
}
Arrays.copyOfRange() // 使用这个方法
public static byte[] copyOfRange(byte[] original, int from, int to) {
int newLength = to - from;
if (newLength < 0)
throw new IllegalArgumentException(from + " > " + to);
byte[] copy = new byte[newLength];
System.arraycopy(original, from, copy, 0,
Math.min(original.length - from, newLength));
return copy;
}
最后还是归结到 System.arraycopy() 这个使用 C语言使用的native方法上,还是 多占用了内存,但是在性能上使用了单次遍历截取,没有回溯。
2. Splitter.on() 这个方法性能就很低,他是guava包中的新方法,代码使用了两次循环遍历并错误回溯的 朴素字符串匹配,性能极差;
优点就是返回一个list集合,编码很简洁,但是不排除有很多坑,源码注释不够多;
这是一张简单的性能对比图面;
下面是一个源码;
public static Splitter on(final String separator) {
checkArgument(separator.length() != 0, "The separator may not be the empty string.");
if (separator.length() == 1) {
return Splitter.on(separator.charAt(0));
}
return new Splitter(
new Strategy() {
@Override
public SplittingIterator iterator(Splitter splitter, CharSequence toSplit) {
return new SplittingIterator(splitter, toSplit) {
@Override
public int separatorStart(int start) {
int separatorLength = separator.length();
positions:
for (int p = start, last = toSplit.length() - separatorLength; p <= last; p++) {
for (int i = 0; i < separatorLength; i++) {
if (toSplit.charAt(i + p) != separator.charAt(i)) {
continue positions;
}
}
return p;
}
return -1;
}
@Override
public int separatorEnd(int separatorPosition) {
return separatorPosition + separator.length();
}
};
}
});
}
这是Splitter.on(String )的核心方法
上图代码就展示了它的极差性能体现之处。
3. Stringtokenier 这个类是三种方法中性能最快的,但是它的返回 只能通过迭代获取每一个值,在代码简洁性上不如上两种。
因为它不支持正则,所以它不需要迭代回溯 ;
它分为初始化,和迭代求解两个过程 完成字符串截取; 但它所要维护的类变量会多一些,内存少量增加;
private int scanToken(int startPos) {
int position = startPos;
while (position < maxPosition) {
if (!hasSurrogates) {
char c = str.charAt(position);
if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
break;
position++;
} else {
int c = str.codePointAt(position);
if ((c <= maxDelimCodePoint) && isDelimiter(c))
break;
position += Character.charCount(c);
}
}
if (retDelims && (startPos == position)) {
if (!hasSurrogates) {
char c = str.charAt(position);
if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
position++;
} else {
int c = str.codePointAt(position);
if ((c <= maxDelimCodePoint) && isDelimiter(c))
position += Character.charCount(c);
}
}
return position;
}
这是它的核心函数; 可见它是不支持正则的
4. string.split() 速度一般但支持正则;
String.split()也可以支持正则,返回数组;
核心函数:
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.length() == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
int last = length();
list.add(substring(off, last));
off = last;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, length()));
// Construct result
int resultSize = list.size();
if (limit == 0) {
while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
resultSize--;
}
}
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}