有关字符串查找截取相关方法探究

程序员文章站 2022-07-14 16:37:43

...

主要分析

Stringtokenizer > string.subString > splitter.on(guava) 三种字符串截取类

1.首先介绍 String.subString () 方法：不支持正则；

public String substring(int beginIndex, int endIndex) {
        int length = length();
        checkBoundsBeginEnd(beginIndex, endIndex, length);
        int subLen = endIndex - beginIndex;
        if (beginIndex == 0 && endIndex == length) {
            return this;
        }
        return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
                          : StringUTF16.newString(value, beginIndex, subLen);
    }

它首先new一个string 对象，对于这个string对象如何赋值，如何截取。下一个函数

public static String newString(byte[] val, int index, int len) {
        return new String(Arrays.copyOfRange(val, index, index + len),
                          LATIN1);
    }

 Arrays.copyOfRange()   // 使用这个方法

    public static byte[] copyOfRange(byte[] original, int from, int to) {
        int newLength = to - from;
        if (newLength < 0)
            throw new IllegalArgumentException(from + " > " + to);
        byte[] copy = new byte[newLength];
        System.arraycopy(original, from, copy, 0,
                         Math.min(original.length - from, newLength));
        return copy;
    }

最后还是归结到 System.arraycopy() 这个使用 C语言使用的native方法上，还是多占用了内存，但是在性能上使用了单次遍历截取，没有回溯。

2. Splitter.on() 这个方法性能就很低，他是guava包中的新方法，代码使用了两次循环遍历并错误回溯的朴素字符串匹配，性能极差；

优点就是返回一个list集合，编码很简洁，但是不排除有很多坑，源码注释不够多；

有关字符串查找截取相关方法探究

这是一张简单的性能对比图面；

下面是一个源码；

public static Splitter on(final String separator) {
    checkArgument(separator.length() != 0, "The separator may not be the empty string.");
    if (separator.length() == 1) {
      return Splitter.on(separator.charAt(0));
    }
    return new Splitter(
        new Strategy() {
          @Override
          public SplittingIterator iterator(Splitter splitter, CharSequence toSplit) {
            return new SplittingIterator(splitter, toSplit) {
              @Override
              public int separatorStart(int start) {
                int separatorLength = separator.length();

                positions:
                for (int p = start, last = toSplit.length() - separatorLength; p <= last; p++) {
                  for (int i = 0; i < separatorLength; i++) {
                    if (toSplit.charAt(i + p) != separator.charAt(i)) {
                      continue positions;
                    }
                  }
                  return p;
                }
                return -1;
              }

              @Override
              public int separatorEnd(int separatorPosition) {
                return separatorPosition + separator.length();
              }
            };
          }
        });
  }

这是Splitter.on(String )的核心方法

有关字符串查找截取相关方法探究

上图代码就展示了它的极差性能体现之处。

3. Stringtokenier 这个类是三种方法中性能最快的，但是它的返回只能通过迭代获取每一个值，在代码简洁性上不如上两种。

因为它不支持正则，所以它不需要迭代回溯；

它分为初始化，和迭代求解两个过程完成字符串截取；但它所要维护的类变量会多一些，内存少量增加；

private int scanToken(int startPos) {
        int position = startPos;
        while (position < maxPosition) {
            if (!hasSurrogates) {
                char c = str.charAt(position);
                if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
                    break;
                position++;
            } else {
                int c = str.codePointAt(position);
                if ((c <= maxDelimCodePoint) && isDelimiter(c))
                    break;
                position += Character.charCount(c);
            }
        }
        if (retDelims && (startPos == position)) {
            if (!hasSurrogates) {
                char c = str.charAt(position);
                if ((c <= maxDelimCodePoint) && (delimiters.indexOf(c) >= 0))
                    position++;
            } else {
                int c = str.codePointAt(position);
                if ((c <= maxDelimCodePoint) && isDelimiter(c))
                    position += Character.charCount(c);
            }
        }
        return position;
    }

这是它的核心函数；可见它是不支持正则的

4. string.split() 速度一般但支持正则；

String.split(）也可以支持正则，返回数组；

核心函数：

public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.length() == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    int last = length();
                    list.add(substring(off, last));
                    off = last;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, length()));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

有关字符串查找截取相关方法探究

有关字符串查找截取相关方法探究

String常用使用方法，1.创建string的常用3+1种方式，2.引用类型使用==比较地址值，3.String当中获取相关的常用方法，4.字符串的截取方法，5.String转换常用方法，6.切割字符串----java

String常用使用方法，1.创建string的常用3+1种方式，2.引用类型使用==比较地址值，3.String当中获取相关的常用方法，4.字符串的截取方法，5.String转换常用方法，6.切割字符串----java