Measuring “String.indexOfAny(String)” performance

(Java micro-benchmarking with jmh)

Background

Recently I came across a bit of code solving a relatively common problem — that of finding if a given contains one of specified characters.
This may be for input validation (refuse Strings that contains certain character(s)), or as a precondition to String modification (escaping / quoting), for example.

It might be used like:

String xmlToCheck = ...;
if (textToCheck.containsAnyOf("&<>'\"")) { // if JDK String had it
// do some escape/quote magic
}

The implementation I saw (which I’ll show in a bit) looked like it might benefit from optimization, or perhaps could be replaced by a 3rd party implementation — JDK does not (perhaps surprisingly) contain a method like that as member method.

But before trying to optimize something, it is a very good idea to have a baseline of performance: both to know if you are making any improvements and and (of course) have an idea of whether performance even matters.

Off to Benchmarking!

This seemed like a nice little use case for micro-benchmarking (*) using the venerable jmh micro-benchmark library (for more background, see f.ex this blog post).

So I decided to create a new Github repo — https://github.com/cowtowncoder/misc-micro-benchmarks — with tests under test.
After cloning, , you could run test with:

java -jar target/microbenchmarks.jar StringContainsChars 

But before getting there, let’s see what we had to work with

(*) the usual disclaimer: micro-benchmarks can easily be misleading; should not be trusted to prove general cases and if used at all should be specifically used in a way as similar as possible to the actual production use.

Code as it was in the beginning

Ok so the initial code looked like this:

List<Character> charsToCheck = Arrays.asList('&', '<', '>', '\'', '"'); 
return charsToCheck.stream().anyMatch(ch -> str.indexOf(ch);

Code is functionally correct and relatively simple. But based on some of earlier work, I suspected that the use of Streams with lambdas might incur non-trivial (*) overhead here.

Benchmarking with a small set of about dozen Strings — some with, most without such characters — gave us the baseline number (on my machine, about 1.5 million calls per second).
Note: it is worth repeating that this would NOT be a good test set for actual performance engineering: you really should have a representative set(s) of input if aiming at serious improvements. Multiple sets could also be used to compare kinds of inputs (with, without matching character(s); short, long Strings; perhaps pathological cases).

(*) non-trivial obviously depends on context — for casual use, test cases, it’s almost certainly trivial. But in tight loop for a, say, document validation, it might be non-trivial. As usual, “it depends”.

First custom alternative: loop without Stream

The first trivial simplification (from perspective of underlying machinery, not code size) is just to change code to do plain looping, call of :

// charsToCheck = "&<>'\"";
public boolean containsAnyOf(String inputString, String charsToCheck)
{
for (int i = 0, len = toCheck.length(); i < len; ++i) {
if (inputString.indexOf(toCheck.charAt(i)) >= 0) {
return true;
}
}
return false;
}

that is, iterate over s to check, call on them one by one.
(note: we could also have removed looping altogether and do 5 separate calls — but for now wanted to keep approach dynamic, close to the original code — trying out that approach could be interesting too)

Should this have much effect? Aren’t s and lambdas pretty efficient?

Yes (and yes)! Pretty efficient is true, for what is being done — yet in this particular case, eliminating doubled throughtput: from 1.5 million per second to almost 3.5 million…

Second and third custom alternatives

Another thing to consider is that instead of looping over characters to check, it’d be possible to scan over characters of the input String and check if they are contained in “characters-to-check-as-String” — that is, loop in the “other” direction. Github repo has this case too (see “ and that is perhaps surprisingly quite a bit faster (I am not 100% sure why: perhaps has special intrinsic translation for Short Strings like here?).

But as mentioned earlier, for this particular use case we could also make use of the fact that we are only interested in a small, statically defined set of characters. So we do not really need at all — we can use statement. Like so:

public boolean containsCharToQuote(String inputString)
{
for (int i = 0, len = str.length(); i < len; ++i) {
switch (str.charAt(i)) {
case '&':
case '<':
case '>':
case '\'':
case '"':
return true;
}
}
return false;
}

So does this help? You betcha! We went from 3.5 million to 9 million!

Or maybe JDK regexps could do it faster?

Knowing that JDK has regular expression functionality — and rather efficient implementation at that, based on my experiences — perhaps we could make use of that. This is relatively easy (if a few more lines than other approaches) as we can either hand-code matcher (for small number of characters) or statically build it. Something like:

StringBuilder sb = new StringBuilder().append("[");
for (char c : new char[] { '&', '<', '>', '\'', '"' }) {
sb.append(Pattern.quote(String.valueOf(c)));
}
Pattern checker = Pattern.compile(sb.append("]").toString());
// used like:
if (checker.matcher(stringToCheck).find()) {
...
}

Turns out this approach gives only marginally better results than the original code — 1.8 M/s.

But isn’t this already written by someone?

Ok so, normally one should probably peek into something like Guava or one of Apache commons libraries to see if they have something, before proceeding to write code. There are likely to be existing implementations there.

Turns out there are indeed:

  • Guava has
  • Apache Commons-lang has (similar to our utility method)

So how do these fare? Surprisingly, perhaps, not-so-well and quite-well, respectively:

  • Guava only does 1.8M/s (similar to our initial implementation)
  • Commons-lang does much better with 5.3M/s

Without looking at Guava implementation, numbers are so similar to JDK Regexp based approach that I wonder if that is how it is implemented.
Commons-lang uses approach similar to the dynamic “loop over String, use String.indexOf()” which makes it quite a bit faster.

One final manual approach

So far so good — we have a few approaches evaluated. But there is one other possibly simple and performant alternative to use of statement — knowing that the characters to check are within lower section of Unicode/Ascii characters, we can use a simple based bitset approach (for some sets of characters even would do — but not here).
Code looks something like this (for full details, check test itself):


final long mask = calculateMask(); // omitted here, check full test
final String stringToCheck = ....;
// loop over input String, check bitmask to see if we have match
for (int i = 0, len = str.length(); i < len; ++i) {
final int ch = str.charAt(i);
int offset = ch - 32;
if ((offset < 64) && (offset >= 0)
&& (CHECKED_CHARS_MASK & (1L << offset)) != 0) {
return true;
}
}
return false;

In my tests, this approach gave roughly same performance as -based alternative (9.8 M/s).

Results table

As a summary of all tests, here are the numbers I saw (plus couple of other entries)

Benchmark                                              Mode  Cnt        Score        Error  Units
StringContainsChars.method1_streamWithIndexOf thrpt 15 1451222.313 ± 6683.463 ops/s
StringContainsChars.method2a_stringIndexOfTimesN thrpt 15 3485362.606 ± 147550.312 ops/s
StringContainsChars.method2b_scanStringAndIndexOf thrpt 15 6007080.938 ± 24865.942 ops/s
StringContainsChars.method2c_scanWithSwitch thrpt 15 9464115.514 ± 377737.768 ops/s
StringContainsChars.method3a_guavaBasedCheck thrpt 15 1840523.997 ± 12226.246 ops/s
StringContainsChars.method3b_commonsLang3ContainsAny thrpt 15 5335174.154 ± 19938.776 ops/s
StringContainsChars.method4a_jdkRegExpBasedCheck thrpt 15 1838346.183 ± 6484.875 ops/s
StringContainsChars.method4b_bitsetBasedCheck thrpt 15 9819436.570 ± 300016.151 ops/s

Tests were run on a Mac mini, using Java 8.

Thoughts?

With the usual caveat that one should always consider real use cases when evaluating performance — as well as whether and how much performance matters — I guess I have some thoughts.

First: commons-lang implementation looks like a solid, fast-enough one, given that it takes dynamic input. I would be happy using it.

Second: a relatively simple scan-over-String-use-switch solution is surprisingly efficient for the use case (half a dozen of characters, fixed set).
So for exact use case I might go with that, if (and only if) performance really does matter.

Third: simple benchmarking can be fun :)

Fourth: the sample set of inputs I used is not great and it is quite possible results above would be different with different kind of input.

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox