Skip to main content

Scanner vs. StringTokenizer vs. String.Split



I just learned about Java's Scanner class and now I'm wondering how it compares/competes with the StringTokenizer and String.Split. I know that the StringTokenizer and String.Split only work on Strings, so why would I want to use the Scanner for a String? Is Scanner just intended to be one-stop-shopping for spliting?




Comments

  1. They're essentially horses for courses.

    Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.

    String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.

    StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.

    You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

    ReplyDelete
  2. Let's start by eliminating StringTokenizer. It is getting old and doesn't even support regular expressions. Its documentation states:


    StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.


    So let's throw it out right away. That leaves split() and Scanner. What's the difference between them?

    For one thing, split() simply returns an array, which makes it easy to use a foreach loop:

    for (String token : input.split("\\s+") { ... }


    Scanner is built more like a stream:

    while (myScanner.hasNext()) {
    String token = myScanner.next();
    ...
    }


    or

    while (myScanner.hasNextDouble()) {
    double token = myScanner.nextDouble();
    ...
    }


    (It has a rather large API, so don't think that it's always restricted to such simple things.)

    This stream-style interface can be useful for parsing simple text files or console input, when you don't have (or can't get) all the input before starting to parse.

    Personally, the only time I can remember using Scanner is for school projects, when I had to get user input from the command line. It makes that sort of operation easy. But if I have a String that I want to split up, it's almost a no-brainer to go with split().

    ReplyDelete
  3. If you have a String object you want to tokenize, favor using String's split method over a StringTokenizer. If you're parsing text data from a source outside your program, like from a file, or from the user, that's where a Scanner comes in handy.

    ReplyDelete
  4. StringTokenizer was always there. It is the fastest of all, but the enumeration-like idiom might not look as elegant as the others.

    split came to existence on JDK 1.4. Slower than tokenizer but easier to use, since it is callable from the String class.

    Scanner came to be on JDK 1.5. It is the most flexible and fills a long standing gap on the Java API to support an equivalent of the famous Cs scanf function family.

    ReplyDelete
  5. String.split seems to be much slower than StringTokenizer. The only advantage with split is that you get an array of the tokens. Also you can use any regular expressions in split.
    org.apache.commons.lang.StringUtils has a split method which works much more faster than any of two viz. StringTokenizer or String.split.
    But the CPU utilization for all the three is nearly the same. So we also need a method which is less CPU intensive, which I am still not able to find.

    ReplyDelete
  6. I recently did some experiments about the bad performance of String.split() in highly performance sensitive situations. You may find this useful.

    http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

    The gist is that String.split() compiles a Regular Expression pattern each time and can thus slow down your program, compared to if you use a precompiled Pattern object and use it directly to operate on a String.

    ReplyDelete
  7. I am using split() currently to scan through a file where each line has number of strings delimited by ~. I read somewhere that Scanner could do better job performance wise with a long file so thought about checking it out. But my question is, would I have to create two instances of Scanner? one to read a line and one based on the line to get tokens for delimeter? If I have to do so, I doubt if I would get any advantage of using it. May be I am missing something here?

    ReplyDelete

Post a Comment

Popular posts from this blog

Why is this Javascript much *slower* than its jQuery equivalent?

I have a HTML list of about 500 items and a "filter" box above it. I started by using jQuery to filter the list when I typed a letter (timing code added later): $('#filter').keyup( function() { var jqStart = (new Date).getTime(); var search = $(this).val().toLowerCase(); var $list = $('ul.ablist > li'); $list.each( function() { if ( $(this).text().toLowerCase().indexOf(search) === -1 ) $(this).hide(); else $(this).show(); } ); console.log('Time: ' + ((new Date).getTime() - jqStart)); } ); However, there was a couple of seconds delay after typing each letter (particularly the first letter). So I thought it may be slightly quicker if I used plain Javascript (I read recently that jQuery's each function is particularly slow). Here's my JS equivalent: document.getElementById('filter').addEventListener( 'keyup', function () { var jsStart = (new Date).getTime()...

Is it possible to have IF statement in an Echo statement in PHP

Thanks in advance. I did look at the other questions/answers that were similar and didn't find exactly what I was looking for. I'm trying to do this, am I on the right path? echo " <div id='tabs-".$match."'> <textarea id='".$match."' name='".$match."'>". if ($COLUMN_NAME === $match) { echo $FIELD_WITH_COLUMN_NAME; } else { } ."</textarea> <script type='text/javascript'> CKEDITOR.replace( '".$match."' ); </script> </div>"; I am getting the following error message in the browser: Parse error: syntax error, unexpected T_IF Please let me know if this is the right way to go about nesting an IF statement inside an echo. Thank you.