Skip to main content

Find duplicates in large file - java



I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).





I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.





How can I solve this problem?





I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.


Comments

  1. The key is that your data will not fit into memory. You can use external merge sort for this:

    Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

    Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

    ReplyDelete
  2. I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

    cat file | sort | uniq

    ReplyDelete
  3. You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

    Pseudo code...

    for line in file
    entries.put(line.hashCode, line-number)
    for entry in entries
    if entry.lineNumbers > 1
    fetch each line by line number and compare

    ReplyDelete
  4. One way I can imagine solving this is to first use an external sorting algorithm to sort the file (searching for external sort java yields lots of results with code). Then you can iterate the file line by line, duplicates will now obviously be directly following each other so you only need to remember the previous line while iterating.

    ReplyDelete
  5. If you cannot build up a complete list since you don't have enough memory, you might try do it in loops. I.e. create a hashmap but only store a small portion of the items (for example, those starting with A). Then you gather the duplicates, then continue with 'B' etc.

    Of course you can select any kind of 'grouping' (i.e. first 3 characters, first 6 etc).

    It only will take (many) more iterations.

    ReplyDelete
  6. I don't think you need to sort the data to eliminate duplicates. Just use quicksort inspired approach.


    Pick k pivots from the data (unless your data is really wacky this should be pretty straightforward )
    Using these k pivots divide the data into k+1 small files
    If any of these chunks are too large to fit in memory repeat the process just for that chunk
    Once you have manageable sized chunks just apply your favorite method (hashing?) to find duplicates


    Note that k can be equal to 1.

    ReplyDelete
  7. You might try a Bloom filter, if you're willing to accept a certain amount of statistical error. Guava provides one, but there's a pretty major bug in it right now that should be fixed probably next week with release 11.0.2.

    ReplyDelete

Post a Comment

Popular posts from this blog

Why is this Javascript much *slower* than its jQuery equivalent?

I have a HTML list of about 500 items and a "filter" box above it. I started by using jQuery to filter the list when I typed a letter (timing code added later): $('#filter').keyup( function() { var jqStart = (new Date).getTime(); var search = $(this).val().toLowerCase(); var $list = $('ul.ablist > li'); $list.each( function() { if ( $(this).text().toLowerCase().indexOf(search) === -1 ) $(this).hide(); else $(this).show(); } ); console.log('Time: ' + ((new Date).getTime() - jqStart)); } ); However, there was a couple of seconds delay after typing each letter (particularly the first letter). So I thought it may be slightly quicker if I used plain Javascript (I read recently that jQuery's each function is particularly slow). Here's my JS equivalent: document.getElementById('filter').addEventListener( 'keyup', function () { var jsStart = (new Date).getTime()...

Is it possible to have IF statement in an Echo statement in PHP

Thanks in advance. I did look at the other questions/answers that were similar and didn't find exactly what I was looking for. I'm trying to do this, am I on the right path? echo " <div id='tabs-".$match."'> <textarea id='".$match."' name='".$match."'>". if ($COLUMN_NAME === $match) { echo $FIELD_WITH_COLUMN_NAME; } else { } ."</textarea> <script type='text/javascript'> CKEDITOR.replace( '".$match."' ); </script> </div>"; I am getting the following error message in the browser: Parse error: syntax error, unexpected T_IF Please let me know if this is the right way to go about nesting an IF statement inside an echo. Thank you.