Skip to main content

How to count string num with limit memory?


The task is to count the num of words from a input file.



the input file is 8 chars per line, and there are 10M lines, for example:




aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...



the output is:




aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...



It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?



My algorithm is to use map<String,Integer> , but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.


Source: Tips4allCCNA FINAL EXAM

Comments

  1. I suck at explaining theoretical answers but here we go....

    I have made an assumption about your question as it is not entirely clear.


    The memory used to store all the distinct words is 80MB (the entire file is bigger).
    The words could contain non-ascii characters (so we just treat the data as raw bytes).


    It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.

    // Loop over the file and for each word:
    //
    // Compute a hash of the word.
    // Convert the hash to a number by some means (skip if possible).
    // If the number is odd then skip to the next word.
    // Use conventional means to store the distinct word.
    //
    // Do something with all the distinct words.


    Then repeat the above a second time using even instead of odd.

    Then you have divided the task into 2 and can do each separately.
    No words from the first set will appear in the second set.

    The hash is necessary because the words could (in theory) all end with the same letter.

    The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.

    ReplyDelete
  2. I believe that the most robust solution is to use the disk space.

    For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.

    I believe that this post can help you. Or search by yourself something about external sorting.

    Update 1

    Or as @jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.

    Update 2

    I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.

    ReplyDelete
  3. Read one line at a time
    and then have e.g. a HashMap<String,Integer>
    where you put your words as key and the count as integer.

    If a key exists, increase the count. Otherwise add the key to the map with a count of 1.

    There is no need to keep the whole file in memory.

    ReplyDelete
  4. I guess you mean the number of distinct words do you?

    So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.

    To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.

    Update

    Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.

    ReplyDelete
  5. Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.

    ReplyDelete
  6. I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]

    ReplyDelete
  7. Depending on what kind of character the words are build of you can chose for this system:

    If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.

    So compute the checksum for the strings like this:

    public static long checksum(String str)
    {
    String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    long checksum = 0;

    for (int i = 0; i < str.length(); ++i)
    {
    int c = tokens.indexOf(str.charAt(i));

    checksum *= tokens.length();
    checksum += c;
    }

    return checksum;
    }


    This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.

    So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.

    ReplyDelete
  8. You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>

    If you just need the distinct words you can use TLongHashSet

    ReplyDelete
  9. If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.

    If you need to sort using Java, this post might help:

    http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

    ReplyDelete
  10. You can use constant memory by reading your file multiple times.

    Basic idea:

    Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.


    Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
    Remove element if we encounter the same value in a partition p_j with j smaller i
    Output result counts for elements in the Map
    Clear Map, repeat for all p_1...p_n

    ReplyDelete
  11. As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.

    Your scarce resource is memory, so you can't store the words in RAM.

    You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.

    Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).

    So my proposal is to iterate over the file, and for each word:


    If the word was found for first time, write the string to a result file together with the integer value 1.
    If the word was processed before (it will appear in the result file), increment the record value.


    This solution scales well no matter the number of lines of your input file nor the length of the words*.

    You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.

    EDIT:
    *It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.

    ReplyDelete
  12. possible solutions:


    Use file sorting and then just count the consequent occurences of each value.
    Load the file in a database and use a count statement like this: select value, count(*) from table group by value

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex