Skip to main content

How much to grow buffer in a StringBuilder-like C module?


In C, I'm working on a "class" that manages a byte buffer, allowing arbitrary data to be appended to the end. I'm now looking into automatic resizing as the underlying array fills up using calls to realloc . This should make sense to anyone who's ever used Java or C# StringBuilder . I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?



Obviously, there's a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I've seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?



Does any one know what Java or C# does under the hood?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.

    There are three basic strategies for solving this problem, and they have different performance characteristics.

    The first basic strategy is:


    Make an array of characters
    When you run out of room, create a new array with k more characters, for some constant k.
    Copy the old array to the new array, and orphan the old array.


    This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let's say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, ... and therefore copying 1000 + 2000 + 3000 + 4000 + ... + 999000 characters, which sums to on the order of 500 billion characters copied!

    This strategy has the nice property that the amount of "wasted" memory is bounded by k.

    In practice this strategy is seldom used because of that n-squared problem.

    The second basic strategy is


    Make an array
    When you run out of room, create a new array with k% more characters, for some constant k.
    Copy the old array to the new array, and orphan the old array.


    k% is usually 100%; if it is then this is called the "double when full" strategy.

    This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, ... and end up copying 1000 + 2000 + 4000 + 8000 ... + 512000 characters, which sums to about a million characters copied; much better.

    The strategy has the property that the amortized cost is linear no matter what percentage you choose.

    This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.

    The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.

    This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don't need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.

    The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.

    ReplyDelete
  2. You generally want to keep the growth factor a little smaller than the golden mean (~1.6). When it's smaller than the golden mean, the discarded segments will be large enough to satisfy a later request, as long as they're adjacent to each other. If your growth factor is larger than the golden mean, that can't happen.

    I've found that reducing the factor to 1.5 still works quite nicely, and has the advantage of being easy to implement in integer math (size = (size + (size << 1))>>1; -- with a decent compiler you can write that as (size * 3)/2, and it should still compile to fast code).

    I seem to recall a conversation some years ago on Usenet, in which P.J. Plauger (or maybe it was Pete Becker) of Dinkumware, saying they'd run rather more extensive tests than I ever did, and reached the same conclusion (so, for example, the implementation of std::vector in their C++ standard library uses 1.5).

    ReplyDelete
  3. When working with expanding and contracting buffers, the key property you want is to grow or shrink by a multiple of your size, not a constant difference.

    Consider the case where you have a 16 byte array, increasing its size by 128 bytes is overkill; however, if instead you had a 4096 byte array and increased it by only 128 bytes, you would end up copying a lot.

    I was taught to always double or halve arrays. If you really have no hint as to the size or maximum, multiplying by two ensures that you have a lot of capacity for a long time, and unless you're working on a resource constrained system, allocating at most twice the space isn't too terrible. Additionally, keeping things in powers of two can let you use bit shifts and other tricks and the underlying allocation is usually in powers of two.

    ReplyDelete
  4. Does any one know what Java or C# does under the hood?


    Have a look at the following link to see how it's done in Java's StringBuilder from JDK7, in particular, the expandCapacity method.
    http://hg.openjdk.java.net/build-infra/jdk7/jdk/file/0f8da27a3ea3/src/share/classes/java/lang/AbstractStringBuilder.java

    ReplyDelete
  5. It's implementation-specific, according to the documentation, but starts with 16:


    The default capacity for this implementation is 16, and the default
    maximum capacity is Int32.MaxValue.

    A StringBuilder object can allocate more memory to store characters
    when the value of an instance is enlarged, and the capacity is
    adjusted accordingly. For example, the Append, AppendFormat,
    EnsureCapacity, Insert, and Replace methods can enlarge the value of
    an instance.

    The amount of memory allocated is implementation-specific, and an
    exception (either ArgumentOutOfRangeException or OutOfMemoryException)
    is thrown if the amount of memory required is greater than the maximum
    capacity.


    Based on some other .NET framework things, I would suggest multiplying it by 1.1 each time the current capacity is reached. If extra space is needed, just have an equivalent to EnsureCapacity that will expand it to the necessary size manually.

    ReplyDelete
  6. Translate this to C.

    I will probably maitain a List<List<string>> list.

    class StringBuilder
    {
    private List<List<string>> list;

    public Append(List<string> listOfCharsToAppend)
    {

    list.Add(listOfCharsToAppend);
    }

    }


    This way you are just maintaining a list of Lists and allocating memory on demand rather than allocating memory well ahead.

    ReplyDelete
  7. List in .NET framework uses this algorithm: If initial capacity is specified, it creates buffer of this size, otherwise no buffer is allocated until first item(s) is added, which allocates space equal to number of item(s) added, but no less than 4. When more space is needed, it allocates new buffer with 2x previous capacity and copies all items from old buffer to new buffer. Earlier StringBuilder used similar algorithm.

    In .NET 4, StringBuilder allocates initial buffer of size specified in constructor (default size is 16 characters). When allocated buffer is too small, no copying is made. Instead it fills current buffer to the rim, then creates new instance of StringBuilder, which allocates buffer of size *MAX(length_of_remaining_data_to_add, MIN(length_of_all_previous_buffers, 8000))* so at least all remaining data fits to new buffer and total size of all buffers is at least doubled. New StringBuilder keeps reference to old StringBuilder and so individual instances creates linked list of buffers.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex