Posted by aogThursday, 30 November 2006 at 16:22 TrackBack Ping URL

When weblogs and real life collide

Serendipitiously, as we were discussing debugging and tinkering I was in the midst of debugging code (and who says weblogging interferes with real life?). I tracked it down to a (very likely) bug in Visual Studio 2005.

Here’s code that demonstrates the bug —

# include <fstream>

int main(int, char**)
{
    std::ifstream f;
    f.open("data.txt");
    if (f) {
        f.exceptions(~0);
        for ( int i(0) ; i < 4*4096 ; ++i )
            f.get();
        f.peek();
        f.unget();
    }
    return 0;
}

This will throw an exception at the unget if the data file is more than 16,385 characters long and it has UNIX style (linefeed only) line endings. I tracked it down through an alternating pattern of tinkering and thinking (with some help from SWIPIAW). I tinkered to gather data about the nature of the problem, and then thought about how a problem I had observed could occur. I then would tinker some more to disprove or refine my theory. As is almost always the case, the two approaches are complentary, not oppositional.

Comments — Formatting by Textile
joe shropshire Thursday, 30 November 2006 at 22:21

for int i(0)? Please to explain.

Annoying Old Guy Thursday, 30 November 2006 at 23:07

C++ regularized object construction so that all of the builtin types have constructors just like user defined classes. This was done primarily to support templates (so that templates do not have to special case builtin types). This means that the type int now has a copy constructor which acts just like int::int(int const src). The notation you asked about invokes this copy constructor.

For built in types, the use of int i(0); or int i = 0; is purely syntactic sugar, but it’s not the same for user classes, where the latter does a construction and an assignment in some compilers, but the former does only a construction. Since there’s no downside for this style with builtin types, I try to inculcate the habit of using the constructor form at all times.

joe shropshire Friday, 01 December 2006 at 08:41

Very cool, I had never seen that syntax before.

cjm Friday, 08 December 2006 at 21:00

it burns my wick that supposedly oo languages have “primitive” types. smalltalk had it right, all objects, all the time. Java is really bad this way, in that arrays are pseudo objects. grrrrr

Annoying Old Guy Friday, 08 December 2006 at 21:33

Concerns about efficiency tend to get in the way of not having primitives. At least C++ made some effort to smooth out some of the rough spots (e.g., raw pointers can be used as iterators, constructor initialization syntax for primitives, etc.).

On the other hand, you could go the CLOS route and have only primitives. This make reflection trivial, since it comes for free. Perl does that the same way, which makes user code to manipulate class types possible (since a class type is really just a composition of primitives like every other data structure).

Annoying Old Guy Friday, 08 December 2006 at 21:39

Update: The Dark Empire has confirmed that this is in fact a real bug in their STL implementation for which there is no work around. I doubt, though, that my business case will result in a hot fix.

The actual bug is that the stream buffer logic keeps a single 4K buffer and no unget lookaside. So when unget is invoked, the stream buffer just decrements the current buffer pointer. Normally, that works because the next buffer isn’t fetched until the first character of the buffer is needed. That fails, of course, if the reason the next character is needed is just to peek at it. That leaves the buffer pointer on the first character in the buffer so unget moves it past the front, which is the error condition that sets off the bug. The appropriate fix is left as an exercise for the reader.

Really, that’s an embarassing bug to have in base code like the STL stream buffer. On the other hand, it seems odd that I am the first person to notice it. I suppose most people don’t use unget.

Peter Burnet Saturday, 09 December 2006 at 05:59

I suppose most people don’t use unget.

There’s a safe assumption, I’d say. AOG, I must tell you once more how much I treasure your posts and comments on code. They’re pure poetry. Look at that middle paragraph! Not a noun, not even a verb is familiar. And the fact that you can switch back to English with such ease and facility marks you as a true Renaissance man. Brit, if you are there, do you think we could make a shekel or two with a compendium of them? Or maybe invent a party game where everybody writes down what he thinks AOG is saying and the rest guess which one is the real one?

Annoying Old Guy Saturday, 09 December 2006 at 09:39

Mr. Burnet;

Thanks. I think. I changed the formatting a bit. Is it comprehensible now?

For the geeks, just enclose the code snippet in @ characters, e.g. @unget@ yields unget.

cjm Saturday, 09 December 2006 at 09:39

my impression is that everything ms does is “naive”. they desperately needed to slow the pace of change simply because their people are such plodders. now that things are moving so quickly, they fall further and further behind. i would have thought every parser compiled with their tool would have blown up eventually (when the source tokens lined up just so).

Annoying Old Guy Saturday, 09 December 2006 at 10:12

Yes, I have found Dark Empire technology powerful but fragile for that very reason. I have a long history of breaking it by not doing things precisely the way the implementors thought I should. I think in this case, anyone who needs more than just get does their own read-ahead management, in which case the bug doesn’t manifest.

The other thing is that my tokenizer is different from standard language tokenizers. Note that for C and C++, by language definition you don’t need to peek because the longest possible token is always the correct implementation (hence the closing nested template ‘>>’ problem). The language I am parsing doesn’t have that property, so sometimes it is necessary to peek to see if the next character is part of the token (equivalently, the language is LL(2), not LL(1)). I think that at the lexical level, the languages the Dark Empire supports are all LL(1) so the issue doesn’t arise.

Post a comment