Code for Concinnity


Reading a Unicode (UTF16) file in Windows (C++)

I can’t believe it’s so convoluted. This is a step-by-step guide from how I discovered the Elegant Way to read a Unicode file in Windows — it’s only 4 lines!

(The Unicode file I’m referring here is what you get when you save a file as Unicode in Notepad — that’s little Endian UTF16)

For the impatient

1
2
3
4
5
6
7
8
9
wstring ReadUTF16(const string & filename)
{
    // Thanks to neminem for reminding us that we need
    // std::ios::binary for Windows
    ifstream file(filename.c_str(), std::ios::binary);
    stringstream ss;
    ss << file.rdbuf() << '\0';
    return wstring((wchar_t *)ss.str().c_str());
}

The problem

To read UTF16, one would expect to use the widechar variants from iostream:

1
2
3
4
5
// Failed attemp
wifstream file("file.utf8.txt");
wstringstream ss;
ss << file.rdbuf();
wstring result = ss.str();

However, the above code will give you bogus text.

The investigation

Why? If you debug it and look at memory, you will see ss.str().c_str() evaluates to something like this:

0x00367738  ff 00 fe 00 60 00 4f 00 7d 00 59 00

The first 4 bytes look strikingly suspicious! 0xfffe is the magic sequence (BOM) for little endian Unicode, yet it’s been broken up by ifstream into two wide-chars (4 bytes).

It turns out that iostream will try very hard to conform to the current codepage — something that’s so obscure about C++ that apparently nobody talks about (from my search). In most cases, the code page will be something Latin. iostream goes the extra mile to break up bytes into widechars, because it assumes all files it reads to be single-byte.

I heard you can actually tell iostream to treat the file as UTF16 instead. There is a setting somewhere probably related to (Click it, look 5 seconds and then come back) codecvt. OK, back? That’s definitely not something you want to touch just to read a damn file.

The solution

It turns to be quite anti-climatic, let’s look at our solution again:

1
2
3
4
5
6
7
wstring ReadUTF16(const string & filename)
{
    ifstream file(filename.c_str());
    stringstream ss;
    ss << file.rdbuf() << '\0';
    return wstring((wchar_t *)ss.str().c_str());
}

You see, basically what the above does is to tell iostream to read the file as a single byte file, but don’t break up the bytes into widechars (so things remain nicely packed) and we’ll do a hardcore conversion ourself. Problem solved.

Caveats

I didn’t bother to trace down where the 0xfffe went after the conversion — it just worked and I was done with it. I also suspect some garbage might be appended at the end of the text stream. Again, Worked For Me. In any case, just do some substring to crop out the parts you don’t want.

Why is it marked “Windows”?

This is marked Windows because the code above is not portable — many *nix platforms have sizeof(wchar_t) == 4. In general, most modern POSIX supports UTF8 natively through plain old char. You can also use the excellent libiconv which converts everything to everything in one function call — making all this fuss irrelevant.

Alternatives

If you deal with UTF8, you might want to check out UTF8-CPP. It probably also supports reading UTF16 but I couldn’t find a quick way to make it work. It’s less heavyweight than ICU — but who can beat 4 lines when you just want to read a darn file?

(UTF8-CPP does support reading UTF-16. Unfortunately it only supports converting it to UTF-8, which is something Windows can’t do conveniently)

Published by kizzx2, on August 3rd, 2010 at 12:15 am. Filled under: Interesting things Tags: , , , , , 1 Comment

One Response to “Reading a Unicode (UTF16) file in Windows (C++)”

  1. So, I spent way too long yesterday trying to figure out why certain input files were failing to read all the way through using this method. Then I gave up and made my first question post on stackoverflow, and wasted a few other peoples’ time, until someone finally discovered: at least in VS2008, your solution doesn’t work on files containing, among other things, the unicode character named FULLWIDTH COLON (U+FF1A), unless you open the fstream in binary mode. If that line instead looks like “std::ifstream file(filename.c_str(), std::ios::binary);”, then it works.

    Comment by neminem on May 9, 2012 at 6:04 am



Leave a Reply