I can’t believe it’s so convoluted. This is a step-by-step guide from how I discovered the Elegant Way to read a Unicode file in Windows — it’s only 4 lines!
(The Unicode file I’m referring here is what you get when you save a file as Unicode in Notepad — that’s little Endian UTF16)
For the impatient
1 2 3 4 5 6 7
| wstring ReadUTF16(const string & filename)
{
ifstream file(filename.c_str());
stringstream ss;
ss << file.rdbuf() << '\0';
return wstring((wchar_t *)ss.str().c_str());
} |
The problem
To read UTF16, one would expect to use the widechar variants from iostream:
1 2 3 4 5
| // Failed attemp
wifstream file("file.utf8.txt");
wstringstream ss;
ss << file.rdbuf();
wstring result = ss.str(); |
However, the above code will give you bogus text.
The investigation
Why? If you debug it and look at memory, you will see ss.str().c_str() evaluates to something like this:
0x00367738 ff 00 fe 00 60 00 4f 00 7d 00 59 00
The first 4 bytes look strikingly suspicious! 0xfffe is the magic sequence (BOM) for little endian Unicode, yet it’s been broken up by ifstream into two wide-chars (4 bytes).
It turns out that iostream will try very hard to conform to the current codepage — something that’s so obscure about C++ that apparently nobody talks about (from my search). In most cases, the code page will be something Latin. iostream goes the extra mile to break up bytes into widechars, because it assumes all files it reads to be single-byte.
I heard you can actually tell iostream to treat the file as UTF16 instead. There is a setting somewhere probably related to (Click it, look 5 seconds and then come back) codecvt. OK, back? That’s definitely not something you want to touch just to read a damn file.
The solution
It turns to be quite anti-climatic, let’s look at our solution again:
1 2 3 4 5 6 7
| wstring ReadUTF16(const string & filename)
{
ifstream file(filename.c_str());
stringstream ss;
ss << file.rdbuf() << '\0';
return wstring((wchar_t *)ss.str().c_str());
} |
You see, basically what the above does is to tell iostream to read the file as a single byte file, but don’t break up the bytes into widechars (so things remain nicely packed) and we’ll do a hardcore conversion ourself. Problem solved.
Caveats
I didn’t bother to trace down where the 0xfffe went after the conversion — it just worked and I was done with it. I also suspect some garbage might be appended at the end of the text stream. Again, Worked For Me. In any case, just do some substring to crop out the parts you don’t want.
Why is it marked “Windows”?
This is marked Windows because the code above is not portable — many *nix platforms have sizeof(wchar_t) == 4. In general, most modern POSIX supports UTF8 natively through plain old char. You can also use the excellent libiconv which converts everything to everything in one function call — making all this fuss irrelevant.
Alternatives
If you deal with UTF8, you might want to check out UTF8-CPP. It probably also supports reading UTF16 but I couldn’t find a quick way to make it work. It’s less heavyweight than ICU — but who can beat 4 lines when you just want to read a darn file?
(UTF8-CPP does support reading UTF-16. Unfortunately it only supports converting it to UTF-8, which is something Windows can’t do conveniently)