Code for Concinnity

beautiful and elegant solutions


Reading a Unicode (UTF16) file in Windows (C++)

I can’t believe it’s so convoluted. This is a step-by-step guide from how I discovered the Elegant Way to read a Unicode file in Windows — it’s only 4 lines!

(The Unicode file I’m referring here is what you get when you save a file as Unicode in Notepad — that’s little Endian UTF16)

For the impatient

1
2
3
4
5
6
7
wstring ReadUTF16(const string & filename)
{
    ifstream file(filename.c_str());
    stringstream ss;
    ss << file.rdbuf() << '\0';
    return wstring((wchar_t *)ss.str().c_str());
}

The problem

To read UTF16, one would expect to use the widechar variants from iostream:

1
2
3
4
5
// Failed attemp
wifstream file("file.utf8.txt");
wstringstream ss;
ss << file.rdbuf();
wstring result = ss.str();

However, the above code will give you bogus text.

The investigation

Why? If you debug it and look at memory, you will see ss.str().c_str() evaluates to something like this:

0x00367738  ff 00 fe 00 60 00 4f 00 7d 00 59 00

The first 4 bytes look strikingly suspicious! 0xfffe is the magic sequence (BOM) for little endian Unicode, yet it’s been broken up by ifstream into two wide-chars (4 bytes).

It turns out that iostream will try very hard to conform to the current codepage — something that’s so obscure about C++ that apparently nobody talks about (from my search). In most cases, the code page will be something Latin. iostream goes the extra mile to break up bytes into widechars, because it assumes all files it reads to be single-byte.

I heard you can actually tell iostream to treat the file as UTF16 instead. There is a setting somewhere probably related to (Click it, look 5 seconds and then come back) codecvt. OK, back? That’s definitely not something you want to touch just to read a damn file.

The solution

It turns to be quite anti-climatic, let’s look at our solution again:

1
2
3
4
5
6
7
wstring ReadUTF16(const string & filename)
{
    ifstream file(filename.c_str());
    stringstream ss;
    ss << file.rdbuf() << '\0';
    return wstring((wchar_t *)ss.str().c_str());
}

You see, basically what the above does is to tell iostream to read the file as a single byte file, but don’t break up the bytes into widechars (so things remain nicely packed) and we’ll do a hardcore conversion ourself. Problem solved.

Caveats

I didn’t bother to trace down where the 0xfffe went after the conversion — it just worked and I was done with it. I also suspect some garbage might be appended at the end of the text stream. Again, Worked For Me. In any case, just do some substring to crop out the parts you don’t want.

Why is it marked “Windows”?

This is marked Windows because the code above is not portable — many *nix platforms have sizeof(wchar_t) == 4. In general, most modern POSIX supports UTF8 natively through plain old char. You can also use the excellent libiconv which converts everything to everything in one function call — making all this fuss irrelevant.

Alternatives

If you deal with UTF8, you might want to check out UTF8-CPP. It probably also supports reading UTF16 but I couldn’t find a quick way to make it work. It’s less heavyweight than ICU — but who can beat 4 lines when you just want to read a darn file?

(UTF8-CPP does support reading UTF-16. Unfortunately it only supports converting it to UTF-8, which is something Windows can’t do conveniently)

Published by kizzx2, on August 3rd, 2010 at 12:15 am. Filled under: Interesting things Tags: , , , , , 1 Comment

Tuples considered harmful

(This post started out as an elaborate explanation to someone who just couldn’t wrap his head around C++’s boost::tuple. The title is probably a misnomer. It sould be renamed “boost::tuples considered harmful outside of quick throw-away internal uses and TMP where they were actually intended to be used,” :P )

The examples here use C++, which has a very broken static type system. Some of these problems are only the ill effects of that. Some of these are inherent to tuples.

In particular, if you’re a budding C++ programmer who finally wants to try this new-agey feature from boost called tuples, this is a tutorial to tell you why you “don’t” want to go down that road.

Sin #1: Tuples makes you anti-social

Let’s look at the following piece of code:

1
2
3
// awesome_math_library.hpp
#include <boost/tuple/tuple.hpp>
boost::tuple<int, int> divide(int numerator, int denumerator);

By looking at the function declaration, how would you use divide?

1
2
3
4
5
6
7
int quotient, remainder;

// Possibility 1
boost::tie(quotient, remainder) = divide(42, 10);

// Possibility 2
boost::tie(remainder, quotient) = divide(42, 10);

How do you know which one is the correct usage? It turns out there is no way to know unless you look at the source code.

Forcing people to have to read your source code before they can use it is plain wrong.

Now it’s probably OK for dynamic languages (they tend to come from open source folks), but it will doom all C++ hardcore machos because knowing about breaks the OOP creed — encapsulation.

Some people think “parameter names suffer the same problem, you just need to name your method appropriately.” Here you go:

1
boost::tuple<int, int> DivideFirstResultIsRemainderSecondIsQuotient(int numerator, int denumerator);

I don’t think I need to say more. No need to thanks for the laugh :P

Sin #2: Tuples leads to fragile code

Another example:

1
boost::tuple<void *, int> get_cube();

Fair enough, the type system actually helped us deduce the sensible usage:

1
2
3
void * vertices;
int num_vertices;
boost::tie<vertices, num_vertices> = get_cube();

So far so good. Fast-forward 6 months, it turns out our method wants to also include color information in the cube:

1
boost::tuple<void *, int, char, char, char> GetCubeVerticesWithColorsRGBInThisOrder();

All of a sudden, original code breaks. Since there doesn’t exist function overloading for return types. Using tuple as your return type is a fast way to seal off your function for future extensions.

Sin #3: Tuples obfuscates your code

1
2
3
4
5
6
7
8
using boost::tuple;

int PickUpTreasure(tuple<int, int, std::string> player, tuple<char, char, int> treasure_chest)
{
    // .. after 200 lines of code, some months later you look at
    // this misery: wtf does it do?
    return get<0>(player) + get<2>(treasure_chest) * get<1>(player);
}

This demonstrates the serious problem of Sin #1 — every user of tuples need to go look up the source code

It also depicts another problem with tuples — PickUpTreasure()‘s writer must spell out the full definition of the tuples even though he only uses a handful of the values.

Sin #4: Tuples break type-safety

1
void TranslateCoordinate(tuple<int, int> point);

This function can be easily abused, and it compiles without problem:

1
TranslateCoordinate(divide(30, 50));

What a pity.

So am I forbidden to return multiple values?

No. It’s called plain old struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
struct cube
{
    void * vertices;
    int num_vertices;
    unsigned char r, g, b;
};

// old code that doesn't know about colors
cube old_code()
{
    cube a = {NULL, 0};
    return a;
}

// new code that uses colors
cube new_code()
{
    cube a = old_code();
    a.r = 10;
    a.g = 20;
    a.b = 42;
    return a;
}

struct point
{
    int x, y;
};

struct division_result
{
    int remainder, quotient;
};

void TranslateCoordinate(point p);

// Notice that it nicely conveys the intent of the function
// This is now illegal as it should:
TranslateCoordinate(divide(50, 21));    // ERROR

Old code can safely use the new struct as-is, without any modifications. Old code can even pass updated objects verbatim to new code that can enhance it. ’nuff said.

FAQs

1. Your second example sucks. I can say “using int as your return type is a fast way to seal off your function for future extensions”

True. Primitive types are usually expected to represent a very fine-grained entity that shouldn’t change. Of course it’s kind of an assumption. The problem with tuple is, tuple<int, int> is actually more like an object. tuple<int, int, char, char, std::string> even more so. The more types you pack together in a tuple, the more fragile that tuple is.

Think about it this way, if a function that originally returns int suddenly is changed to return std::string, the whole function probably needs to be rewritten anyway and the return type is the less of our concern (the more important concern is behavior, obviously). However, if your return type is tuple<int, std::string, double>, it is very likely that your function does more than one thing and you may well need to expand your return type to tuple<int, std::string, double, void *> some time later so you can expose more of your stuff that you originally thought could be encapsulated (this is a whole different topic).

Of course, no one can be absolutely sure he won’t want to return more things when he writes a function that returns more than one value. Think how struct handles it elegantly.

2. Defining a new type every time I want to stilt together a bunch of variables suck!

It does, but that’s a fact of life, embrace it. Understand that what you think is “a bunch of variables” may evolve into a full blown object sooner than you may think.

In a few convenient places, though, you can use tuple internally if you don’t expose tuples to the outside world (so other people don’t get confused). But how many non-trivial projects are internal? Even if your project is from the same company, it’s just good practice to write your part like an API so other people (including yourself 2 years later) can use it conveniently.

With those in mind, we can come to the conclusion that useful scenarios for tuples are really limited.

3. Your third example is stupid. Nobody uses tuples in parameter list. Functions have a natural parameter list that supports parameter naming

Right, let’s change it:

1
int PickUpTreasure(int playerHP, int playerLevel, int treasureMoney);

Seems better, let’s see how our caller adapts

1
PickUpTreasure(get<0>(player), get<1>(player), get(2)<treasure>);

Yikes!

Just for your comparison, here’s a well-formed and well-typed version using plain old OOP:

1
player.PickUp(treasure);

Who said programming languages must be cryptic!?

4. Your last remark in the third example shows your noobness. The author doesn’t have to spell out the definition of the tuple. He could have used a simple typedef

Right, but that kind of defeats the purpose of tuples IMO. The fact that I want to use a tuple is because it’s quick and dirty. If I go through the trouble to type:

1
typedef tuple<int,int>point;

Maybe I should just type a few more characters and benefit from named values:

1
struct point{int x, y;};

Oops, it turned out to be less characters, ironically :P

5. How about out params? (Not really related to tuples)

It’s sometimes needed but should be avoided as much as possible:

1
2
3
4
5
big_object * my_object;
create_big_object(big_object ** something);

// Danger! my_object may be NULL
use_one_attribute(my_object->name);

In general, it’s better to return a struct by value given your struct holds a small number of primitive types (such as Point, Matrix, Rectangle, etc.). Because there is no performance penalty for returning a simple struct by value (see explanation below), but the added benefit is that the intent becomes crystal clear. The pass-by-value syntax is how programming languages should work, as God intended. I no longer need to worry about object life-times and a whole bunch of unimportant stuffs.

(For low-level machos, it’s notable that it’s faster for the CPU to juggle around primitive types in registers* than accessing them from memory. (But that’s another different beast topic.))

If you need to return a pointer, use their modern variants instead:

1
2
3
// Clear intent -- create_big_object will give up ownership so the caller
// should take care of the object's life time
std::auto_ptr<big_object> my_object = create_big_object();

If you think about it, what’s an auto_ptr? It’s a struct! (class actually, synonymous in C++)

* I made that remark without really giving a deep consideration. I did a very crude test using a 2-double Point struct and made that conclusion. Obviously it’s very compiler specific. Most of the time, you’d find that the pass-by-value version is faster when you have 2 to 4 members in the struct, and your compiler is using some sort of fastcall or x64 calling convention. Having said that, it’s probably safer to stick back to good ol’ pass-by-const-reference most of the time anyway. For return values, we have RVO so it’s usually OK to return whole objects.

Published by kizzx2, on July 31st, 2010 at 2:03 am. Filled under: Interesting things Tags: , , , , No Comments

Composition vs. Inheritance

I was thinking about composition vs. inheritance today and I wanted to make everything crystal clear. Here’s some quick brain-dump of my research and thoughts:

  • Use composition when possible. It delegates the dependency process to the callsite. This is a prefer composition over inheritance. Since modern programming idioms prefer composition to inheritance, most of the time using delegates and Dependency Injection to achieve composition would be a Good Thing.

  • Why prefer composition to inheritance? Simply because inheritance is almost the strongest coupling you can introduce between two classes (second to friend classes in C++). We all know that coupling is bad and that separation of concern is good.

  • Use Inheritance whenever necessary. An example of a necessary case is when the new functionality (injection point) needs to access the protected members of the base class.

  • When using Inheritance, prefer nonpublic inheritance to public inheritance. Use private inheritance when you need to access the protected members of the base class; use protected inheritance when you need to access the protected members of the base class, while at the same time exposing them to your derivatives.

  • Only use public inheritance when your derived class is a true IS-A relationship that satisfies the Liskov Substitution Principle.

  • There is a good quote from somewhere:

Inherit (publicly) not to reuse, but to be reused.

Is there anything that inheritance can do but composition can’t do?

  • Yes, when you want to access protected members of the base class.

  • Theoretically, this could still be done using composition by having the base class pass along the required attributes, but that often quickly degenerates into passing a chain of message.

  • Even if you pass an instance of the Base class to the Derived class in hope of that the Derived class can access the Base class, it’s still limited to the public interface only.

Published by kizzx2, on June 13th, 2010 at 3:21 pm. Filled under: Interesting things Tags: , , , , , No Comments

Some great bash command line tricks I learned lately

Many adopted from Peteris Krumins’ blog post

Display mounted file systems nicely

The main point here is really about the column command:

1
2
3
4
5
6
7
8
9
10
11
$ mount
/dev/root on / type ext3 (rw)
/proc on /proc type proc (rw)
/dev/mapper/lvmraid-home on /home type ext3 (rw,noatime)

$ mount | column -t
/dev/root                 on  /      type  ext3   (rw)
/proc                     on  /proc  type  proc   (rw)
/dev/mapper/lvmraid-home  on  /home  type  ext3   (rw,noatime)

# woot, now it's printing out nicely!

Repeat arguments of the most recent command

Alt + .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ echo hello world
hello world

$ echo (Alt + .) # this becomes...
$ echo world

# Cool, but what if I wanted "hello"?

$ echo hello world
hello world

$ echo (Alt + 1)(Alt + .)
$ echo hello # here you go!

# Doing the same thing with history expansion
$ echo hello world
$ echo !!:1
echo hello
hello

# There's a shorthand for the last argument in history expansion
$ echo hello world
$ echo !$
echo world
world

Edit the whole command line in $EDITOR

This one is extremely huge. Ever got tired of doing those multiple lines long ffmpeg command lines? Here’s your salvation (and his name is vi)

1
2
$ ffmpeg -i my-input-file.avi <Ctrl+x><Ctrl+e>
# You can specify multiple command lines in your $EDITOR and they'll be executed one by one, cool!
Published by kizzx2, on April 22nd, 2010 at 2:34 am. Filled under: Interesting things Tags: , , No Comments

Firefox vs. Chrome — is Mozilla going down?

Just read an article on Tom’s Hardware: Why Mozilla Needs to go into Survival Mode. Woflgang Gruener said that “Google’s royalties account for 80-90% of Mozilla’s entire revenues.” If that’s true, then when Google’s got their own browser, Mozilla would surely go down.

Since it’s launch last year, Chrome’s been iterating extremely rapidly. Chrome v5 is about to launch, that’s five freaking versions in a year! Firefox on the other hard, pretty much (seemingly) got stuck since 3.6. This is probably the weakness of open source software — changes are so damn slow!

I would be very sad to see Firefox go down, because the market really needs more open source sofware instead of giving more personal information to Google.

Strangely enough, my personally experience with Chrome isn’t particularly spectacular. Yes, it launches (way) faster than Firefox and I use it in my eee PC 1000H, but it’s not that much more stable. Ironically with its tab-process separation thing, I’ve had Chrome crashed on me several times bringing down the whole system. Another thing is that I’ve never managed to install any extensions to Chrome, every time I bothered to try, it crashed and would possibly bring my system together with it.

Published by kizzx2, on April 19th, 2010 at 1:13 am. Filled under: Interesting things Tags: , , , , No Comments

Google’s SketchUp vs Autodesk’s AutoCAD

I just used Google’s SketchUp for some quick personal project, and I was just wondering: “This is a very powerful piece of software, and it’s available for free! What’s the catch?”

I mean, there’re “commercial grade” software packages out there like Autodesk’s famous AutoCAD. How does SketchUP stack up against AutoCAD?

The information was surprisingly scarce, so I decided to collect my findings (mainly from forums) into this post:

What’s missing in SketchUp

SketchUp handles large drawings poorly

It’s been quite a well known fact that SketchUp slows down to a hog when you load a drawing with many poly counts. This effective puts SketchUp to the “sketch” category instead of a professional drawing tool.

From my experience, SketchUp’s definition of “large drawings” may actually be way smaller than you might think. For instance, importing a typical “house” sketch from the Google 3D Warehouse with a garden and a car will actually stretch SketchUp quite far to its limits.

Also, from my own experience, SketchUp as a whole is a less stable piece of software, probably because it’s new (and probably because it’s written in Ruby).

Create complex animations

You could do it with Podium’s Animate, which is a plugin to SketchUp. But it works by creating new scene for every frame — you get the idea. SketchUp’s scene based animation is only useful for simple presentation (which suits its intended purpose).

Efficient user interface

In AutoCAD you could access commands by typing it in the command line. For example I can type rect to access the rectangle tool; ext to use extrude. Whereas I have to move my left hand all the way to the P key just to use the Push Up/Down tool.

Where SketchUp is actually better than AutoCAD

Google 3D Warehouse

This is probably THE most winning feature of SketchUp. I mean, the major selling point of SketchUp is to be able to do rapid prototyping, but with practice, AutoCAD masters can probably dish out sketches as fast as SketchUp.

The 3D Warehouse is actually where SketchUp users will gain a huge edge in productivity. After several years of release, the warehouse covers an amazingly comprehensive array of objects ready to be imported to SketchUp without opening the Web browser (well, it does open an internal browser inside SketchUp).

Ruby script

SketchUp supports scripts with Ruby, so I don’t have to learn another new scripting language like I have to in AutoCAD. It looks like AutoCAD supports scripting in Ruby with IronRuby, but that’s probably not as fluid as using a native, standard scripting language.

Common myth: SketchUp’s looks and feels aren’t professional

SketchUp seems to give you a really fast and easy to use tool to do some visualization, but some people may think that SketchUp’s look and feel simply don’t look professional for a presentation. While it doesn’t come with any decent renderer out of the box, the image quality is pretty much dependent on the renderer, not the modelling software itself. Look at this render by VRay for SketchUp made with SketchUp:

A render by VRay for SketchUp

Free renderer Kerkythea for SketchUp also gives satisfactory results:

A render by Kerkythea for SketchUp

Verdict

I think what it really comes down to is your budget for money and time investment. SketchUp is free and very quick to pick up, but when it comes to large drawings or extreme details, professional packages like AutoCAD are still worth the money.

Published by kizzx2, on April 8th, 2010 at 5:45 pm. Filled under: Interesting things Tags: , , 2 Comments