Typing Chinese like English

I’ve long wished that typing Chinese could be as intuitive as typing English. By that, I mean that you should be able to approach a Chinese keyboard for the first time in your life and type any character you know, regardless of what dialect you speak. You shouldn’t need prior study of an input method like 倉頡. You shouldn’t need to know Mandarin. You shouldn’t even need to know Chinese beyond a cursory study of the writing system. I wanted there to be a Chinese equivalent of hunt-and-peck typing. 看見打字.

For about 15 years now, I’ve attempted a solution to this problem. I’ve had a few periods where I’ve worked on it intensely, but it’s mostly been a background hobby. To be honest, if I had known how long it was going to take, I might never have started.

I’m making it available for use today. Is it perfect? No. But I think it’s good enough to show other people. Also, I’m kinda sick of working on it.

So introducing: http://wuhou.im

I named it after Empress Wu, the only female ruler of imperial China, and somewhat relevant to the issue of Chinese input for the fact that she invented her own new Chinese characters.

For a long time I was stuck on the fact that HTML didn’t provide a way to know exactly which physical key the user is pressing when typing on a website, but thanks to Firefox’s support of DOM Level 3 key events, that is now possible. (If you’re using a different browser, the website might not work properly if you have a non-QWERTY keyboard layout.)

My current employer doesn’t let me work on Open Source projects or personal projects without prior permission, so let me be clear that this is something I developed prior to my current employment. And if I made any minor tweaks recently, you have no proof of that.

I myself do not know Chinese well (my interest in Chinese is mostly as a historical language of Vietnam and a liturgical language of Buddhism), so if you want to leave any comments, I’m gonna need them in English (or French).

Edit: You gotta press SHIFT to see more radicals. I’m sorry that’s not obvious. I’ll try to make that more intuitive.

Rebranding

In case any of my colleagues haven’t heard, I no longer work at Mozilla. I apologize for not sending out a company-wide farewell email. I wasn’t expecting my email access to be cut off so promptly.

Hence, this is no longer my work blog, but I still hope to contribute to the Mozilla Project (I am working on some patches as we speak), and I will continue to blog here about tech-related topics (and especially Firefox-related topics) on the occasion that I have something to say.

Python-style ‘with’ in C++

One of the most important idioms in C++ is “Resource Acquisition Is Initialization” (RAII), where the constructor of a class acquires a resource (like locking a mutex, or opening a file) and the corresponding destructor releases the resource.

Such classes are almost always used as local variables. The lifetime of the variable lasts from its point of declaration to the end of its innermost containing block.

A classic example is something like:

extern int n;
extern Mutex m;

void f() {
    Lock l(&m); // This is an RAII class.
    n++;
}

The problem is, code written with these classes can easily be misunderstood.

1) Sometimes the variable is used solely for the side-effects of its constructor and destructor. To a naive coder, this can look like an unused variable, and they might be tempted to remove it.

2) In order to control where the object is destroyed, the coder sometimes needs to add another pair of braces (curly brackets) to create a new block scope. Something like:

void f() {
    [...]

    {
        Lock l(&m);
        n++;
    }

    [...]
}

The problem is, it’s not always obvious why that extra block scope is there. (For trivial code, like the above, it’s obvious. But in real code, ‘l’ might not be the only variable defined in the block.) To a naive coder, it might even look like an unnecessary scope, and they might be tempted to remove it.

The usual solutions to this situation are: (a) Write a comment, or (b) trust people to understand what you meant. Those are both bad options.

That’s why I’m fond of the following MACRO.

#define with(decl) \
for (bool __f = true; __f; ) \
for (decl; __f; __f = false)

This allows you to write:

void f() {
    [...]

    with (Lock l(&m)) {
        n++;
    }

    [...]
}

This creates a clear association between the RAII object and the statements which depend on its existence. The code is less likely to be misunderstood.

If you’re familiar with Python, you’ll recognize this as analogous to Python’s with-statement.

Any kind of variable declaration can go in the head of the with-statement (as long as it can appear in the head of a for-statement). The body of the with-statement executes once. The variable declared in the head of the statement is destroyed after the body executes.

I didn’t invent this kind of MACRO. (I think I first read about something similar in a Dr Dobb’s article.) I just find it really useful, and I hope you do too.

The Canadian, Day 5

Today I woke up on the outskirts of Greater Vancouver, judging by the signs in this industrial area. The steward has just announced we’ll be arriving in Vancouver in one hour.

Thus concludes my journey. Though the journey spanned five calendar days, I left late on Thursday night and I’m arriving early Monday morning, so it’s more like four nights and three days. (A total travelling time of 3 days and 14.5 hours.) Because I travelled over a weekend, I only lost one day of office time and gained a scenic weekend.

The trip roughly breaks down to one day in the forests of Ontario, one day across the plains of Manitoba and Saskatchewan, and one day across the mountains of Alberta and British Columbia.

In retrospect, I should have done my work on the second day, when my mobile internet connection was good and the scenery was less interesting. (Sorry, Manitoba. Sorry, Saskatchewan.)

From here I take a bus to Whistler to attend what Mozilla calls a “work week” — a collection of presentations, team meetings, and planning sessions. It ends with a party on Friday night, the theme of which is “lumberjack”. (Between the bears and the hipsters, don’t I already get enough of people dressing like lumberjacks back in Toronto?)

Because I’m a giant hypocrite, I’ll be flying back to Toronto. But I heartily recommend travel by Via Rail. Their rail passes (good for visiting multiple destinations, or doing a round trip) are an especially good deal.

I wonder what Kylie Minogue has to say about rail travel.

The Canadian, Day 4

The great thing about travelling by train is that there is almost zero environmental impact.

Before you say “but what about those giant diesel engines burning diesel”, let me explain that I’m talking about marginal impact. That is, this train was going to be running regardless of whether I was on it or not. By choosing to be on it, I’m not contributing any additional environmental impact.

That’s not true for a bus or plane. If 100 people suddenly decided to travel from Toronto to Vancouver by bus, the bus company would have to schedule another couple busses. If 100 extra people decided to travel by plane, the airline would have to schedule another plane. But if 100 people decided to all take the train, Via Rail can simply add another couple of cars to their existing scheduled train, with negligible environmental impact.

(Caveat: There are reports that Via Rail no longer does this, and Elizabeth May is not happy about it. See Point 1 in her letter.)

There are downsides too. It’s more expensive than travelling by plane, though still within my company’s travel policy. (Note that I’m travelling in Economy. Note also that if you’re a member of CAA or Hostelling International, you get 10% off Via Rail tickets.) It also takes more time, but it’s not unproductive time. You can get quite a lot done on the train.

The main downside is sleeping in a seat. I was fine for the first couple nights, but after that third night (last night), I’m starting to feel a bit rough. And I have one more night to go.

The cost of the Via Rail ticket includes one stopover in any city for as long as you want. If I had more time, I might have scheduled a stopover in Winnipeg or Jasper for a couple nights to recharge.

Update: Jasper
Apparently I slept past Saskatoon and Edmonton.

Jasper is beautiful, but you already knew that.

Update: Approaching Kamloops
I can’t help wishing I was seeing this scenery in winter instead of summer. Green mountains just look like big hills.

We’re back on tracks laid by the Canadian Northern Railway. It still boggles my mind that we’re following a path chosen by someone 100 years ago. I spend so much of my life surrounded by new technology, it’s strange to be using something created even just 3 or 4 generations ago.

The 1950s and 1960s saw the decline of passenger rail in Canada. It couldn’t compete with the rise of air travel and new highways like the Trans-Canada and the 401. The federal government created Via Rail in the 1970s to take over passenger operations from CN and CP, since those services were no longer profitable.

I’m not exactly sure why the government keeps Via Rail running, but I’m glad they do. Last time I travelled by air, security confiscated my penknife.

Musical Interlude IV
[This one would have made more sense last night. I might swap them in a later edit.]

The Canadian, Day 3

I expected to wake up in Winnipeg, but instead we’re stopped in the middle of nowhere. I can tell we’re in Manitoba, because the landscape has become flat and grassy instead of Ontario’s rocky forests and lakes, but I don’t see Winnipeg anywhere. Still no internet connection.

The train is moving again. We just passed a farm that seems to be raising cows and abandoned trucks.

Update: Winnipeg

Our train has a layover of about 3 hours in Winnipeg. The Via Rail station is located beside The Forks, a historic part of Winnipeg of significance to First Nations people. This site has been used as a meeting place for at least 6000 years. It’s especially active today because today is Aboriginal Day.

The site also hosts the Canadian Museum of Human Rights (which was still under construction last time I was here), a farmer’s market that is open today, and a couple historic rail cars. A pedestrian bridge (the Esplanade Riel) links it to Winnipeg’s French quarter, St Boniface, across the river. (Until 1971, St Boniface was its own predominantly-francophone city, a rarity in Western Canada.)

I didn’t spend much time looking about, however. I went straight to the Assiniboine Athletic Club, just two short blocks from the Via Rail station’s main entrance, to take a shower. This is the only opportunity to shower between Toronto and Vancouver. It’s $11 for a day pass to the gym. I would have had a workout too, but I didn’t pack any shorts.

Former rail car of the Temiskaming and Northern Ontario Railway

The rail car in the above picture used to belong to the Temiskaming and Northern Ontario Railway, later known as the Ontario Northland Railway. Owned by the Ontario government, it’s one of the few railways in Canada which isn’t privately owned. Unfortunately, the Ontario government shut down ONR’s passenger services last year. It now only provides freight service and occasional tourist service.

Update: Somewhere in Manitoba

Manitoba is more scenic than you’d think. We’re currently in a rather pleasant-looking valley.

We’re now on tracks that were built by Grand Trunk Pacific. These tracks go from Winnipeg to Prince Rupert in northern BC. Together, Grand Trunk Pacific (which operated west of Winnipeg) and National Transcontinental Railway (which operated east of Winnipeg) were the third and final transcontinental rail route across Canada.

Like most Canadian railways, they eventually went bankrupt and were nationalized by the Canadian government, becoming part of the government-owned Canadian National Railway. The only major railway to escape this fate was Canadian Pacific Railway, leading to today’s situation where CNR and CPR (usually now called CN and CP) dominate Canada’s rail industry as a duopoly. CN was privatized in 1995. Apparently the largest shareholder is now Bill Gates.

The GTP tracks are still well used. Prince Rupert is a popular port for shipments from China. Due to the shape of the Earth, Prince Rupert is closer to China than Vancouver is. It’s easier to see on a globe.

Musical Interlude III

The Canadian, Day 2

This is what I saw when I woke up.

attachment1

(Taken with my Firefox OS phone.)

Update: Sudbury
So, it took us 8 hours to reach Sudbury. Sudbury is only about a 3 and half hour drive by car on the highway. You could call this a leisurely pace.

We are now on tracks that were originally owned by the Canadian Northern Railway (not to be confused with the Northern Railway of Canada, whose tracks we were on before). These tracks stretch across the country from Vancouver to Quebec City. Canadian Nothern Railway was the second railway to provide transcontinental service across Canada (the first being Canadian Pacific Railway, who still dominate Canada’s rail industry today).

Now these tracks are owned by CNR (Canadian National Railway).

Update: Hornepayne

If you look at a population map of Canada, you’ll see there’s a mostly unpopulated gap between Sault Ste Marie and Kenora, separating western Canada from eastern Canada. I’m currently in that gap.

I didn’t realize it would be so hard to get internet here. I haven’t managed to get an internet connection since leaving Sudbury, and I suspect I won’t be able to until we’re near Thunder Bay.

I’m supposed to be working today, but lack of internet is really hindering what I can do. I spent a while glued to my phone, hoping to glimpse a bar of service or two, but eventually gave up. It’s frustrating, because I’m trying to submit a patch for review, and I was hoping to get it approved before this weekend. I don’t like my chances of getting it reviewed next week when everyone’s busy at Whistler.

After a while, I went to the observation car and stared out the window, which I found relaxing. There seem to be countless beautiful lakes and rivers up here. We never seem to be far from a watercourse. I wonder if these rivers were a route used by fur traders in the early days of Nouvelle France.

The train stopped a couple times to drop off canoeists. I suspect we’re in a part of the country only accessible by rail.

I saw a beaver! You know, I’ve been in Canada all these years, and I think that’s the first time I’ve actually seen a beaver.

Most of the signs on this train are embossed with Braille. The signs use uncontracted Braille. I notice that the English is prefixed with dot-6 (⠠) while the French is prefixed with dot-46 (⠨). I wonder if that’s a standard convention.

Update: Sioux Lookout

We’re near Thunder Bay. For about 15 minutes we had cell phone and internet coverage, and everyone was staring to their phones.

We’re now on tracks that were part of the National Transcontinental Railway, built in 1885. This route spanned from Winnipeg to Moncton. The section from Winnipeg to Quebec City is almost a straight line stretching across northern Ontario and northern Quebec. The idea was that it would provide the quickest route from the Prairies to the Maritimes by bypassing cities like Toronto, Ottawa, Montreal. The railway was never successful.

Most of the railway is now abandoned and owned by no one. Parts of it are owned by CNR, and we’re currently on one of those parts.

Many francophone towns in Northern Ontario, like Kapuskasing and Hearst, lie along the abandoned portion of this railway. Those towns are now serviced by Highway 17.

But there are no highways along this section. There’s barely a road in sight.

Musical Interlude II

The Canadian

I’m currently sitting aboard Via Rail’s The Canadian, waiting for the train to depart from Toronto’s Union Station. I’m on my way to a company-wide Mozilla event in Whistler.

This isn’t the first time I’ve travelled on Via Rail, but this is the first time I’ll be travelling all the way to Vancouver.

I’m thinking I might live blog it.

Update: Vaughan

The train is currently stationary, and I figure we’re probably in Vaughan, north of Toronto. We passed a sign saying “Parc Downsview Park” about 10 minutes ago, so that must mean we’re on Metrolinx’s Barrie Line (formerly the CNR Newmarket Sub).

According to Wikipedia, this is the oldest trackage in Ontario, dating back to 1835. This track was laid by the Northern Railway of Canada, which was acquired by Grand Trunk Railway in 1888, which was in turn absorbed into CNR in 1923. CNR sold this trackage to Metrolinx in 2009.

I guess we’re waiting for a freight train to pass. (They said that would happen a bit.)

Oh, we’re moving again.

Update: Richmond Hill

We just passed Richmond Hill Centre. I recognize it from the giant Cineplex.

I remember when there was a proposal to build a casino here. But that didn’t happen.

Occasionally you hear people complain that Richmond Hill (and nearby Markham) is too dominated by Chinese people. That’s obviously bullshit. If this place was dominated by Chinese people, there would have been a casino here years ago.

Update: ???

I have no idea where I am. It is completely dark outside and I can’t see a thing.

Update: Musical interlude

Which Unicode character should represent the English apostrophe? (And why the Unicode committee is very wrong.)

The Unicode committee is very clear that U+2019 (RIGHT SINGLE QUOTATION MARK) should represent the English apostrophe.

Section 6.2 of the Unicode Standard 7.0.0 states:

U+2019 […] is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.”

This is very, very wrong. The character you should use to represent the English apostrophe is U+02BC (MODIFIER LETTER APOSTROPHE). I’m here to tell you why why.

Using U+2019 is inconsistent with the rest of the standard

Earlier in section 6.2, the standard explains the difference between punctuation marks and modifier letters:

Punctuation marks generally break words; modifier letters generally are considered part of a word.

Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.

According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.

(It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)

Using U+2019 breaks regular expressions

When doing word matching on Unicode text, programmers might reasonably assume they can detect “words” with the regex /\w+/ (which, in a Unicode context, matches characters with General Category L*, M*, N*, or Pc). This won’t actually work with English words that contain apostrophes if the apostrophes are represented as U+2019, but it will work if the apostrophes are represented as U+02BC.

To be fair, this problem exists in ASCII right now, where /\w+/ fails to match \x27 (the ASCII apostrophe). This leads to common bugs where users named O’Brien get told they can’t enter their name on a form, or where blog titles get auto-formatted as “Don’T Stop The Music”. Programmers soon learn they need to include the ASCII apostrophe in their regex as an exception.

But we shouldn’t be perpetuating this problem. When a programmer is writing a regex that can match text in Chinese, Arabic, or any other human language supported by Unicode, they shouldn’t have to add an exception for English. Furthermore, if apostrophes are represented as U+2019, the programmer would have to add both \x27 and \u2019 to their regex as exceptions.

The solution is to represent apostrophes as U+02BC, and let programmers simply write /\w+/ to match words like O’Brien and don’t.

[Edit: If you’re about to tell me that word segmentation should be done using UAX #29, guess what: Using U+2019 for the apostrophe breaks UAX #29! Using U+02BC would fix it. See comments. –Ed]

Using U+2019 means that Word Processors can’t distinguish between apostrophes and actual quotation marks, leading to a heap of problems.

How many times have you seen things like ‘Tis the Season or Up and at ‘em with the apostrophe curled the wrong way, because someone’s word processor mistook the apostrophe for an opening single quotation mark? Or, have you ever cut-and-pasted a block of text to use as a quote, put quotation marks around it, and then had to manually change all the nested quotation marks from double to single? Or maybe you’ve received text from the UK to use in your American presentation, but first you had to change all the quotation marks, because the UK prefers single-quotes while the US prefers double-quotes.

These are all things your word processor should be able to handle automatically and properly, but it can’t due to the ambiguity of whether a U+2019 character represents a single quotation mark or an apostrophe. We wouldn’t have these problems if apostrophes were represented by U+02BC. Allow me to explain.

1) In my perfect world, Word processors would automatically ensure that quotes (and nested quotes) were properly formatted according to your locale’s conventions, whether it be US, UK, or one of the myriad crazy quote conventions found across Europe. All quote marks would be reformatted on the fly according to their position in the paragraph, e.g. changing to (and vice versa) or to (and vice versa).

2) Right now, Microsoft word users use the " key to type a double-quote (either opening or closing) and the ' key to type either a single-quote (either opening or closing) or an apostrophe. In my perfect world, since the word processor could automatically convert between single- and double-quotes as needed, you’d only need one key (the " key) to type all quotation marks, and the ' key would be reserved exclusively for typing apostrophes. Therefore, the word processor would know that '-t-i-l means ʼtil, not ‘til.

But because U+2019 can represent either an apostrophe or single quote, it’s hard for Word Processors to do (1), which means they also can’t do (2).

So whenever you see ‘Til we meet again with the apostrophe curled the wrong way, remember that’s because of the Unicode committee telling you to use U+2019 for English apostrophes.

Common bloody sense

For godsake, apostrophes are not closing quotation marks!

U+2019 (RIGHT SINGLE QUOTATION MARK) is classed as a closing punctuation mark. Its general category is Pf, which is defined in section 4.5 of the standard as meaning:

Punctuation, final quote (may behave like Ps [opening punctuation] or Pe [closing punctuation] depending on usage [They’re talking about right-to-left text –Ed])

When you use U+2019 to represent an apostrophe, it’s behaving as neither a Ps or Pe. (Or, if it is, it’s an unbalanced one. As of Unicode 6.3, the Unicode bidi algorithm attempts to detect bracket pairs for bidi processing. They would be unable to do the same for quotation marks, due to all these unbalanced “quotation marks”.)

Compare that to U+02BC (MODIFIER LETTER APOSTROPHE) which has “apostrophe” right in its name. LOOK AT IT. RIGHT THERE, IT SAYS APOSTROPHE.

Which do you think make more sense for representing apostrophes?

C’mon, let’s fix this.