Which Unicode character should represent the English apostrophe? (And why the Unicode committee is very wrong.)

June 3, 2015June 11, 2015 tedclancy Uncategorized

The Unicode committee is very clear that U+2019 (RIGHT SINGLE QUOTATION MARK) should represent the English apostrophe.

Section 6.2 of the Unicode Standard 7.0.0 states:

U+2019 […] is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.”

This is very, very wrong. The character you should use to represent the English apostrophe is U+02BC (MODIFIER LETTER APOSTROPHE). I’m here to tell you why why.

Using U+2019 is inconsistent with the rest of the standard

Earlier in section 6.2, the standard explains the difference between punctuation marks and modifier letters:

Punctuation marks generally break words; modifier letters generally are considered part of a word.

Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.

According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.

(It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)

Using U+2019 breaks regular expressions

When doing word matching on Unicode text, programmers might reasonably assume they can detect “words” with the regex /\w+/ (which, in a Unicode context, matches characters with General Category L*, M*, N*, or Pc). This won’t actually work with English words that contain apostrophes if the apostrophes are represented as U+2019, but it will work if the apostrophes are represented as U+02BC.

To be fair, this problem exists in ASCII right now, where /\w+/ fails to match \x27 (the ASCII apostrophe). This leads to common bugs where users named O’Brien get told they can’t enter their name on a form, or where blog titles get auto-formatted as “Don’T Stop The Music”. Programmers soon learn they need to include the ASCII apostrophe in their regex as an exception.

But we shouldn’t be perpetuating this problem. When a programmer is writing a regex that can match text in Chinese, Arabic, or any other human language supported by Unicode, they shouldn’t have to add an exception for English. Furthermore, if apostrophes are represented as U+2019, the programmer would have to add both \x27 and \u2019 to their regex as exceptions.

The solution is to represent apostrophes as U+02BC, and let programmers simply write /\w+/ to match words like O’Brien and don’t.

[Edit: If you’re about to tell me that word segmentation should be done using UAX #29, guess what: Using U+2019 for the apostrophe breaks UAX #29! Using U+02BC would fix it. See comments. –Ed]

Using U+2019 means that Word Processors can’t distinguish between apostrophes and actual quotation marks, leading to a heap of problems.

How many times have you seen things like ‘Tis the Season or Up and at ‘em with the apostrophe curled the wrong way, because someone’s word processor mistook the apostrophe for an opening single quotation mark? Or, have you ever cut-and-pasted a block of text to use as a quote, put quotation marks around it, and then had to manually change all the nested quotation marks from double to single? Or maybe you’ve received text from the UK to use in your American presentation, but first you had to change all the quotation marks, because the UK prefers single-quotes while the US prefers double-quotes.

These are all things your word processor should be able to handle automatically and properly, but it can’t due to the ambiguity of whether a U+2019 character represents a single quotation mark or an apostrophe. We wouldn’t have these problems if apostrophes were represented by U+02BC. Allow me to explain.

1) In my perfect world, Word processors would automatically ensure that quotes (and nested quotes) were properly formatted according to your locale’s conventions, whether it be US, UK, or one of the myriad crazy quote conventions found across Europe. All quote marks would be reformatted on the fly according to their position in the paragraph, e.g. changing ‘ to “ (and vice versa) or ’ to ” (and vice versa).

2) Right now, Microsoft word users use the " key to type a double-quote (either opening or closing) and the ' key to type either a single-quote (either opening or closing) or an apostrophe. In my perfect world, since the word processor could automatically convert between single- and double-quotes as needed, you’d only need one key (the " key) to type all quotation marks, and the ' key would be reserved exclusively for typing apostrophes. Therefore, the word processor would know that '-t-i-l means ʼtil, not ‘til.

But because U+2019 can represent either an apostrophe or single quote, it’s hard for Word Processors to do (1), which means they also can’t do (2).

So whenever you see ‘Til we meet again with the apostrophe curled the wrong way, remember that’s because of the Unicode committee telling you to use U+2019 for English apostrophes.

Common bloody sense

For godsake, apostrophes are not closing quotation marks!

U+2019 (RIGHT SINGLE QUOTATION MARK) is classed as a closing punctuation mark. Its general category is Pf, which is defined in section 4.5 of the standard as meaning:

Punctuation, final quote (may behave like Ps [opening punctuation] or Pe [closing punctuation] depending on usage [They’re talking about right-to-left text –Ed])

When you use U+2019 to represent an apostrophe, it’s behaving as neither a Ps or Pe. (Or, if it is, it’s an unbalanced one. As of Unicode 6.3, the Unicode bidi algorithm attempts to detect bracket pairs for bidi processing. They would be unable to do the same for quotation marks, due to all these unbalanced “quotation marks”.)

Compare that to U+02BC (MODIFIER LETTER APOSTROPHE) which has “apostrophe” right in its name. LOOK AT IT. RIGHT THERE, IT SAYS APOSTROPHE.

Which do you think make more sense for representing apostrophes?

C’mon, let’s fix this.

46 thoughts on “Which Unicode character should represent the English apostrophe? (And why the Unicode committee is very wrong.)”

karl says:

June 3, 2015 at 7:48 am

About French. 🙂 I’m curious now.

because as you are writing for l’homme, [l’amour] (the love) is indeed the contraction of two words le + amour, but [aujourd’hui] (today) or [d’abord] (first of all) or [d’ailleurs] (btw, for that matter) is one word.

LikeLike

Reply
- tedclancy says:
  
  June 4, 2015 at 10:18 pm
  
  The question of “Which Unicode character should represent the French apostrophe?” is a little trickier, because there’s no perfect answer.
  
  First there’s the question of whether the French apostrophe should be considered something that separates words (punctuation mark), or something that is part of words (modifier letter). As you point out, it can be either. If forced to choose one, though, I think punctuation mark makes more sense, since that’s the usual case in French, with d’abord, d’ailleurs, and aujourd’hui being exceptions that obviously derive from the usual case. (That is, «d’abord» can be recognized as «d’» + «abord». Same for d’ailleurs. Aujourd’hui is possibly less obvious than the others because the word hui is now archaic in French, but until recently, aujourd’hui was spelt «au-jour-d’hui».)
  
  However, we also have to realize that there are limited options in Unicode.
  
  I definitely think that U+2019 (RIGHT SINGLE QUOTATION MARK) is the wrong character, for the reasons I gave above.
  
  I think U+0027 (APOSTROPHE) is the semantically correct choice, since it has the right General Category (Po, meaning “Punctuation, other”) and is named “apostrophe”. However, there’s the practical problem that in many fonts, it appears as a vertical straight glyph, and not the curled glyph desired for an apostrophe. As far as I’m aware, there’s nothing in the Unicode standard that requires this (there’s a note saying that this character has a straight glyph, but it is non-normative), and as a presentation issue, perhaps it’s outside of Unicode’s scope. But nevertheless, it’s a practical problem.
  
  There’s also the issue that this character has a history of being used as a quote character in ASCII. (Unicode says that this character can be called “apostrophe-quote”, but that alias is also non-normative.)
  
  Because of these problems, perhaps U+02BC (MODIFIER LETTER APOSTROPHE) would be the best choice for the French apostrophe. It’s a compromise, though. At least it has the benefit of being the same character as the one that should be used in English.
  
  LikeLiked by 1 person
  
  Reply
- mjfgates says:
  
  June 8, 2015 at 8:43 pm
  
  The “correct” way to handle that would be to use U+02BC in cases where the apostrophe is in the middle of a single word, and U+0027 in cases where the apostrophe divides one word for another. All of which must absolutely thrill the guys working on French auto-correct features.
  
  LikeLike
  
  Reply
- Hervé Pfeiffer says:
  
  August 7, 2015 at 7:30 am
  
  Not really no. [D’abord] is the contraction of de+abord, [d’ailleurs] is de+ailleurs, and [aujourd’hui] is a weird pleonasm: it’s a mashup of au+jour+de+hui (‘hui’ being an old form of the word ‘day’), meaning “at the day of today!”
  
  Very interesting article. I may set a find/change request in InDesign just for that matter!
  
  LikeLike
  
  Reply
Josh says:

June 3, 2015 at 8:49 pm

You are obviously correct.

LikeLiked by 1 person

Reply
Unicode Tees says:

June 3, 2015 at 10:01 pm

This is a very reasoned argument, with a number of current use cases as proof. Well done. 🙂

LikeLiked by 2 people

Reply
Victor Tramp says:

June 3, 2015 at 10:10 pm

but how do i fix this, I’m just one person? =(

LikeLike

Reply
- tedclancy says:
  
  June 4, 2015 at 10:32 pm
  
  Let’s write a letter to Michael Everson.
  
  LikeLike
  
  Reply
  - Ben says:
    
    June 11, 2015 at 9:50 pm
    
    I can’t help but think that he *must* have thought of this already, but at the same time, this makes a hell of a lot of sense. This is the first time I’ve read a blog entry criticizing an aspect of Unicode where my response was “Actually, that makes a whole lot of sense” instead of “You’re an idiot and have no idea how Unicode works”.
    
    I’m actually very curious as to what Michael Everson’s response would be.
    
    LikeLiked by 1 person
  - tedclancy says:
    
    June 11, 2015 at 10:49 pm
    
    I tweeted this post to Michael Everson, but he didn’t reply 😦
    
    LikeLike
Greg says:

June 4, 2015 at 5:08 am

Can’t agree more, and I’ve updated my AutoHotKey file to reflect acknowledgement of this. So now those words with apostrophes in them are automatically corrected on the fly.

LikeLiked by 1 person

Reply
m50d says:

June 4, 2015 at 9:02 am

Yeah! Let’s fix this!

How? Is this post aimed solely at word-processor developers? Are you submitting a proposal to the unicode consortium? I agree with you, so what’s the next step?

LikeLike

Reply
- tedclancy says:
  
  June 4, 2015 at 8:52 pm
  
  Simple. We form a people’s revolutionary proletariat army, seize control of the state, and legislate a new version of Unicode.
  
  Who’s with me?
  
  LikeLiked by 1 person
  
  Reply
drhyde says:

June 4, 2015 at 12:09 pm

You think that a committee that takes “pile of poo” and “snowman” seriously will give a shit?

LikeLiked by 1 person

Reply
Arthur Breitman says:

June 4, 2015 at 7:11 pm

While we’re at it, the convention that “don’t” is a single word is also stupid. “don’t” is a contraction of “do” and “not” which is written and pronounced in a particular way.

Words are first and foremost semantic tokens, not strings of letters. When you read “hipopotamus”, you parse it as { with a misspelling}, and not as {some unknown word}.

So whenever the language produces the tokens and in succession, their graphical and phonetic representation can be changed, but “don’t” isn’t a word.

LikeLike

Reply
- tedclancy says:
  
  June 4, 2015 at 8:50 pm
  
  “don’t” is a contraction of “do” and “not” which is written and pronounced in a particular way. […] “don’t” isn’t a word
  
  I disagree strongly. I think the quickest way to demonstrate your error is to note that “Don’t you mind?” is a valid English sentence, but *”Do not you mind?” is not. (The correct word order would be “Do you not mind?”.) That means “don’t” is not a literal abbreviation of “do not”, despite its etymology.
  
  For a longer answer, I refer you to the field of linguistics, which has no patience for this kind of nonsense.
  
  There’s a pervasive myth amongst bad school teachers that contractions are “incorrect” English, in which case every great English writer from Shakespeare to Atwood (and even the people who peddle this myth themselves) has been writing incorrect English. I suspect you might have been misled by such myths.
  
  In any case, it’s irrelevant to the discussion on Unicode, since (a) not all words with apostrophes are contractions of multiple words (my favourite example is the word “fo’c’sle”); and (b) Unicode should be able represent informal English.
  
  LikeLiked by 2 people
  
  Reply
  - chlewey says:
    
    June 5, 2015 at 1:30 pm
    
    The apostrophe, both in English and French (@karl) denotes a contraction, not a contraction of words but a contraction inside a word. The ʼ in “donʼt” is not a punctuation that separates “don” from “t” but a symbol that replaces the omitted ‘o’ in the affix “-not”; in French “lʼhomme”, it is a mark of the omitted soft ‘e’ in “le”, as well as marking that /lom/ is phonetically and prosodically one unit rather than two lexemes.
    
    The closing single quotation is definitively not the right symbol for the apostrophe in either English or French. The ASCII apostrophe (U+0027) is (and should be) ambiguous (I both code and typeset, and when coding the ASCII apostrophe is definitively a punctuation symbol in virtually any coding language, except TeX which has itʼs own idiosyncrasies), so the ASCII apostrophe is probably not the best option for a semantically typeset apostrophe.[*]
    
    Iʼm not completely convinced that U+02BC is the best solution, but seems the best compromise.
    
    [*] For most practical cases, however, the ASCII apostrophe is good enough from a semantically point of view as longer as there is a clear context that we are typesetting text rather than coding.
    
    LikeLiked by 1 person
  - anon says:
    
    June 14, 2015 at 3:32 pm
    
    Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.
    
    According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.
    
    (It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)
    
    You’re making a real effort to gloss over a very common way to use apostrophes in English. We use them to separate clitics from the words they attach to exactly the same way French does. (And the line you open with, “consider any English word with an apostrophe“, is poorly chosen rhetoric when it’s clear that you haven’t put much thought into this.)
    
    So consider the following sentences, each containing a “word” of the category “any English word with an apostrophe”:
    
    1. I’m on my way. (clitic am)
    
    2. That man’s pants are on fire! (english “genitive” marker)
    
    3. That man’s about to jump! (clitic is)
    
    4. My professor’s already looked it over. (clitic has)
    
    You could make a case that “I’m” is an ossified relic of a time when it was possible to contract I and am into one syllable. But there’s no case that clitic is and has are fossils, because they can freely attach to any modern English noun phrase. (The same is true of the ‘s genitive marker, but the convention in linguistics is to consider it a “phrasal case marker” and not an independent word regardless of that.)
    
    In general, it’s very poor rhetorical form to point out a gaping hole in your argument and then say “but I’m not going to consider this, because it’s very common in French and I want to talk about English”. It’s very common in English too. You need to consider it.
    
    Use U+0027 for English apostrophes. ;p
    
    LikeLike
  - tedclancy says:
    
    June 15, 2015 at 5:16 am
    
    You’re making a real effort to gloss over a very common way to use apostrophes in English. … it’s clear that you haven’t put much thought into this.
    
    Which is it? Am I making an effort to gloss over it, or have I not thought of it? Am I being intellectually dishonest, or am I stupid? Please be consistent with your insults, douchebag.
    
    We use them to separate clitics from the words they attach to
    
    I disagree. I think the apostrophe is part of the clitic. e.g. The clitic is ʼs, not simply s. (At least, that’s how I’ve usually heard it described, and it’s necessary if you consider something like n’t to be a clitic.)
    
    In any case, it doesn’t matter. What really matters is: What constitutes a word?
    
    Let me be clear: For Unicode’s purposes, what matters is the orthographic word, which is largely shaped by spelling conventions. Firstly, Unicode is all about orthography. Secondly, the orthographic word corresponds to what a typical computer user thinks of as a word. When the user double-clicks on text in a document, what do they expect to light up? You’ll find it’s the orthographic word.
    
    In an alphabetic language like English, an orthographic word goes from one “word separator” (space or punctuation) to another. It is my strong belief that readers and writers of English do not consider the apostrophe to be a word separator. That is, they see boy’s in “The boy’s ball” as a single word. Unicode agrees with me here, because under UAX #29, boy’s does not contain a word break. (However, UAX #29 does not treat boys’ as a single word in “The boys’ ball” when the apostrophe is represented by U+0027 or U+2019, which is a problem I’m proposing to fix.)
    
    Similarly, most readers and writers of Chinese consider “蝴蝶” to be two words, based on the orthography, even though it is a single morpheme. Again, UAX #29 appropriately treats it as two words, based on the orthography.
    
    For the record, I recognize that ʼs can occur after almost every word in English (“not only nouns and pronouns, but also verbs and prepositions, and adjectives, adverbs, foreign words quoted and even animal noises”), and so a spell-checker might need to handle it specially. But that doesn’t make ʼs or s a word.
    
    The reason I mention French as a separate case is because I think readers and writers of French are more likely to see the apostrophe as a word boundary (which it arguably is in French). I’m aware that French speakers are influential over a number of Engineering standards, through organizations like ISO and ITU. I don’t know how much influence they have over Unicode, but I did wonder if French ideas about the apostrophe had incorrectly influenced the choice of U+2019 for the English apostrophe.
    
    LikeLike
  - anon says:
    
    June 15, 2015 at 8:15 am
    
    OK. You’re being dishonest. There is no difference between the use of an apostrophe in “l’homme” and the use of an apostrophe in “professor’s”, where the latter token includes a clitic has. If the apostrophe is a word separator in French, it’s one in English too.
    
    If what you really mean is that English should be encoded differently from French because the English-writing audience generally holds less accurate beliefs about their language than does the French-writing audience, say that in your post. You say French is different because it makes “more sense” to consider French l’homme as two words (than to do something unspecified). From context, I have to conclude that you’re saying it makes more sense to consider French l’homme as two words than to consider English professor’s as two words. This is (a) without basis and (b) completely irrelevant to your new goal of reflecting what French-writers think about their own language. (You might instead be saying that it makes more sense to view French l’homme as two words than to view English don’t as two words. That would be correct, but it would also be an exercise in extreme dishonesty.)
    
    To the Chinese example, I would offer two points:
    
    1. China wages a constant misinformation campaign about written Chinese, and a lot of people who should know better will tell you that Chinese 方言 are, when written down, identical to (written) Mandarin. This is not a position I would care to defend in any context.
    
    2. Nobody’s ever had any problem explaining to me that two or more characters form a single word. In fact, they do it all the time. 虽然蝴蝶是两个字，但是是一个词。
    
    Written English does not draw the distinction between word-internal apostrophes and “French standard apostrophes” that you’re trying to advocate for. It’s not a good idea to try and force the distinction back in at a later point. It’s also not a good idea, as pointed out in another comment, to try to get French people to type one kind of apostrophe in l’homme and another in aujourd’hui.
    
    There’s a reason “opening-style” and “closing-style” apostrophe- and quotation-like marks are coded by different characters in TeX. Instructing a computer to recognize when it should use which falls not far short of getting the computer to understand written English. Look at what happened in my earlier comment, where I write “consider any English word with an apostrophe” — the quotation mark after “apostrophe” is an opening one. Hard to blame that on miscoded apostrophes, since it’s a double quote. So yes, having a special character for non-quotative apostrophes will fix the problem with non-quotative apostrophes curling the wrong way… but as a solution to the problem of getting marks to curl the right way, it’s misaimed.
    
    Finally, you veered into a longstanding peeve of mine:
    
    Let me be clear: For Unicode’s purposes, what matters is the orthographic word, which is largely shaped by spelling conventions. Firstly, Unicode is all about orthography.
    
    If Unicode were all about orthography, there wouldn’t be separate code points for Greek capital letter omega and the Ohm sign. The Ohm sign is, by definition, a capital Greek letter omega. There can never be an orthographic difference. But somehow the Ohm sign has a dedicated code point while seconds make do with U+0073 “latin small letter s”.
    
    LikeLike
  - tedclancy says:
    
    June 15, 2015 at 2:47 pm
    
    Here are some differences between how the apostrophe is used in English and how the apostrophe is used in French:
    
    1) In French, apostrophes only occur at morpheme boundaries. In English, apostrophes can occur within a morpheme.
    
    2) When French clitics contain an apostrophe (l’, d’, s’, m’, etc.), they are recognizable variants of words that can stand on their own. That is not always true in English, where the possessive ʼs is neither a word nor a recognizable variant of any word. (This is the difference between a special clitic and a simple clitic.)
    
    3) The letter after an apostrophe in French is sometimes capitalized (e.g. L’Homme). This doesn’t happen in English, except for in proper names like O’Brien.
    
    The bottom line is that French speakers are likely to view l’homme as two words. (And even if they don’t, I don’t care. It’s irrelevant to the discussion of the English apostrophe.)
    
    LikeLike
  - tros says:
    
    August 6, 2015 at 8:59 am
    
    [quote]2) When French clitics contain an apostrophe (l’, d’, s’, m’, etc.), they are recognizable variants of words that can stand on their own. That is not always true in English, where the possessive ʼs is neither a word nor a recognizable variant of any word. (This is the difference between a special clitic and a simple clitic.)[/quote]
    The possessive ‘s in English is the contraction of ‘his’ which in turn lost any gender (so her would also become ‘s), that is my view. So that’s still a separate word.
    
    LikeLike
  - tedclancy says:
    
    August 11, 2015 at 2:37 pm
    
    > The possessive ‘s in English is the contraction of ‘his’ which in turn lost any gender
    
    That’s actually a myth.
    
    https://en.wikipedia.org/wiki/English_possessive#History
    
    LikeLike
moyogo says:

June 4, 2015 at 7:23 pm

To put this in context: Before Unicode 2.1, the modifier letter apostrophe U+02BC was the preferred character for the “apostrophe”. This was corrected in Unicode 2.1, see http://www.unicode.org/reports/tr8/tr8-3.html
You might also want too look at http://unicode.org/L2/L1998/98053.pdf and http://www.unicode.org/L2/L1999/n2043.pdf

LikeLiked by 1 person

Reply
- tedclancy says:
  
  June 4, 2015 at 8:37 pm
  
  Thanks for the links.
  
  It’s interesting that none of those documents explain why the change from U+02BC to U+2019 was made, except for a vague reference to “mapping problems” from Windows and Mac character sets.
  
  I find it telling that the document at your third link contains the statement “The semantics of U+2019 are therefore context dependent. If surrounded by text, it behaves as an in text punctuation character (does not separate words or lines). If bordered by space on one side, it is a quotation character.” It makes me wonder if this whole mess was caused by people who don’t realize that English words can start and end with apostrophes.
  
  LikeLike
  
  Reply
  - Michael Douglas says:
    
    June 10, 2015 at 6:05 am
    
    Well, this whole mess was caused by people who try to make things work and compatible with legacy charsets. It is not an easy task or one that’s devoided of compromises.
    
    LikeLiked by 2 people
Jaquez says:

June 4, 2015 at 8:14 pm

In addition to the above, it would also help real-time spell-checkers not to highlight, for example, {isn’} as a misspelling in the middle of typing the word {isn’t} since it would know not count the apostrophe as a word boundary, triggering the spell check for that word. Admittedly a very small annoyance, but one that computers should be able to eliminate trivially.

LikeLike

Reply
- tedclancy says:
  
  June 4, 2015 at 8:28 pm
  
  Good point!
  
  LikeLike
  
  Reply
concerted internet anonymous says:

June 6, 2015 at 7:49 pm

It’s a modifier symbol, but the apostrophe in “don’t” has nothing to do with “n”. Moreover in “someone’s”, it relates to s (“’s” is clitic).

U+02BC is there for different cases.

LikeLike

Reply
- tedclancy says:
  
  June 8, 2015 at 5:28 am
  
  I believe you’re saying that U+02BC is inappropriate for representing the English apostrophe because U+02BC is a modifier letter while the English apostrophe is not a modifier.
  
  I would agree with you, except that Unicode committee seems liberal in what it calls a “modifier letter”. U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING) are both classed as modifier letters, even though they are not modifiers. (Their primary purpose is for transliterating consonants from Hebrew and Arabic.) There’s also a note on U+02BC itself saying “many languages use this as a letter of their alphabets”, which seems to suggest it’s not always a modifier. CJK iteration marks are also classified as modifier letters.
  
  It’s as if the Unicode committee uses “modifier letter” as a fallback category for things that behave like letters but which aren’t traditionally considered to be part of their writing system’s alphabet (or syllabary, or collection of ideograms).
  
  Maybe Unicode needs a new category (“Lp”?) for letters that look like punctuation marks but behave like letters? Until that time, I continue to think that U+02BC is the best choice for representing the English apostrophe. (And is definitely better than U+2019.)
  
  LikeLike
  
  Reply
Bill Stewart says:

June 8, 2015 at 9:23 pm

The best thing about this problem is that if the people who keep wanting to “fix” single and double quote marks to be “smarter” and use the “right” characters can’t figure out how to do it right, they won’t do it as often. I can recognize Mac users by the former quote marks rendered as random-seeming non-ASCII characters in my browser, and recognize Unicode users by the blocks with four hex characters for the same reason.
Almost anybody who can geek about Unicode is also a computer programmer, and when I’m writing examples of code in a document, I *really* don’t want y’all “fixin'” it in ways by converting the correct characters to incorrect ones, *especially* switching single and double quotes around in ways that might not even correctly `correct’ English, much less shell scripts.

LikeLike

Reply
tedclancy says:

June 9, 2015 at 9:50 pm

Hi. I see that my post has generated a little discussion. I’d like to address a few criticisms I’ve seen on the internet.

== UAX #29 ==

A couple people pointed out that /\w+/ is a naive way to detect words. One said that it would be better to detect word boundaries as described in UAX #29. Another said that Perl 5.22 (which was released just last week!) adds support for detecting word boundaries using the syntax \b{wb}.

I’m aware of UAX #29, but I’m also aware that it’s not convenient to use for many programmers. I did not know about Perl’s new \b{wb}, but I’m happy to hear about it. (I’m a fan of Perl.) I hope other languages adopt it too. I should note that Perl’s \b{wb} is an implementation of UAX #29.

However, even UAX #29 can’t handle apostrophes correctly at the start or end of a word if they are represented by U+0027 or U+2019. (It erroneously thinks there’s a word boundary between the apostrophe and the rest of the word.) But if you represent your apostrophes using U+02BC, then UAX #29 handles them correctly.

So I think this one is a point in my favour.

== Is detection of quotation mark pairs possible? ==

I mentioned above that Unicode 6.3 detects bracket pairs for bidi processing, and said that the usage of U+2019 for apostrophes makes it impossible to do the same for quotation marks.

Someone commented that detecting pairs of quotation marks would be impossible anyway, because when quotations span multiple paragraphs, it’s usual to put an opening quotation mark at the start of each paragraph, but only the final paragraph gets a closing quotation mark.

That’s not actually a problem. The Unicode bracket matching algorithm works paragraph-by-paragraph (no state is carried over between paragraphs), and extra opening brackets at the outermost level don’t affect how inner brackets are paired. It would be the same for quotation marks, presumably.

Similarly, any word processor which detects quotation mark pairs — whether for automatically localizing quotation marks, or some other purpose — would similarly have to work paragraph-by-paragraph, with unmatched opening quotation marks at the start of the paragraph not affecting how inner quotation marks are paired.

== Would this be confusing for the user? ==

There was some talk about how a user using a word processing application could differentiate a closing quotation mark from an apostrophe.

I’m hesitant to get into UI design, but there are many ways this could be accomplished. For a start, there’s the spell checker, which should flag an error if the word “don’t” contains a closing quotation mark by accident, or if an apostrophe was accidentally put at the end of a quotation.

But in my ideal future where word processors assist users with the correct use of quotation marks, I imagine that closing single quotation marks would be automatically matched with their corresponding opening single quotation marks in some way that was visible to the user. I imagine that would be sufficient to distinguish them from apostrophes.

(Unmatched closing single quotation marks should be flagged as an error to the user. Though unmatched opening quotations marks are customary when a quotation spans multiple paragraphs, unmatched closing quotation marks are always an error.)

I’d like to reiterate that I believe the apostrophe should be typed with the ' key (the one we call ' in HTML) and all quotation marks should be typed with the " key (the one we call " in HTML). Although I again hesitate to get involved in UI design, anything else will make the HTML entity names not match, and that would really bug me.

LikeLike

Reply
wlindley says:

June 9, 2015 at 10:49 pm

English words that end with apostrophes‽ To paraphrase Mark Twain, ainʼ nothinʼ wrong witʼ it.

LikeLiked by 1 person

Reply
- sirriccoukRic says:
  
  July 7, 2015 at 2:38 pm
  
  Nice interrobang!
  
  LikeLike
  
  Reply
Everyone says:

June 14, 2015 at 4:20 pm

Who cares?

LikeLike

Reply
- tedclancy says:
  
  June 15, 2015 at 1:34 am
  
  Me.
  
  LikeLike
  
  Reply
  - Piotr Gabryjeluk says:
    
    June 17, 2015 at 7:37 pm
    
    The amount of all the teams’ work required to do this is outrageous!
    
    LikeLike
jdlh says:

June 14, 2015 at 11:19 pm

Very interesting proposition! I think the Unicode Standard has an important role to play in facilitating textual communication, and I’m glad to see you take it seriously. Many posts of the form “Unicode got X wrong” or “Unicode should do Y” don’t take facilitation of communication seriously. (I’m looking at you, taco emoji partisans.)

I think your disagreement is not with the Unicode Technical Committees decisions about U+02BC versus U+2019. It is with the UTC’s understanding of English language orthography. Is the apostrophe used for contractions with a word (e.g. “we’ve” a letter, or is it punctuation, or is it something else?

You motivated me to write several more paragraphs, and rather than take up room here, I wrote a blog post: ‘’tain’t right, says he’: storm in apostrophe.

So while I don’t share your apparent indignation, or even your conviction that the Unicode Technical Committee got it wrong, I am impressed by your willingness to engage with the Unicode Standard at a fairly sophisticated level. We need more people doing this. Thank you.

LikeLike

Reply
Seumas McLeod says:

June 15, 2015 at 1:36 am

Even if “don’t” were to be considered two words, the words are definitely _not_ “don” and “t” (the words are “do” and “n’t”), so I guess using U+2019 would still be wrong.

LikeLiked by 3 people

Reply
Bruce Lawson’s personal site : Reading List says:

June 19, 2015 at 5:01 pm

[…] Which Unicode character should represent the English apostrophe? (And why the Unicode committee is v… – and you thought I bang on about niche stuff. […]

LikeLiked by 1 person

Reply
Linguist says:

October 31, 2015 at 12:32 pm

I think the main problem would be that although U+02BC exists in Unicode since the version 1.1, not many fonts contain this character. Neither “Windows Glyph List 4 (WGL)”, nor Opentype “World glyph set” contain this character also. Microsoft began to include the character in its fonts only recently, if I’m not wrong, since Windows Vista or 7. Yet, many if not most Windows fonts (even very basic like Candara or Corbel) still lack it. The giants of typographic business like Lynotype/Monotype do not include this character in their fonts by default. Not to mention much smaller companies or private designers. I believe these all were behind the decision of Unicode for favouring U+2019. They wanted backward compatibility and they knew established typographic business standards. If you choose to use U+02BC everywhere, then the end users would likely see some sort of replacement character (square □ or ?).

See:
http://www.fileformat.info/info/unicode/char/02bc/fontsupport.htm
http://www.linotype.com/1697-21120/opentype-character-sets-opentype-std.html
http://www.linotype.com/5801/european-ot-character-set-w1g.html

LikeLike

Reply
Linguist says:

October 31, 2015 at 12:44 pm

I believe they (Unicode) included U+02BC only for some more or less exotic usage like denoting ejective consonants in phonetic transcriptions. Note, so in such usage U+02BC *is* indeed a letter modifier: writing it after, say, /k/ does modify the letter, so you get /kʼ/, that is an ejective /k/. While writing /k’/ (with U+2019) would be a technically wrong transcription. In the same manner work true letter modifiers: /ʰ/ (aspiration), /ʲ/ (palatalization), /ʷ/ (labialization) and so on. I think the main confusion was to assign the name MODIFIER LETTER APOSTROPHE for U+02BC. Less confusing would be MODIFIER LETTER EJECTIVE or something like that.

So U+02BC is a letter modifier for consonants in the first place. Second, it is used as an *independent* letter in some orthographies like Navajo or Nenets. Better not to use it in the place of apostrophe. In “don’t” the letter “n” is not ejective, nor the apostrophe is a letter here.

The real solution would be to create fonts in which U+0027 would look like a curly apostrophe. An yes, not to use single quotes at all. I dislike very much this Britishism, though I’m more inclined to use Brittish spelling.

LikeLike

Reply
unbob says:

September 6, 2016 at 2:27 pm

Wow. Fascinating discussion. (No, really. I like this kinda stuff).

The takeaway for me is: if you’re a standards body wading into a space with a lotta existing implementations, you’re probably gonna break some things and annoy some people, no matter what you choose. It comes down to what you’re going to break and who you’re going to annoy. Sometimes hindsight will reveal you could’ve made better choices. Sometimes. . .it’s not so clear.

LikeLike

Reply
Wolf-Dieter says:

December 15, 2016 at 8:48 pm

Thank you thank you thank you! This is what I (almost desperately) looked for!

LikeLike

Reply
Brian Tristam Williams says:

December 22, 2020 at 7:56 am

Ted, I agree 100%, but now Iʼm sharing it to my blog, and Iʼve just noticed that your blogʼs title needs an amendment. It should be Tedʼs Blog, not Ted’s Blog.

LikeLiked by 1 person

Reply
- tedclancy says:
  
  December 29, 2020 at 6:30 am
  
  Done!
  
  LikeLike
  
  Reply

	tedclancy on Which Unicode character should…
	Brian Tristam Willia… on Which Unicode character should…
	Wolf-Dieter on Which Unicode character should…
	monkeytiger on Typing Chinese like Engli…
	tedclancy on Typing Chinese like Engli…