mranney · January 30, 2012 23:05 · Jan 30, 2012
diff --git a/emoji_sad.txt b/emoji_sad.txt
@@ -0,0 +1,168 @@
+From: Chris DeSalvo <[email protected]>
+Subject: Why we can't process Emoji anymore
+Date: Thu, 12 Jan 2012 18:49:20 -0800
+Message-Id: <[email protected]>
+
+--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
+Content-Transfer-Encoding: quoted-printable
+Content-Type: text/plain;
+	charset=utf-8
+
+If you are not interested in the technical details of why Emoji current =
+do not work in our iOS client, you can stop reading now.
+
+Many many years ago a Japanese cell phone carrier called SoftBank came =
+up with the idea for emoji and built it into the cell phones that it =
+sold for their network.  The problem they had was in deciding how to =
+represent the characters in electronic form.  They decided to use =
+Unicode code points in the private use areas.  This is a perfectly valid =
+thing to do as long as your data stays completely within your product.  =
+However, with text messages the data has to interoperate with other =
+carriers' phones.
+
+Unfortunately SoftBank decided to copyright their entire set of images, =
+their encoding, etc etc etc and refused to license them to anyone.  So, =
+when NTT and KDDI (two other Japanese carriers) decided that they wanted =
+emoji they had to do their own implementations.  To make things even =
+more sad they decided not to work with each other and gang up on =
+SoftBank.  So, in Japan, there were three competing emoji standards that =
+did not interoperate.
+
+In 2010 Apple released iOS 2.2 and added support for the SoftBank =
+implementation of emoji.  Since SoftBank would not license their emoji =
+out for use on networks other than their own Apple agreed to only make =
+the emoji keyboard visible on iPhones that were on the SoftBank network. =
+ That's why you used to have to run an ad-ware app to make that keyboard =
+visible.
+
+Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
+standard.  (In case any cares, Unicode originated in 1987 as a joint =
+research project between Xerox and Apple.)  The smart Unicode folks =
+added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
+more symbols needed for several African languages, and hundreds more CJK =
+symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
+but now, like then, nobody gives Vietnam any credit).
+
+With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0.  The emoji =
+keyboard was made available to all users and generates code points from =
+their new Unicode 6.0 locations.  Apple also added this support to OS X =
+Lion.
+
+You may be asking, "So this all sounds great.  Why can't I type a smiley =
+in Voxer and have the damn thing show up?"  Glad you asked.  Consider =
+the following glyph:
+
+=F0=9F=98=84
+SMILING FACE WITH OPEN MOUTH AND SMILING EYES
+Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84
+
+You can get this info for any character that OS X can render by bringing =
+up the Character Viewer panel and right-clicking on a glyph and =
+selecting "Copy Character Info".  So, what this shows us is that for =
+this smiley face the Unicode code point is 0x1F604.  For those of you =
+who are not hex-savvy that is the decimal number 128,516.  That's a =
+pretty big number.
+
+The code point that SoftBank had used was 0xFB55 (or 64,341 decimal).  =
+That's a pretty tiny number.  You can represent 64,341 with just 16 =
+bits.  Dealing with 16 bits is something computers do really well.  To =
+represent 0x1F604 you need 17 bits.  Since bits come in 8-packs you end =
+up using 24 total.  Computers hate odd numbers and dealing with a group =
+of 3 bytes is a real pain.
+
+I have to make a side-trip now and explain Unicode character encodings.  =
+Different kinds of computer systems, and the networks that connect them, =
+think of data in different ways.  Inside of the computer the processor =
+thinks of data in terms defined by its physical properties.  An old =
+Commodore 64 operated on one byte, 8 bits, at a time.  Later computers =
+had 16-bit hardware, then 32, and now most of the computers you will =
+encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
+time.  Networks still like to think of data as a string of individual =
+bytes and try to ignore any such logical groupings.  To represent the =
+entire Unicode code space you need 21 bits.  That is a frustrating size. =
+ Also, if you tend to work in Latin script (English, French, Italian, =
+etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
+ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
+up to the next byte boundary) because those top 17 bits will always be =
+unused.  So what do you do?  You make alternate encodings.
+
+There are many encodings, the most common being UTF-8 and UTF-16.  There =
+is also a UTF-32, but it isn't very popular since it's not =
+space-friendly.  UTF-8 has the nice property that all of the original =
+ASCII characters preserve their encoding.  So far in this email every =
+single character I've typed (other than the smiley) has been an ASCII =
+character and fits neatly in 7 bits.  One byte per character is really =
+friendly to work with, fits nicely in memory, and doesn't take much =
+space on disk.  If you sometimes need to represent a big character, like =
+that smiley up there, then you do that with a multi-byte sequence.  As =
+we can see in the info above the UTF-8 for that smiley is the 4-byte =
+sequence [F0 9F 98 84].  Make a file with those four byes in it and open =
+it in any editor that is UTF-8 aware and you'll get that smiley.
+
+Some Unicode-aware programming languages such as Java, Objective-C, and =
+(most) JavaScript systems use the UTF-16 encoding internally.  UTF-16 =
+has some really good properties of its own that I won't digress into =
+here.  The thing to note is that it uses 16 bits for most characters.  =
+So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
+UTF-8, in UTF-16 it is the 2-byte 0x0061.  Note that the SoftBank 0xFB55 =
+fits nicely in that 16-bit space.  Hmm, but our smiley has a Unicode =
+value of U+1F604 (we use U+ when throwing Unicode values around in =
+hexadecimal) and that will NOT fit in 16 bits.  Remember, we need 17.  =
+So what do we do?  Well, the Unicode guys are really smart (UTF-8 is =
+fucking brilliant, no, really!) and they invented a thing called a =
+"surrogate pair".  With a surrogate pair you can use two 16-bit values =
+to encode that code point that is too big to fit into a single 16-bit =
+field.  Surrogate pairs have a specific bit pattern in their top bits =
+that lets UTF-16 compliant systems know that they are a surrogate pair =
+that represent a single code point and not two separate UTF-16 code =
+points.  In the example smiley above we find that the UTF-16 surrogate =
+pair that encodes U+1F604 is [U+D83D U+DE04].  Put those four bytes into =
+a file and open it in any program that understands UTF-16 and you'll see =
+that smiley.  He really is quite cheery.
+
+So, I've already said that Objective-C and Java and (most) JavaScript =
+systems use UTF-16 internally so we should be all cool, right?  Well, =
+see, it was that "(most)" that is the problem.
+
+Before there was UTF-16 there was another encoding used by Java and =
+JavaScript called UCS-2.  UCS-2 is a strict 16-bit encoding.  You get 16 =
+bits per character and no more.  So how do you represent U+1F604 which =
+needs 17 bits?  You don't.  Period.  UCS-2 has no notion of surrogate =
+pairs.  Through most of time this was ok because the Unicode consortium =
+hadn't defined many code points beyond the 16 bit range so there was =
+nothing out there to encode.  But in 1996 it was clear that to encode =
+all the CJK languages (and Vietnamese!) that we'd start needing those =
+17+ bit code points.  SUN updated Java to stop using UCS-2 as its =
+default encoding and switched to UTF-16.  NeXT did the same thing with =
+NeXTSTEP (the precursor to iOS).  Many JavaScript systems updated as =
+well.
+
+Now, here's what you've all been waiting for:  the V8 runtime for =
+JavaScript, which is what our node.js severs are built on, use UCS-2 =
+internally as their encoding and are not capable of handing any code =
+point outside the base 16 bit range (we call that the BMP, or Basic =
+Multilingual Plane).  V8 fundamentally has no ability to represent the =
+U+1F604 that we need to make that smiley.
+
+Danny confirmed this with the node guys today.  Matt Ranney is going to =
+talk to the V8 guys about it and see what they want to do about it.
+
+Wow, you read though all of that?  You rock.  I'm humbled that you gave =
+me so much of your attention.  I feel that we've accomplished something =
+together.  Together we are now part of the tiny community of people who =
+actually know anything about Unicode.  You may have guessed by now that =
+I am a text geek.  I have had to implement java.lang.String for three =
+separate projects.  I love this stuff.  If you have any questions about =
+anything I've written here, or want more info so that you don't have to =
+read the 670 page Unicode 6.0 core specification (there are many, many =
+addenda as well) then please don't hesitate to hit me up.
+
+Love,
+Chris
+
+p.s.  Remember that this narrative is almost all ASCII characters, and =
+ASCII is a subset of UTF-8.  That smiley is the only non-ASCII =
+character.  In UTF-8 this email (everything up to, but not including my =
+signature) is 8,553 bytes.  In UTF-16 it is 17,102 bytes.  In UTF-32 it =
+would be 34,204 bytes.  These space considerations are one of the many =
+reasons we have multiple encodings.=