Skip to content

Instantly share code, notes, and snippets.

@mranney
Created January 30, 2012 23:05

Revisions

  1. mranney created this gist Jan 30, 2012.
    168 changes: 168 additions & 0 deletions emoji_sad.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,168 @@
    From: Chris DeSalvo <[email protected]>
    Subject: Why we can't process Emoji anymore
    Date: Thu, 12 Jan 2012 18:49:20 -0800
    Message-Id: <[email protected]>

    --Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
    Content-Transfer-Encoding: quoted-printable
    Content-Type: text/plain;
    charset=utf-8

    If you are not interested in the technical details of why Emoji current =
    do not work in our iOS client, you can stop reading now.

    Many many years ago a Japanese cell phone carrier called SoftBank came =
    up with the idea for emoji and built it into the cell phones that it =
    sold for their network. The problem they had was in deciding how to =
    represent the characters in electronic form. They decided to use =
    Unicode code points in the private use areas. This is a perfectly valid =
    thing to do as long as your data stays completely within your product. =
    However, with text messages the data has to interoperate with other =
    carriers' phones.

    Unfortunately SoftBank decided to copyright their entire set of images, =
    their encoding, etc etc etc and refused to license them to anyone. So, =
    when NTT and KDDI (two other Japanese carriers) decided that they wanted =
    emoji they had to do their own implementations. To make things even =
    more sad they decided not to work with each other and gang up on =
    SoftBank. So, in Japan, there were three competing emoji standards that =
    did not interoperate.

    In 2010 Apple released iOS 2.2 and added support for the SoftBank =
    implementation of emoji. Since SoftBank would not license their emoji =
    out for use on networks other than their own Apple agreed to only make =
    the emoji keyboard visible on iPhones that were on the SoftBank network. =
    That's why you used to have to run an ad-ware app to make that keyboard =
    visible.

    Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
    standard. (In case any cares, Unicode originated in 1987 as a joint =
    research project between Xerox and Apple.) The smart Unicode folks =
    added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
    more symbols needed for several African languages, and hundreds more CJK =
    symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
    but now, like then, nobody gives Vietnam any credit).

    With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji =
    keyboard was made available to all users and generates code points from =
    their new Unicode 6.0 locations. Apple also added this support to OS X =
    Lion.

    You may be asking, "So this all sounds great. Why can't I type a smiley =
    in Voxer and have the damn thing show up?" Glad you asked. Consider =
    the following glyph:

    =F0=9F=98=84
    SMILING FACE WITH OPEN MOUTH AND SMILING EYES
    Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84

    You can get this info for any character that OS X can render by bringing =
    up the Character Viewer panel and right-clicking on a glyph and =
    selecting "Copy Character Info". So, what this shows us is that for =
    this smiley face the Unicode code point is 0x1F604. For those of you =
    who are not hex-savvy that is the decimal number 128,516. That's a =
    pretty big number.

    The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). =
    That's a pretty tiny number. You can represent 64,341 with just 16 =
    bits. Dealing with 16 bits is something computers do really well. To =
    represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end =
    up using 24 total. Computers hate odd numbers and dealing with a group =
    of 3 bytes is a real pain.

    I have to make a side-trip now and explain Unicode character encodings. =
    Different kinds of computer systems, and the networks that connect them, =
    think of data in different ways. Inside of the computer the processor =
    thinks of data in terms defined by its physical properties. An old =
    Commodore 64 operated on one byte, 8 bits, at a time. Later computers =
    had 16-bit hardware, then 32, and now most of the computers you will =
    encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
    time. Networks still like to think of data as a string of individual =
    bytes and try to ignore any such logical groupings. To represent the =
    entire Unicode code space you need 21 bits. That is a frustrating size. =
    Also, if you tend to work in Latin script (English, French, Italian, =
    etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
    ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
    up to the next byte boundary) because those top 17 bits will always be =
    unused. So what do you do? You make alternate encodings.

    There are many encodings, the most common being UTF-8 and UTF-16. There =
    is also a UTF-32, but it isn't very popular since it's not =
    space-friendly. UTF-8 has the nice property that all of the original =
    ASCII characters preserve their encoding. So far in this email every =
    single character I've typed (other than the smiley) has been an ASCII =
    character and fits neatly in 7 bits. One byte per character is really =
    friendly to work with, fits nicely in memory, and doesn't take much =
    space on disk. If you sometimes need to represent a big character, like =
    that smiley up there, then you do that with a multi-byte sequence. As =
    we can see in the info above the UTF-8 for that smiley is the 4-byte =
    sequence [F0 9F 98 84]. Make a file with those four byes in it and open =
    it in any editor that is UTF-8 aware and you'll get that smiley.

    Some Unicode-aware programming languages such as Java, Objective-C, and =
    (most) JavaScript systems use the UTF-16 encoding internally. UTF-16 =
    has some really good properties of its own that I won't digress into =
    here. The thing to note is that it uses 16 bits for most characters. =
    So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
    UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 =
    fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode =
    value of U+1F604 (we use U+ when throwing Unicode values around in =
    hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. =
    So what do we do? Well, the Unicode guys are really smart (UTF-8 is =
    fucking brilliant, no, really!) and they invented a thing called a =
    "surrogate pair". With a surrogate pair you can use two 16-bit values =
    to encode that code point that is too big to fit into a single 16-bit =
    field. Surrogate pairs have a specific bit pattern in their top bits =
    that lets UTF-16 compliant systems know that they are a surrogate pair =
    that represent a single code point and not two separate UTF-16 code =
    points. In the example smiley above we find that the UTF-16 surrogate =
    pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into =
    a file and open it in any program that understands UTF-16 and you'll see =
    that smiley. He really is quite cheery.

    So, I've already said that Objective-C and Java and (most) JavaScript =
    systems use UTF-16 internally so we should be all cool, right? Well, =
    see, it was that "(most)" that is the problem.

    Before there was UTF-16 there was another encoding used by Java and =
    JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 =
    bits per character and no more. So how do you represent U+1F604 which =
    needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate =
    pairs. Through most of time this was ok because the Unicode consortium =
    hadn't defined many code points beyond the 16 bit range so there was =
    nothing out there to encode. But in 1996 it was clear that to encode =
    all the CJK languages (and Vietnamese!) that we'd start needing those =
    17+ bit code points. SUN updated Java to stop using UCS-2 as its =
    default encoding and switched to UTF-16. NeXT did the same thing with =
    NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as =
    well.

    Now, here's what you've all been waiting for: the V8 runtime for =
    JavaScript, which is what our node.js severs are built on, use UCS-2 =
    internally as their encoding and are not capable of handing any code =
    point outside the base 16 bit range (we call that the BMP, or Basic =
    Multilingual Plane). V8 fundamentally has no ability to represent the =
    U+1F604 that we need to make that smiley.

    Danny confirmed this with the node guys today. Matt Ranney is going to =
    talk to the V8 guys about it and see what they want to do about it.

    Wow, you read though all of that? You rock. I'm humbled that you gave =
    me so much of your attention. I feel that we've accomplished something =
    together. Together we are now part of the tiny community of people who =
    actually know anything about Unicode. You may have guessed by now that =
    I am a text geek. I have had to implement java.lang.String for three =
    separate projects. I love this stuff. If you have any questions about =
    anything I've written here, or want more info so that you don't have to =
    read the 670 page Unicode 6.0 core specification (there are many, many =
    addenda as well) then please don't hesitate to hit me up.

    Love,
    Chris

    p.s. Remember that this narrative is almost all ASCII characters, and =
    ASCII is a subset of UTF-8. That smiley is the only non-ASCII =
    character. In UTF-8 this email (everything up to, but not including my =
    signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it =
    would be 34,204 bytes. These space considerations are one of the many =
    reasons we have multiple encodings.=