Created
February 17, 2012 04:50
-
-
Save isaacs/1850768 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ "inline": | |
{ "unicode-support-in-js-today":"💩" | |
, "unicode-support-in-js-someday":"😁" } | |
, "surrogates": | |
{ "unicode-support-in-js-today":"\uf09f\u92a9" | |
, "unicode-support-in-js-someday":"\uf09f\u9881" } | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function assert(x) { | |
if (!x) console.error("assertion failed") | |
else console.error("assertion passed") | |
} | |
{ "use unicode" // opt-in so we don't break the web | |
var x = "\u1F638" // > 2 byte unicode code point | |
var y = "😁" // face with open mouth and smiling eyes | |
assert(x.length === 1) // less important, but ideal | |
assert(y.length === 1) // less important, but ideal | |
assert(x === y) // unicode code points should match literals | |
console.log(x) // <-- should output a smiley, not "ὣ8" | |
console.log(y) // <-- should output a smiley, not mochibake | |
assert(JSON.stringify(y) === JSON.stringify(x)) | |
assert(JSON.parse(JSON.stringify(y)) === y) | |
assert(JSON.parse(JSON.stringify(x)) === x) | |
assert(x.indexOf(y) === 0) | |
assert(y.indexOf(x) === 0) | |
var arr = ["a", "b", "c"] | |
var axbxc = arr.join(x) | |
var aybyc = arr.join(y) | |
assert(axbxc.split(x)[1] === arr[1]) | |
assert(axbxc.split(y)[1] === arr[1]) | |
// etc. | |
// They're just characters, and just strings. | |
// No special anything, just treat it like any other character. | |
} |
Please see Gist 1861530
For some reason, I couldn't post it as a comment here.
It appears that, in node at least, we're being bitten by http://code.google.com/p/v8/issues/detail?id=761. We will work with v8 to figure out the best solution there, to get from utf8 bytes into a JavaScript string, which doesn't arbitrarily trash non-BMP characters. I apologize for misunderstanding the issue and impugning the good name of JavaScript. (In my defense, it's a particularly complicated issue, and JavaScript's name isn't really all that good ;)
Nevertheless, I think that clearly the long-term correct fix is for JavaScript to handle unicode intelligently (albeit with the presence of big red switches), so I'm very happy to see your proposal.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hm, github ate my reply. I'll try to do it justice. I'm sure the original was much more compelling :)
@BrendanEich I wasn't glibly asserting it'd work. I was glibly asking what specifically would break. I've heard several claims about how it would break things. Those claims have the ring of truth, but I've grown skeptical of rings, even truthy ones. I'd like to see a program in the wild that'll actually be affected, especially one that isn't using strings as a make-shift binary array, or doing fancy byte-shuffling in order to work around this very issue.
Skepticism aside, I'm not completely insane. Of course this would have to be opt-in. If it can't be a pragma, fine; a BRS, or even a separate special type would be perfectly acceptable, as long as it would enable us to serialize and deserialize the string faithfully, and know what the characters should be rather than rely on the dom to sort it out for us.
Yes, that's true. It'd have to either be somehow framed, like
\U+[1F638]
or something, or we just bite the bullet and write out the surrogates.