Discover the best of the web!
Learn more about Digg by taking the tour.
Moving to Unicode 5.1
googleblog.blogspot.com — Google has just begun supporting Unicode 5.1, less than one month after it was released. It's now available in search, so people speaking languages such as Malayalam can now search for words containing the new characters in Unicode 5.1. Web pages can use a variety of different character encodings, like ASCII, Latin-1, or Windows 1252, or Unicode.
- 531 diggs
- digg it
- mrclark411, on 05/06/2008, -4/+18A seriously nerdy article.
- sfacets, on 05/06/2008, -0/+8Yes... you are on Digg...
- bl4k3r, on 05/06/2008, -0/+8...in Technology: Programming.
- KibibyteBrain, on 05/06/2008, -0/+10Maybe for English speakers, but unfortunately character encoding support is still such a nightmare that normal people with native languages that center around the need for extended character sets are all experts on the issue.
- sfacets, on 05/06/2008, -0/+8Yes... you are on Digg...
- arjie, on 05/06/2008, -1/+9Hurray for Unicode. I was seriously amazed when I could have URLs in Hindi. That was so freaking cool. Then it was back to being bothered by how hard it is to type those with a qwerty keyboard. Now, another language I should know and the input phase is the part that's tough.
- Hangly, on 05/06/2008, -6/+1I guess it still doesn't support vertical scripts like Mongolian.
le sigh- mijelh, on 05/06/2008, -0/+2In fact, Mongolian Script was introduced as soon as in unicode v3 (unicode range U+1800 - U+18AF)
- slavingia, on 05/06/2008, -31/+3Come on Digg, who the ***** cares...
- Ocelot13, on 05/06/2008, -1/+28we actually get an article that is related to computers, and you bitch.
go back to 4chan and your lolcats. - DeathJux, on 05/06/2008, -0/+3Enough to get it to the front page, and, thus, goo-gobs of traffic... whatever that's worth.
- KibibyteBrain, on 05/06/2008, -0/+7What a selfish attitude. You'd care if your languages or special need characters were in the Unicode update. And developers should care unless they want to shrug off that market's ability to use their products.
- p0tent1al, on 05/11/2008, -0/+1Um, dude. Checked your profile. Everyone of your comments has been buried.
All 4 of them.
Go back to MySpace, maybe someone there will care about your ***** ass site with the wallpaper you stole from a Deviantart user.
- Ocelot13, on 05/06/2008, -1/+28we actually get an article that is related to computers, and you bitch.
- xErath, on 05/06/2008, -7/+1Great! It supports a month old unicode format, but still does not support xhtml served as xml or
- gsnedders, on 05/06/2008, -1/+16Difference: XML (in total) accounts for c. 0.004% of the web, Unicode accounts for > 25%. Which do you think has more demand?
- lordmetroid, on 05/06/2008, -6/+1Now we don't need to worry about running out of URLs at least...
- byronm, on 05/06/2008, -0/+4Is this why Google bots have slowed down crawling? They're getting updated and building a new index based on Unicode 5.1?
- irCuBiC, on 05/06/2008, -1/+3I wonder how they find if a page really is unicode, do they just look at the metadata, or do they actually infer it from the content of the file?
If from metadata, I wonder how much of that is just pages copy-pasting doctypes/xml descriptors from W3C and other places, without actually saving as unicode.- skmice2, on 05/06/2008, -0/+2That is really a good question - on top of meta tags you have a few ways how to check that:
1.) as you said the file can be saved as Unicode (thus having a Unicode identification byte at its start)
2.) the server may 'tell' the browser that it serves the content as Unicode
3.) analyze the content (check if words make sense in the encoding which is set meta-data of each page)
I find the combination of the first and third method most likely, since this report came from Google - they are perhaps the only ones with a large enough database of individual words in all their forms for each language + the processing capabilities to run a report such as this one. - astrosmash, on 05/06/2008, -0/+1It comes from the Content-Type header, which is provided by the Web Server by default and can be overridden in the HTML document using the meta http-equiv tag. If you view source on this page you'll see that it specifies UTF-8, which means you should be able to see this umbrella ☂ unless you override your browser's encoding detection.
For plain English content, UTF8 == ASCII, so the fall of ASCII in favor of UTF8 is simply a configuration issue. The content of the pages remains the same.
People who deal with non-English content will already be aware of their plain-text encoding and should be actively migrating away from whatever banana republic encoding they used to use in favor of UTF-8.
From what I understand, Chinese, Japanese, and Korean sites aren't keen to migrate to Unicode because the same Unicode character can look different depending on the language (i.e., different fonts for the same letter).
- skmice2, on 05/06/2008, -0/+2That is really a good question - on top of meta tags you have a few ways how to check that:
- sexylegs, on 05/06/2008, -3/+5I see a breast. Just me?
- acdx1, on 05/06/2008, -2/+1Personally I'd be thrilled to see Google serve valid HTML result pages, or..*gasp* XHTML!
- everling, on 05/06/2008, -0/+3XHTMl 1.0 and 1.1 is not compatible with XHTML 2.0.
Afaik, no web browser supports XHTML 2.0.
Most web giants are backing *gasp* HTML 5!
- everling, on 05/06/2008, -0/+3XHTMl 1.0 and 1.1 is not compatible with XHTML 2.0.
- keralablogger, on 05/06/2008, -0/+2I am from kerala and i Loved to see my mother tongue typed on google official blog.!!!!!!
- antdude, on 05/06/2008, -0/+1I am surprised China is not going up big.
- spectre_25gt, on 05/06/2008, -1/+5I'm a big fan of the whole unicode movement, but I still feel in the dark about certain things. For one thing, I've noticed that Firefox never seems to correctly auto-detect unicode pages. Sometimes, when it does, I see blocks with question marks in them. Is that an issue with font support?
Now when you get into editing, it gets really strange. I find it hard to tell whether I'm saving files in a unicode character set or not. I know this is a problem with software implementation, but it effects the standard.- irCuBiC, on 05/06/2008, -0/+2Firefox can only infer the encoding of a page to a certain degree, it's usually up to the author to specify the encoding of a page, either in metadata or HTTP headers. If it can't find it any other way, it'll guess. At either of these points a fault can be made; A page can be served with header saying it's UTF-8, while being saved as latin-1; Firefox could guess the wrong encoding.
When this happens, strange signs can result because the browser reads the page wrong, and yes, if your font doesn't support the sign, it'll show a strange questionmark block. Even worse is when it actually know the signs and you end up with chinese/russian/[language] text instead of half-readable text. =)
- irCuBiC, on 05/06/2008, -0/+2Firefox can only infer the encoding of a page to a certain degree, it's usually up to the author to specify the encoding of a page, either in metadata or HTTP headers. If it can't find it any other way, it'll guess. At either of these points a fault can be made; A page can be served with header saying it's UTF-8, while being saved as latin-1; Firefox could guess the wrong encoding.
- fuzzynyanko, on 05/06/2008, -0/+3I like the concept of UNICODE, but the implementation is a mess. UTF-7, UTF-8, UTF-16, and UTF-32 with the many endians. It's transparent for end users, but if you are programming...
Digg is coming to a city (and computer) near you! Check out all the details on our