35 Comments
- Ocelot13, on 05/06/2008, -1/+28we actually get an article that is related to computers, and you bitch.
go back to 4chan and your lolcats. - gsnedders, on 05/06/2008, -1/+16Difference: XML (in total) accounts for c. 0.004% of the web, Unicode accounts for > 25%. Which do you think has more demand?
- mrclark411, on 05/06/2008, -4/+18A seriously nerdy article.
- irCuBiC, on 05/06/2008, -0/+10No it isn't, it's a one-byte character encoding, as is latin-1.
Just because they REFER to unicode codepoints some places does not mean they use said codepoints when saved to file. - KibibyteBrain, on 05/06/2008, -0/+10Maybe for English speakers, but unfortunately character encoding support is still such a nightmare that normal people with native languages that center around the need for extended character sets are all experts on the issue.
- arjie, on 05/06/2008, -1/+9Hurray for Unicode. I was seriously amazed when I could have URLs in Hindi. That was so freaking cool. Then it was back to being bothered by how hard it is to type those with a qwerty keyboard. Now, another language I should know and the input phase is the part that's tough.
- sfacets, on 05/06/2008, -0/+8Yes... you are on Digg...
- bl4k3r, on 05/06/2008, -0/+8...in Technology: Programming.
- KibibyteBrain, on 05/06/2008, -0/+7What a selfish attitude. You'd care if your languages or special need characters were in the Unicode update. And developers should care unless they want to shrug off that market's ability to use their products.
- spectre_25gt, on 05/06/2008, -1/+5I'm a big fan of the whole unicode movement, but I still feel in the dark about certain things. For one thing, I've noticed that Firefox never seems to correctly auto-detect unicode pages. Sometimes, when it does, I see blocks with question marks in them. Is that an issue with font support?
Now when you get into editing, it gets really strange. I find it hard to tell whether I'm saving files in a unicode character set or not. I know this is a problem with software implementation, but it effects the standard. - byronm, on 05/06/2008, -0/+4Is this why Google bots have slowed down crawling? They're getting updated and building a new index based on Unicode 5.1?
- fuzzynyanko, on 05/06/2008, -0/+3I like the concept of UNICODE, but the implementation is a mess. UTF-7, UTF-8, UTF-16, and UTF-32 with the many endians. It's transparent for end users, but if you are programming...
- everling, on 05/06/2008, -0/+3XHTMl 1.0 and 1.1 is not compatible with XHTML 2.0.
Afaik, no web browser supports XHTML 2.0.
Most web giants are backing *gasp* HTML 5! - DeathJux, on 05/06/2008, -0/+3Enough to get it to the front page, and, thus, goo-gobs of traffic... whatever that's worth.
- sexylegs, on 05/06/2008, -3/+5I see a breast. Just me?
- mijelh, on 05/06/2008, -0/+2In fact, Mongolian Script was introduced as soon as in unicode v3 (unicode range U+1800 - U+18AF)
- irCuBiC, on 05/06/2008, -1/+3I wonder how they find if a page really is unicode, do they just look at the metadata, or do they actually infer it from the content of the file?
If from metadata, I wonder how much of that is just pages copy-pasting doctypes/xml descriptors from W3C and other places, without actually saving as unicode. - irCuBiC, on 05/06/2008, -0/+2Firefox can only infer the encoding of a page to a certain degree, it's usually up to the author to specify the encoding of a page, either in metadata or HTTP headers. If it can't find it any other way, it'll guess. At either of these points a fault can be made; A page can be served with header saying it's UTF-8, while being saved as latin-1; Firefox could guess the wrong encoding.
When this happens, strange signs can result because the browser reads the page wrong, and yes, if your font doesn't support the sign, it'll show a strange questionmark block. Even worse is when it actually know the signs and you end up with chinese/russian/[language] text instead of half-readable text. =) - skmice2, on 05/06/2008, -0/+2That is really a good question - on top of meta tags you have a few ways how to check that:
1.) as you said the file can be saved as Unicode (thus having a Unicode identification byte at its start)
2.) the server may 'tell' the browser that it serves the content as Unicode
3.) analyze the content (check if words make sense in the encoding which is set meta-data of each page)
I find the combination of the first and third method most likely, since this report came from Google - they are perhaps the only ones with a large enough database of individual words in all their forms for each language + the processing capabilities to run a report such as this one. - keralablogger, on 05/06/2008, -0/+2I am from kerala and i Loved to see my mother tongue typed on google official blog.!!!!!!
- astrosmash, on 05/06/2008, -0/+1It comes from the Content-Type header, which is provided by the Web Server by default and can be overridden in the HTML document using the meta http-equiv tag. If you view source on this page you'll see that it specifies UTF-8, which means you should be able to see this umbrella ☂ unless you override your browser's encoding detection.
For plain English content, UTF8 == ASCII, so the fall of ASCII in favor of UTF8 is simply a configuration issue. The content of the pages remains the same.
People who deal with non-English content will already be aware of their plain-text encoding and should be actively migrating away from whatever banana republic encoding they used to use in favor of UTF-8.
From what I understand, Chinese, Japanese, and Korean sites aren't keen to migrate to Unicode because the same Unicode character can look different depending on the language (i.e., different fonts for the same letter). - antdude, on 05/06/2008, -0/+1I am surprised China is not going up big.
- p0tent1al, on 05/11/2008, -0/+1Um, dude. Checked your profile. Everyone of your comments has been buried.
All 4 of them.
Go back to MySpace, maybe someone there will care about your ***** ass site with the wallpaper you stole from a Deviantart user. - malayalamnews, on 08/05/2008, -0/+0go to our-kerala.com Malayalam UTF-8 UNICODE
Malayalam news
http://www.our-kerala.com/ - irCuBiC, on 05/06/2008, -1/+1Um, UTF-8 is an entirely different encoding; It's one of the few encodings that actually use Unicode.
Windows-1252, Latin-1 (aka ISO 8859-1), et. al. are just different 8-byte encodings made to support ASCII plus another 128 domain-specific characters. Latin-1, for example, is intended for most part of western Europe. - yenta4shop, on 09/07/2008, -0/+0http://www.yenta4shop.co.uk/
http://astore.amazon.com/12.volt.battery.charger-2 ...
http://astore.amazon.com/5.gallon.water.bottle-20
http://astore.amazon.com/aerobed.raised-20
http://astore.amazon.com/bug.zapper-20
http://astore.amazon.com/flowtron.insect.killer-20
http://astore.amazon.com/furniture.chaise.lounge-2 ...
http://astore.amazon.com/inflatable.bed-20
http://astore.amazon.com/steam.cleaner.mop-20 - acdx1, on 05/06/2008, -2/+1Personally I'd be thrilled to see Google serve valid HTML result pages, or..*gasp* XHTML!
- everling, on 05/06/2008, -1/+0Windows-1252 supports 255 characters, Unicode supports hundred thousands and potentially millions.
Windows-1252 is at best a subset of Unicode, at least for the first 7-bits. - Hangly, on 05/06/2008, -6/+1I guess it still doesn't support vertical scripts like Mongolian.
le sigh - lordmetroid, on 05/06/2008, -6/+1Now we don't need to worry about running out of URLs at least...
- xErath, on 05/06/2008, -7/+1Great! It supports a month old unicode format, but still does not support xhtml served as xml or
- slavingia, on 05/06/2008, -31/+3Come on Digg, who the ***** cares...



What is Digg?
Digg is coming to a city (and computer) near you! Check out all the details on our