29 Comments
- ThatsUnpossible, on 10/12/2007, -0/+2Here's a question: How does Google treat pages that are XML/XSL transformations? Does it just index the XML tags, or is its bot smart enough to do client-side transformation so that it can index the resulting XHTML?
Sorry, for a second I thought this was slashdot. - MikeyC, on 10/12/2007, -0/+1The whole article is based on speculation. The guy sees "mozilla" in the useragent string and assumes it's now Mozilla-based?? Browsers (including Internet Explorer) have always added "Mozilla" to their useragent string for the sake of compatibility with servers sniffing for Netscape (when Netscape was King).
It's more likely that "mozilla" has been added to the Googlebot string for the same reason. - bebopbass, on 10/12/2007, -0/+1I don't know if this is news I regularly surf as Googlebot 2.1, user agent as FF extension, but yeah if your run under it you see pretty much what you'd see as any other browser
- Rjx_, on 10/12/2007, -0/+1i don't think google have used lynx in the past few years (if at all), given that as far as 3 years back they were testing for black-on-black and white-on-white text.
secondly, where's the source for this information? i see no links to external verification.. - inactive, on 10/12/2007, -0/+1I suspect this has been done to workaround any filters that prohibit anything but Mozilla compatible clients. Back in the day, Internet Explorer added "Mozilla compatible" in the agent header to avoid all the greedy webmasters blocking it.
- ScoTTeh, on 10/12/2007, -0/+0Searching Google for the Mozilla version of the crawler brings up articles as old as April 2005 (Googlebot v2.1). This is probably linked to the whole 'Bigdaddy' project they've got going. The article also just seems to be based on speculation and not fact.
- neffy, on 10/12/2007, -0/+0Another wonderful feature they could add to their new, highly-extensible bot is validation. Lets say...Page 1 has to validate as valid markup. Thats taking your market power and leveraging it for Good™.
- ogletree, on 10/12/2007, -0/+0Yeah he needs to check the IP of that visitor I bet it is not Google
- listrophy, on 10/12/2007, -0/+0This may have ramifications on the hidden keyword scam. Previously, one could hide keywords on the page with an external CSS declaration. Avoiding such schemes would probably improve google's search accuracy.
That is, of course, if the UA wasn't just faked and this is a story about nothing. - inactive, on 10/12/2007, -0/+0Here's what it looks like from the Apache server logs:
Before:
"GET /robots.txt HTTP/1.1" 200 26 "-"
"Googlebot/2.1 (+http://www.google.com/bot.html)"
After:
"GET /robots.txt HTTP/1.1" 200 26 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - dvvarf, on 10/12/2007, -0/+0if googlebot really is looking for color, it might mean that adsense that better synchronises with the layout..that would be fun
- Nanobe, on 10/12/2007, -0/+0I have looked through my logs, and I see a third basic Googlebot user agent:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Googlebot 2.1
However, I'm not convinced that it's a bot. It may well be a Firefox/Mozilla user who has simply modified his/her user agent string. The one instance of it I see in my logs apparently found my site from a Google search pagethis does not look like a bot to me. It apparently reached my site from a Google search page and content was downloaded from my site as Firefox or Mozilla would after a page visit. Stylesheets and .js file were requested, as well as two images from the default stylesheet, and then it was gone. Looks to me like a real person who just hit his back button quickly after visiting.
I have no CSS/JS file requests from known Googlebot user agent strings in the last few months of logs. - inactive, on 10/12/2007, -0/+0I use Googlebot/2.1 as my useragent for Firefox and Opera.. just to be a dick.
- skxy, on 10/12/2007, -0/+0Well I certainly see the user-agent being used "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - and the IP address it is used by is registered to Google.com (194.66.249.66.in-addr.arpa name = crawl-66-249-66-194.googlebot.com.)
Still this article is light on details and could easily be mistaken - ScoTTeh, on 10/12/2007, -0/+0Heres a recent and more interesting article on Googlebot/Bigdaddy: http://www.smarthouse.com.au/Computing/Industry/?article=/Computing/Industry/News/J4P3B2S7
- ypnbites, on 10/12/2007, -0/+0Wow, this is cool. Front page of digg. That's great.
@redux
I think you completely and deliberately missed the point so you could come out sounding condescending. I simply stated how Google currently calculates relevance. In no way did I imply It is necessary to mark up specifically for Google other that via html comment.
@everyone else
The article is based on things and information gathered by myself along with many others I know in the same field, and we all came to the same conclusion. It is of course speculation, as I state in the article; but I just write so people can read. I'm not trying to predict the future. - challahc, on 10/12/2007, -0/+0Why would googlebot use lynx or mozilla, both are browsers. Humans use browsers to read html. I don't see why the useragent string makes any difference unless you are serving different versions of your site for each browser.
Where's the bury button - MalDON, on 10/12/2007, -0/+0Maybe this will take all those crap sites off of page one.
- Nanobe, on 10/12/2007, -0/+0The new user agent string thing is old news. I blogged about it back in 2004: http://nanobox.chipx86.com/blog/2004/09/new-googlebot.php
Although I hadn't heard about it requesting CSS and JS files. I'll have to look through my logs and check it out. - Evroccck, on 10/12/2007, -0/+0this is old, they've had this for a while.
- otwist, on 10/12/2007, -0/+0If this is how it works I hope this keeps bots from flooding comments in blogs.
- redux, on 10/12/2007, -0/+0From the article: "Previously the only way to tell if information was important, was if we told Google it was by using various forms of markup" and "As the web evolves, so must Google"
hmmm...firstly, using various forms of *correct* markup (if something's a heading, mark it as a heading, etc) isn't just done for google...it's the foundation upon which the entire www rests; standards, definitions of how content gets marked up.
so the evolution of the web foreseen by the author is one where everything can just be marked up as whatever, as long as it's styled? a huge step back! that's devolution, not evolution. - rsullivan25, on 10/12/2007, -0/+0I have been watching this crawler - it does exist and has been doing the things mentioned in the article and more. It IS a Google IP. Something else I've noticed and confirmed: This crawler can fill out and submit forms. I work with sites where the activity this crawler is involved in is not your typical search engine spider activity. IE filling out forms. It also gets between 2 and 10 times the pages the old crawler did in the same time period.
- br0ken1128, on 10/12/2007, -0/+0"This may have ramifications on the hidden keyword scam. Previously, one could hide keywords on the page with an external CSS declaration. Avoiding such schemes would probably improve google's search accuracy."
Not necessarily, if you have your hiding done in the external css file and then add that css file to the robots.txt file then people will still be able to get around it.
Unless google decides to start ignoring robots.txt - nano, on 10/12/2007, -0/+0It'd be cool if they also "punished" IE-only sites and sites using ActiveX
- wolrah, on 10/12/2007, -1/+0Useful. Now not only will Google be able to index better, but sites that are broken in Mozilla will probably lose a good number of spots. A win-win situation (except for those who like to lazily code IE-only sites).
- DisposableRob, on 10/12/2007, -1/+0What's so irritating about thumbnail searches? It's the quickest way to see if a page is what you are looking for, random junk, or porn.
- volz0r, on 10/12/2007, -1/+0Laugh. Spammers have been indexing web information for like this for several years now.
It's not a huge step to add pseudo visualization information to categorize importance
to a crawler. - brlewis, on 10/12/2007, -2/+0Dugg for interest, but I'll check the comments and undigg if this turns out to be a hoax.


What is Digg?
The Digg Toolbar for Firefox lets you Digg, submit content, and keep track of Digg even when you're not on the Digg site. Download the official