31 Comments
- geminitojanus, on 10/12/2007, -1/+11A Web Spider is a tool that search engines (such as Google) use to index webpages. Ever wonder how Google knows exactly what is on a website, and how it is able to pick a relevant website out of billions of pages? This is the first step.
Basically, it's a small utility that goes through a webpage at a selected URL and downloads the content. From there, it passes it off to an engine which starts to take apart the webpage, usually throwing away unneccesary information such as HTML tags and CSS stylesheets and Javascript. From there, it's left with basically a jumble of words that it can pass on to the next part of the Spider which counts the individual words and makes a frequency table for them. Then all of that information is indexed into a database server, so that when you go to search, it can look at the frequency table and see how precisely your query fits the page.
Of course, that's a very generalized overview, and there are a lot more steps than is displayed, but that should give you the gist of how a search engine works. - seenthefuture, on 10/12/2007, -1/+7Meh, I like this one much better... http://se-spider.com
- armbar, on 10/12/2007, -0/+4@turgor:
Search engine spiders don't see keywords, but the tool referred to in the article does.
The reason I commented in the first place is because you seemed to suggest Lynx as an alternative to the spider tool, which it's not, since it doesn't do the things I mentioned: keyword density, page links and word count calculations.
By the way, you mention that spiders see ASCII. This is only true if the page encoding is ASCII. Most of the time, it's actually a UTF or ISO variation. The only reason I even mention this is because you seem like a dickweed.
Anyway, be sure to leave me another hot-headed comment explaining how stupid I am. - neoform, on 10/12/2007, -0/+3This spider kinda sucks though, it doesn't remove any of the noise words from the word density which makes it pretty much useless for actual indexing of content.
- armbar, on 10/12/2007, -1/+4Um, because Lynx doesn't show you keyword density, meta tags, page links, and word counts?
Why do people always say "OMG UES LYNX"? Any modern browser will let you turn off CSS styling, for similar effect. - sembetu, on 10/12/2007, -0/+2If you need a convenient way of viewing your page as a spider would see it, (or screen reader for that matter), try Opera. Seriously. I have been a web designer for quite a while now, and in the interest of cross-browser compatibility, I recently started using a great feature in Opera - User Mode / Emulate Text Browser. You can switch between the two views easily, similar to how you can turn page styling off in Firefox. However, this displays almost identical to a text browser a la Lynx or Links (depending on the system). And, by the way, you can use Opera on Mac and Windows.
/Disclaimer/ I still prefer Firefox for everyday use, and I still ensure my sites progressively enhance even from on 5.0 browsers. - Aquilla, on 10/12/2007, -0/+2This is a super old idea, been around for at least 5 years, there's plenty of other tools around.
http://www.1-hit.com/all-in-one/tool.search-engine-viewer.htm
http://www.searchwho.com/sw5-spider.html
are just an example.
This stuff used for SEO purposes mostly. - leffunov, on 10/12/2007, -0/+1This is what is destroying search engines right now. Search engines all worked well before optimizers now even google struggles to find what I want.
- fakerjohn, on 10/12/2007, -0/+1I really like that "searchwho" one. It shows wordcounts and that's pretty nice.
I use a number of Firefox extensions for all of this, including SEOPen and Web Developer Toolbar, that let you do just about anything in the way of looking at your pages from a different perspective.
BUT, much of this is moot if you're thinking about SEO. Google, who provides by far the largest share of search-based traffic to my sites, uses so many weird criteria in its algorithm (like expert ranking, click-thrus, user popularity) that you can't expect on-the page optimization to do all the work for you. However, that said, knowing the status of your alt and title tags and whether or not your site makes "sense" without its CSS intact is extremely important.
This is a good thread. Digged. - toxicredm, on 10/12/2007, -0/+1http://www.webopedia.com/TERM/s/spider.html
- cesclaveria, on 10/12/2007, -1/+2thanks for the link, that page is great.
- defe007, on 10/12/2007, -5/+6What exactly is a Search Engine Spider?
- etruscan, on 10/12/2007, -0/+1A good tool for checking what the Googlebot sees is to bring up your indexed page in search and view the Cached page. In the frame above the page select the "cached text" view. This will give you the content, alts, headers, etc. It's a great way to check for content density.
- diggmatter, on 10/12/2007, -0/+1Agreed. This site has some extra information and concatenates followed links nicely.
- sembetu, on 10/12/2007, -0/+1Valid tool, however:
1. There is a growing base of Opera users.
2. The simplicity of simply switching to "user mode / emulate text browser" is really a novelty.
I do like the power I get from the WebDev TB, however, I was only pointing out the usefullness of a specific component in Opera as it related to THIS particular thread. - mistercharlie, on 10/12/2007, -0/+1very very cool. should be useful for competitive intelligence.
all your data are belong to us. - brentzilla, on 10/12/2007, -1/+1Umm, how about Firefox + Web Developer Extension (http://chrispederick.com/work/webdeveloper/). Once again, no need for Opera.
- mpancha, on 10/12/2007, -1/+1very useful, added to my bookmarks.
- bede, on 10/12/2007, -1/+1Very useful, thanks.
- FZero, on 10/12/2007, -1/+1The page bumped on the abuse detection at Dreamhost for my site. Interesting. http://se-spider.com gave some interesting additional information with a note from DH:
"Precondition Failed We're sorry, but we could not fulfill your request for / on this server. We have established rules for access to this server, and any person or robot that violates these rules will be unable to access this site. To resolve this problem, please try the following steps:
Ensure that your computer is free of viruses, Trojan horses, spyware or any other sort of malicious software.
If you are using any sort of personal firewall or browser privacy software, check to ensure that its settings do not cause your web browser to inadvertently violate any of the rules listed below.
If you are behind a Web proxy or corporate firewall, the proxy must conform to the HTTP specification with respect to proxy servers. Contact your network administrator if the trouble persists, or bypass the proxy and connect directly if possible.
Disable any download accelerators you may be using. They don't speed up your downloads anyway; in most cases, they actually run slower! If all else fails, try using a different Web browser, such as Firefox.
If you still need assistance, please contact fzero at geradorzero.com.
More Information:
For your reference, the conditions for access to this server are:
Robots: MUST read and obey robots.txt. MUST identify themselves properly; for example MUST NOT identify as Mozilla. MUST NOT pretend to be a human.
Humans: MUST NOT pretend to be a robot. MUST NOT use a computer infected with viruses, Trojan horses or other malicious software.
Both: MUST NOT harvest email addresses. MUST NOT attempt to send spam. MUST NOT attempt to compromise server security. MUST NOT use excessive amounts of bandwidth or other server resources.
The precondition on the request for the URL / evaluated to false." - BloatedBunny, on 10/12/2007, -1/+1Very cool. I'll have to keep this in mind for later use.
- inactive, on 10/12/2007, -3/+3Awesome tool to check your Website for:
Page title
Meta keywords
Headers
Links
Total and unique word counts
Word list @ Stats - Lee69, on 10/12/2007, -0/+0Thanks for this, will definitely bookmark and try later.
- searchoptimize, on 10/12/2007, -0/+0good read, rather interesting tool here, I'll have to look further into it for my love of search engine stuff.
- smotheredinhugs, on 10/12/2007, -0/+0This is an interesting tool. There's no information on what spider-ing methodology they are using for the tool - different engines use different spiders, some look at meta info and some do not, some obey robots.txt exclusion protocols and some do not. As I understand it, Google uses one algorithm and MSN uses another. I would try to find out more about the source of the statistics before mounting an SEO initiative based on these results.
- neonic, on 10/12/2007, -0/+0Haha, try to search a page with frames. Other than that little weird thing, it seems pretty interesting.
- inactive, on 10/12/2007, -2/+1Why not just look at the text only version of the page cached in Google/Yahoo or MSN? These are the only places where you really see what the page looks through the eyes of a search engine.


What is Digg?
Browsing Digg on your phone just got easier with our enhancements to the