40 Comments
- HouseCentipede, on 07/27/2008, -2/+18Why does linux have a ® next to it?
- stygiansonic, on 07/28/2008, -0/+8Because Linus owns the trademark: (At least in the US)
http://en.wikipedia.org/wiki/Linux#Copyright_and_n ... - YodaJones, on 07/27/2008, -0/+7Nice article. Many many moons ago I believe there was a pretty good article in either PC Magazine or Byte about spiders also.
- fLUx1337, on 07/28/2008, -0/+6Its done exactly the same on Windows, OSX, ect.... as long as you can get python/ruby up and running your fine...
- takatoo, on 07/28/2008, -1/+7I love ruby
simple yet powerful - azbmr, on 07/28/2008, -0/+5I know, we hate to have technology-related articles around here, huh?
- inactive, on 07/28/2008, -1/+6Because this is how you build it on Linux. It's Linux specific becuase it's an article for Linux. If you want to build a webspider on Windows or any other OS, than there are other articles.
- daftman, on 07/28/2008, -0/+4because it is a register trademark
- kevdotbadger, on 07/28/2008, -0/+4Awesome article. I'm considering making a duggmirror clone. Where the spiders goes through the most popular upcoming articles, steals the text/images/video/flash and rebuilds the site on a dedicated server.
...but then someone will have to make a similar application to mirror that website since I'm not made of money to pay for huge bandwidth costs. - MacBookForMe, on 07/28/2008, -0/+4Linux has a 'charm'...
- phatfiend, on 07/28/2008, -0/+4import webspider
GG python - kevdotbadger, on 07/28/2008, -0/+3The jokes getting a little overrated now.
- diggdiggdug, on 07/28/2008, -0/+2Now I'd like to find out how to build a spider that will crawl into by boss's bed and bite him in the arse.
- spinchange, on 07/28/2008, -0/+2Open Source, DIY search FTW! Who says just Google, or anyone else for that matter has to be the only ones to "define" and thus be the only oracles of the modern world?
- masfenix, on 07/28/2008, -0/+2really ... netscape 7.2? older then the interwebz
- nothin2g, on 07/28/2008, -3/+5***** you, 2 of my friends got killed by build-essentials.
- billizm, on 07/28/2008, -0/+1fLUx1337 explains my point exactly. I didn't think I would have to actually say it. So that being the case, why the limit to Linux?
- jay019, on 07/30/2008, -0/+1Cause most times your default Linux install includes all necessary tools. A default install of windows contains well, er, nothing usefull.
- inactive, on 07/28/2008, -0/+1Every ***** article? Really?
- Commodore13, on 07/31/2008, -0/+1I think you may also have a legal problem there.
- inactive, on 07/28/2008, -0/+1Don't bury it just for that, though I really think the proper way is with the trademark symbol.
- tehmoth, on 07/28/2008, -0/+1I love how they use ruby and python, and not perl which has a much longer history and more robust modules to do spidering of websites such as WWW:::Mechanize. Not really suprised though.
- vade79, on 07/28/2008, -0/+1Writing a web crawler in ruby is kind of the wrong language for the job. I wrote a crawler a few months ago in C, and it still takes up a good deal of processing/memory/sockets/etc. Had the streamline the parsing of the HTML several times just to make it keep up. I guess this will cut it for your basic one-socket-at-a-time deal...but crawling the web like that is pretty unusable.
- Commodore13, on 07/31/2008, -0/+1It has the trademark symbol because the submitter didn't bother to actually write a description, he just copied the first paragraph of the article. It makes sense that IBM would do that because IBM is a very professional company.
- jay019, on 07/30/2008, -0/+1what so special about crysis. i tried it and thought it was utter crap. nothing new or innovative. *yawn*
- malexan, on 07/28/2008, -0/+1An old one, but still a good read. For those interested in spidering software, there's SocSciBot3, WebBot, and the Standford WebBast ProjecT. Use Google to track down the home pages :-)
- nothin2g, on 07/28/2008, -0/+1crysis® anyone?
- rolty125, on 07/29/2008, -0/+1Is there a reason why the Linux in the Title of the article doesn't have it while description does?
- DelMonte, on 07/29/2008, -0/+1According to http://en.wikipedia.org/wiki/Robots.txt :
"The protocol, however, is purely advisory."
However, I believe that public sites that mirror/copy the actual content could use the robots.txt protocol as a legal defense against a company that would try to sue them for copyright infringement, arguing that the protocol is a well know way to "opt-out" of being cached/searched by bots.
Archive.org actually uses that legal defense. Any site that currently has a robots.txt file, even after it was archived, won't be accessible through the archive. - blackturtleus, on 07/28/2008, -0/+1I'd of preferred PERL code, but ruby is easy to understand and so it'll do!!! Great article on a very interesting topic!!!
- sansculottes, on 07/28/2008, -1/+1The fact that Linux is a registered trademark does not preclude it from being "open source." The GPL is--admittedly, only theoretically at this point--enforceable by copyrights/trademarks. Ergo, much GPL software/media is copyrighted/trademarked precisely to protect it from someone else copyrighting/trademarking it with the intention of denying others the rights guaranteed by the GPL.
- jlebrech, on 07/28/2008, -1/+1Remember to check robots.txt when spidering or you'll get sued.
- Aciid, on 07/28/2008, -3/+3Yet another IBM article. IBM's Linux articles keep popping up to Digg front page like every other day.....
- broodking, on 07/28/2008, -1/+1now that's a good article
- Rich43, on 07/28/2008, -1/+0Python has a Mechanize module too.
Perl is a bracket soup mess.... Python is clean :) - staed, on 07/28/2008, -2/+0No, not really. I'm sorry. It won't happen again.
- crazyjake, on 07/28/2008, -5/+2i guess the Linux® name isn't open source.
- Skrezium, on 07/28/2008, -8/+1Buried for the "Linux®" lame thing.
- staed, on 07/28/2008, -7/+0***** you! Two of my friends died when they were bitten by a web spider!
- billizm, on 07/28/2008, -8/+1Why limit to Linux?
What is Digg?
Check out the new & improved