38 Comments
- GerryBot, on 10/12/2007, -5/+21Looks like everybody wants to kill off Google these days.....
- aliengoods, on 10/11/2007, -3/+19While I love open source, I don't believe and open source google would work. You already have hoards of people gaming the google system. With an open source model, I think they would be able to do it far quicker. What you would have is an arms race pitting SEOs against OSS developers, and SEOs get paid by the hour, so they don't mind the time.
- dorkus999, on 10/11/2007, -3/+11Clone Google. Riiiiiight. Also, shoot me if my job title ever becomes "Search Evangelist".
- Robotsu, on 10/11/2007, -0/+8I had this same idea sometime back in 2004 (I know, nobody cares). But to my surprise, there was already a company enacting it: http://en.wikipedia.org/wiki/Grub_%28search_engine%29
"Users could download the grubclient software and let it run during computer idle time. The client indexed URLs and sent them back to the main grub server in a highly compressed form. The collective cache could then be searched on the Grub website. Grub was able to quickly build a large cache by asking thousands of clients to cache a small portion of the web each."
However, it is now apparently defunct, so the idea is ripe for the taking. But will it actually work? It seems possible, however the monetization needed to keep the project current with evolving search engine tactics and features is going to be hard to find. This needs a real business solution as much as a technical one. - Philodox, on 10/11/2007, -2/+8I like how this "how to" article at no point actually discusses how to do it. The article makes some vague references to a distributed architecture and that's about it.
- fuzzmeister, on 10/11/2007, -0/+5The main problem with having clients crawl the web could be bandwidth (for the clients). Having your computer downloading and uploading constantly 24/7 might draw your ISP's ire.
- Anpheus, on 10/11/2007, -3/+7No, he's saying that with an open algorithm, it'll be much easier for SEOs to know exactly what to do in order to achieve results.
And the potential for surreptitious and seemingly innocuous modification of the algorithm is of course still there. Minor tweaks might not seem like a big deal, but a well-qualified computer scientist could easily turn a minor tweak into a profit. - krif, on 10/11/2007, -0/+4are they talking about this then?
http://en.wikipedia.org/wiki/YaCy - 1021, on 10/11/2007, -0/+3"99% of people in IT have no idea what 'Distributed system' is..."
hey, 99% of people in IT have no idea what anything fundamentally is in IT. This is why we can refer to Google and Wikipedia... just enjoy the irony of this statement in context of this story. - geminitojanus, on 10/11/2007, -2/+5"I think you missed the point. Using an open source algorithm would let SEO specialists know EXACTLY what they must do to get top search results. And they WILL do it, whatever it takes."
Using Open Source Software lets Malware designers know EXACTLY what they must do to exploit the systems. And they WILL exploit them. Wait a second, how many major Linux exploits have their been over the years? As compared to closed-software exploits?
"The only REAL way to fight this is by putting most of the weight on how many "link to"s a site has, and the popularity of each site linking to the given site, which is part of what google already does."
Nobody said we have to do anything different than Google. All that matters is that it works. You realize the fact that any system that can be constructed can also be exploited, but Open Source systems tend to move faster than they can be exploited. I can name you at least 10 Apache exploits I've seen over the years, none of them work today, few of them worked in their hayday; people patched the system so quickly that by the time people had developed Malware against those specific defects, everyone already patched their software and moved on.
The only thing SEOs gain by looking at the system is transparency; they can see exactly why top rated sites are top rated. That doesn't grant them the magical ability to beat the system, it just gives them the knowledge of how the system works. How is this any different than Google? Pretty much the exact specifications of Google's central software have been detailed and explained with great detail by Google engineers in patents, lectures, articles. SEOs developed software exactly to counteract that, but because Google is fast moving themselves, they can keep ahead. Because their systems are becoming smarter as Google's data centers get more power, they can do more sophisticated filtering.
Now imagine we removed the data center constraint; distributed computing means we can literally spread the network across the globe, no central computer necessary. Now, armor that system with techniques borrowed from P2P networks (Bittorrent would be a great place to start). Introduce Baysian filtering (currently too computationally expensive to do as Google). Start actual parsing of the language and pages, statistically generating results based on more than who links to who and reverse indexes for words. We do have the technology to greatly hamper spammers, to make it so difficult to spam it's not worth their time. We should use it.
So no, I think you're the one that doesn't "get it". Open Source has faced these exact same criticisms for years now, and year after year Open Source projects globally stand up to them and defeat them. Just because this project is "hard" doesn't mean it's not worth while, nor does it mean it's impossible. The truth is, we've already started on this revolution; distributed hash caching of the Internet is being undertaken on a global scale with Coral, pair Coral with a search engine for its hash tables and we can start replacing the "centralized" web with a "distributed web".
Do I think I'll see this in my lifetime? Probably not. Too many companies have too much money invested in search engines the way they're currently implemented, the way the net is currently architected to move to newer systems. But is it possible? Hell yes, it's possible. - SniperX, on 10/11/2007, -0/+3You know.. to add to my above statement, I would LOVE an alternative open AND distributed search system. The idea of Google being able to censor and restrict search results (which they have done many times before) has always left a bad taste in my mouth.
It might not be as good perfect, or as fast, but it would certainly be a welcomed available option. - 1021, on 10/11/2007, -0/+3distributed computing and p2p combined solves both questions risen above but there needs to be some thought put into how the details will work. I believe this is doable... remotely, but doable nonetheless.
- tybris, on 10/11/2007, -3/+699% of people in IT have no idea what 'Distributed system' means, but it sure sounds fancy.
- Zachariah, on 10/11/2007, -0/+2I don't care about replacing the search engine, but I'd love to install software on *my* server that's similar to all of google's online apps. That way Google wouldn't have all my data, but I could still have some fancy web apps.
- SniperX, on 10/11/2007, -1/+3geminito:
I think you missed the point. Using an open source algorithm would let SEO specialists know EXACTLY what they must do to get top search results. And they WILL do it, whatever it takes. Yes, changes can then be made to ward off that series of SEO tweaks but no matter what is changed, the SEO people will adapt, and they don't even care about how their sites look as long as they're showing up at the top of search results (as we've seen). So the only people that will be hurt by this are the people running the legit sites. The only REAL way to fight this is by putting most of the weight on how many "link to"s a site has, and the popularity of each site linking to the given site, which is part of what google already does.
Thus this open system can never be as good as a closed system in this particular case. - grumpyrain, on 10/11/2007, -2/+3I agree that if such an approach were taken, then you would need safe-guards in place to protect people from gaming the system, but I don't think that an open model inherently makes it easier. Yes the source code is public, so people can see how to popularise their sites, but achieving this would be much harder. You do not just have to fool one body (like google), you need to fool a distributed body, each with the potential to realise what you have done and report it.
Monopolies are not a good thing, and Google has an effective monopoly. Just because they are yet to abuse it, does not mean that it is going to be forever run so responsibly. It is a bit like a good dictator can lead in a manner that causes freedom and prosperity, but dictators are not good in the long run. Just look at the compromises made to enter the lucritive Chinese market which in the not so distant future will be more profitable than the US and EU combined. In the long run, I don't think it is a healthy state of affairs when what I learn about a topic is largely sourced from the information indexed by a single company. Information should not be owned like that. - gmillerd, on 10/11/2007, -2/+3This idea is stupid, purely indexing the web has no value who gives a ***** if its open source.
Its the personalized tuning the "web-to-one" aspect that google continues to strive for that has value. Not to mention Google's integration with most everything of meaning push wise, every time something changes "Submit Form" -> "Store in Database" && "Submit to Google" { index && rank ?}, most every blog does this and Google almost has a virtual "peering agreement" every big source of data out there.
These projects are just hype, waste of cash until they find real jobs or a bigger free lunch.
@Riya.com is pretty slick though, would love to see some of that head over to GIS - inactive, on 10/11/2007, -1/+2You could just point out the link on the wikipedia instead of make fun of the IT people ignorance...
"Distributed computing is a method of computer processing in which different parts of a program run simultaneously on two or more computers that are communicating with each other over a network. Distributed computing is a type of parallel computing. But the latter term is most commonly used to refer to processing in which different parts of a program run simultaneously on two or more processors that are part of the same computer..."
More at: http://en.wikipedia.org/wiki/Distributed_system - san1ty, on 10/11/2007, -0/+1This doesn't explain how to do anything, its just the over-caffeinated ramblings of some no-skillz blogger, typical.
- Anpheus, on 10/11/2007, -0/+1I don't know how 'origins' got in there, but it's funny nonetheless.
- whackjob, on 10/11/2007, -0/+1Jimmy Wales' Wikia Search is already trying to do this, and there are abundant discussions along those lines on its discussion list, search-l (c/o the Wikia Search Wiki).
- chrisutley, on 10/11/2007, -0/+1"I'd love to install software on *my* server that's similar to all of google's online apps. That way Google wouldn't have all my data, but I could still have some fancy web apps."
Setup a Linux box at home and install Zimbra. That will get you part of the way there. - BlipBertMon, on 10/11/2007, -0/+0dead on. Google spends hundreds of millions
USD to index less than 20% of the total Web;
They can't afford to build a compute farm of
the capacity to index the *entire* Web. The
only thing big enough to handle that job -
is the Web itself. -all 8.426M machines as
of last count in 1999
( www.isc.org/index.pl?/ops/ds/ )
Google's 450,000 servers as of 2006
( forums.randi.org/showthread.php?t=58518 )
can't hold a candle to that. A P2P F/OSS
Google@Home could implement all kinds of
rankings other than 'numbers of links to a
page' which is only a popularity ranking, and
only one of many other possible metrics of
merit. The only reason Schmidt & co. hasn't
made the 'peristroika move' to abandon their
cat-bird-seat as the nexus through which the
vast resources of the Internet are actually
brought into usable focus is ... revenue?
Right. I think somebody out there is draining
a few billion $US of our collective attention-
seconds and pixel real estate for something
which I'd be happy to pay my power bill and
give up a few stray CPU cycles for instead -
to get *everything* on the Web indexed and
fulltext-searchable. - maninalift, on 10/11/2007, -0/+0Why do you want to kill Google? For the first time one of the highest grossing companies is built on a fundamentally liberal model. Free access to consumers and free use for companies not charging the consumer, cooperation with open source community etc. It is a company that has an ethos of having a positive impact on society as well as making a profit.
OK so they harvest a lot of user information but for the moment they only want to use it to display unintrusive adverts and improve their searches. I for one would rather texts links to things that are relevant to me than full page animated flash screens inviting me to spend my "bud bucks" on the Maxim party, or worse those bloody smilies saying "helooo". Something has to pay for the web and advertising pays for the bits that are not paid for directly by the consumer. Take your choice, either you have volume of advertising or you have targeted advertising which means someone knowing what links you are likely to click on. I'd rather that person with that data to be the one with "don't be evil" over their desk.
Power may corrupt but cooperation not antagonism is the answer.
The main reason Google search is so good is the famous Google-number backlink counting. Its patented. Tough. The Google servers are fast (gmail is so fast it makes me weep) distributed searches are not. - aryo, on 10/11/2007, -1/+1i think the open source part of it is good. a distributed search engine though? i don't know...
- Anpheus, on 10/11/2007, -1/+1Your example is apples and origins. In the case of the Linux operating system you're dealing with the definite flaws that may exist, that may allow an exploit to gain control of an account or the system itself. Those flaws can always be sealed and are frequently discovered and quietly patched. (People love to make a big deal out of how many security fixes Microsoft publishes, but they pay no mind to the hundreds of security patches that may occur to the Linux Kernel between even minor revisions.) However, a search engine algorithm is a different beast, it is a subjective measure of the quality of a website. The problem is, once you've got an algorithm like that, something that turns one set of data into another, regardless of how difficult it is you can undo it. And unlike a cryptographically secure hash or a one-way function, the input and output have strict rules that create semantic meaning. Creating an open source Google and having it be successful would be like loading the gun and the target, and handing both to the SEOs. Google doesn't give them the gun or the target. When Google changes their algorithm slightly, the SEOs often lose a little ground, for a little while. But they have the time and money to adapt. In an open source algorithm they won't just be able to adapt afterward, they'll be able to study the algorithm before it goes live, or even implement it themselves and see just how well various ideas do. The very nature of an open source program is that the entire process should be transparent and interactive to anyone, and that the results can be replicated by anyone.
Open Source Google just wouldn't work, and I think Larry and Sergey know that, after all, they've open sourced so much. Of course they're reluctant to divulge the algorithm. - maninalift, on 10/11/2007, -0/+0Sorry I see my mistake... the last comment about speed of distributed search was irrelevant.
- geminitojanus, on 10/11/2007, -3/+1"With an open source model, I think they would be able to do it far quicker."
People have used that same argument with Open Source Software. It's failed every time. Open means agile; someone attempts to game the system, the system adapts. SEOs attempts to engineer broken patches, backdoors, etc will all be transparent and easily ignored. Distributed systems means no one system holds the entire vote. Viruses, Malware is a practically a non-issue, vs closed systems being continuously exploited.
Using your logic, Apache, Linux, etc would never exist/work. - duster12, on 10/11/2007, -7/+3one thing - who's gonna pay for the servers? I guess they forgot about that
- Tarnum, on 10/11/2007, -5/+1Who's gonna pay for the pipes? The 1000 TBps link you'd need to download the internet?
- sarusa, on 10/11/2007, -7/+3If by 'open source' you mean '***** and inadequate', perhaps you are correct. Really, there are some areas where OSS just dominates and is awesome, and then there's doing something dumb for ideological reasons.
- jacqueschirac, on 10/11/2007, -7/+3@aliengoods, yes but it doesn't have to accept every patch or something. Does Gnome accept every patch request? I think you're talking about Wikiasari - wiki search engine!!!!
- chrisutley, on 10/11/2007, -8/+2Step 1. Collect Underpants
Step 2. ????
Step 3. Profit - Dhalgren, on 10/11/2007, -7/+1Ok, this is lame
- f00xx0riz3r, on 10/11/2007, -7/+1stupid blog spam about nothing. just marketing drivel.
- tybris, on 10/11/2007, -7/+1Project failed.
- Nebukadnezzar, on 10/11/2007, -6/+0Isn't that what http://www.yacy.net/yacy/ is trying to do right _now_?


What is Digg?
The Digg Toolbar for Firefox lets you Digg, submit content, and keep track of Digg even when you're not on the Digg site. Download the official