67 Comments
- inactive, on 10/12/2007, -17/+72Not wanting to flame Digg but does anybody else find it to be one of the slowest sites to load? Especially when you consider the lack of advertising compared to other sites.......
- pennyfan87, on 10/12/2007, -4/+29The digg comments are the slowest pages on my personal internets. I'm pretty sure its not the connection though. My CPU usage on a core shoots to 100% until the page finishes loading.
This happen to anyone else?
(BTW-I'm running Firefox on a 3Ghz Pentium D) - philovivero, on 10/12/2007, -4/+28I had wanted to talk about how Digg is a community project, front *AND* back. On the front, the people choose the important stories for the world.
But even more subtly, the entire back-end of Digg is also a community project. The MySQL engine, PHP language, Linux OS, etc etc etc are all community projects, and Digg employees are buried deep in the tech community to make it work.
I just couldn't think of a way to say this in the talk that wouldn't sound cheesy or cliche, so I finally decided to just not launch into it. But here, here in the comments on this story, I'd like to say: hey tech community. Thanks. You guys rawk. I hope we've been able to help you out, too. - fkr3, on 10/12/2007, -4/+26I experience a slow-assed site constantly.
As for their scaling, I believe their approach is to just throw a ridiculous amount of hardware at the problem and hope it compensates for poor design decisions and architecture.
--------
http://plentyoffish.wordpress.com/2006/10/08/digg-is-doomed-unless-they-fire-their-tech-staff/
A reader pointed out that digg.com is doing 7 million pageviews and has 75 servers and not 25, a number which I thought was really bloated. Read about digg.com’s server setup here.
7 million pageviews / 24 hours / 60 minutes / 60 seconds = 81 Pageviews/second on average.
On average digg.com is serving 1 pageview per server. Given that most of the pageviews are hitting the homepage and are CACHED, the servers are probably only handling sub 50 pageviews a second on average.
Using myspaces most recent numbers of 1.5 billion pageviews a day they would be processing 17361 pages per second on average. If their infrastructure was as bad as digg.com’s they would need over 18,750 servers!!! I think digg.com wins the worst infrustructure/setup award of any major site hands down. - brundlefly76, on 10/12/2007, -1/+19"Preventing Digg’s enthusiastic developers from adding powerful but CPU-intensive features is "a political thing I constantly have to deal with as a DBA," said White."
Organizationally, that should not be legitimate chokepoint for product development if you rethink your infrastructure..
I see this problem all the time with db-driven sites. When I worked at Yahoo!, this was dealt with very very early on, as many of their original senior engineers came from database companies.
Their solution: dont use dbms systems!
Thats right - the guys who previously developed enterprise database systems were the primary evangelists on how many ways you could *avoid* using a database for large-scale web applications and why it was almost always a better solution.
Overall, databases like MySQL are *way* overused for web sites, when there are far less complicated options which perform & scale much much better, have much better reliability, and far less management. In short, the answer is the FILESYSTEM.
When you access dynamically personal-generated pages at Yahoo!, you are working on the filesystem - your prefs are a small formatted text file (and there are *tons* more prefs across Yahoo! properties then digg), and your web page is assembled from snippets on disk, etc - THERE IS NO DATABASE HIT - READ OR WRITE.
The question: "how do I do this without a database" is determined for every single feature, and, invariably, there is a solution (yes there are exceptins but few).
As a result 300 engineers can launch new features on any site on Yahoo! every day without ever going through a database admin gatekeeper - it would be impossible to scale over 100 properties as fast as they did with that model, and it would hold up product development (which it clearly is doing at Digg). - burtonbe, on 10/12/2007, -2/+11That's a ***** for just text data.
- fkr3, on 10/12/2007, -3/+12It wouldn't just sound cheesy or cliche, it'd be innacurate.
Digg, the part we see in our browsers, is community driven. The rest is a proprietry and commercial business that happens to use some open source software on their servers, and those software packages have their own independent communities. Any overflow in userbase is nothing to do with digg itself.
What exactly has been given back darkphan? What's digg done for mysql or php or fedora exactly? I don't recall seeing any "Digg sponsors improvement in [platform] they use!!!" stories. - bofhcabbit, on 10/12/2007, -2/+11Try the mobile version for a fast-loading digg: diggriver.com
- samuelcotterall, on 10/12/2007, -3/+1130GB of data?
I'm actually surprised it's not greater than that. - tpink, on 10/12/2007, -0/+7"Digg tackles the scalability issue by distributing it across multiple low-medium cost servers. Which is the right way to do it."
If you disregard the cost of floor space, power, and cooling. Big servers tend to have smaller footprints and use less power than what the equivalent amount of cheap boxes provides. There are also benefits to be had in performance if you have multiple servers running on a single machine because inter-server network traffic moves at the speed of memory instead of the speed of physical ethernet cables and switches. Both architectures have their advantages and disadvantages, but there is no single silver bullet, right way to do it. - bmeshier, on 10/12/2007, -0/+7The slow loading is probably more attributed to the massive javascript libraries and not so much the back end data processing.
- merreborn, on 10/12/2007, -0/+7Second Life is the antithesis of scalability. Being able to support a maximum of 40 users per server is not the sort of thing you want in a good scalable architecture.
- rudy23, on 10/12/2007, -0/+4thats sharting
- willclarke, on 10/12/2007, -0/+4I love Digg, but it's slow as hell. Show me an article about how facebook scales and I'll digg it.
- dlsspy, on 10/12/2007, -0/+4@coryking
The article states they have 98% reads, so an occasional sharded lock for a comment such as this is not a big deal.
BTW, memcached is awesome. I dislike mySQL as much as the next guy, but it's an appropriate thing to use when you have several machines and you want to have a consistent cache across them. I use it in a couple of applications where I programmatically generate images. It has an obvious positive performance benefit in my applications.
Similarly, in my last job, I was pushing several fairly intensive transactions through per second. Each one required loading a fairly complicated graph of objects at the beginning, and saving them at the end. The DB was a mix of oracle and postgres and still benefited greatly from avoiding DB access. - geronimo, on 10/12/2007, -0/+4"Not to brag, but once I dumped MySQL in favour of PostgreSQL"
Don't worry, I brag about postgresql all the time! (insert bagging..) Years ago I decided to test the two DB's side by side with about 100-200 threads doing writes/reads. I noticed mysql was blazingly fast with 1 or 2 threads. Then I upped it to 100-200 and it fell over while postgres chunked along like a steadfast cadillac. Postgres was designed right from the beginning, with transactions and ACID in mind. Mysql seems more like a hack to me, they hacked it to try and be consistent. I used mysql for years and I worked around all its little quirks, trying to get it to use indices, and we got it to work on a site with more traffic than digg. But, there were those odd corner cases that just left you scratching your head, postgres seems more deterministic - I can trust it. That's a bare minimum for me. - coryking, on 10/12/2007, -0/+4A 2% write rate is still a lot - that is 1 out of every 50 queries. That single write forces all the readers to sit around and wait until the data is flushed to disk (and I pray they are still using either fsync or a battery backed RAID controller).
Memcache is indeed awesome. It's simple API in perl makes it a no-brainer for shared caching. I've got a page-count query on my site that is about as close to SELECT COUNT(*) as you can get that I cache. I'm also considering moving the session data out of the database completly and shoving it into memcached.
My "problem" with memcache is that it still adds a lot of complexity and back when my website was MySQL, I had to shove everything in it. I had all kinds of strange vodoo bugs where users would post a comment or story and the change didn't immediatly show up. It might have been a lot to do with the fact that memcached used to sit on expired data for a while before removing it to avoid race conditions. The codebase was becoming very brittle and bloated as I tried to fix all the bugs.
Not to brag, but once I dumped MySQL in favour of PostgreSQL, and after I actually started to use complex queries, views, triggers, and all the other cool stuff - I was able to gut almost all the cache code I had written and magically all the vodoo-bugs went away. The best part is since the database could do a significant amount of work for me, the size of the codebase went down considerably. The biggest win was the front page - probably 500 lines of code turned into about 6 lines and a database view :-) - counterplex, on 10/12/2007, -0/+3@brundlefly76
I've used this approach for a number of smaller things myself. Since the filesystem is a database itself (albiet a very lightweight one) it's a great resource to use when dealing with data that conforms to the limitations of a filesystem. The user preferences example is perfect. Sessions management is another. - geronimo, on 10/12/2007, -0/+3http://www.baselinemag.com/article2/0,1540,2084131,00.asp
" As of November, MySpace was exceeding the number of simultaneous connections supported by SQL Server, causing the software to crash."
...
"We were scratching our heads for about a month trying to figure out why our Windows 2003 servers kept shutting themselves off," Benedetto says. Finally, with help from Microsoft, his team figured out how to tell the server to "ignore distributed denial of service; this is friendly fire."
^^ that is exactly why I avoid Microsoft products. I guarantee that the myspace sysadmins suffer from high stress levels. My servers are situated 10 feet from their servers in LA, and I sleep very well every night. No hidden gotchas, if something like that were to happen, I look at the code and it won't take me months to figure out the problem. It takes me hours, then I tell the maintainer of the open source package and I have a patch in a day at most. Just a few weeks ago I noticed a core dump in the open source 'file' package, I sent an email to the maintainer, had a patch in an hour. I sleep very well every night, but sometimes I have nightmares of my microsoft days. Microsoft is working hand in hand propping myspace up, I know that they did this with one of their largest exchange customers(etrade). - dasilva333, on 10/12/2007, -1/+4FTA: Preventing Digg’s enthusiastic developers from adding powerful but CPU-intensive features is "a political thing I constantly have to deal with as a DBA," said White.
I really do wish they would add mroe feature, like the greasemonkey duggmirror script and a damn picture section - geronimo, on 10/12/2007, -0/+3That is nice that they don't have to use transactions and can use ISAM tables, that helps a lot I imagine. That way their databases aren't taxed, which is important as Mysql w/ transactions doesn't scale as well as postgres. I have been doing 'sharding' since before the phrase 'sharding' was made - back in the 90's. 'Sharding' or partitioning across machines is the proven way to go. I think a lot of the slowness of digg that people report is actually due to the non-trivial amount of javascript code, when I view sites like reddit, rendering is much faster due to the javascript being more lightweight. I have noticed that sometimes I submit a comment and the refresh doesn't show my new comment, 10 seconds later it is there. I imagine they write directly to the database hosting the topic(assuming they partition based on topic which makes the most sense), then read the page from the cache, and there is a delay between the time you write to the DB and the time the topic is updated in the cache. This is a small price to pay for scaling given the architecture chosen.
I prefer the postgres+slony way, write to the master, query the slave database and use a cache for the queries on the slave. - brundlefly76, on 10/12/2007, -0/+3"Digg runs on less than 10 web servers and around 40 mySQL servers"
As far as web architectures go, that is my worst nightmare. - SoCalDissident, on 10/12/2007, -8/+11I love LAMP!
- DeadlyBrad42, on 10/12/2007, -0/+3The site, which lets its users vote on, or "digg," their favorite news stories hosted on other sites...
Ohhhh, that's how it works... - Kailash.Nadh, on 10/12/2007, -0/+3100 servers and 30GB data. Regardless of the traffic, the numbers seeem bloated.
- coryking, on 10/12/2007, -0/+3"That is nice that they don't have to use transactions and can use ISAM tables, that helps a lot I imagine"
Until you realize that MyISAM locks the entire table for an update/insert... good bye scaling, hello gobs of caching code. Arg... I hate MySQL with a passion. The minute you see the word "memcached" you know the system is backed with MySQL. - inactive, on 10/12/2007, -0/+3This explains why Digg's search is so critically useful.
Wait... - motionblur, on 10/12/2007, -1/+4@ fkr3: At the Web 2.0 Expo last week, Owen Byrne mentioned that Digg runs on less than 10 web servers and around 40 mySQL servers. The rest are for staff and other non-site related functions.
- brundlefly76, on 10/12/2007, -0/+2@coryking
Lol I get this all the time when I consult.
There is a data management learning curve for programmers - first they learn to use flat files for data, maybe even small flat file databases - then they learn how to use a relational database, which is very powerful, and then they start thinking of all data management in terms of databases. At the top of the curve where you manage dynamic web services at the highest scale you have to dispose of the entire notion of the relational database whereever possible, and start working with custom filesystem solutions on extremely fast and reliable NAS (usu NetApps).
So, you are associating flat files with naivety because thats where you started as a programmer, but there is massive scalability to be found there are exploited by the largest data applications in the world.
Also - you are thinking in terms of complex queries, not data access. Most web applications either do not need to do a complex query or they can lay their data indexed out on the filesystem in prediction of the specific data questions they need answered - its all about specificity vs flexbility. - Ellsass, on 11/05/2008, -1/+3FTA: "The other atypical feature of Digg’s setup is its use of what Tim Ellis, another Digg engineer, calls 'sharding.'
A term apparently coined by Google engineers, sharding involves breaking a database into smaller parts in order to isolate heavy loads for better performance."
I thought it was when you think your going to pass gas but you end up soiling yourself... I guess I know nothing about web sites :( - coredump0x01, on 10/12/2007, -1/+3@pennyfan87
I experience that too, mostly when expanding buried comments. Firefox (2.0.0.3 + Arch Linux + AMD Athlon 64 4000+) shoots to ~95% CPU usage. It's especially chunky when the page has 500+ comments (like a fat person running on a treadmill). I'm no web programmer but I assume it's because of the way Digg uses AJAX/CSS (or whatever) to render the comment pages.
One thing that helps ease the pain is using a version of Firefox that's compiled for optimizations for your processor. If you use Linux, I recommend getting Swiftfox ( http://getswiftfox.com ) Goto the download page and grab a copy that matches your processor type, It really helps reduce the chunkiness of the folding/unfolding of buried comments is a lot smoother on pages with a ***** of comments, and pages with ~100 comments are silky smooth. - ricksite, on 10/12/2007, -1/+3fkr3, Although many diggers are probably up at all hours of the night, there are probably higher peak loads during parts of the day. Average pageviews per minute don't mean much here.
- Robotsu, on 10/12/2007, -0/+2Funny this article comes up, IBM's developerworks just had an article on using Memcache with PHP to speed up your web application:
http://www-128.ibm.com/developerworks/library/os-php-fastapps3/index.html?ca=drs-&ca=dkw-php - classicrock39, on 10/12/2007, -0/+2I appreciate all of the technology that LAMP has to offer, don't get me wrong.
The db is not always the fastest, but it's more manageable and can be very fast. mySQL is great and getting better, but there are better databases and geez, maybe you have to pay. How much effort is it to maintain 25 servers? We just bought a 4CPU dual core IBM 3755 w/32gb for $26k. Hmm... - donjaime, on 10/12/2007, -2/+4Floor space and cooling are addressed by using server racks. By multiple low-medium cost servers, I didn't mean separate towers filling up a datacenter.
There are cost/size limitations to memory on single boxes. If your bottleneck is the database, generally you are I/O bound and have issues with constant paging to disk. The more disks the better.
"There are also benefits to be had in performance if you have multiple servers running on a single machine because inter-server network traffic moves at the speed of memory instead of the speed of physical ethernet cables and switches"
Not if you are I/O bound. If you can fit the entire database in memory, then you are right. Networks are FASTER than disk accesses. It is easier to fit more data in memory on multiple machines, than it is to have it on a single box with alot of memory.
The cost of a single machine with 32GB of ram is not worth it given cheaper alternatives. Power footprints are generally better for big boxes, but that difference is small compared to the cost of the hardware. Digg has smart people running their infrastructure. They have 100 machines for a reason. - alex4u2nv, on 10/12/2007, -0/+2From the article: "Most people come to Digg’s front page, read it and leave, which is kind of nice," said Ellis
Makes sort of sense. Ever tried using Digg's search feature? It seems they focus on the idea quoted above, because features such as the search function is very poor at returning accurate results, if it isn't returning an error. - coryking, on 10/12/2007, -0/+2Yikes! 20 database servers! WTF are these guys doing?
"...Also, Digg was having a problem with its storage misreporting the status of data synchronizations. "Our hardware wanted to be fast," White said. "It was telling us things were synced to disk when it was not."
Finally, there is the mundane challenge of minimizing "schema cruft," or redundant tables of data which, if read, can slow down performance, said White..."
In other words, the people who originally designed this stuff had no idea how to design a database schema. Somehow, "premature optimization" comes to mind. Hundred bucks says somebody said "we are smarter than our database! lets denormalize everything so we can 'scale'" or worse "The database should be dumb! Everything should be done in the application code!! Long live David Heinemeier Hansson!!!".
Sounds like new DBA has quite a job ahead of him :-) - coredump0x01, on 10/12/2007, -0/+1To those who are commenting on how slow Digg loads consider this: Think about how many people view/refresh Digg, how many aggregators leech from the homepage/RSS feed, and how quickly lesser webservers die out even before they hit the frontpage. Of course a more thought out infrastructure would benefit Digg, but the same can be said for any web service. Look at Google, they run on redundant hardware with as little as 533MHz to 1.4GHz systems ( http://en.wikipedia.org/wiki/Google_platform#Server_hardware_and_software ) despite the mass, and despite the power consumption, there's room for innovation in everything. And the cost savings they reep from using "trash" hardware is of great benefit infrastructure-wise. Think about how swiftly Google can fulfill a search request running on redundant "trash" hardware running on customized Linux software and think about how many queries Google must fulfill every hour. If you scale that to the much smaller user demand of Digg (in comparison to Google) It's probably more efficient (cost-wise) to throw more "trash" hardware at the problem then to rethink your entire infrastructure.
/drunken rant - panique, on 10/12/2007, -0/+1Uh yeah. When loading the "main" page, or in my case, the Technology topic page, it is the slowest site I have ever experienced, and the first time I used the Internet, all I had was a 14.4K modem, which I promptly went out and replaced with a 28.8. The comments pages seem to be pretty snappy unless there is 160+ comments, then the browser chugs rendering the page.
- coryking, on 10/12/2007, -3/+4"Flatfiles are faster than a database" is a claim only a novice would say. You really think the filesystem, which has many of the same problems as a database, will somehow be faster when it has 2 million "rows" (er, files)?
Look up "indexes" and "binary trees" and get back to me with how your file based design will scale like a modern relational database. - skyfire1, on 10/12/2007, -0/+1Unless a truther posts I never have to wait but a few seconds for the page to load.
- donjaime, on 10/12/2007, -2/+3Scale out. Scaling up refers to improving the hardware on individual machines.
Digg tackles the scalability issue by distributing it across multiple low-medium cost servers. Which is the right way to do it. - dbr_onix, on 10/12/2007, -0/+1Err, Ruby On Rails seems far more "bloated" than PHP does - C or some other native language would make far more sense
- FairlyStupid, on 10/12/2007, -0/+1How do HUGE sites like facebook or Myspace pull off this infrastructure setup?
- smackhero, on 10/12/2007, -0/+1@corkyking:
PostgreSQL is a more full-featured database, but MySQL is faster. That's why a lot of web developers use MySQL, because they don't need the extra feature set of PostgreSQL, and they benefit far more from the faster speed of MySQL. - puggy, on 10/12/2007, -0/+1Clustering is supposed to "scale out" not "scale up". Scaling-up is when you upgrade the hardware on a node.
- smackhero, on 10/12/2007, -0/+1i would look to google rather than myspace for a good distributed system. myspace is always having problems and not working right. while the site was based on some pretty good ideas, the implementation is horrible.
- KIERANMULLEN, on 10/12/2007, -0/+1An option should be in the users profiles to have comments hidden/ or shown be default.
Off = Quicker loading times.
Hopefully each comment is not a separate sql entry. That wouldnt make sense. - petsounds, on 10/12/2007, -0/+1The slow loading is not only from the JS libraries, but also from the browser polling for responses from the ad networks serving the banner ads. Just watch the status bar...when the ads finally load in suddenly the page content is up in no time. This is not specific to digg; most of these ad revenue type of sites have this issue. The obvious solution is to delay the ad requests until the page content is loaded, but that never seems to happen.
- schestowitz, on 10/12/2007, -0/+1I think the initial design is to blame here. They weren't expecting it to get that huge, but it's too late to redesign. It's like the design of the World Wide Web/Internet.
-
Show 51 - 66 of 66 discussions



What is Digg?
Digg is coming to a city (and computer) near you! Check out all the details on our