Discover the best of the web!
Learn more about Digg by taking the tour.
Looking for the perfect DB storage array for your hot Web 2.0 company?
blogs.smugmug.com — We do billions of queries per day in our DBs, so we have very specific requirements for what our storage does. It turns out YouTube has very similar requirements, so I wrote up a nice, detailed list of things to watch for if you're a startup looking to scale your web application.
- 663 diggs
- digg it
- fkr3, on 10/12/2007, -36/+10Step 1 - don't use MySQL, use an rdbms that actually scales out of the box. The time and money you waste making MySQL keep up with your needs is going to quickly overtake the cost of Oracle or SQL Server.
- onethumb, on 10/12/2007, -5/+19Try reading the article (and linked items) before you post. :)
Paul, YouTube's DBA, has been an Oracle DBA for over 15 years. He ran PayPal's Oracle installation, handling billions of dollars.
He doesn't miss Oracle. MySQL is his choice at YouTube, YouTube obviously has a massive budget, and he's only been using it for 8 months. Sure doesn't sound like they're wasting time & money. - fkr3, on 10/12/2007, -22/+13I did. And they've obviously mucked around with a bunch of hardware and configurations to reach the optimal setup for MySQL and YouTube. And that mucking around is a waste of time and money since better solutions exist that scale natively.
There's a reason why none of the biggest databases in the world are MySQL and it's not because MySQL is "bad", it's because it's not "great". It's not a forward-compatible solution if you plan on growing huge.
And you're right, his snippet of a bio does say he doesn't miss Oracle much. But did you miss the bit where it also says .... "He has been a MySQL DBA for 8 months, solving relentless scalability challenges at YouTube."
Keywords being "relentless scalability challenges".
Thanks for playing. - m3mnoch, on 10/12/2007, -0/+27so, raid controller caching, striping, spindles... what parts of all of that have anything to do with the rdbms itself?
in addition to that, some facts:
1) youtube (google) has a *****-ton of money.
2) they can afford oracle, but aren't running it.
3) they can afford msql, but aren't running it.
4) google and youtube are #3 and #4 heaviest trafficked sites on the web which speaks to their understanding of "load."
pretty much, it seems like a "if it's good enough for them, it's good enough for me" situation that relies on facts and not some random internet dork's opinion on how "mucking around with a bunch of hardware" is time and money wasted. especially considering paul's been a dba at oracle for 15 years.
some reading for you:
http://www.mysql.com/customers/customer.php?id=75
m3mnoch. - geminitojanus, on 10/12/2007, -0/+8"4) google and youtube are #3 and #4 heaviest trafficked sites on the web which speaks to their understanding of "load.""
To be fair, most of Google runs on its own internally created database software for its indexes, but YouTube was created by external people who were bought into Google, and MySQL was selected then.
That being said, if MySQL doesn't work for you, there's always PostgreSQL, which might be a bit more "traditional" and to you're liking. Then again, MySQL is likely to be more than you ever need if you know what you're doing and your hardware is up to the task. - djlosch, on 10/12/2007, -0/+5step 0.1) make sure you're out of beta and your app actually has sizable userbase before you start buying specialized equipment and office supplies. most startups focus on their "professional environment" before they focus on having an app that actually generates revenue. you can start your system on a decent dedicated host, then move to colo, then up from there.
- onethumb, on 10/12/2007, -1/+5@geminitojanus:
True that Google uses lots of their own stuff (BigTable, Map/Reduce, SawZall, etc), but they do use a lot of MySQL internally, too. I think they just pick the best tool for each job, and sometimes that's MySQL and sometimes it's not. - MattCruikshank, on 10/12/2007, -0/+5@fkr3 - I don't mean to be rude, but you kind of sound like a fanboy. You're simply going to have to put up (tell us how much traffic you run, compared to SmugMug), or shut up.
Personally, I doubt you know what you're talking about, because you're comparing apples (expertise with the software) with oranges (expertise with the hardware.) - Duncan3, on 10/12/2007, -3/+4They used Oracle at PayPal because it mattered if the database screwed up. They work very hard to screw their customers in other ways of course.
If Google loses match #1,938,267 for "cat" or a video nobody really cares. - gmillerd, on 10/12/2007, -0/+1If your web2.0 company is so sophisticated out of the box that you cannot do it in mysql out of the gate you have other problems that this article can provide for.
- onethumb, on 10/12/2007, -5/+19Try reading the article (and linked items) before you post. :)
- phlux, on 10/12/2007, -3/+5I love articles like this.
I have been in IT for a long time - but I typically dont get down to DB issues like this as I run large organization that run all corporate services, rather than dealing directly with DBAs etc..
DBs are the biggest area of unknown for me so I love candid real-world info about whats proper to do.
I am in the process of starting a site - and although I dont expect the load to be as large as these guys, I want to architect it correctly from the get go. I have no problem architecting for massive amounts of *network traffic* - but I fall a little short in architecting for actual transaction traffic (hitting a DB) - tuzziel, on 10/12/2007, -20/+1Thats n00b talk, massive traffic site rules are:
1) do not use Apache
2) do not use MySQL
3) do NOT use PHP or other interpreters
4) do not use sockets or pipes to talk to your DB
know your HTTP code, know your DNS code, know your DB code (do not use some 3rd party *****)
new rule of 2007:
5) do not use harddisks (lol yes, real bottlenecks), unless you run YouTube of course- optize, on 10/12/2007, -0/+7I know plenty of "massive" sites that have more then one of those, if not all and run fine.
If it runs like *****, it's most likely your code or structure. - Xiretsa, on 10/12/2007, -0/+7I thought Digg uses 1, 2 and 3. If they uses number 4 or not I have no idea.
- onethumb, on 10/12/2007, -1/+5I think you just described almost every "massive" traffic site you can name, including the one you're posting this comment on. Since you're obviously trolling, I think I'll leave it at that. Apologies for even taking a tiny taste of the troll bait. :)
- optize, on 10/12/2007, -0/+7I know plenty of "massive" sites that have more then one of those, if not all and run fine.
- bobdobolena, on 10/12/2007, -4/+1Whilst this article is interesting, I would probably wager that youtube/google/insert big name company here is still not using DAS for their infrastructure. What also is wrong in this article is that SAN storage does not _have_ to be all that expensive. A low end HP EVA/ SUN/Storagetek 6140 would provide initial storage and allow startups to expand later while providing the performance that you would need. Add in a clustered filesystem that many DB front ends can access and you probably can get more performance that you can have fun with. :) Direct attached storage offerings are OK for small shops, but when you start talking spindles, etc that is so 1997... If you are committed to SUN hardware I would speak to your reps about the storagtek stuff they just bought, etc... Personally I would stay away from EMC ($$$). There are a lot of other san-in-a-box companies out there as well that sell some crazy stuff.
- onethumb, on 10/12/2007, -1/+4I'll take your wager. I'll bet Google *and* YouTube *and* nearly any other big web company you can think of is either using internal disks or DAS on their DB masters and slaves.
The cost / performance benefit just isn't there with other solutions. Believe me, we've talked to Sun in obscene detail about their storage, and the SAN / NAS stuff just costs too much with limited (really, no) gain.
I seriously doubt you can find any major shop that doesn't care about spindles. It's not "so 1997" - it's crucial and it's an unavoidable physical problem. Disks, even the very fastest, only spin so fast. To handle a large load, you have to add more disks somehow. There are lots of ways to do that - partitioning your data, replication, "sharding," etc. But the fact remains - one way or another, you're going to care about spindles.
- onethumb, on 10/12/2007, -1/+4I'll take your wager. I'll bet Google *and* YouTube *and* nearly any other big web company you can think of is either using internal disks or DAS on their DB masters and slaves.
- fenris6644, on 10/12/2007, -0/+5Finally, an interesting technical article makes the home page. I was getting sick of seeing ***** like "creates index on your MySQL tables" or "do not execute SQL statements against the DB while iterating a loop in your PHP script."
- akirakurosawa, on 10/12/2007, -4/+5Youtube, Digg and even Google (all use MySQL) doesn't require transactionaly consistent data unlike say Amazon, Paypal or eBay (all use Oracle).
And while i am a big fan of MySQL, lets not pretend it can handle the same amount of load as Oracle (Real Application Clusters, Partitioning, Advanced DW features, Parallel Queries, etc.).
If you want to see some real high-end database check out DAYTONA. it was developed and used by AT&T and handles the largest production database in the world. check it out. http://www.trappedbydogma.com/blog/the-worlds-most-massively-scalable-database-management-system-daytona/ - nycmac247, on 10/12/2007, -4/+1Xserve with IP over firewire for failover - why not mentioned?
BSD is not good enough?
http://www.apple.com/xserve/specs.html
??????- Ndric, on 10/12/2007, -2/+2Yeah, buy Apple, waste more money
- theduke25, on 10/12/2007, -0/+2Maybe myspace should take notes on performance, lol
- ElMoselYEE, on 10/12/2007, -0/+3myspace is beyond help
- willynilly, on 10/12/2007, -1/+1It seems that the Digg community would like to style itself as technologically savvy. Yet it was dumb enough to promote a story using the ***** term "Web 2.0", after all these months of people making it clear that THERE IS NO "WEB 2.0".
So really, this site is not a source of reliable tech info. Congratulations.
Digg is coming to a city (and computer) near you! Check out all the details on our