54 Comments
- thinkingserious, on 12/01/2008, -4/+27I always thought it had something to do with gnomes.
- zoydberg, on 12/03/2008, -3/+23http://64.233.183.132/search?q=cache:n1L5gK5nA0UJ: ...
&
What Your Computer Does While You Wait
This post takes a look at the speed - latency and throughput - of various subsystems in a modern commodity PC, an Intel Core 2 Duo at 3.0GHz. I hope to give a feel for the relative speed of each component and a cheatsheet for back-of-the-envelope performance calculations. I’ve tried to show real-world throughputs (the sources are posted as a comment) rather than theoretical maximums. Time units are nanoseconds (ns, 10-9 seconds), milliseconds (ms, 10-3 seconds), and seconds (s). Throughput units are in megabytes and gigabytes per second. Let’s start with CPU and memory, the north of the northbridge:
Latency and throughput in an Intel Core 2 Duo computer, North Side
The first thing that jumps out is how absurdly fast our processors are. Most simple instructions on the Core 2 take one clock cycle to execute, hence a third of a nanosecond at 3.0Ghz. For reference, light only travels ~4 inches (10 cm) in the time taken by a clock cycle. It’s worth keeping this in mind when you’re thinking of optimization - instructions are comically cheap to execute nowadays.
As the CPU works away, it must read from and write to system memory, which it accesses via the L1 and L2 caches. The caches use static RAM, a much faster (and expensive) type of memory than the DRAM memory used as the main system memory. The caches are part of the processor itself and for the pricier memory we get very low latency. One way in which instruction-level optimization is still very relevant is code size. Due to caching, there can be massive performance differences between code that fits wholly into the L1/L2 caches and code that needs to be marshalled into and out of the caches as it executes.
Normally when the CPU needs to touch the contents of a memory region they must either be in the L1/L2 caches already or be brought in from the main system memory. Here we see our first major hit, a massive ~250 cycles of latency that often leads to a stall, when the CPU has no work to do while it waits. To put this into perspective, reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk down the hall to buy a Twix bar.
The exact latency of main memory is variable and depends on the application and many other factors. For example, it depends on the CAS latency and specifications of the actual RAM stick that is in the computer. It also depends on how successful the processor is at prefetching - guessing which parts of memory will be needed based on the code that is executing and having them brought into the caches ahead of time.
Looking at L1/L2 cache performance versus main memory performance, it is clear how much there is to gain from larger L2 caches and from applications designed to use it well. For a discussion of all things memory, see Ulrich Drepper’s What Every Programmer Should Know About Memory (pdf), a fine paper on the subject.
People refer to the bottleneck between CPU and memory as the von Neumann bottleneck. Now, the front side bus bandwidth, ~10GB/s, actually looks decent. At that rate, you could read all of 8GB of system memory in less than one second or read 100 bytes in 10ns. Sadly this throughput is a theoretical maximum (unlike most others in the diagram) and cannot be achieved due to delays in the main RAM circuitry. Many discrete wait periods are required when accessing memory. The electrical protocol for access calls for delays after a memory row is selected, after a column is selected, before data can be read reliably, and so on. The use of capacitors calls for periodic refreshes of the data stored in memory lest some bits get corrupted, which adds further overhead. Certain consecutive memory accesses may happen more quickly but there are still delays, and more so for random access. Latency is always present.
Down in the southbridge we have a number of other buses (e.g., PCIe, USB) and peripherals connected:
Latency and throughput in an Intel Core 2 Duo computer, South Side
Sadly the southbridge hosts some truly sluggish performers, for even main memory is blazing fast compared to hard drives. Keeping with the office analogy, waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months. This is why so many workloads are dominated by disk I/O and why database performance can drive off a cliff once the in-memory buffers are exhausted. It is also why plentiful RAM (for buffering) and fast hard drives are so important for overall system performance.
While the “sustained” disk throughput is real in the sense that it is actually achieved by the disk in real-world situations, it does not tell the whole story. The bane of disk performance are seeks, which involve moving the read/write heads across the platter to the right track and then waiting for the platter to spin around to the right position so that the desired sector can be read. Disk RPMs refer to the speed of rotation of the platters: the faster the RPMs, the less time you wait on average for the rotation to give you the desired sector, hence higher RPMs mean faster disks. A cool place to read about the impact of seeks is the paper where a couple of Stanford grad students describe the Anatomy of a Large-Scale Hypertextual Web Search Engine (pdf).
When the disk is reading one large continuous file it achieves greater sustained read speeds due to the lack of seeks. Filesystem defragmentation aims to keep files in continuous chunks on the disk to minimize seeks and boost throughput. When it comes to how fast a computer feels, sustained throughput is less important than seek times and the number of random I/O operations (reads/writes) that a disk can do per time unit. Solid state disks can make for a great option here.
Hard drive caches also help performance. Their tiny size - a 16MB cache in a 750GB drive covers only 0.002% of the disk - suggest they’re useless, but in reality their contribution is allowing a disk to queue up writes and then perform them in one bunch, thereby allowing the disk to plan the order of the writes in a way that - surprise - minimizes seeks. Reads can also be grouped in this way for performance, and both the OS and the drive firmware engage in these optimizations.
Finally, the diagram has various real-world throughputs for networking and other buses. Firewire is shown for reference but is not available natively in the Intel X48 chipset. It’s fun to think of the Internet as a computer bus. The latency to a fast website (say, google.com) is about 45ms, comparable to hard drive seek latency. In fact, while hard drives are 5 orders of magnitude removed from main memory, they’re in the same magnitude as the Internet. Residential bandwidth still lags behind that of sustained hard drive reads, but the ‘network is the computer’ in a pretty literal sense now. What happens when the Internet is faster than a hard drive?
I hope this diagram is useful. It’s fascinating for me to look at all these numbers together and see how far we’ve come. Sources are posted as a comment. I posted a full diagram showing both north and south bridges here if you’re interested. - Myztry, on 12/03/2008, -3/+16All this leads back to how bloatware Operating Systems are killing the advances made in computing. Gates Law is undermining Moores Law. Stalling the hardware left, right and center.
Layer upon layer of compatibility hacks inbred all the way back to QDOS. Applications no matter how well designed internally must comply with, and pass through the quagmire of Windows, and unavoidably get bogged down.
Writing applications for the Amiga was a joy. But then it was designed from the ground up as a modern computer, and was thus able to run like one in less than 512k, and on a 7Mhz processor.
As I sit here on my 3.2Ghz Quad core, it is obviously apparent that Operating Systems haven't got 1700 times more complex. it's just the Operating System is at the very least hundreds of times less capable of efficiently utilizing given hardware. Our Ghz are being stolen...
That simple fact pisses me off to no end! - jasonh1234, on 12/03/2008, -1/+9Nothing.
It's Ceiling Cat you need to worry about. - Finalreminder, on 12/03/2008, -0/+8That was so interesting.
Has anyone got a flow chart on what my boiler does when I'm sleeping? - d0nkeym0nkey, on 12/03/2008, -1/+8Because digg is the only way to access the web, right?
- hartley, on 12/03/2008, -1/+8it does.
well, at least for desktop environments. - jerwong, on 12/03/2008, -0/+7How about "What the Server Does While you Wait"?
- walk1355, on 12/03/2008, -1/+7Is it really that difficult to press the 'h' key on your keyboard?
- inactive, on 12/03/2008, -0/+6right
- inactive, on 12/03/2008, -0/+6part of the lesson is that if you want to save money, go for the cheapest processor with a certain L2 cache size because that will make the biggest difference in performance, not the clock speed or the front side bus speed.
- walk1355, on 12/03/2008, -0/+5yep yep
- Giga, on 12/03/2008, -1/+6Yes. Yes it is. Wy do you ave to be so mean?
- frsrblch, on 12/03/2008, -0/+5That was a pretty shortsighted comment, seeing as the page is down already.
- nunlover, on 12/03/2008, -0/+5mine stands over my bed and watches me while i sleep...or my dad, or somebody...
- headband, on 12/03/2008, -0/+5http://www.tomshardware.com/reviews/cache-size-mat ...
actually quite a bit of difference - JKAL, on 12/03/2008, -1/+5how long is it going to take for you n00bs to learn the amount of diggs != the total traffic, even it is all came via Digg, most people just check the link and move on with out digging.
- torressr3, on 12/03/2008, -1/+5how does i web outside digg?
- jbmcb, on 12/03/2008, -0/+4He's using wifi - which adds in quite a bit of latency, especially if you're using a bandwidth-enhancing technology that relies on multipath - the driver needs to reassemble the modulated signal and process out the bits, which takes time.
- BobCFC, on 12/03/2008, -0/+3Change title to what Firefox does while you wait for the bloody server.
- damien1989, on 12/03/2008, -0/+3absolutely
- JerodSlay, on 12/02/2008, -2/+5Pretty awesome.
- garvallagh, on 12/03/2008, -0/+3Right. ***** out lads, its a tech article.
- catinthebox, on 12/03/2008, -0/+3I just assumed that they all secretly networked to plot the end of human kind.
- torressr3, on 12/03/2008, -0/+2mirror(google cache):
http://209.85.129.132/search?q=cache:n1L5gK5nA0UJ: ... - Lokomis, on 12/03/2008, -1/+3http://www.instantrimshot.com/
- Myztry, on 12/03/2008, -0/+2All (x86) processors do the equivalent functions. Which is what the article is directly about. Linux does similar things (caching, etc) but it does not do the same thing as Windows. You just don't get the same drive thrashing on Linux like is so common on equivalent spec Windows systems for a start.
Ram and the HDD are bottle necks for the processor. I agree. The whole point you are missing is that poorly designed systems bloat out, and overflow into these bottlenecks. An Operating System is meant to manage these resources, and Windows in particular does so very poorly. The fat man can't complain about the size of the doorway!
Microsoft has near unlimited resources. Linux is a community largely of volunteers. Yet, Linux is gaining ground, even dominating The Internet and Supercomputers. The heavy lifting' computer applications. Windows 7 may provide some hope, as Linux has caused Microsoft to try to match efficiency, so they too can be a player in the convenience sized device market.
You are right about it being pointless optimizing obscure rarely used application functions. But WTF has that got to do with it. I'm talking Operating System, and in particular the one that is the lowest common denominator.
If you cut 1 second off task times at that level, for the user base of Windows, you save millions of seconds of people's time. Quite substantial. The Vista file copy bug cost 10's of thousands of man hours. They have improved on that with SP1, but it is nowhere near fixed.
Hard enough getting XP, let alone Vista in 70 years. If Vista was still relevant (when Microsoft is trying desperately to push past it now) then it would indeed be a sad state of affairs. - exscape, on 12/03/2008, -0/+2Well, perhaps in theory when you look at this... but do you have ANY benchmarks to back your claims up? I'll bet that a 3 GHz C2D with 6MB L2 is faster than a 2.8 GHz/12MB at most tasks.
- damien1989, on 12/03/2008, -0/+2yeh, i looked at the link, thats all thats needed
- GorfTron, on 12/03/2008, -4/+5What does it do while I jerk-off?
Giggity. - Trax91, on 12/03/2008, -0/+1It probably went through Reddit traffic before reaching Digg.
- choicenotchance, on 12/03/2008, -1/+2Great article!
- fr34k5h0w, on 12/03/2008, -0/+1It makes one wonder how much overhead different application designs have. For example, the MVC methodology. True it makes code extremely reusable, but how many clock cycles are spent communicating between the various objects compared to a procedural program doing the same tasks?
- adlep, on 12/03/2008, -0/+1Person writing this blog is very smart. All of his entries are very well written...
- fidgy, on 12/04/2008, -0/+1You're totally right Myztry.
- jbmcb, on 12/03/2008, -1/+1Really nice overview, should go a bit more in depth about SIMD though, with instruction prefetching and a lot of pipes it's possible for commands to take zero cycles, which is a cool feature.
That's a sweet blog, by the way. - toastgodsupreme, on 12/03/2008, -2/+2bleh, site is fail. but what IS interesting is to download diskmon and watch it. seriously, it's actually kind of interesting to see all the stuff behind the scenes just as your computer sits there and "does nothing".
- exscape, on 12/03/2008, -1/+1I never said cache size doesn't matter, because it clearly does (until the point where the increased latency slows down more than added cache speeds up). All I said was I think core frequency usually (not always, either) does more for performance.
- Thoku, on 12/03/2008, -1/+1Gigabit ethernet is 125MB/s, not 30...
- erichh, on 12/03/2008, -1/+1The L2 cache can catch you out when working on large data structures. I remember when L2 caches were 4MB getting caught out scaling images for a texture mapped game. Once the images approached 4MB in size the performance dropped of a cliff. I know that now the GPU does a lot of this scaling and heavy-lifting for you but the principle when managing any large structure is the same.
- soupedupchicken, on 12/02/2008, -3/+3Very interesting. Thanks.
- whoreable, on 12/02/2008, -4/+4The first diagram goes to show how important cache really is. Also for having cable 80ms ping isn't that good. Maybe for a world wide average that is pretty good but inside the US with cable and good routing 80 is kinda high. Just like me.
- drmsux, on 12/03/2008, -3/+2You're an idiot. Linux is open source, and it does same things Windows does at the same speed. If you'd RTFA, you would know that your performance is bottlenecked by RAM size and HDD, not OS or evil greedy conspirators. Apps, too, can (and do) suck in this regard - Outlook working with a 10GB pst file, for example, is ***** scary. Still, there's a known tradeoff between perf and development complexity (and cost), and nobody will ship a program that takes an hour to load, just as nobody would rewrite some obscure function in assembly, that may execute just once in the background for .5% of all the users. If you don't understand that (and I'm afraid you don't) then please lock yourself in the basement and start rewriting Windows in assembly. Maybe when you're 90 years old, you will have Vista kernel performing 15% faster in some benchmarks...I'm pretty sure in 2080 Vista will still be very relevant, even more than Amiga today...
- mechnoch, on 12/03/2008, -2/+1My guess is Star Trek porn.
http://youtube.com/watch?v=1PwpcUawjK0
perv. - slickwatson, on 12/03/2008, -2/+1????
Profit - iMatt711, on 12/02/2008, -13/+11Digg effect at 66 diggs. Seriously?
- GoKings, on 12/03/2008, -5/+1wat
- bobbonew, on 12/03/2008, -4/+0Alternate source anyone?
- inactive, on 12/03/2008, -6/+2Lol @ comment #22
- meinrosebud, on 12/03/2008, -5/+1Like I care, sheesh, get a life dude!
-
Show 51 - 54 of 54 discussions




What is Digg?
Browsing Digg on your phone just got easier with our enhancements to the