The Nehalem Preview: Intel Does It Again
by Anand Lal Shimpi on June 5, 2008 12:05 AM EST- Posted in
- CPUs
A Quick Path to Memory
Our investigation begins with the most visibly changed part of Nehalem's architecture: the memory subsystem. Nehalem implements a very Phenom-like memory hierarchy consisting of small, fast individual L1 and L2 caches for each of its four cores and then a single, larger shared L3 cache feeding the entire chip.
Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).
CPU / CPU-Z Latency | L1 Cache | L2 Cache | L3 Cache |
Nehalem (2.66GHz) | 4 cycles | 11 cycles | 39 cycles |
Core 2 Quad Q9450 - Penryn - (2.66GHz) | 3 cycles | 15 cycles | N/A |
The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.
While Intel did a lot of tinkering with Nehalem's caches, the inclusion of a multi-channel on-die DDR3 memory controller was the most apparent change. AMD has been using an integrated memory controller (IMC) since 2003 on its K8 based microprocessors and for years Intel has resisted doing the same, citing complexities in choosing what memory to support among other reasons for why it didn't follow in AMD's footsteps.
With clock speeds increasing and up to 8 cores (including GPUs) making their way into Nehalem based CPUs in the coming year, the time to narrow the memory gap is upon us. You can already tell that Nehalem was designed to mask the distance between the individual CPU cores and main memory with its cache design, and the IMC is a further extension of the philosophy.
The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly. We turned to Everest Ultimate 4.50 to give us memory bandwidth and latency numbers from Nehalem.
Note that these figures are from a completely untuned motherboard and are using DDR3-1066 (dual-channel on the Core 2 system and triple-channel on the Nehalem system):
CPU / Everest Ultimate 4.50 | Memory Read | Memory Write | Memory Copy | Memory Latency |
Nehalem (2.93GHz) | 13.1 GB/s | 12.7 GB/s | 12.0 GB/s | 46.9 ns |
Core 2 Extreme QX9650 - Penryn - (3.00GHz) | 7.6 GB/s | 7.1 GB/s | 6.9 GB/s | 66.7 ns |
Memory accesses on Conroe/Penryn were quick due to Intel's very aggressive prefetchers, memory accesses on Nehalem are just plain fast. Nehalem takes a little over 2/3 the time to complete a memory request as Penryn, and although we didn't have time to run comparable Phenom numbers I believe Nehalem's DDR3 memory controller is faster than Phenom's DDR2 controller.
Memory bandwidth is obviously greater with three DDR3 channels, Everest measured around a 70% increase in read bandwidth. While we don't have the memory bandwidth figures here, Gary measured a 10% difference in WinRAR performance (a test that's highly influenced by memory bandwidth and latency) between single-channel and triple-channel Nehalem configurations.
While we didn't really expect Intel to somehow do wrong with Nehalem's memory architecture, it's important to point out that it is very well implemented. Intel managed to change the cache structure and introduce an integrated memory controller while making both significantly faster than what AMD managed despite a four-year headstart.
In short: Nehalem can get data out of memory quick like bunnies.
108 Comments
View All Comments
SiliconDoc - Monday, July 28, 2008 - link
Crysis- etc. :Pete, you can be very happy knowing it will do folding like mad, and you can fantasize that you've cured cancer while you spend your money for some tax subsidized already to the hilt University program, because you're such a good and loving person.
( I know YOU didn't mean anything like that - see sarcasm! )
In the mean time, the OLD HT single core chips will do just fine cranking most games, and dual core or core2duo or 2180 or some other then $40 chip will be a few percentage pts. shy.
My gawd, they've got our number.
I bet they "unlock it !!!!! " OMG ! for like 2 grand if you're cooooool you can get one!
Crank the Planet - Thursday, June 5, 2008 - link
I know it may be exciting but the article sounds fan-boyish. For most of the marks it shows what intel is claiming 20-30% boost. He gets one mark to go 50% and now it's 20-50% boost?? He compares in another mark AMD 21 and nehalem 14 and says it's almost 50% faster!!! and then compares penryn 18 and nehalem 14 and says it's 28%. I think the AMD mark was more like 35%.As I've said before everybody knows AMD was going to hurt themselves in the short run by buying ATI. If they didn't buy ATI I think things would be very different. Now that the last year of payments is being made for buying ATI AMD will be able to get back into the game.
Intel has only now integrated the memory controller. Everybody knew as soon as they did they would see a nice bump. They haven't had any significant innovations in a long time. AMD is in the same position they were before K8. Just give them some time to finish absorbing ATI, then watch out- fusion is just around the corner :)
hs635 - Tuesday, June 17, 2008 - link
Fuck off retardmasouth - Friday, June 6, 2008 - link
What kind of idiot fan-boy drivel is this?"He gets one mark to go 50% and now it's 20-50% boost??"
Ummm, yes?
1, 2, 3, 4, 8
What is the range of those numbers? 1-8, right?
Does the majority of them being being in the 1-5 range somehow negate the fact that the actual range is 1-8?
THINK PEOPLE!
michael2k - Thursday, June 5, 2008 - link
You're the one that sounds like a fanboy.What makes you think Intel's CPU-GPU integration won't be as fabulous as their IMC or quad-core components? Intel doesn't need "significant innovations" (nor does AMD), they just need higher performance, lower power, and lower cost, which is exactly what they have.
Innovations only exist to serve those aspects.
Justin Case - Sunday, June 8, 2008 - link
Wrong.AMD64 (the instruction set) isn't about "more performance". Virtualization isn't about "more performance". Hardware no-execute flags aren't about "more performance". SATA's hot-plug ability isn't about "more performance".
Your statement shows the kind of lack of vision that brought us the Pentium 4.
I for one am far more excited about technology that allows me to do something new or different than "technology" that simply lets me do the same stuff faster. 99% of CPU cycles in the planet go unused anyway.
zsdersw - Thursday, June 5, 2008 - link
Given the overall tone of your reply, the criticism of the article as "fan-boyish" is, really, the pot calling the kettle black.Visual - Thursday, June 5, 2008 - link
so you agree as well? yeah, me too.they are both black. they are both fanboys :)
zsdersw - Thursday, June 5, 2008 - link
I've said nothing about agreeing with anything. What I have said, though, is that a fanboy calling someone else a fanboy is perhaps not indicative of any objective truth.Jynx980 - Saturday, June 7, 2008 - link
It will be a great day when I can read any CPU discussion without the word fanboy in it.The close up of the chip has waaaaaay to much thermal compound on it.
Is it just me or is the first pic of the Intel roadmap rather... phallic?