AMD's Quad-Core Barcelona: Defending New Territory
by Johan De Gelas on September 10, 2007 12:15 AM EST- Posted in
- IT Computing
"Native Quad-Core"
AMD has told the whole world and their pets that Barcelona is the first true quad-core as opposed to Intel's quad-cores which are twin dual cores. This should result in much better scaling, partly a result of the fact that cores should be able to exchange cache information much quicker.
To quantify the delay that a "snooping" CPU encounters when it tries to get up-to-date data from another CPU's cache, take a look at the numbers below. We have used Cache2cache before; you can find more info here. Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time.
AMD's native quad-core needs about 76ns to exchange (L1) cache information. That's not bad, but it's not fantastic either as the shared L2 cache approach of the Xeons allows the dual cores to exchange information via the L2 in about 26-30ns. Once you need to get information from core 0 to core 3, the dual die CPU of Intel still doesn't need much more time (77ns) than the quad-core Opteron (76ns). The complex L1-L2-L3 hierarchy might negate the advantages of being a "native" quad-core somewhat, but we have to study this a bit further as it is quite a complex matter.
Memory Subsystem
AMD has improved the memory subsystem of the newest Opteron significantly: the L1 cache is about the only thing that has not been changed: it's still the same 2-way set associative 64KB L1 cache as in K8, and it can be accessed in three cycles. Like every modern CPU, the new Opteron 2350 is capable of transferring about 16 bytes each cycle.
L2 bandwidth has been a weakness in the AMD architectures for ages. Back in the "K7 Thunderbird" days, AMD simply "bolted" the L2 cache onto the core. The result was a relatively narrow 64-bit path from the L2 cache to the L1 cache which could at best deliver about 2.4 to 3 bytes per cycle. The K8 architecture improved this number by 50% and more, but that still wasn't even close to what Intel's L2 caches could deliver per cycle. In the Barcelona architecture, The data paths into the L1 cache have been doubled once again to 256-bits. And it shows:
Barcelona, aka Opteron 23xx, is capable of delivering no less than 50%-60% more bandwidth to its L1 cache than K8. We also measure a latency of 15 cycles, which puts the AMD L2 cache in the same league as the Intel Core caches.
The memory controllers of the third generation of Opterons have also been vastly improved:
Okay, let's see if we can make all those promises of better memory performance materialize. We first tested with Lavalys Everest 4.0.11.
The deeper buffers and more flexible 2x64-bit accesses have increased the read bandwidth, but the write buffer might have negated the effect of those a bit. That is not a problem, as very few applications will be solely writing for a long period of time. Notice that per cycle, the improved copy bandwidth is 54% and is the biggest gain. This is most likely the result of the copy action resulting in an interleaving of writes and reads, allowing the split memory access design to come into play.
With much higher L2 cache and memory bandwidth combined with low latency access, the memory subsystem of the 3rd generation of Opterons is probably the best you can find on the market. Now let's try to find out if this superior memory subsystem offers some real world benefits.
AMD has told the whole world and their pets that Barcelona is the first true quad-core as opposed to Intel's quad-cores which are twin dual cores. This should result in much better scaling, partly a result of the fact that cores should be able to exchange cache information much quicker.
To quantify the delay that a "snooping" CPU encounters when it tries to get up-to-date data from another CPU's cache, take a look at the numbers below. We have used Cache2cache before; you can find more info here. Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time.
Cache coherency ping-pong (ns) | |||
Same die, same package | Different die, same package | Different die, different socket | |
Opteron 2350 | 152 | N/A | 199 |
Xeon E5345 | 59 | 154 | 225 |
Xeon DP 5160 | 53 | - | 237 |
Xeon DP 5060 | 201 | N/A | 265 |
Xeon 7130 | 111 | N/A | 348 |
Opteron 880 | 134 | N/A | 169-188 |
AMD's native quad-core needs about 76ns to exchange (L1) cache information. That's not bad, but it's not fantastic either as the shared L2 cache approach of the Xeons allows the dual cores to exchange information via the L2 in about 26-30ns. Once you need to get information from core 0 to core 3, the dual die CPU of Intel still doesn't need much more time (77ns) than the quad-core Opteron (76ns). The complex L1-L2-L3 hierarchy might negate the advantages of being a "native" quad-core somewhat, but we have to study this a bit further as it is quite a complex matter.
Memory Subsystem
AMD has improved the memory subsystem of the newest Opteron significantly: the L1 cache is about the only thing that has not been changed: it's still the same 2-way set associative 64KB L1 cache as in K8, and it can be accessed in three cycles. Like every modern CPU, the new Opteron 2350 is capable of transferring about 16 bytes each cycle.
Lavalys Everest L1 Bandwidth | |||||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | Bytes/cycle (Read) | Latency (ns) | |
Opteron 2350 2 GHz | 32117 | 16082 | 23935 | 16.06 | 1.5 |
Xeon 5160 3.0 | 47860 | 47746 | 95475 | 15.95 | 1 |
Xeon E5345 2.33 | 37226 | 37134 | 74268 | 15.96 | 1.3 |
Opteron 2224 SE | 51127 | 25601 | 44080 | 15.98 | 0.9 |
Opteron 8218HE 2.6 GHz | 41541 | 20801 | 35815 | 15.98 | 1.1 |
L2 bandwidth has been a weakness in the AMD architectures for ages. Back in the "K7 Thunderbird" days, AMD simply "bolted" the L2 cache onto the core. The result was a relatively narrow 64-bit path from the L2 cache to the L1 cache which could at best deliver about 2.4 to 3 bytes per cycle. The K8 architecture improved this number by 50% and more, but that still wasn't even close to what Intel's L2 caches could deliver per cycle. In the Barcelona architecture, The data paths into the L1 cache have been doubled once again to 256-bits. And it shows:
Lavalys Everest L2 Bandwidth | |||||||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | Bytes/cycle (Read) | Bytes/cycle (write) | Bytes/cycle (Copy) | Latency (ns) | |
Opteron 2350 2 GHz | 14925 | 12170 | 13832 | 7.46 | 6.09 | 6.92 | 1.7 |
Dual Xeon 5160 3.0 | 22019 | 17751 | 23628 | 7.34 | 5.92 | 7.88 | 5.7 |
Xeon E5345 2.33 | 17610 | 14878 | 18291 | 7.55 | 6.38 | 7.84 | 6.4 |
Opteron 2224 SE | 14636 | 12636 | 14630 | 4.57 | 3.95 | 4.57 | 3.8 |
Opteron 8218HE 2.6 GHz | 11891 | 10266 | 11891 | 4.57 | 3.95 | 4.57 | 4.6 |
Lavalys Everest L2 Comparisons | |||
Bytes/cycle (Read) | Bytes/cycle (write) | Bytes/cycle (Copy) | |
Barcelona versus Santa Rosa | 63% | 54% | 51% |
Barcelona versus Core | -1% | -5% | -12% |
Santa Rosa versus Core | -39% | -38% | -42% |
Barcelona, aka Opteron 23xx, is capable of delivering no less than 50%-60% more bandwidth to its L1 cache than K8. We also measure a latency of 15 cycles, which puts the AMD L2 cache in the same league as the Intel Core caches.
The memory controllers of the third generation of Opterons have also been vastly improved:
- Deeper buffers. The low latency integrated memory controller was already one of the strongest points of the Opteron, but the amount of bandwidth it could extract out of DDR2 was mediocre. Only at higher frequencies is the Opteron able to gain a bit of extra performance from fast DDR2-667 DIMMs (compared to DDR-400). This has been remedied in 3rd generation Opteron thanks to deeper request and response buffers.
- Write buffer. When Socket 939 and dual channel memory support was introduced, we found that the number of cycles that bus turnaround takes had a substantial impact on the performance of the Athlon 64. Indeed with a half duplex bus to the memory it takes some time to switch between writing and reading. When you fill up all the DIMM slots in a socket 939 system, the bus turnaround has to be set to two cycles instead of one. This results in up to a 9% performance hit, depending on how memory intensive your application is. So the way to get the best performance is to use one DIMM per channel and keep the bus turnaround at one cycle. However, even better than trying to keep bus turnaround as low as possible is to avoid bus turnarounds. A 16 entry write buffer in the memory controller allows Barcelona to group writes together and then burst the writes sequentially.
- More flexible. Each controller supports independent 64-bit accesses. (Dual core Opteron: a single 128-bit access across both controllers)
- DRAM prefetchers. The DRAM prefetcher works to request data from memory before it's needed when it sees that the memory is being accessed in regular patterns. It can go forward or backward in the memory.
- Better "open page" management. By keeping the right rows ready on the DRAM, the memory controller only has to pick out the right columns (CAS) to get the necessary data instead of searching for the right row, copying the row, and then picking out the right column. This saves a lot of latency (e.g. RAS to CAS), and can also save some power.
- Split power planes. Feeding the memory controller and the core from different power rails is not a direct improvement to the memory subsystem, but it does allow the memory controller to be clocked higher than the CPU core.
Okay, let's see if we can make all those promises of better memory performance materialize. We first tested with Lavalys Everest 4.0.11.
Lavalys Everest Memory BW | |||||||
Read (MB/s) | Write (MB/s) | Copy (MB/s) | Bytes/cycle (Read) | Bytes/cycle (write) | Bytes/cycle (Copy) | Latency (ns) | |
Opteron 2350 2 GHz | 5895 | 4463 | 6614 | 2.95 | 2.23 | 3.31 | 76 |
Dual Xeon 5160 3.0 | 3656 | 2771 | 3800 | 1.22 | 0.92 | 1.27 | 112.2 |
Xeon E5345 2.33 | 3578 | 2793 | 3665 | 1.53 | 1.2 | 1.57 | 114.9 |
Opteron 2224 SE | 7466 | 6980 | 6863 | 2.33 | 2.18 | 2.14 | 58.9 |
Opteron 8218HE 2.6 GHz | 6944 | 6186 | 5895 | 2.67 | 2.38 | 2.27 | 64 |
Lavalys Everest Memory BW Comparison | ||||
Bytes/cycle (Read) | Bytes/cycle (write) | Bytes/cycle (Copy) | Latency (ns) | |
Barcelona versus Santa Rosa | 26% | 2% | 54% | 29% |
Barcelona versus Core | 92% | 86% | 111% | -34% |
Santa Rosa versus Core | 74% | 99% | 44% | -44% |
The deeper buffers and more flexible 2x64-bit accesses have increased the read bandwidth, but the write buffer might have negated the effect of those a bit. That is not a problem, as very few applications will be solely writing for a long period of time. Notice that per cycle, the improved copy bandwidth is 54% and is the biggest gain. This is most likely the result of the copy action resulting in an interleaving of writes and reads, allowing the split memory access design to come into play.
With much higher L2 cache and memory bandwidth combined with low latency access, the memory subsystem of the 3rd generation of Opterons is probably the best you can find on the market. Now let's try to find out if this superior memory subsystem offers some real world benefits.
46 Comments
View All Comments
erikejw - Tuesday, September 11, 2007 - link
I take back what I said.I mixed up 3 different reviews that does not correlate and is not comparable.
I did not realize that until now even though I looked at them again.
Optimizations put off on AMD processors was just hearsay and likely with the results presented but since I was wrong about the results that part is probably wrong too.
So now everything I have to say is, great article :)
Now I look forward to the tests with the 2.5GHz part and some overclock on it to see what a 2.8 or even a 3GHz part would do.
kalyanakrishna - Tuesday, September 11, 2007 - link
Sorry ... with all the discussion, your methodology is incomplete and leading to a biased result. Maybe there is code that is optimized for Intel processor - but the focus of the article is performance - thats what you intended it to be ... if not, please redo the article, change your deductions and focus it on code compatibility. No one measuring performance on their systems will use Intel Xeon optimized code on AMD processors. There are bunch of other compilers and performance libraries available. If not, please use a compiler that WILL optimize for both - pathscale, gcc and more ...I agree with your processor frequency aspect ... however, neither did Intel have a high speed freq on the launch date. The way it should have been presented is "at same frequency ... there is not much difference in performance" "at higher clocks, Intel does have advantage that comes at a price" Is this the same deduction you brought out in your article? Far from it ... do you concur?
And your reasoning that you didnt have time to optimize the code is not acceptable. What was the point of this article - throw out some incomplete article on the day of the launch so everyone doesnt think AnandTech doesnt have a comment on Barcelona or maintain your high standards and put out a well written, mature and complete article based on results based from a rock solid testing methodology with critical analysis?
The article was leaning towards a dramatic touch than presenting a neutral analysis. And, please stop saying Linpack is Intel friendly. The code is NOT, the way you compiled it is optimized for LinPack!! There is a HUGE difference. A code can only be Intel "friendly" when it is written with special attention to make sure it fully exploits all the features that Xeon has to offer and not necessarily by any other processors. And, if you do read my email to you - you will notice my stand on that point and lot more.
So, I kindly request you to immediately take down this article with a correction or redo your article and change the focus. Maybe you had a different idea in your mind when writing it ... but the way it was written is not what you said you wanted it to be. All the comments you made now are not brought out in the article.
Thank you for your time.
kalyanakrishna - Tuesday, September 11, 2007 - link
And of course, we didnt even get to the point where the test setup says "BIOS Note: Hardware prefetcing turned off" but in your analysis section it says"but masterly optimization together with hardware prefetching ensures most of the data is already in the cache. The quad-core Xeon wins again, but the victory is a bit smaller: the advantage is 20%-23%."
That says enough about the completeness and accuracy of your article. The article is full of superlatives like masterly, meticulous to describe Intel processors. The bias cant be any more blatant.
Now, will you please take it down and stop spreading the wrong message!!! There is nothing wrong in saying it was an incomplete article and in the interests of accuracy we would like to retract our claims!! Stop sending the wrong message to your huge reader base and influence their opinion of a potentially good product!
fitten - Tuesday, September 11, 2007 - link
Potentially... but not yet a good product, IMO. Hopefully AMD will have another stepping out sometime by the end of the year that may actually be competitive. As of right now, Barcelona isn't competitive with Intel offerings. The problem is that the target is moving as Intel will be releasing new chips by the end of the year.As far as Intel compilers are concerned, you do realize that Intel's compilers are better than GCC (which is NOT known for agressive optimizations and stellar performance) and are downloadable from their site. Code compiled with Intel compilers tends to execute faster on both Intel and AMD processors than code compiled with GCC in many cases.
As far as accuracy of the various reviews... it's AMD's fault for getting only a few systems to a few reviewers only 48 hours before the launch date. I believe this was intentional in order to delay any thorough testing of Barcelona in the short term. Plus, there's the whole bit about AMD requiring that reviewers submit reviews to AMD for sanitizing before publishing them, as well. I'm quite convinced that AMD knew (and knows) that Barcelona is a turd and are just trying to buy time by various nefarious methods so that they can have a little more time to get their act together. If it weren't for investors and the world pushing AMD to actually release on their (much delayed) launch date, I'm quite sure AMD would have rather waited a few months so they'd have a better stepping to debut.
kalyanakrishna - Tuesday, September 11, 2007 - link
This is exactly what I am talking about ... see the comments on Digg:http://www.digg.com/hardware/Finally_AMD_s_Barcelo...">http://www.digg.com/hardware/Finally_AMD_s_Barcelo...
Now, please retract your observations.
kalyanakrishna - Wednesday, September 12, 2007 - link
Johan,Comment by swindelljd below ...
I believe you underestimated the impact your article has on purchasing decisions of the customers. :)
I hope customers do continue to look at AnandTech as a source of impartial, genuine and correct data on performance of new technologies.
As Spiderman's uncle would say "With great power, comes great responsibility". :) :)
swindelljd - Thursday, September 13, 2007 - link
Yes, I would say anandtech.com has the most comprehensive, thoughtful, well organized, un-biased and current analysis of any site/content that currently exists. Many other sites even reference or simply use anandtech.com's analysis barely augmenting with their own.I have definitely used them in the past for both personal purchases (enthusiast OC'ing) and business purchases of my production hardware environment. In each case I've used multiple sources but always find myself returning to anandtech.com.
I'd hate to see them delay the release of an article just because there was "just one more test to run". Like many things in life, sometimes it's more important to simply work with the information at hand (even if not quite complete) than to wait to make a decision. Some might call that "analysis-paralysis".
Ultimately it's up to me when making purchasing decisions to weigh all the information and consider how much issues such as you pointed out regarding "not quite complete" analysis would impact a real world scenario.
I applaud anandtech.com for all the work they do (and the LONG hours they must put in) in quantifying what in some cases is unquantifiable.
Now back to my original question - why do the Woodcrest/MySQL benchmarks taken approx 14 months apart vary by so much and for the worse? Did the benchmark used change or am I just misreading the benchmark?
thanks,
John
kalyanakrishna - Thursday, September 13, 2007 - link
John,Its great that you trust the site content so much. I know many people who do. That is why I was shocked to see the shortcomings in the article ... most of which are, I must say, basic to some extent.
I myself have been reading the site since many years and I know many colleagues who refer this site for just about anything ... hence my stand that they realize the importance of their work and publishings.
Maybe its your fondness for the site ... but the specific comments I made are very important and do affect real world results. For any one looking to make a cluster or build an HPC system - thats their real world. Just like database performance is real world to you.
Just to make it explicit... its not a flame war or anything like that ... it is to make sure that the data is correct and a relevant comparison is made.
thanks.
flyck - Tuesday, September 11, 2007 - link
When will there be an update available? :).
JohanAnandtech - Monday, September 10, 2007 - link
2 GHz Intel's were not available to us. And considering AMD's pricepoints, a 2 GHz Opteron 2350 are targetting 2.33 GHz Xeons. It is fairly accepted that AMD has to lure customers with a small price advantage.
Because there is a lot Intel optimized code out there? Do you deny that there are developers out there that use the Intel MKL?