SMP limited by memory controller

Psi*

Tech Monkey
I have been teetering for the past few weeks about buying a i7 980X. This response from support of one of the programs I use kind of spoiled the notion. The software is actually a collection of various "solvers" which are all multi-threaded number-crunchers. They can run from minutes to days.

"You will see almost no benefit for the hexacore (6 thread) processors compared to the quad-core processors for transient solvers (FIT, FDTD, TLM).


The single memory controller will be the bottleneck and will almost destroy the possible performance improvement. The effect will be the same as for the octacore (8 thread) processors we have already benchmarked.


There's a single memory controller per CPU, in prior generation Intel sockets the memory controller was shared by 2 CPUs. This was improved for the Intel Nehalem (x5500, x5600) family, so each socket has its own memory controller. But it can only take advantage of 4 cores for transient solvers, which involve memory swap."

I have never seen any benchmark program comment or even hint at this limitation. What the heck?! This was a problem for my older dual Opteron 290s (~4 y/o) system ... 4 cores but only 2 threads would launch.:mad: I did find another web site discussing this as it applies to a different solver than this company. A few searches for "smp memory controller limitation" does produce a few other pages that discuss this.

I have paid much attention to AMD for a couple of years but I wonder if they have a similar architecture?

So, I guess I am in the market for multi-socket motherboards with "cheap" dual core processors. But somehow this doesn't seem right!:confused: And given the overclocker that I am, I am arguing with myself about just another system that is a duplicate of what I have in the signature ... OC-ed to 4.4 GHz ... this makes a pretty competitive choice.
 

Rob Williams

Editor-in-Chief
Staff member
Moderator
Alright, this is a bit strange. Those solvers must be quite unique, because I've never heard of the single-channel memory controller ever being a problem - at least, if we're talking about single socket, which in this case, we are.

I don't quite understand it, but at least the company was straight-forward about it. The fact that it even benchmarked octal-core processors is rather impressive. I still don't understand where the limitation lays, but if what the company is saying is true, then even faster memory or more memory shouldn't make a single difference.

That's unfortunate, and rather mind-blowing. You'd expect that moving up to a six-core would deliver tremendous improvements, not none at all! It was a smart move to e-mail them rather than just rush and purchase the CPU. That would have stung a little.

I am not familiar with AMD's server processors to be honest, but it may be that they do improve the situation a bit. Still, I wonder how AMD vs. Intel would compare clock-to-clock. You might be able to take advantage of all six cores on the AMD, but those six cores may still only equal four cores on Intel.

If each socket tops out at utilizing a quad-core, the best bet might be to stick with what you have, unless you want to go with a dual-socket board and get two quad-cores (that's the only way you'd ever see an improvement over a single quad, naturally).

There isn't some sort of benchmark the company offers that I could run on our six-core processor, is there?
 

Psi*

Tech Monkey
Thanks, Rob. I have been struggling with this issue for over a year. I did not believe it & thought that they were just trying to figure out a way to charge more for the software. BUT, I have finally bought into it.

This link about FDTD (finite difference time domain) from a different company, at the bottom of the article, makes the same claim about the the limitation of the memory controller.

FYI ... The transient solvers that he mentions are "time domain". They model an electrical problem by sending an electrical impulse into whatever it is that your are interested in ... connectors (like CPU sockets) to antennas.

I have emailed him asking how they benchmark this & what they use. SPECfp has GemsFDTD. Of course knowing that, how does one discern the difference between bottlenecks ... RAM, bus, memory controller, or ...?
 

Kougar

Techgage Staff
Staff member
"You will see almost no benefit for the hexacore (6 thread) processors compared to the quad-core processors for transient solvers (FIT, FDTD, TLM).

The single memory controller will be the bottleneck and will almost destroy the possible performance improvement.

Okay, I'm going to balk at this one. Intel chips that use the LGA1366 socket have 3 memory controllers, and they are pretty good. This is why they use triple channel memory kits, one module per channel.

I have never before heard of a program that was limited by these and not the number of cores. It sounds like the program is not well written, does not scale well, or perhaps I do not understand the workload involved. Nehalem does best with HPC workloads (which this sounds like), and the large pool of L3 cache ensures it is not fetching to RAM all the time.

I run Folding@home Bigadv work units, and these require 100% loads with up to four days for a stock Core i7 920 using 6GB of RAM. Overclocked as in my sig drops the time to roughly 2.1 days. Changing the memory speed from 1333Mhz to 1600MHz saves about 3-4 minutes per run, across 100 runs.



As far as a "smp memory controller limitation" goes, it sounds like hogwash. I didn't spot any valid links searching it, were there any you found you consider relevant to this that I could look at?

AMD's adoption of the NUMA platform gave them a huge advantage here, but Nehalem also adopted the NUMA platform so the playing field became fairly equal. Intel's 980X has three memory controllers, and they can handle 2,000Mhz speed memory with ease. AMD's Phenom X6 only has two memory controllers, and doesn't see any gains above 1333 to 1600Mhz because of the 1.8GHz chipset limitation.


--------------------------------------------------------

Psi, from the aforementioned link, in direct regards to Nehalem:

The result showed from the previous study showed a dramatic 3x performance improvement for processors released in the same calendar year,

They are saying the opposite about the performance there.

Here is what they say about NUMA, bolding is mine:

The NUMA architecture used in Nehalem helps FDTD Solutions scale nearly ideally by increasing memory bandwidth and processing power simultaneously.

Where exactly are they saying it is a bottleneck? They are saying the opposite here. ;) I think you got confused with the rest of the paragraph, which I will quote below:

Previous generations of multi-processor systems by Intel used a Symmetric Multiprocessor Architecture (SMP). In this configuration, all processors access a single memory bank through a common memory controller hub, as shown in Figure 7. That memory controller hub often tends to be a performance bottleneck for memory intensive applications such as FDTD Solutions.

These "previous generations" would be Intel's CPUs dating from 2001-2005. Netburst-based designs like the Pentium 4 and so forth. AMD's versions would be the Athlon XP series. Modern desktop AMD and Intel CPUs have not used a "memory controller hub" since 2006. ;)

The 980X would be the best processor for HPC workloads, especially when combined with 2,000MHz RAM like the G.Skill Ripjaws. Because only the 980X can run the uncore at a 1.5 multiplier I cannot test 2,000MHz on a Core i7 920 or 930 processor, most can't run a 4Ghz uncore required to run 2,000MHz RAM. The Core i7 980X by comparison only requires a 3GHz uncore.
 
Last edited:

Psi*

Tech Monkey
In this link, Figure 7 description at the bottom of the page, "The Symmetric Multiprocessor Architecture (SMP) results in performance bottlenecks owing to the shared memory controller hub." And, the paragraph above it describes it slightly better.

The same company has a Frequency Domain (FD) solver which I think in a phone call, he told me that it is not affected. All of the FD solvers that I am familiar with use a finite element mesh which is made up of tetrahedras. That makes for a coordinate tranformation overhead ... converting from funny looking & odd shaped pyramid-ish looking mesh to XYZ coordinates ... in so many words. So they may not suffer from this issue so much because of the additional overhead.

This is where I have backed off my arguing with *them*. BUT I, like you (Kougar & Rob), am suspicious of how the code was written, or at least the vintage of the technology at the time on 1 hand. On the other, this company has thrown huge amounts of money into their development. By some estimates, they have invested more than almost all of the rest of their competitors!

This is why I decided to post here about this supposed issue.
 

Kougar

Techgage Staff
Staff member
In this link, Figure 7 description at the bottom of the page

Psi, please reread the paragraph before the illustration as I believe you are missing one key point they made clear on their page. I'm going to bold the parts in question:

Previous generations of multi-processor systems by Intel used a Symmetric Multiprocessor Architecture (SMP). In this configuration, all processors access a single memory bank through a common memory controller hub, as shown in Figure 7.

This tells you Figure 7 is in reference to to the pre-2006 Intel architecture I was referring to in my earlier post. It applies to FSB based systems that use a MCH. Which means Figure 7 does not apply to NUMA based platforms such as Nehalem, Phenom, or any other modern single or dual-CPU motherboard today. By their own words, this bottleneck no longer applies.

If you still have any doubt study the architecture itself... The 980X has three memory controllers built directly in, not one. For this SMP bottleneck to exist, the memory controller must reside in the chipset itself. For Core i7 900 series, each memory controller is routed directly to the RAM via the QPI/VTT DRAM bus, it is not routed through the PCH (platform controller hub) or IOH (I/O controller hub).
 

Psi*

Tech Monkey
I saw that paragraph & re-read it a few times. And, I am *not* arguing against your reasoning or point of view or opinion ... different, eh?? In fact I share all of that with you ... meaning "I agree"? But this does not make it proof positive, beyond a shadow of a doubt, or even irrefutable indisputable ... uh ... that we are correct.:confused:

I am trying to do the holiday thing this weekend which is a very foreign experience for me, but I am just putting the flag up to see who salutes. In other words I posted this "issue" to see what kind of any response from anyone I get. It is great that we are in agreement, but how do you prove that this "issue" does not exist???

When the business week starts up again in the US, I am going to call software vendors that make competitive packages as well as those using similar software technology to see what they say about their experience.

A begging question for me, and ignoring the current CPUs as i7, how did this problem show up? What metric of what ever benchmark pointed to the memory controller? Versus poorly written multi-threaded software or a weakness in the hardware architecture?
 

Kougar

Techgage Staff
Staff member
Ah, then I misunderstood your post to an extent. I thought you were using their information to prove as to why this SMP bottleneck might be the case.

Unfortunately I can't prove it beyond showing that my point-of-view aligns with the article and that the facts match. Looking directly at the platform architecture clearly shows the scenario of a chipset memory controller doesn't exist with NUMA, so this bottleneck can't exist either.

If you want more information on this I might suggest you look into HPC and server benchmarks. Anandtech IT has a few they perform every quarter or so, but only on their IT site. If you check out those benchmarks you can see Nehalem's NUMA archiecture (and AMD's Mangy-Cours NUMA architecture) are both extremely powerful systems not bottlenecked by the memory bandwidth. Although I will point out Nehlem's platform does offer higher memory bandwidth than AMD's can offer, currently.

If you ever believe there is a memory bandwidth bottleneck, then increasing the available memory bandwidth would still show some improvement during testing. I am not talking about synthetic programs, but actual workloads. Synthetic programs are great at testing memory bandwidth, but pure memory bandwidth as seen by the processor is not a guarantee of performance.

Even with Core 2 Duo which had the memory controllers on the chipset, if you increased the RAM from DDR2-667 to DDR2-800 or even 1066MHz you would see a small improvement in performance. For HPC workloads that span 8-12 threads or more on Core i7, if there was a memory bottleneck then any increase in memory bandwidth should result in a tangible performance gain for the application in question.

I got myself mixed up, 2006 is the year Core 2 Duo launched, but the memory controllers were on the chipset for Q6600, Q9650, and other Core 2 Duo processors, that I need to make clear. Regardless of this, something to point out is that despite this disadvantage, Core 2 Duo easily outperformed the AMD Athlon64 series which had no such bottleneck. The point to take away, is that for average desktop workloads this handicap was not enough to matter significantly, and this is why Intel waited until Nehalem's launch in 2008 before they removed it on their processors.

Again if you want to see examples of benchmarks that highlight this, you should focus on benchmarks that feature Intel Xeon X5100 series (dualcore) or Xeon X5300 series (quadcore) as these were the initial "Core" architecture that launched in 2006 to the server market, that features this FSB/memory controller handicap. The problem is these processors still chewed up AMD's Opterons in the benchmarks, but a few will show the memory advantages if you look for them.

To try and answer your question directly, I did some googling. This was one of the first results I found: http://www.anandtech.com/show/1254/3 It is from 2004 when AMD was introducing their NUMA platform, but highlights the problem exactly. Amusingly their new site design broke their articles, the charts in that one are not working so you will need to email them if you want to see the data.

After testing 1700MHz and 1866MHz frequency on my system with a special processor, I did not see any improvement over 1600MHz speed memory with F@H. Number of CPU cores and CPU frequency were the bottleneck. I would of liked to try 2,000MHz, but I could not get it fully stable at suitable timings to make it a suitable comparison. Still, for HPC workloads I think a 980X paired with 1600MHz CAS 7 or CAS 8 memory is the best option. The only thing better would be an EVGA SR-2 using two Xeon Gulftowns. Either platform would have more memory bandiwdth then any other computer before it. As one site phrased it, Core i7 has enough memory bandwidth to equal some midrange GPUs....
 
Last edited:

Psi*

Tech Monkey
Ah, then I misunderstood your post to an extent. I thought you were using their information to prove as to why this SMP bottleneck might be the case.

Unfortunately I can't prove it beyond showing that my point-of-view aligns with the article and that the facts match. Looking directly at the platform architecture clearly shows the scenario of a chipset memory controller doesn't exist with NUMA, so this bottleneck can't exist either
I am guilty of trying to bait someone to prove to me the opposing point of view (which I do not actually agree with). Meaning that you and I are in agreement! Thanks in advance for your indulgence and I don't mean to offend. :eek:

I'll follow up with the link as well as anything else I find out. Thanks.
 
Top