Dynamic Power Management: A Quantitative Approach
by Johan De Gelas on January 18, 2010 2:00 AM EST- Posted in
- IT Computing
Analysis: What Happened?
The measurements on the previous page are fine but we also want to understand how well the hardware and operating system coped with the "low load" scenario. What did Windows 2008R2 do? We asked the Windows Driver Kit "Powertest" tool to tell us more. The first thing we want to know is the clock speed the CPU was ordered to run at in "Balanced" mode. The differences are very telling. First the Xeon's clock speed changes:
Xeon L3426 Core Speeds | ||||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
20 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 |
Many | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
The Xeon L3426 almost always ran at 1.86GHz. In a period of 30 seconds, we noticed only two P-state change requests: one speed bin lower (-133MHz) and 3 speed bins lower (-400MHz). All cores were always asked to run at the same clock speed.
Next those of the Opteron:
Opteron 2435 Core Speeds | ||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 |
1 time | 800 | 1400 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 1400 | 1400 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 2600 | 1400 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
Where the Xeon hardly gets any P-state changes, the six-core Opteron 2435 frequently switches between 0.8GHz, 1.4GHz, and 2.6GHz. A lot of times one of the cores runs at 1400MHz, another one at 2600MHz, and the rest at 800MHz. Basically, the above table is repeated over and over again. This means that the frequency scaling is far from ideal: we should see two cores at 2.6GHz most of the time as the application spawns two threads that require 100% core power. This in turn explains the 15% performance hit between "Balanced" and "Performance". If the hardware and OS worked together better, the performance hit should not be more than a few percent. This makes us conclude that in this case, the 4W power savings are not worth the performance hit.
Sleeping
We have focused on the active cores so far, but the important power savings can also come from putting idle cores in sleep states. Did the CPU driver and OS scheduler work well together? Again, there are remarkable differences.
CPU Sleep State Comparison | ||||
% Idle | ACPI C1 | ACPI C2 | ACPI C3 | |
Opteron 2435 | 86 | 100 | 0 | 0 |
Xeon L3426 | 81 | 7 | 93 | 0 |
Opteron 2389 | 72.4 | 100 | 0 | 0 |
The six-core had more idle cores than the quad-core Opteron, and as a result it did experience more idle time. All idle time with the Opterons was spent in the C1/"Halt" status.
The Xeon was quite a bit more aggressive: 93% of the idle time was spent in the C2 state, but C2 at the operating system level does not mean the hardware actually runs in C2. In theory, the hardware is capable of putting the core into a "deeper" CC (Core Sleep) state. Intel promised that the idle Nehalem cores would be able to reach even the deepest C6 sleep while other cores were working. Did that actually happen?
Software tools read out the API of the OS and thus - as far as we know - always read out the ACPI states. We followed the guidelines in Intel's White Paper, "Intel Turbo Boost Technology in Intel Core Microarchitecture Based Processors", and did some programming (in assembly) to find the actual hardware C-states.
First we read out the Time Stamp Register
RDTSC
0x000086FCCA7EBD0E
Next we read out the right Machine Specific Register
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xF842A000
We wait for 1500ms and then repeat the previous procedure:
RDTSC
0x000086FD78268DC2
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xFA3F0000
In some cases, the MSR did not get one tick more, clearly indicating that the CPU had not entered C6 during the 1.5 second period. Both the "real" physical and logical core report the same TSC and MSR info, so it is quite easy to make a distinction between the real cores and the logical cores which are a result of SMT (Hyper-Threading).
With the "Performance" power plan we get:
"Performance" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2913456308 | 33316864 | 1.14% |
Core 2 | 2933155470 | 0 | 0.00% |
Core 3 | 2950461391 | 2809569280 | 95.22% |
Core 4 | 2957802638 | 0 | 0.00% |
So on average the CPU is in C6 24% of the time, which is quite impressive. However, the way we measure this is not perfect: the measurement puts an extra load (slightly less than a chess thread) on the CPU. So the load on the CPU is not two but rather three threads. This means that the CPU probably spends even more time in C6 mode with two active threads.
Next the same measurement but with the "Balanced" power plan:
"Balanced" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2961019252 | 0 | 0.00% |
Core 2 | 2991271044 | 2371919872 | 79.29% |
Core 3 | 3012220038 | 74088448 | 2.46% |
Core 4 | 3012878436 | 22192128 | 0.74% |
This time we spend a little bit less time in C6: about 21%. Setting the power plan to Performance allows the idle cores to go just a little bit more into deep sleep as the active cores are working harder. Of course total power does not decline as the higher power consumption of the Turbo Boosted cores is much more important than the small effect of some cores being in deep sleep an extra 10% of the time.
35 Comments
View All Comments
n0nsense - Monday, January 18, 2010 - link
Here is what system sees ...only one is 2.5, other three are 2.0 :)
nons ~ # cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 2497.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.38
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 7012.69
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.08
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.09
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
VJ - Tuesday, January 19, 2010 - link
These are mobile CPUs, however:With Linux on a Latitude (Intel T7200 or T7500), CPU Frequency Scaling Monitor allows one to scale the frequency of one core to its max while leaving the other core at its minimum.
With an AMD TL62, this is not possible. The induced scaling of one core causes the frequency of the other core to follow.
With an AMD ZM84 this is possible. Just like with the Latitude, one can have one core at its max with the other core at its minimum.
Maybe what's shown is not what's taking place.
Additionally;
http://www.intel.com/technology/itj/2006/volume10i...">http://www.intel.com/technology/itj/200...al_Manag...
"For example, in a Dual-Processor system, when the OS decides to reduce the frequency of a single core, the other core can still run at full speed. In the Intel Core Duo system, however, lowering the frequency to one core slows down the other core as well."
VJ - Tuesday, January 19, 2010 - link
Additionally; AMD's ZM84 allows each core to operate at different frequencies. The lowest frequency is 575Mhz while the highest is 2300Mhz.I can set one core to 1150Mhz with the other set at 2300Mhz. This is different from the Intel (Mobile) CPUs I've come across where a difference in frequency between cores is only possible when one core is (seemingly) operating at its lowest frequency (in a dual core system).
What is also interesting from aforementioned cpuinfo output is that only core is running at its max frequency while all (3) other cores are (seemingly) at their minimum frequency. Considering my previous conjecture on C2 and C0 states, it would be surprising if one can show cpuinfo output where 2 cores are running at max frequency while the other 2 cores are running at any frequency other than max frequency. That shouldn't be possible at all.
valnar - Thursday, May 6, 2010 - link
Does anyone know if this kind of power management for Lynnfield processors is available in Windows 2003?hshen1 - Sunday, June 23, 2013 - link
This is really a good article for power management researchers like me!!