Chip Architect: Cache efficiency for SPEC 2000 benchmarks

August 29, 2003

Cache efficiency for the SPEC 2000 benchmarks

by Hans de Vries

Cache efficiency for the SPEC 2000 Benchmarks

The SPEC 2000 benchmarks are subject to much debate in the scientific community. Are they broken? Do they just depend on memory bandwidth? Do they fit entirely in the cache? The recent publication of new benchmarks for the hp server rx5670 gives us a chance to produce some metrics. This small server is a four processor machine with a single memory controller. The memory bandwidth is 6.4 GByte /second. We look at the scores for four different configurations:

1) Single 1000 MHz Itanium 2 with 3.0 MByte L3 on Chip Cache CINT2000 CFP2000

2) Four 1000 MHz Itanium 2 with 3.0 MByte L3 on Chip Cache CINT2000 CFP2000

3) Single 1500 MHz Itanium 2 with 6.0 MByte L3 on Chip Cache CINT2000 CFP2000

4) Four 1500 MHz Itanium 2 with 6.0 MByte L3 on Chip Cache CINT2000 CFP2000

We define the Cache efficiency here as 100% if four processors finish just as fast as a single processor. Cache efficiency is said to be "0%" if four processors take four times as long to finish the benchmark: This means that the run-time is entirely determined by the throughput of the single memory controller.

We give the performance ratio's for one and four processor configurations. The ratio should be 1.5 (1500/1000) if the application fits entirely in the caches. It should be higher than 1.5 if it fits better in the 6.0 MByte cache than in the 3.0 MByte cache. The ratio is lower than 1.5 if the memory controller becomes a bottleneck. A ratio of 1.0 effectively means that the performance is entirely determined by the memory controller throughput: 1000 MHz processors run just as fast as the 1500 MHz processors. Some small differences are due fact that a newer version of the compiler is used for the 1500 MHz Itanium 2 configurations

We'll see that we get very different results for the Integer and Floating Point benchmarks.

Cache efficiency for Integer SPEC 2000 benchmarks.

www.chip-architect.org
hp server rx5670
SPECint_rate	1000 MHz Itanium 2 3 MByte L3			1500 MHz Itanium 2 6 MByte L3			Ratios:* 1000 to 1500 MHz		Memory Footprint SPEC
	1P sec.	4P sec.	Cache efficiency	1P sec.	4P sec.	Cache efficiency	1P ratio	4P ratio	size resident	size virtual
164.gzip	240	241	99 %	143	145	99 %	1.68	1.66	180 MB	199 MB
175.vpr	200	203	98 %	126	127	99 %	1.59	1.60	50 MB	54 MB
176.gcc	109	111	98 %	73.2	74.5	98 %	1.49	1.49	154 MB	156 Mb
181.mcf	220	236	91 %	80.4	82.5	97 %	2.74	2.86	190 MB	190 MB
186.crafty	128	128	100 %	81.3	81.4	100 %	1.57	1.57	2.0 MB	2.6 MB
197.parser	273	273	100 %	184	182	101 %	1.48	1.50	37 MB	67 MB
252.eon	129	130	99 %	84.1	84.2	100 %	1.53	1.54	0.6 MB	1.5 MB
253.perlbmk	221	221	100 %	151	150	101 %	1.46	1.47	146 MB	158 MB
254.gap	162	172	92 %	103	116	85 %	1.57	1.48	192 MB	194 MB
255.vortex	160	161	99 %	98.3	98.3	100 %	1.63	1.64	72 MB	79 MB
256.bzip2	198	199	99 %	123	124	99 %	1.61	1.60	185 MB	199 MB
300.twolf	341	342	100 %	234	234	100 %	1.46	1.46	3.4 MB	4.0 MB
SPECint_rate	9.36	36.8	98 %	15.2	60.0	98 %	1.62	1.63

*) different compiler versions are used for the 1000 MHz and 1500 MHz results

Cache efficiency for Integer SPEC 2000 benchmarks.

The caches seem to work quite well for the integer benchmarks. 10 out of 12 benchmarks have a cache efficiency of 98% or higher for the 3.0 MByte caches. This becomes 11 out of 12 for the 6 MByte cache (97%+). The Integer benchmarks do a lot of work on the data they load into the cache. Tasks likes Compilation and Compression have a high re-use of the data they work on. Caching works fine here even though the memory footprints as given by the SPEC committee are much larger than the cache itself.

Benchmark 181.mcf becomes number 11 of the highly efficient ones. We see performance improvements of 2.74 x and 2.86 x for the one and four processor versions. The only one left (254.gap) becomes actually less efficient, 92% to 85%. Four processors running this benchmark simultaneously are hindered more by the throughput of the single memory controller then they benefit from the increased cache size.

Cache efficiency for Floating Point SPEC 2000 benchmarks.

www.chip-architect.org
hp server rx5670
SPECfp_rate	1000 MHz Itanium 2 3 MByte L3			1500 MHz Itanium 2 6 MByte L3			Ratios:* 1000 to 1500 MHz		Memory Footprint SPEC
	1P sec.	4P sec.	Cache efficiency	1P sec.	4P sec.	Cache efficiency	1P ratio	4P ratio	size resident	size virtual
168.wupwise	155	169	89 %	112	129	82 %	1.38	1.31	176 MB	177 MB
171.swim	91.9	274	11 %	73.8	273	2.7 %	1.25	1.00	191 MB	192 MB
172.mgrid	97	238	21 %	71.8	225	9.2 %	1.35	1.06	56 MB	57 MB
173.applu	96.2	114	79 %	61.7	101	48 %	1.56	1.13	181 MB	191 MB
177.mesa	193	193	100 %	122	122	100 %	1.58	1.58	9.4 MB	23 MB
178.galgel	114	122	91 %	68.3	69	99 %	1.67	1.77	63 MB	155 MB
179.art	62.2	72.2	82 %	39	39.7	98 %	1.59	1.82	3.7 MB	4.3 MB
183.equake	67.8	125	39 %	44.6	108	22 %	1.52	1.16	49 MB	49 MB
187.facerec	151	182	77 %	97	124	71 %	1.56	1.47	16 MB	19 MB
188.ammp	243	249	97 %	155	156	99 %	1.57	1.60	26 MB	28 MB
189.lucas	151	274	40 %	114	266	24 %	1.32	1.03	142 MB	143 MB
191.fma3d	271	325	78 %	199	251	72 %	1.36	1.29	103 MB	105 MB
200.sixtrack	122	122	100 %	80	80.4	99 %	1.53	1.52	26 MB	60 MB
301.apsi	381	417	88 %	260	285	88 %	1.47	1.46	191 MB	192 MB
SPECfp_rate	16.6	49.3	66 %	24.5	66.4	57 %	1.48	1.35

*) different compiler versions are used for the 1000 MHz and 1500 MHz results

Cache efficiency for Floating Point SPEC 2000 benchmarks.

We see a very different picture for the Floating Point benchmarks however. Only 3 out of 14 have a cache efficiency of 97% and higher for the 3.0 MByte cache. This becomes 5 out of 14 for the 6.0 MByte cache. Numbers 4 and 5 are the "infamous" 179.art and 178.galgel. 179.art becomes entirely cache resident in the 6.0 Megabyte cache. 178.galgel has a memory footprint much larger then the cache. The re-use of data however makes if very efficient in a 6 Megabyte cache.

Most of these Scientific and Technical benchmark operate on large data-structures. The re-use becomes very high if a significantly part of the data structure fits into the cache. For instance: If the data-structure is a 3D-volume then it may be enough to hold three planes in the cache. This because new values are often calculated with the directly surrounding points:

New Data [ i, j, k ] = Function { Data [ i-1, j-1, k-1 ] ..... Data [ i+1, j+1, k+1 ] }

The re-use of data then becomes 26 / 27 = 96% where 27 is the number of points in the 3x3x3 sub-cube, only one new point needs to be loaded from memory if the 3x3x3 sub-cube moves a single position through the volume.

We see that 2 benchmarks are completely bandwidth starved at 1500 MHz for a single processor: 171.Swim and 172.mgrid

8 out of 14 benchmarks become less cache efficient if we go from 1000 MHz / 3 MB cache Itanium 2's to the newer 1500 MHz/

6 MB Itanium 2's The Memory Controller bottleneck more then cancels the advantage of the larger caches.

If you're looking to use the Itanium 2 for Large Scale scientific or Technical calculations then you'll have to look at SGI with is propriety memory controllers. SGI uses two SHUB Memory Controllers for every four processors. The fully loaded 64 processor Altix 3000 has a 65% higher throughput then the half loaded 32 processor Altix 3000.

The memory footprint of the SPEC2000 benchmarks is less then 200 MByte to be able to run on systems with 256 MByte DRAM. Heavier applications using multiple Gigabyte structures are likely to see much greater degradations. AMD's distributed memory solution based on HyperTransfer links is likely to pay of in these cases. A four processor 2200 MHz Opteron may reach a similar SPEC2000_rate performance as a four way 1500 MHz Itanium 2 even though the latter has a much higher single processor score. Again, larger floating point memory footprints may skew the results even further.

Regards, Hans