August 29, 2003

Cache efficiency for the SPEC 2000 benchmarks 

 by Hans de Vries 

 

 

     Cache efficiency for the SPEC 2000 Benchmarks

 

The SPEC 2000 benchmarks are subject to much debate in the scientific community. Are they broken? Do they just depend on memory bandwidth?  Do they fit entirely in the cache?  The recent publication of new benchmarks for the hp server rx5670 gives us a chance to produce some metrics. This small server is a four processor machine with a single memory controller. The memory bandwidth is 6.4 GByte /second. We look at the scores for four different configurations:

 

1)  Single 1000 MHz Itanium 2 with 3.0 MByte L3 on Chip Cache     CINT2000     CFP2000

2)  Four    1000 MHz Itanium 2 with 3.0 MByte L3 on Chip Cache     CINT2000     CFP2000

3)  Single 1500 MHz Itanium 2 with 6.0 MByte L3 on Chip Cache     CINT2000     CFP2000

4)  Four    1500 MHz Itanium 2 with 6.0 MByte L3 on Chip Cache     CINT2000     CFP2000

 

We define the Cache efficiency here as 100% if four processors finish just as fast as a single processor. Cache efficiency is said to be "0%"  if four processors take four times as long to finish the benchmark: This means that the run-time is entirely determined by the throughput of the single memory controller. 

 

We give the performance ratio's for one and four processor configurations. The ratio should be 1.5 (1500/1000) if the application fits entirely in the caches. It should be higher than 1.5 if it fits better in the 6.0 MByte cache than in the 3.0 MByte cache. The ratio is lower than 1.5 if the memory controller becomes a bottleneck. A ratio of 1.0 effectively means that the performance is entirely determined by the memory controller throughput: 1000 MHz processors run just as fast as the 1500 MHz processors. Some small differences are due fact that a newer version of the compiler is used for the 1500 MHz Itanium 2 configurations

 

We'll see that we get very different results for the Integer and Floating Point benchmarks.

 

    Cache efficiency for  Integer SPEC 2000 benchmarks.

 

www.chip-architect.org

 

hp server rx5670

 

 SPECint_rate

1000 MHz Itanium 2

3 MByte L3

 

1500 MHz Itanium 2

6 MByte L3

 

Ratios:*

1000 to 1500 MHz

 

Memory

Footprint

SPEC

 

1P

sec.

4P

sec.

Cache

efficiency

 

1P

sec.

4P

sec.

Cache

efficiency

 

1P

ratio

4P

ratio

 

size

resident

size

virtual

164.gzip

240

241

99 %

 

143

145

99 %

 

1.68

1.66

 

180 MB

199 MB

175.vpr

200

203

98 %

 

126

127

99 %

 

1.59

1.60

 

50 MB

54 MB

176.gcc

109

111

98 %

 

73.2

74.5

98 %

 

1.49

1.49

 

154 MB

156 Mb

181.mcf

220

236

91 %

 

80.4

82.5

97 %

 

2.74

2.86

 

190 MB

190 MB

186.crafty

128

128

100 %

 

81.3

81.4

100 %

 

1.57

1.57

 

2.0 MB

2.6 MB

197.parser

273

273

100 %

 

184

182

101 %

 

1.48

1.50

 

37 MB

67 MB

252.eon

129

130

99 %

 

84.1

84.2

100 %

 

1.53

1.54

 

0.6 MB

1.5 MB

253.perlbmk

221

221

100 %

 

151

150

101 %

 

1.46

1.47

 

146 MB

158 MB

254.gap

162

172

92 %

 

103

116

85 %

 

1.57

1.48

 

192 MB

194 MB

255.vortex

160

161

99 %

 

98.3

98.3

100 %

 

1.63

1.64

 

72 MB

79 MB

256.bzip2

198

199

99 %

 

123

124

99 %

 

1.61

1.60

 

185 MB

199 MB

300.twolf

341

342

100 %

 

234

234

100 %

 

1.46

1.46

 

3.4 MB

4.0 MB

 SPECint_rate 

9.36

36.8

98 %

 

 

15.2

60.0

98 %

 

1.62

1.63

 

 

 

 

*) different compiler versions are used for the 1000 MHz and 1500 MHz results

     Cache efficiency for  Integer SPEC 2000 benchmarks.

 

The caches seem to work quite well for the integer benchmarks. 10 out of 12 benchmarks have a cache efficiency of 98% or higher for the 3.0 MByte caches. This becomes 11 out of 12 for the 6 MByte cache (97%+). The Integer benchmarks do a lot of work on the data they load into the cache. Tasks likes Compilation and Compression have a high re-use of the data they work on. Caching works fine here even though the memory footprints as given by the SPEC committee are much larger than the cache itself.

 

Benchmark 181.mcf becomes number 11 of the highly efficient ones. We see performance improvements of 2.74 x and 2.86 x for the one and four processor versions. The only one left (254.gap) becomes actually less efficient, 92% to 85%. Four processors running this benchmark simultaneously are hindered more by the throughput of the single memory controller then they benefit from the increased cache size.  

 

     Cache efficiency for Floating Point SPEC 2000 benchmarks.

 

www.chip-architect.org

 

hp server rx5670

 

 SPECfp_rate

1000 MHz Itanium 2

3 MByte L3

 

1500 MHz Itanium 2

6 MByte L3

 

Ratios:*

1000 to 1500 MHz

 

Memory

Footprint

SPEC

 

1P

sec.

4P

sec.

Cache

efficiency

 

1P

sec.

4P

sec.

Cache

efficiency

 

1P

ratio

4P

ratio

 

size

resident

size

virtual

 168.wupwise

155

169

89 %

 

112

129

82 %

 

1.38

1.31

 

176 MB

177 MB

 171.swim  

91.9

274

11 %

 

73.8

273

2.7 %

 

1.25

1.00

 

191 MB

192 MB

 172.mgrid 

97

238

21 %

 

71.8

225

9.2 %

 

1.35

1.06

 

56 MB

57 MB

 173.applu 

96.2

114

79 %

 

61.7

101

48 %

 

1.56

1.13

 

181 MB

191 MB

 177.mesa

193

193

100 %

 

122

122

100 %

 

1.58

1.58

 

9.4 MB

23 MB

 178.galgel

114

122

91 %

 

68.3

69

99 %

 

1.67

1.77

 

63 MB

155 MB

 179.art

62.2

72.2

82 %

 

39

39.7

98 %

 

1.59

1.82

 

3.7 MB

4.3 MB

 183.equake

67.8

125

39 %

 

44.6

108

22 %

 

1.52

1.16

 

49 MB

49 MB

 187.facerec

151

182

77 %

 

97

124

71 %

 

1.56

1.47

 

16 MB

19 MB

 188.ammp

243

249

97 %

 

155

156

99 %

 

1.57

1.60

 

26 MB

28 MB

 189.lucas

151

274

40 %

 

114

266

24 %

 

1.32

1.03

 

142 MB

143 MB

 191.fma3d

271

325

78 %

 

199

251

72 %

 

1.36

1.29

 

103 MB

105 MB

 200.sixtrack

122

122

100 %

 

80

80.4

99 %

 

1.53

1.52

 

26 MB

60 MB

 301.apsi 

381

417

88 %

 

260

285

88 %

 

1.47

1.46

 

191 MB

192 MB

 SPECfp_rate

16.6

49.3

66 %

 

 

24.5

66.4

57 %

 

1.48

1.35

 

 

 

 

*) different compiler versions are used for the 1000 MHz and 1500 MHz results

      Cache efficiency for Floating Point SPEC 2000 benchmarks.

 

We see a very different picture for the Floating Point benchmarks however. Only 3 out of 14 have a cache efficiency of 97% and higher for the 3.0 MByte cache.  This becomes 5 out of 14 for the 6.0 MByte cache. Numbers 4 and 5 are the "infamous" 179.art and 178.galgel. 179.art becomes entirely cache resident in the 6.0 Megabyte cache. 178.galgel has a memory footprint much larger then the cache. The re-use of data however makes if very efficient in a 6 Megabyte cache. 

 

Most of these Scientific and Technical benchmark operate on large data-structures. The re-use becomes very high if a significantly part of the data structure fits into the cache.  For instance: If the data-structure is a 3D-volume then it may be enough  to hold three planes in the cache. This because new values are often calculated with the directly surrounding points:

 

New Data [ i, j, k ]  = Function { Data [ i-1, j-1, k-1 ] ..... Data [ i+1, j+1, k+1 ]   }

 

The re-use of data then becomes 26 / 27 =  96% where 27 is the number of points in the 3x3x3 sub-cube, only one new point needs to be loaded from memory if the 3x3x3 sub-cube moves a single position through the volume.

 

We see that 2 benchmarks are completely bandwidth starved at 1500 MHz for a single processor: 171.Swim and 172.mgrid

8 out of 14 benchmarks become less cache efficient if we go from 1000 MHz / 3 MB cache Itanium 2's to the newer 1500 MHz/ 

6 MB Itanium 2's  The Memory Controller bottleneck more then cancels the advantage of the larger caches.

 

If you're looking to use the Itanium 2 for Large Scale scientific or Technical calculations then you'll have to look at SGI with is propriety memory controllers. SGI uses two SHUB Memory Controllers for every four processors. The fully loaded 64 processor Altix 3000 has a 65% higher throughput then the half loaded 32 processor Altix 3000.

 

The memory footprint of the SPEC2000 benchmarks is less then 200 MByte to be able to run on systems with 256 MByte DRAM. Heavier applications using multiple Gigabyte structures are likely to see much greater degradations. AMD's distributed memory solution based on HyperTransfer links is likely to pay of in these cases. A four processor 2200 MHz Opteron may reach a similar SPEC2000_rate performance as a four way 1500 MHz Itanium 2 even though the latter has a much higher single processor score.  Again, larger floating point memory footprints may skew the results even further.    

 

 

 

    Regards,  Hans

 

 

 

HOME