Fresh from the foundry, Intel's fast new 64-bit chip leaps ahead of previous 32-bit Xeons. But can it challenge Opteron? Infatuated with Itanium, Intel has long resisted the obvious: creating a 64-bit chip that simply extends the x86 architecture on which the company built its fortune. Instead, Intel has ceded that ground to AMD, whose 64-bit, x86-compatible Opteron has steadily eaten into the market shares of both Itanium and the 32-bit Xeon throughout the past year. For many, Itanium has been too expensive and too much of a departure. And Xeon has lacked the 64-bit headroom provided by Opteron.The game changed in late June when Intel introduced its newest Xeon processor, code-named Nocona. Intel unveiled the most recent arrival in its storied Xeon processor line with little fanfare, but in fact, it’s a major departure given that Intel has actually adapted the x86-64 instruction set that AMD developed for the Opteron. Intel’s new architecture, the EM64T (Extended Memory 64 Technology), brings a Xeon chip into direct competition with the Opteron for the first time.The newly released CPU will take awhile to make its mark in the 64-bit computing world because all x86-64 code written so far has been for Opteron. Software tweaks and optimizations are in order to permit recently developed x86-64 code to run on the processor. For example, the 64-bit beta version of Windows Server 2003 will not run on this chip, as it’s been coded for the Opteron and will not install if CPU detection fails to find Opteron. This will be remedied soon, no doubt, but similar issues will abound for some time. Similarly, Red Hat Advanced Server Update 1 will not install on an EM64T system, but Update 3 will because EM64T support has been added. Downshifting to 32 bitLack of 64-bit support is the main reason I benchmarked the Nocona chip in 32-bit mode only, using a 32-bit Xeon system as a basis for comparison rather than an Opteron system. Given time and development effort, 64-bit benchmarks will be more accurate and will provide a clearer picture as to how Nocona measures up against Opteron in 64-bit performance. Nonetheless, I was able to install Red Hat Linux Advanced Server 3.0 for x86-64 Update 3 on our prerelease, Nocona-based Dell PowerEdge 2800 test system. A few new drivers were required, notably the MegaRAID2 driver for the PERC RAID controller. Otherwise, the installation was smooth and uneventful (see more on Opteron vs. the new EM64T-based Xeon).Running strictly 32-bit code, I tested the new chip on Red Hat Linux Advanced Server 3.0 for i386. The Dell PowerEdge 2800 test system had dual 3.6GHz Xeon EM64T CPUs with 1MB of Level 2 cache, an 800MHz FSB (front-side bus), and 4GB of DDR2 RAM. The 32-bit Xeon contender was a Dell PowerEdge 2600 that had dual 3.2GHz Pentium 4 Xeon CPUs with 512KB of Level 2 cache, a 533MHz FSB, and 4GB of DDR RAM. For reference, I also ran the Linpack tests on a Hewlett-Packard ML350 with dual 2.8GHz Pentium 4 Xeon CPUs with a 400MHz FSB, 512MB of Level 2 cache, and 4GB of RAM. All tests were run with Hyper Threading disabled, and I ran all tests on the stock Red Hat 2.4.21-EL kernel (these numbers would probably improve with a v2.6 kernel). The test suite was similar to the one I used to benchmark the v2.6 Linux Kernel in January, providing a snapshot of real-world performance as measured by static and dynamic Web serving tests run on Apache 2.0.46 and a single-threaded database test run on MySQL 3.23.58. This time, I also ran Linpack tests to compute GFLOPS ratings for the systems (1 GFLOPS equals 1 billion floating-point operations per second). The results of these tests were more indicative of the next generation of Xeon-based servers — with PCI Express, an 800MHz FSB, and higher-throughput DDR2 RAM — and less indicative of specific CPU performance. The systems tested varied in FSB speed, RAM architecture, and Level 2 cache sizes, but the numbers suggest what we can expect from EM64T-based servers over the next year.The Web serving tests were driven by Apache’s ab benchmarking tool, measuring performance in requests per second as delivered by each system across a Gigabit copper connection. The static test page was a highly graphical page totaling 100KB, with 10,000 requests run per test, maintaining 10 concurrent connections at all times. The 3.6GHz EM64T-based PowerEdge 2800 easily outperformed the 3.2GHz PowerEdge 2600 by a whopping 67 percent. The dynamic Web tests were run with 5,000 requests per test, 10 concurrent, calling a CGI script written in Perl that presented an HTML table with 200 rows of eight columns selected from a MySQL database of 35,000 rows. The MySQL database ran on the same system as the Web server to eliminate any performance interference that might have occurred if a separate database server had been run during the test. The numbers here are definitely closer but still show the 3.6GHz PowerEdge 2800 outpacing the 3.2GHz PowerEdge 2600 by almost 12 percent.Using MySQL’s single-threaded sql-bench benchmark suite, I ran the database tests from another Red Hat Advanced Server over a gigabit LAN. Rather than proceed down the slippery slope of database performance tweaking, I ran all tests with no optimization whatsoever. Again, the PowerEdge 2800 easily beat the 3.2GHz Xeon server, with a 23 percent average performance gain across all tests. More than just another XeonTo get a better reading on the true performance of the chips, I ran HPL (High-Performance Linpack) tests on the two servers, as well as an older Pentium 4 Xeon 2.8GHz server. Although I expected the Nocona system to outperform the others, the results were still impressive: Running 32-bit with a single thread per processor, the PowerEdge 2800 was well ahead of the PowerEdge 2600 system, posting a maximum performance increase of 44 percent at a problem size of 10,000.The raw figures put the PowerEdge 2800 at a peak of 10.244 GFLOPS at the highest problem size of 20,000, whereas the PowerEdge 2600 topped out at 7.209 GFLOPS at the same problem size. Obviously, the Nocona chip has solid floating-point chops. The 2.8GHz Xeon HP ML350 unsurprisingly clocked in at the bottom, with a 6.078 GFLOPS rating at a problem size of 20,000. Thus, the performance delta between the 2.8GHz and 3.2GHz x86 Xeon systems at that problem size was slightly more than 1 GFLOPS, given identical Level 2 cache sizes. The delta between the 3.2GHz x86Xeon PowerEdge 2600 and the 3.6GHz EM64T-based PowerEdge 2800 was an impressive 3 GFLOPS. These results were derived from independent math libraries, but Dell requested that I run Intel’s Linpack 1000 benchmarks as well. These tests were run with Intel-optimized libraries. The results of the Linpack 1000 tests show similar performance ranges to the straight Linpack tests, with higher overall GFLOPS ratings.In all but one test I ran, the 3.6GHz Nocona performed better — sometimes much better — than the 12 percent clock speed difference between the chip and its nearest 3.2GHz Xeon rival. No doubt the larger cache and faster FSB accounted for the big margins.But that’s not the end of the argument. In the tests I’m running for an upcoming article, a quad-CPU 2.2GHz Opteron 848-based server is outpacing a quad-CPU Xeon 3.06GHz server handily in 32-bit performance, although rough Linpack tests show that the 3.6GHz EM64T-based PowerEdge 2800 is pulling ahead of the 2.2GHz Opteron 848-based server in 32-bit floating-point performance. Regardless of whether 64-bit code is being run, it appears that the x86-64 chips are strong performers. My preliminary tests suggest that for high-performance computing, Intel’s Nocona is probably not the best choice. Intel’s Itanium has better floating-point performance than any Xeon CPU, as does AMD’s Opteron. But for demanding 32-bit x86-based tasks, EM64T-based CPUs provide solid performance. As far as 64-bit performance goes, however, the jury is still out. Software DevelopmentTechnology IndustrySmall and Medium Business