
Introducing a New Dynamically and Design-
Scalable Microarchitecture that Rewrites the
Book On Energy Efficiency and Performance
Since the introduction of Intel® Core™ microarchitecture in 2006
and its 45nm enhancements—the 45nm next generation Intel
Core microarchitecture (Penryn family of processors) in 2007—the
blistering performance and energy efficiency of Intel® microprocessors
has delivered unprecedented capability to end users.
Now in 2008, a new microarchitecture code named Nehalem
stands to further build on these microarchitectural marvels,
rewriting the book on processor energy efficiency, performance,
and scalability.
The first chapter is all about scalability. Next generation Intel®
microarchitecture (Nehalem) is a dynamically scalable and designscalable
microarchitecture. At runtime, it dynamically manages
cores, threads, cache, interfaces, and power to deliver outstanding
energy efficiency and performance on demand. At design time, it
scales, enabling Intel to easily provide versions that are optimized
for each server, desktop, and notebook market. Intel will deliver
versions differing in the number of cores, caches, interconnect
capability, and memory controller capability, as well as in the
segmented use of an integrated graphics controller. This allows
Intel to deliver a wide range of price, performance, and energy
efficiency targets for servers, workstations, desktops, and laptops.
To extract greater performance from this new microarchitecture,
in targeted market segments, Intel is also introducing a new
platform architecture: Intel® QuickPath Architecture. Through
integrated memory controllers and a high-speed interconnect for
connecting processors and other components, Intel QuickPath
Architecture delivers best-in-class performance, bandwidth, and
reliability. In turn, it truly enables systems to fully unleash the
new levels of performance that new and more powerful next
generation microarchitecture-based processor cores will deliver.
Next generation Intel microarchitecture (Nehalem) marks the
next step (a “tock”) in Intel’s rapid “tick-tock” cadence for
delivering a new process technology (tick) or an entirely new
microarchitecture (tock) every year. The first Nehalem-based
processors are expected to release in the latter part of 2008.
The family will grow to include server, workstation, desktop, and
mobile processors.
Unlocking All the Power of Intel’s 45nm Hi-k
Metal Gate Process Technology
Next generation Intel microarchitecture (Nehalem) has been
designed from the ground up to capitalize on all the advantages
of Intel’s industry-leading 45-nanometer (nm) Hi-k metal gate
silicon technology. This new process technology is one of the
biggest advancements in fundamental transistor design in 40
years. It uses a new material combination of Hi-k gate dielectrics
and conductors to enable Intel to continue record-breaking PC,
laptop, and server processor performance while reducing the
electrical leakage from transistors that can hamper chip and PC
design, size, power consumption, and costs.
Intel’s 45nm Hi-k silicon process technology increases transistor
switching speeds to enable higher core and bus clock frequencies
and thus more performance in the same power and thermal
envelope. This performance efficiency is helping Intel extend
Moore’s Law (a high-tech industry axiom that transistor counts
double about every two years to deliver ever more functionality
at exponentially decreasing cost) well into the next decade.
An Overview of Intel’s New Microarchitecture
Next generation Intel microarchitecture (Nehalem) is the next
step in Intel’s continuing success in leading the industry in
processor performance and energy efficiency. In fact, it represents
another big leap in performance and energy efficiency,
similar to the leap made by Intel Core microarchitecture over
the first 90nm Intel® Pentium® M processors.
Next generation Intel microarchitecture (Nehalem) continues
Intel’s philosophy of focusing on improvements in how the
processor uses available clock cycles and power, rather than just
pushing up ever higher clock speeds and energy needs. The goal
is to do more in the same power envelope—or even reduced
envelopes. In turn, like its Intel Core microarchitecture predecessor,
next generation Intel microarchitecture (Nehalem) includes the
ability to process up to four instructions per clock cycle on a
sustained basis compared to just three instructions per clock
cycle or less processed by other processors. However, the next
generation microarchitecture’s biggest innovations come from
new optimizations of the individual cores and the overall
multi-core microarchitecture to increase single-thread and multithread
performance.
The next generation microarchitecture’s performance
and power management innovations include:
• Dynamically managed cores, threads, cache, interfaces,
and power.
• Simultaneous multi-threading (SMT) for enabling a more energy
efficient means of increasing performance for multi-threaded
workloads. The next generation microarchitecture’s SMT
capability enables running two simultaneous threads per
core—an amazing eight simultaneous threads per quad-core
processor and 16 simultaneous threads for dual-processor
quad-core designs.
• Innovative extensions to the Intel® Streaming SIMD Extensions
4 (SSE4) that center on enhancing XML, string, and text
processing performance.
• Superior multi-level cache, including an inclusive shared
L3 cache.
• New high-end system architecture that delivers from two
to three times more peak bandwidth and up to four times more
realized bandwidth (depending on configuration) as compared
to today’s Intel® Xeon® processors.
• Performance-enhanced dynamic power management.
On the design side, next generation Intel microarchitecture
(Nehalem) enables optimal price/performance/energy efficiency
for each market segment through:
• Scalable performance for from one-to-16 (or more) threads
and from one-to-eight (or more) cores.
• Scalable and configurable system interconnects and integrated
memory controllers.
• High-performance integrated graphics engine for client platforms.
Let’s look at how next generation microarchitecture’s dynamically
scalable and design-scalable directly contribute to power
efficiency and performance.
Power Efficiency
In the past, when a computer’s energy efficiency wasn’t a
concern, nearly every architecture feature that could improve
processor performance would be included without worrying
about the power cost. But in an age of increasing concern for
limited resources and increased energy costs, every segment
(server, workstation, desktop, and mobile) is power-constrained
and designing a microarchitecture requires a different approach.
Processor manufacturers must consider the power cost whether
the processor is intended for the home, data center, or ultra-light
laptop. Consequently, Intel weighed every architectural feature
added to the next generation microarchitecture against a strict
power/performance efficiency threshold. If the feature couldn’t
add more than a one percent performance gain vs. one percent
power gain for a less than three percent power cost, Intel wouldn’t
add it. By measuring the benefit of the performance gain against
the power cost, Intel was able to design the next generation
microarchitecture to deliver greater power efficiency at any
power envelope.
Performance Improvement Features
With the next generation microarchitecture, Intel made significant
core enhancements to further improve the performance of the
individual processor cores. Below we describe some of
these enhancements.
Instructions Per Cycle Improvements. The more instructions that
can be run per each clock cycle, the greater the performance. In
addition, in many cases, by running more instructions in any given
clock cycle, the work task can complete sooner enabling the
processor to more quickly get back into a lower power state. To run
more instructions per cycle, Intel made several key innovations.
• Greater Parallelism. One way to extract more parallelism out
of software code is to increase the amount of instructions
that can be run “out of order.” This enables more simultaneous
processing and overlap latency. To be able to identify more
independent operations that can be run in parallel, Intel increased
the size of the out-of-order window and scheduler, giving them
a wider window from which to look for these operations. Intel
also increased the size of the other buffers in the core to
ensure they wouldn’t become a limiting factor.
• More Efficient Algorithms. With each new microarchitecture,
Intel has included improved algorithms in places where previous
processor generations saw lost performance due to stalls (dead
cycles). Next generation Intel microarchitecture (Nehalem)
brings many such improved algorithms to increase performance.
These include:
- Faster Synchronization Primitives: As multi-threaded
software becomes more prevalent, the need to synchronize
threads is also becoming more common. Next generation
Intel microarchitecture (Nehalem) speeds up the common
legacy synchronization primitives (such as instructions with
a LOCK prefix or the XCHG instruction) so that existing
threaded software will see a performance boost.
- Faster Handling of Branch Mispredictions: A common
way to increase performance is through the prediction of
branches. Next generation Intel microarchitecture (Nehalem)
optimizes the cases where the predictions are wrong, so
that the effective penalty of branch mispredictions overall
is lower than on prior processors.
- Improved Hardware Prefetch and Better Load-Store
Scheduling: Next generation Intel microarchitecture
(Nehalem) continues the many advances Intel made
with the 45nm next generation Intel Core microarchitecture
(Penryn) family of processors in reducing memory
access latencies through prefetch and load-store
scheduling improvements.
Enhanced Branch Prediction. Branch prediction attempts to
guess whether a conditional branch will be taken or not. Branch
predictors are crucial in today’s processors for achieving high
performance. They allow processors to fetch and execute instructions
without waiting for a branch to be resolved. Processors also
use branch target prediction to attempt to guess the target of
the branch or unconditional jump before it is computed by parsing
the instruction itself. In addition to greater performance, an
additional benefit of increased branch prediction accuracy is that it
can enable the processor to consume less energy by spending less
time executing mis-predicted branch paths.
Next generation Intel microarchitecture (Nehalem) uses several
innovations to reduce branch mispredicts that can hinder performance
and to improve the handling of branch mispredicts.
• New Second-Level Branch Target Buffer (BTB). To improve
branch predictions in applications that have large code footprints,
such as database applications, Intel added a second-level branch
target buffer (BTB). BTBs reduce the performance penalty of
branches in pipelined processors by predicting the path of the
branch and caching information used by the branch.
• New Renamed Return Stack Buffer (RSB). RSBs store
forward and return pointers associated with call and return
instructions. Next generation microarchitecture’s renamed RSB
helps avoid many common return instruction mispredictions.
Simultaneous Multi-Threading. For next generation Intel
microarchitecture (Nehalem), Intel introduces an enhanced
version of Intel® Hyper-Threading Technology (HT), a technique
used previously on some Intel Pentium and Intel Xeon processors
that enabled a single execution core to run two threads at the
same time. In a multi-core processor, simultaneous multi-threading
doubles the potential number of overall threads that can be run
simultaneously by each of the processors. This means a quad-core
processor could run up to eight threads simultaneously. What’s
unique to next generation Intel microarchitecture (Nehalem) is
that its larger cache and larger bandwidth provide even more
opportunities to take advantage of HT.
Incorporating simultaneous multi-threading significantly boosts
performance for very little power cost. It can deliver substantial
performance (up to 20 to 30 percent1) depending on the application
for only a slight amount of power. That makes simultaneous
multi-threading a perfect processor technology for today’s
power-constrained environments.
Intel® Smart Cache Enhancements. Next generation Intel
microarchitecture (Nehalem) enhances the Intel Smart Cache by
adding an inclusive shared L3 (last-level) cache that can be up to
8 MB in size. In addition to this cache being shared across all
cores, the inclusive shared L3 cache can increase performance
while reducing traffic to the processor cores. Some architectures
use exclusive L3 cache, which contains data not stored in other
caches. Thus, if a data request misses on the L3 cache, each
processor core must still be searched, or snooped, in case their
individual caches might contain the requested data. This can
increase latency and snoop traffic between cores. With next
generation microarchitecture, a miss of its inclusive shared L3
cache guarantees the data is outside the processor and thus is
designed to eliminate unnecessary core snoops to reduce latency
and improve performance.
The new three-level cache hierarchy for next generation Intel
microarchitecture (Nehalem) consists of:
• Same L1 cache as Intel Core microarchitecture (32 KB Instruction
Cache, 32 KB Data Cache)
• New L2 cache per core for very low latency (256 KB per core
for handling data and instruction)
• New fully inclusive, fully shared 8 MB L3 cache (all applications
can use entire cache)
A new two-level Translation Lookaside Buffer (TLB) hierarchy
is also included in next generation Intel microarchitecture
(Nehalem). A TLB is a processor cache that is used by memory
management hardware to improve the speed of virtual address
translation. The TLB references physical memory addresses in its
table. All current desktop and server processors use a TLB, but
next generation Intel microarchitecture (Nehalem) adds a new
second level 512 entry TLB to further improve performance.
New Application Targeted Accelerators and Intel SSE4.
Next generation Intel microarchitecture (Nehalem) includes all the
additional Intel SSE4 instructions Intel included in the 45nm next
generation Intel Core microarchitecture (Penryn) for faster
computation/manipulation of media (graphics, video encoding
and processing, 3-D imaging, and gaming).2 In addition, next
generation Intel microarchitecture (Nehalem) adds seven new
Application Targeted Accelerators for more efficient accelerated
string and text processing of applications like Extensible Markup
Language (XML).
Application Targeted Accelerators extend the capabilities of
Intel® architecture by adding performance-optimized, low-latency,
lower power fixed-function accelerators on the processor die to
benefit specific applications. Such accelerators are the start of a
natural evolution where gradually more and more advantageous
implementations of fixed-function capabilities will be developed
and added to the processor. Just as the evolution of silicon technology
from 65nm to 45nm to 32nm enables more transistors for
additional cores and cache, so too will it also enable more of these
fixed-function on-die implementations. The benefit will be greater
performance—and superior energy efficiency—for these
specific applications.
The seven Application Targeted Accelerators included in the
next generation microarchitecture provide new string and text
processing instructions to improve performance of string and
text processing operations. For example, they enable parsing of
XML strings and text at a much higher speed. These Application
Targeted Accelerators will be useful for lexing, tokenizing, regular
expression evaluation, virus scanning, and intrusion.
Improved Virtualization Performance. Virtualization partitions
a computer so that it can run separate operating systems and
software in each partition, allowing one computer to act as many.
Virtualization enables computers, particularly servers, to better
leverage multi-core processing power and increase efficiency.
Next generation Intel microarchitecture (Nehalem) adds new
features that enable software to further improve their performance
in virtualized environments. For example, the next
generation microarchitecture includes an Extended Page Table
(EPT) for reconciling memory type specification in a guest
operating system with memory type specification in the host
operating system in virtualization systems that support memory
type specification.
New System Architecture: Intel® QuickPath
Technology
With more powerful processors, a potential bottleneck can form
anytime a processor or its individual cores can’t fetch instructions
and data as fast as they’re being executed. Whenever this
happens, performance slows. Of particular importance to the
performance of a system is the speed at which a microprocessor
and its execution cores can access system memory (in addition to
internal cache). In multi-processor systems, not only is the actual
access to data important, but also the multi-processor communication
required to ensure memory coherency (also called snoop traffic).
For years Intel kept instructions and data flowing quickly to
the processor through an external bi-directional data bus called a
front-side bus (FSB). This bus performed as a backbone between
the processor cores and a chipset that contained the memory
controller hub and served as the connection point for all other
buses (PCI, AGP, etc.) in the system. In turn, this has delivered
industry-leading processor performance on the Intel Core
microarchitecture family of processors.
In its long-range planning, Intel has long anticipated that the
development of a high-performance, dynamically, and designscalable
microarchitecture like next generation Intel microarchitecture
(Nehalem) would lead to moving beyond FSBs to a new system
architecture. The result was the development of Intel QuickPath
Architecture, a new system architecture that integrates a memory
controller into each microprocessor, dedicates specific areas of
system memory to each processor, and connects processors and
other components with a new high-speed interconnect. Previously
referenced under the code name Common System Interface or
CSI, Intel® QuickPath Interconnect unleashes the performance of
next generation microarchitecture-based processors and future
generations of Intel® multi-core processors.
Intel QuickPath Architecture is a platform architecture that
provides high-speed connections between microprocessors and
external memory, and between microprocessors and the I/O hub.
One of its biggest changes is the implementation of scalable
shared memory. Instead of using a single shared pool of memory
connected to all the processors in a server or high-end workstation
through FSBs and memory controller hubs, each processor
has its own dedicated memory that it accesses directly through
an Integrated Memory Controller on the processor die. (For
dual-core desktop and mobile processors, the memory controller
will be implemented in the processor package.) In cases where
a processor needs to access the dedicated memory of another
processor, it can do so through a high-speed Intel QuickPath
Interconnect that links all the processors.
An advantage of Intel QuickPath Interconnect is that it is point-topoint.
There is no single bus that all the processors must use and
contend with each other to reach memory and I/O. This improves
scalability and eliminates the competition between processors
for bus bandwidth.
Intel QuickPath Architecture is already receiving strong industry
support. More than 10 industry leaders have licensed Intel®
QuickPath Technology and are developing innovative products.
Intel® QuickPath Architecture Performance
Intel QuickPath Interconnect’s throughput clearly demonstrates
its best-in-class interconnect performance in the server/
workstation market segment.
• Intel QuickPath Interconnect uses up to 6.4 Gigatranfers/second
links, delivering up to 25 Gigabytes/second (GB/s) of total
bandwidth. That’s up to 300 percent greater than any other
interconnect solution used today. (Gigatransfer refers to the
number of data transfers or operations.)
• Intel QuickPath Interconnect’s superior architecture reduces
the amount of communication required in the interface of
multi-processor systems to deliver faster payloads.
• Intel QuickPath Interconnect’s tightly integrated reliability,
availability and serviceability (RAS) features ensure high
reliability. These features include:
– Implicit Cyclic Redundancy Check (CRC) with link-level
retry to ensure data quality and performance by providing
CRC without the performance penalty of additional cycles.
– Self-healing links that avoid persistent errors by
re-configuring themselves to use the good parts of the link.
– Clock fail-over to automatically re-route clock function
to a data lane in the event of clock-pin failure.
– Hot plug capability to support hot plugging of nodes,
such as processor cards.
Coming Soon to a Server, Desktop,
or Laptop Near You
The next generation Intel microarchitecture (Nehalem) processor
family and its Intel QuickPath Architecture will be available in new
upcoming Intel® desktop and mobile processors and Intel Xeon and
Intel® Itanium® processor-based servers and workstation platforms.
Developed as a design-scalable microarchitecture, next generation
Intel microarchitecture (Nehalem) will enable Intel to rapidly tailor
versions optimized for a wide range of price, performance, and
energy efficiency targets. This will make it easy for users to take
advantage of the many ways this microarchitecture dynamically
manages cores, threads, cache, interfaces, and power to deliver
outstanding energy efficiency and performance on demand.
Next generation Intel microarchitecture (Nehalem) will earn
another distinction in the 2009/2010 time frame. An enhanced
version of next generation Intel microarchitecture (Nehalem) will
be Intel’s first microarchitecture used with our upcoming 32nm
Hi-k metal gate silicon technology.
1. Source: Intel estimates based on internal measurements March 2008.
2. For more on Intel SSE4, download the white paper, “Extending the World’s Most Popular Processor Architecture,” at http://download.intel.
com/technology/architecture/new-instructions-paper.pdf.
All products, platforms, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
All data is based on comparisons of engineering data sheets or measurements using actual hardware or simulators.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for
some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and
software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please
check with your application vendor.
Copyright © 2008 Intel Corporation. All rights reserved. Intel, the Intel logo, the Intel. Leap ahead. logo, Intel Xeon, Intel Core, Pentium,
and Itanium are trademarks of Intel Corporation in the U.S. and other countries.
Tuesday, April 7, 2009
Posted by
Kevin Center
at
1:47 PM
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment