This site may earn chapter commissions from the links on this page. Terms of apply.

With silicon clock scaling largely dead thanks to the laws of physics, computer scientists and flake designers accept had to search for performance improvements in other areas — typically by improving architectural efficiency and reducing power consumption. Multi-core scaling may have paid big dividends with dual and quad-core chips, but largely plateaued thereafter. There are a number of reasons why adding new processor cores yields diminishing marginal returns. I of the critical causes is the steep overhead associated with maintaining enshroud coherency between cores that share data sets. At present, a team of researchers working with Intel think they may have found a solution. If their piece of work proves useful, information technology could offer a significant performance heave in certain applications.

Before nosotros hash out the solution, nosotros need to spend a scrap of time talking about the trouble. Imagine ii split CPU cores, each of which is working on role of a common computation. Each CPU will have its own L2 cache, where data related to the problem is stored. In a coherent enshroud, CPU 0 completes part of its calculations, writes a new value to a block of retention, and so communicates that it has done so. CPU one now knows that its own information is out-of-sync with CPU i and can update its own L2 cache accordingly. There are several methods of implementing coherence, just at the simplest level, it's a method for ensuring that all of the CPUs are "on the same page," as it were. Cache coherence is essential to multi-cadre scaling, only information technology also represents a substantial bottleneck as core counts increase. The more CPUs in a organization, the more CPU time must be spent enforcing whatever coherence strategy has been chosen, and the less bandwidth is available for actually solving the compute problem in question.

Cache_Coherency_Generic

Cache coherency — image from Wikipedia

The North Carolina researchers and Intel have jointly proposed a combined software-hardware solution they call a Communication Accelerated Framework (CAF). The CAF would include a queue direction device (QMD) implemented in hardware. The researchers describe its benefits as follows:

QMD achieves several significant benefits. First, it makes queue operations fast. Instead of executing hundreds of instructions at the core to manage a software queue, a core can execute an enqueue or dequeue instruction, with QMD treatment the residual. Consequently, QMD frees up the core to work on more useful jobs. Second, QMD tin can handle multiple producers and consumers without requiring locks or synchronizations. Tertiary, QMD removes most coherence-related communication incurred in software queue implementations, both in the control plane and in the data plane. The last two benefits increase the scalability ceiling vs. software queues. Furthermore, the scalability ceiling of QMD can be farther lifted past making QMD distributed. Our results show upwards to 2− 12× throughput improvement compared to a fully optimized software queue structure.

Queuemanagement

The proposed hardware queue manager

The QMD proved capable of delivering upward to a twenty-fold functioning improvement in test simulations, and Intel is said to be keenly interested in the results. It's of import to note that tests similar this don't solve all the problems of multi-core scaling, even if they show valuable — the same forces pushing Intel and other companies towards cloud computing would keep shoving that way, especially since multi-core communication doesn't really bottleneck modern CPUs running desktop applications. Trying to find solutions to many of these issues is difficult, trying to find solutions that justify incorporating them into all processors is even more so.

"We have to improve performance past improving energy efficiency," Yan Solihin, lead author on the study and a professor of electrical and computer engineering, told IEEE Spectrum. "The only manner to exercise that is to motion some software to hardware. The challenge is to figure out which software is used frequently enough that we could justify implementing it in hardware. There is a sugariness spot."

At present read: How L1 and L2 CPU caches work, and why they're an essential part of modern chips