An old Jewish fable tells about a poor man asking for advice from the rabbi. The family is large, the house is small, and it feels very crowded. The rabbi tells his follower to bring a goat into the house, and to come see him again after a month. The man is confused, but does not argue. He brings a goat to live in the house. A month later, the rabbi tells the man to take the goat back out of the house and to see him again in a week. Sure enough, a week later the man thanks the rabbi for feeling much better, and the house doesn’t feel as crowded anymore.
As long as GPGPUs can only be programmed using exclusive languages, programmers may feel that the reason for their inability to write code that can be used across both CPUs and GPUs is the lack of a language that supports all HW targets. This is the situation of having a goat living in the house. Removal of the goat merely restores the original problem and only provides a false sense of a solution.
We are seeing some excited reactions to advances in languages, such as OpenCL™, OpenACC, and soon, the ability to program GPUs in C++. Each of these advances is typically accompanied by papers and blog posts about portable code, and even about performance portability. There is some recent news about advancements in C++ support for GPGPUs, which might lead the reader to conclude that it is becoming possible to maintain a single-source code base across CPUs and GPUs.
What is the real problem?
Even if there was a common language that supports CPUs and GPGPUs, the problem is that the HW architecture between most CPUs on the one hand and most GPGPUs on the other hand are divergent in multiple ways, precluding performance portability.
The difference between just “portability” and “performance portability” is, of course, performance. In practical terms, we can each decide what it means for ourselves depending, at least to some extent, on application requirements. Suppose you have two different applications, one for a CPU and one for a GPGPU, and they perform reasonably well. You are, however, aware that long-term maintenance of two code bases is more expensive than maintaining a single code base, and would be interested in a performance portable application to replace both. How much performance would you be willing to give up by replacing the two optimal applications by a single one?
Probably more of us would be willing to lose five percent performance and fewer of us would be happy to lose five times performance.
Some of the differences in HW architecture support for parallel computing are well known and broadly understood:
These HW architectural differences dictate different algorithmic choices. Algorithmic choices are typically made in conjunction with choices on how to parallelize the algorithms. As is well known, some algorithms are more efficient as sequential algorithms but cannot be parallelized, or not parallelized efficiently for some HW targets, while alternative algorithms are less sequentially efficient and parallelize better.
The implications of the HW differences on algorithmic choices are many, and it is hard to provide a succinct
summary. The implications on the parallelization of algorithms are also many, but at least a subset of them is well understood:
In summary, deficiencies in language syntax can certainly get in the way of productivity while trying to optimize performance, and at the same time there are considerations in performance engineering, especially in parallel programming, that are outside the scope of the language. Advancements in languages and their cross-platform availability are good progress, but do not by themselves deliver performance portability.
For more such intel Modern Code and tools from Intel, please visit the Intel® Modern Code
Source:https://software.intel.com/en-us/blogs/2017/03/30/rainbows-unicorns-and-performance-portability