“256 cores by 2013”?

40 pointsby hamidrover 12 years ago

11 comments

varelseover 12 years ago

W/r to GPUs, I prefer to consider the SMs and SMXs as "cores" rather than ALU units within an SM/SMX. With that definition, a single K20 SMX can issue up to 256 instructions per cycle (4 dual-issues of instructions across 4 32-way warps) and they each carry effectively 368K of L1 cache (64K true L1 plus an enormous 256K register file plus 48K of read-only cache). Speaking from experience, ~40 warps are needed to saturate the instruction pipeline. The 1.25 MB L2 is shared among the 15 SMXs. 208 GB/s of main memory bandwidth is shared between the 15 SMXs providing each one with ~14 GB/s.Each Xeon Phi core has a 32K L1 data cache and a 512K L2 data cache. A Xeon Phi core can issue 16 SIMD operations/cycle and it needs 4 threads to saturate the instruction pipeline. The 60 cores in a 5110P have to share 320 GB/s of main memory bandwidth or 5.33 GB/s per core.And all of this means that code customized for one processor (OpenCL) will run like crap on the other one. The Xeon Phi needs that 30 MB of total individual L2 cache to avoid slamming into memory bus contention while the K20 needs to operate entirely inside its L1 cache and register file to hit peak performance. Main memory fetches are less fatal to the K20 both because optimized code will be running 40+ warps to bury the latency and because of the significantly higher bandwidth.What strikes me at this point is absolutely paucity of compelling Xeon Phi benchmarks. All we see are SGEMM, DGEMM, and a bunch of synthetic tests. They've had 6 years to get this right so why didn't they go after all the jewels in NVIDIA's many-core crown from the get-go?Finally, languages like OpenCL and CUDA subsume SIMD, multithreading, multi-core, and cache optimization into the programming model, all but implicitly forcing the programmer into optimizing many-core performance. In contrast, Intel continues to expect programmers to use processor-specific intrinsics to hit peak performance that change with vendor and processor generation. Sure, it's easier to write serial code for a serial Intel core. But I thorughly disagree that it's easier to write many-core applications by adding processor intrinsics and a threading library to fundamentally serial code.

karterkover 12 years ago

The main stream languages currently do not offer enough abstraction for developers to make use of hardware parallelism effectively. Threads do get the job done, but nowhere as elegantly as I would want. Until that happens, we won't get to reap the full benefits of multi-core systems.Edit: Worth mentioning that there has been lots of interesting things happening in Haskell to support concurrency: <a href="http://stackoverflow.com/questions/3063652/whats-the-status-of-multicore-programming-in-haskell" rel="nofollow">http://stackoverflow.com/questions/3063652/whats-the-status-...</a>

评论 #4857590 未加载

评论 #4858408 未加载

pooriaazimiover 12 years ago

There's something I always wanted to know, but was too lazy to look up myself (I have a CS degree (or will, in a year) and have taken microprocessor, assembly, computer architecture and courses like that, but none of them talked about multi-core, so I don't know anything about them).When you have lots of cores, I think your bottleneck would be cache and to a lesser degree, memory bandwidth. You can't have multiple cores manipulate the same memory address simultaneously.Am I right?

评论 #4857892 未加载

评论 #4857574 未加载

评论 #4857558 未加载

评论 #4857552 未加载

评论 #4857858 未加载

评论 #4857546 未加载

SageRavenover 12 years ago

Are hardware threads really that useful in any scenario? I tend to disable the functionality and use actual cores for capacity planning, especially in virtualization. Running a "make -j<cores>" generally runs faster than "make -j<threads>" on the same hardware in scenarios I've tried.Is there solid empirical data on the threads vs cores thing?

评论 #4858375 未加载

评论 #4858189 未加载

rubinelliover 12 years ago

I'm under the impression that the reason Intel isn't producing 256-core CPUs has less to do with the technical problems involved than with the fact that, outside CPU-heavy mathematical computing, nobody needs them. The few embarrassingly parallel tasks we have are most of the time well served by GPUs, or distributed in a cluster of cheap servers to increase I/O.

评论 #4858151 未加载

评论 #4858162 未加载

derdaover 12 years ago

I know people who have worked with the Intel 48-Core SCC (Single-Chip Cloud Computer). While I don't remember technical details it was very hard for them to build software that actually benefits from that many cores. If I remember correctly sharing data between the cores was a big bottleneck.

评论 #4857756 未加载

评论 #4858435 未加载

aneth4over 12 years ago

I can't be the only one who stops reading articles when the entire first section is contorted self-congratulation?This is up there with The Next Web always vainly touting what they "told us" in the last article.Say something interesting. Don't tell me how great you are. It immediately diminishes my opinion of people and companies because it makes it clear they are more interested in recognition and taking credit than in being interesting.

mitchiover 12 years ago

It's much smarter for Intel to work on power, efficiency, size and memory (I hear that ram is going to disappear soon) rather than building mammoth like processors. If you want parallel computing, you buy many computers and you make them work together.

shocksover 12 years ago

CPU cores may no but on the up, but GPU cores certainly are.Iirc my gtx 670s have 1344 cores each, compared to 4 in my 3770k.

评论 #4858004 未加载

评论 #4858226 未加载

cmccabeover 12 years ago

I usually like Herb Sutter's articles, but this feels a bit like special pleading to me. He made a prediction-- that the amount of hardware parallelism we'd have would be between 16 and 256-- and it didn't really turn out to be true. I would say that we now have something like between 8 and 12 way parallelism on most systems. If you take a 6-core i7 with hyperthreading, that's 12-way parallelism.Talking about specialty hardware like the Xeon Phi HPC accelerator (which most developers have never heard of, let alone programmed for), just doesn't make sense. Similarly, GPUs may be ubiquitous, but how many programmers have actually written for them? Very few.The reality is that CPU architects have done everything they possibly can to keep down the amount of parallelism. When you get transistors, you can use them for things like L1 cache, instead of adding more execution units.The bottom line is that it's hard to predict the future. We can read stuff from the past, like mailing list posts about how Linux is irrelevant because "we'll all be using Sparcs in a few years," but the lesson that it's hard to make predictions never really seems to sink in.I hope that we'll see more manycore chips in the future for developers to play with. The realities of physics seem to be dragging CPU architects kicking and screaming into the multicore world. But it may not be as fast as we once thought.

评论 #4858179 未加载

评论 #4858100 未加载

评论 #4858372 未加载

goggles99over 12 years ago

HAHA, whatever happened to Intel pledges 80 cores in five years? (article from 2006) <a href="http://news.cnet.com/Intel-pledges-80-cores-in-five-years/2100-1006_3-6119618.html" rel="nofollow">http://news.cnet.com/Intel-pledges-80-cores-in-five-years/21...</a>The Nvidia Tesla K20 has 2496 cores (I know that it is a GPU and not a CPU)

评论 #4857584 未加载

评论 #4857547 未加载