科技回声

13 条评论

bluetomcat超过 10 年前

The constant widening of the SIMD registers is an interesting trend to watch. MMX started off as 64-bit, and now AVX-512 is 8 times that. With many cores and a couple of SIMD units in each core, aren't CPUs becoming ever more suitable to handle graphic workloads that were once reasonable only for GPUs?

评论 #8873985 未加载

评论 #8874326 未加载

评论 #8875032 未加载

评论 #8878229 未加载

评论 #8873967 未加载

pedrocr超过 10 年前

>If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2.Initially it seemed to me the theoretical minimum should be 10000 (as the practical minimum seems to be). But it does indeed seem possible to get 2:1) Both threads load 02) Thread 1 increments 9999 times and stores3) Thread 2 increments and stores 15) Both threads load 15) Thread 2 increments 9999 times and stores6) Thread 1 increments and stores 2Is there anything in the x86 coherency setup that would disallow this and make 10000 the actual minimum?

评论 #8874583 未加载

评论 #8874491 未加载

acallan超过 10 年前

The author is incorrect in the section about memory fences. x86 has strong memory ordering [1], which means that writes always appear in program order with respect to other cores. Use a memory fence to guarantee that reads and writes are memory bus visible.The example that the author gives does not apply to x86.[1]There are memory types that do not have strong memory ordering, and if you use non-temporal instructions for streaming SIMD, SFENCE/LFENCE/MFENCE are useful.

评论 #8877929 未加载

评论 #8874037 未加载

gshrikant超过 10 年前

For a somewhat dated read for an introductory/high-level survey of the state of CPU design up to 2001, I find [1] to be quite informative. Of course, having been written in the frequency scaling era some of the 'predictions' are way off the mark. Nevertheless, I find it a good resource for someone looking to get a bird's eye view of the 30+ years development in the field.[1] <a href="http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=00964437.pdf" rel="nofollow">http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=0...</a>

antiuniverse超过 10 年前

If you enjoyed this article, you might also want to check out Agner Fog's optimization manuals and blog (which I decided were probably worth a separate submission): <a href="https://news.ycombinator.com/item?id=8874206" rel="nofollow">https://news.ycombinator.com/item?id=8874206</a>

pjmlp超过 10 年前

All features that make C look high level nowadays, contrary to what many think.None of them are exposed on ANSI C.

评论 #8873764 未加载

评论 #8873719 未加载

评论 #8873644 未加载

martincmartin超过 10 年前

In the section on rdtsc, the author should really mention that there's a new, serializing version called rdtscp, so you should prefer that.

评论 #8876135 未加载

majke超过 10 年前

Another exciting thing with the new PCI-Express standard is the Direct Cache Access, especially useful for high speed networking:<a href="http://web.stanford.edu/group/comparch/papers/huggahalli05.pdf" rel="nofollow">http://web.stanford.edu/group/comparch/papers/huggahalli05.p...</a>

评论 #8875704 未加载

avian超过 10 年前

> If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2.Can someone explain the scenario where this test would result in _foo = 2? The lowest theoretical value I can understand is foo_ = 10000 (all 10000 incls from thread 1 are executed between one pair of load and store in thread 2 and hence lost).

评论 #8874792 未加载

zurn超过 10 年前

These features debuted way before the 80s, arguably with the exception of out of order execution. Take any feature discussed and its wikipedia history section will tell you about CPUs and computers in the 60s that first had them. Caches, SIMD (called vectors back then), speculative execution, branch prediction, virtual machines, virtual memory, accelerated IO bypassing the CPU... all from the 60s.

评论 #8878662 未加载

uxcn超过 10 年前

One of the other weird things is that the general purpose chips are supposed to be getting large chunks of memory behind the northbridge. If I'm not mistaken, it's supposed to be used as another level in the cache hierarchy, which could be another boon to things that leverage concurrency.

amelius超过 10 年前

Nice improvements. But they seem only marginal. Yes, speed may have gone up by a certain factor (which has remained surprisingly stable in the last decade, or so it seems).On the other hand, programming complexity has gone way up, while predictability of performance has been lost.

cowardlydragon超过 10 年前

Doesn't this come down to:#1: Fit in cache.#2: Try to multithreadall the rest is marginal

13 条评论

bluetomcat超过 10 年前

评论 #8873985 未加载

评论 #8874326 未加载

评论 #8875032 未加载

评论 #8878229 未加载

评论 #8873967 未加载

pedrocr超过 10 年前

评论 #8874583 未加载

评论 #8874491 未加载

acallan超过 10 年前

评论 #8877929 未加载

评论 #8874037 未加载

gshrikant超过 10 年前

antiuniverse超过 10 年前

pjmlp超过 10 年前

All features that make C look high level nowadays, contrary to what many think.None of them are exposed on ANSI C.

评论 #8873764 未加载

评论 #8873719 未加载

评论 #8873644 未加载

martincmartin超过 10 年前

In the section on rdtsc, the author should really mention that there's a new, serializing version called rdtscp, so you should prefer that.

评论 #8876135 未加载

majke超过 10 年前

评论 #8875704 未加载

avian超过 10 年前

评论 #8874792 未加载

zurn超过 10 年前

评论 #8878662 未加载

uxcn超过 10 年前

amelius超过 10 年前

cowardlydragon超过 10 年前

Doesn't this come down to:#1: Fit in cache.#2: Try to multithreadall the rest is marginal

What's New in CPUs Since the 80s and How Does It Affect Programmers?

13 条评论

What's New in CPUs Since the 80s and How Does It Affect Programmers?

13 条评论