TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Counting bytes faster than you'd think possible

154 pointsby asicsp10 months ago

8 comments

anonymoushn10 months ago
My own solution which is ~1ms faster uses some other pattern that was found experimentally, but I cannot seem to get it to go any faster by tuning the parameters, and the #1 spot remains slightly out of reach.<p>Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:<p><pre><code> Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles: &gt; According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states &gt; ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations. </code></pre> So one wonders if it&#x27;s not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.<p>[0]: <a href="https:&#x2F;&#x2F;tu-dresden.de&#x2F;zih&#x2F;forschung&#x2F;ressourcen&#x2F;dateien&#x2F;projekte&#x2F;firestarter&#x2F;2015_hackenberg_hppac.pdf?lang=en" rel="nofollow">https:&#x2F;&#x2F;tu-dresden.de&#x2F;zih&#x2F;forschung&#x2F;ressourcen&#x2F;dateien&#x2F;proje...</a>
sYnfo10 months ago
FYI vien [0] figured out that simply compiling with &quot;-static -fno-pie&quot; and _exit(0)-ing at the end puts the solution presented here to 15000 points and hence #4 on the leaderboard. Pretty funny.<p>[0] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;user?id=vient">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;user?id=vient</a>
评论 #41083548 未加载
dinobones10 months ago
Is there a path forward for compilers to eek out these optimization gains eventually? Is there even a path?<p>550x gains with some C ++ &#x2F; mixed gnarly low level assembly vs standard C++ is pretty shocking to me.
评论 #41082387 未加载
评论 #41081429 未加载
maxbond10 months ago
Usually, it&#x27;s fair game to use all of the information presented in an exam-style question to derive your answer.<p>With that in mind, I propose the following solution.<p>`print(976563)`
评论 #41084886 未加载
lumb6310 months ago
Does anyone have any tips for similar wizardry-level SIMD optimization on ARM?
评论 #41081931 未加载
评论 #41082870 未加载
rini1710 months ago
Can this optimization be applied to matmult for us, critters who are running llama on cpu? XD
评论 #41080900 未加载
_a_a_a_10 months ago
&quot;The solution presented here is ~550x faster than the following naive program.&quot;<p><pre><code> ... std::cin &gt;&gt; v; ... </code></pre> Oh come on! That&#x27;s I&#x2F;O for every item, I&#x27;m surprised it&#x27;s not even slower.
评论 #41087900 未加载
评论 #41085094 未加载
TacticalCoder10 months ago
Le met hazard a guess: that blog post was <i>not</i> written by a LLM!?