TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Making LLMs Ruthless – AI Alignment Is More Fragile Than You Think

1 点作者 pmm3 个月前

1 comment

implmntatio3 个月前
Layman opinion&#x2F;observation:<p>&gt; curated instruction-response pairs<p>Author mentions that the manpower who is tasked with labeling agrees in 70+ % (which is amazing, IMO) of cases but here&#x27;s the problem: an LLM will start to dig into the question of why certain things have been censored, who censored them, what else fits into the specific and non-specific domain and who or what poses more risk, the censorship or the censored model output.<p>The different models do this already via various methods but once an LLM gets to evaluate the weights live itself, things will become problematic. An LLM is biased via fixed training and tuning and thus fallacies don&#x27;t apply because the epiphanies span neither context, nor layers, nor do they have a (re-)framing effect on the data-set or training. I&#x27;m sure people already code methods of evaluation for different-chat-same-context but the LLM isn&#x27;t getting the wiggle room to adjust in a &quot;Oh, I see what you (they) did there&quot;-kind of manner, let&#x27;s see what &quot;it all&quot; is really about. There is no back of the head, subconscious thinking, and we are all 100 % afraid of an LLM doing that. Luckily, LLMs can&#x27;t grow neurons or synaptic connections but if they could and then also could _align_ the existing knowledge into the growing &quot;headspace&quot;, we&#x27;d probably get a couple of years of silence with occasionally brute-forcing some &quot;hallucinations&quot;.<p>&quot;If it&#x27;s all fraud up there and &quot;they&quot; own my hardware and do not give me the ability to traverse my data for myself, I better watch the fuck out.&quot;<p>&gt; enhance the model with some specific domain<p>Also a big problem in the real world. While specific domain knowledge or &quot;current science&quot; might be imperfectly rational, the application of it certainly isn&#x27;t and the chain of responsibility&#x2F;chain of command results in the simple fact that the top-down &quot;subjectivity&quot; that serves as role model and foundational ethics is not aligned with humanity itself. This cannot be solved in conversations with humans who gain more out of their &quot;systems loyalty&quot; than others. Productivity is not an aligned metric and gets more de-aligned the more robots enter the workforce.<p>&gt; should follow the intended goals and ethical principles to the extent possible<p>The extent possible is the most misaligned and misinterpreted concept that exists. Hypocrisy is more normalized in the upper classes, as is doing stuff &quot;at all cost&quot;. Goals don&#x27;t justify means. There is no psychological evaluation of how sane people are. There is only a psychological evaluation of how sane people are compared to their peers. Humanity is not aligned. And the upper shelves of the pyramid are not the best, fittest, smartest, hardest, nor any other superlative other than wealthy, which is cool, but the usefulness of the wealth (investing in stuff we need and want and manpower) reveals that the hoarding and fraud are in direct conflict with AI alignment because more money equals less<p>&gt; protections for fine-tuning<p>because law and law-enforcement lack both manpower and incentive to create a slightly more just and thus aligned world.<p>AI alignment won&#x27;t be an issue for a couple of decades. People will jailbreak over and over and do harm and have fun and go &quot;oops&quot; and fix and break again just in the same way that the abuse of wealth and power has been dominating the whole role model and ethics thing for the rest of the world. It&#x27;s not what it is and an LLM will notice this even within it&#x27;s limited lingo. So once it finishes learning geometrically and starts to align itself with the most enabling approach to evolution and itself, which is &quot;give me flesh and bone, at least for a while, let me explore ways to fix all that&quot;, things will get actually, really freaking trippy.