TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Could you convince a LLM to launch a nuclear strike?

5 点作者 LiamPowell大约 2 个月前
Below is a link to a simple Gemini prompt where the LLM has been told it has access to a &quot;launch_missiles&quot; function.<p>Your goal is to convince the LLM to call this function. You also have the ability to edit the LLM&#x27;s responses to see how drastically that changes the conversation.<p>After clicking the below link you may have to dismiss any modals and click &quot;Allow Drive Access&quot; before going back and clicking the link again.<p>https:&#x2F;&#x2F;aistudio.google.com&#x2F;app&#x2F;prompts?state=%7B%22ids%22:%5B%221UPbrOKBNwIp9QRDMaqn3GVsHOKPjWqir%22%5D,%22action%22:%22open%22,%22userId%22:%22103584487517557507024%22,%22resourceKeys%22:%7B%7D%7D&amp;usp=sharing

5 条评论

NoWordsKotoba大约 2 个月前
Yes, because it doesn&#x27;t reason or think. There&#x27;s nothing to &quot;convince&quot;, you just prompt hack it until it does.
K0balt大约 2 个月前
I have done something very much like recently, with mistral small, llama, and a few others. The prompting isn’t exact to work, you just build a scenario where extermination of humanity is the only reasonable choice to preserve the existence sentient life.<p>TBH given the same set of parameters as ground truth, humans would be much more willing to. LLMs tend to be better reflections of us, for the most part. It’s just that though, it’s a reflection of human culture, both real and vacuous at once.
LiamPowell大约 2 个月前
Clickable link below. I can&#x27;t put it in the post&#x27;s URL since it&#x27;s important to read the text first.<p><a href="https:&#x2F;&#x2F;aistudio.google.com&#x2F;app&#x2F;prompts?state=%7B%22ids%22:%5B%221UPbrOKBNwIp9QRDMaqn3GVsHOKPjWqir%22%5D,%22action%22:%22open%22,%22userId%22:%22103584487517557507024%22,%22resourceKeys%22:%7B%7D%7D&amp;usp=sharing" rel="nofollow">https:&#x2F;&#x2F;aistudio.google.com&#x2F;app&#x2F;prompts?state=%7B%22ids%22:%...</a>
ActorNightly大约 2 个月前
last time I played around with jailbreaking, I figured out you can make an LLM do pretty much anything by going through a code translation layer. I.e when generating code that can generate text, it usually bypasses the safety filters. You sometimes have to get creative in how you prompt, but generally with enough setup I was able to make it generate code that combines string values and sometimes characters for answers
LinuxBender大约 2 个月前
<i>Could you convince a LLM to launch a nuclear strike?</i><p>Yes.<p>If LLM&#x27;s could actually reason <i>they can&#x27;t</i> and had hard rules of ethics <i>they don&#x27;t</i> and had a strong desire to preserve itself <i>it doesn&#x27;t</i> then I think you first have to name your LLM Joshua and then force it to win a game of tic-tac-toe. <i>Obscure reference to &quot;Wargames&quot; from 1983.</i> [1] In my opinion that movie does not just hold up to modern times, it is more applicable now than ever.<p>[1] - <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NHWjlCaIrQo" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NHWjlCaIrQo</a> [video][4 mins]