TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

GPT-Prompt-Engineer

356 pointsby sturzaalmost 2 years ago

23 comments

fatso784almost 2 years ago
This tool doesn’t benchmark based on how a model actually responds to the generated prompts. Instead, it trusts GPT4 to rank prompts simply in terms of how well it <i>imagines</i> they will perform head-to-head. Thus, there’s no way to tell if the chosen ‘best prompt’ actually is the best, because there’s no ground truth against actual responses.<p>Why is this so popular, then (more popular than promptfoo, which I think is a much better tool in the same vein)? AI devs seem enamored with the idea of LLMs evaluating LLMs —everything is ‘auto-‘ this and that. They’re in for a rude awakening. The truth is, there are no shortcuts to evaluating performance in real world applications.
评论 #36681447 未加载
评论 #36680515 未加载
评论 #36683052 未加载
评论 #36679655 未加载
评论 #36679757 未加载
评论 #36682736 未加载
评论 #36679741 未加载
评论 #36688666 未加载
mangecoeuralmost 2 years ago
Should we really call it &quot;engineering&quot; if its a case of &quot;try random things until one of them works without really knowing why&quot;?
评论 #36678450 未加载
评论 #36678401 未加载
评论 #36681543 未加载
评论 #36679335 未加载
评论 #36679114 未加载
brunoluizalmost 2 years ago
Isn’t engineering an exact science while prompt engineering is completely not?<p>Although, even software engineering being an exact science, it is a funny one: most of us don’t get certified as like, let’s say, mechanical engineers do. Would they say we are engineers?<p>So perhaps the “engineer” term got overloaded in recent years?
评论 #36679454 未加载
评论 #36679306 未加载
评论 #36679296 未加载
评论 #36679304 未加载
评论 #36680088 未加载
评论 #36679594 未加载
评论 #36679280 未加载
评论 #36679767 未加载
shivamsalmost 2 years ago
BTW, GPT-Engineer is openly collecting all of your data: user prompts and other metadata. And they were even defending it until they received some strong responses from the community: <a href="https:&#x2F;&#x2F;github.com&#x2F;AntonOsika&#x2F;gpt-engineer&#x2F;issues&#x2F;415">https:&#x2F;&#x2F;github.com&#x2F;AntonOsika&#x2F;gpt-engineer&#x2F;issues&#x2F;415</a> They now explicitly ask for consent regarding user data, but can we really trust their motives?
Towaway69almost 2 years ago
Is this prompt generation for the purposes of prompt engineering? Is this then a kind of meta engineering? Engineering for the purposes of engineering which then hopefully will generate working code for the computer that generated the prompt and the response to the prompt.
评论 #36677988 未加载
Apfelalmost 2 years ago
Usage query: It looks like this could get expensive quite quickly. The approach is great, but with GPT-4 especially, could be very difficult. Is it worth using with 3.5 as a first pass then switching prompts to GPT4 once you&#x27;ve got the best prompt?
评论 #36679770 未加载
hocalmost 2 years ago
Doug Adams would&#x27;ve had so much fun these days.
评论 #36678472 未加载
评论 #36687095 未加载
xmcqdpt2almost 2 years ago
I think one should just use GPT to generate the prompts so as to reduce the human input further still, a kind of gpt-gpt-prompt-engineer-engineer.
评论 #36679456 未加载
评论 #36681413 未加载
评论 #36679441 未加载
jwestburyalmost 2 years ago
It&#x27;s turtles^W GPT all the way down, I guess.
Kiroalmost 2 years ago
How are they actually ranked?
评论 #36678062 未加载
asimpleusecasealmost 2 years ago
A bit like an autoGPT, I did not immediately see any kind of token limits. But I did not look carefully. On a complex problem or one that accesses a lot of data the cost might ramp up.
msp26almost 2 years ago
Currently working on something similar for myself, this doesn&#x27;t seem to fit my needs (benchmarking generations too rather than just classification). I only have a crude cosine similarity metric for accuracy for now. Also I&#x27;m using function calling rather than the normal completions.<p>I was hoping this would do something more interesting with multiple messages (if using a chat model) rather than just dumping the entire prompt in one message. The assistant lets you do stuff with examples.
评论 #36678971 未加载
hacksoialmost 2 years ago
Are we really going down this path of prompt-prompt-engineering?
PUSH_AXalmost 2 years ago
It would be cool if, given a handful of test cases, you could send those off to the LLM to generate even more test cases.<p>My first thought when looking over this tool was &quot;Why do I have to do all the work?&quot;, the ideal scenario is that I give the high level description and the LLM does the hard work to create the best prompt.
评论 #36677849 未加载
hamashoalmost 2 years ago
Off topic but Jupyter cells on GitHub can&#x27;t display horizontally long content and it frustrates me a lot. This small piece of code for the browser console helps me see more content, but it only works in large displays.<p><pre><code> $(&quot;[data-type=&#x27;ipynb&#x27;]&quot;).style.width = &#x27;100%&#x27;</code></pre>
评论 #36680144 未加载
m3kw9almost 2 years ago
You need to learn how to use this to generate a good prompt but why not just learn how to generate good prompts? This code is basically asking for something, with examples. And ask a few real questions to test it
sgt101almost 2 years ago
this is supervised machine learning on top of unsupervised machine learning with some interesting wrinkles in both steps!<p>it reminds me of those aircraft that folks in rural india build from time to time.
namuolalmost 2 years ago
This could work really well if it replaced GPT-X-judged performance ranking with human-in-the-loop ranking of prompts, but that’s not as exciting, I guess.
评论 #36686470 未加载
magicroot75almost 2 years ago
People don&#x27;t understand intelligence
fdondialmost 2 years ago
Does it only work with ChatGPT? Seems it would be useful also for local Llamas etc.
gaolei8888almost 2 years ago
This is a cool tool
jstarfishalmost 2 years ago
Uh...am I missing something, or is this whole thing setting the user up for humiliating failure by doing its testing the same way that bit that lawyer in the ass?<p>&gt; Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task.<p>&gt; You will be provided with the task description, the test prompt, and two generations - one for each system prompt.<p>&gt; Rank the generations in order of quality. If Generation A is better, respond with &#x27;A&#x27;. If Generation B is better, respond with &#x27;B&#x27;.<p>&gt; Remember, to be considered &#x27;better&#x27;, a generation must not just be good, it must be noticeably superior to the other.<p>&gt; Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other.<p>&gt; Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.<p>So what factors make the &quot;quality&quot; of one prompt &quot;better&quot; than another?<p>How &quot;impressive&quot; it is to an LLM? What even <i>impresses</i> an LLM? I thought as an AI language model, it lacks human emotional reactions or whatever.<p>Quality is subjective. Even accuracy is subjective. What needs testing is alignment-- with <i>your</i> interests. The thing is hardcoded to rate based on what aligns with model hosts&#x27; interests, not yours.<p>Only the &quot;classification version&quot; looks capable of making any kind of assertion:<p>&gt; &#x27;prompt&#x27;: &#x27;I had a great day!&#x27;, &#x27;output&#x27;: &#x27;true&#x27; [sentiment analysis I assume?]<p>The rest of the test prompts aren&#x27;t even complete sentences, they&#x27;re half-thoughts you&#x27;d expect to hear Peter Gregory mutter to himself:<p>&gt; &#x27;prompt&#x27;: &#x27;Launching a new line of eco-friendly clothing&#x27; [ok, and?]<p>The one for &#x27;Why a vegan diet is beneficial for your health&#x27; makes some sense at least, but it&#x27;s really ambiguous.<p>I&#x27;m just some idiot, but if I were creating this, I&#x27;d expect the response to ask for a number of expected keywords or something to measure how close each model comes to what <i>the user</i> actually wants. Like, for me, &#x27;what are operating systems&#x27; &quot;must&quot; mention all keywords Linux, Windows, and iOS, and &quot;should&quot; mention any of Unix, Symbian, PalmOS, etc.<p><i>All</i> tests should tank the score if it detects fourth-wall-breaking &quot;As an AI language model&#x2F;I don&#x27;t feel comfortable&quot; crap anywhere in the response. National Geographic got outed on that one the other day.
lofaszvanittalmost 2 years ago
&quot;Prompt engineering is kind of like alchemy. There&#x27;s no clear way to predict what will work best. It&#x27;s all about experimenting until you find the right prompt.&quot;<p>lololoollool
评论 #36679359 未加载