What we've learned from a year of building with LLMs

401 点作者 ViktorasJucikas12 个月前

16 条评论

mloncode12 个月前

This is Hamel, one of the authors of the article. We published the article with OReilly here:Part 1: <a href="https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/" rel="nofollow">https://www.oreilly.com/radar/what-we-learned-from-a-year-of...</a> Part 2: <a href="https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/" rel="nofollow">https://www.oreilly.com/radar/what-we-learned-from-a-year-of...</a>We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.

评论 #40559223 未加载

评论 #40555752 未加载

JKCalhoun12 个月前

> Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.

评论 #40555036 未加载

Multicomp12 个月前

Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.If I can do node-red or a function chain for prompts and outputs, that would be sweet.

评论 #40549158 未加载

评论 #40549191 未加载

评论 #40552785 未加载

评论 #40549249 未加载

评论 #40549065 未加载

评论 #40550287 未加载

评论 #40549157 未加载

threeseed12 个月前

RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:<a href="https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf" rel="nofollow">https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...</a>So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

评论 #40555079 未加载

评论 #40553834 未加载

DylanSp12 个月前

Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: <a href="https://news.ycombinator.com/item?id=40508390">https://news.ycombinator.com/item?id=40508390</a>.

mercurialsolo12 个月前

As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs

jakubmazanec12 个月前

I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.

评论 #40556199 未加载

评论 #40556088 未加载

sheepscreek12 个月前

I’m sure this has some decent insights but it’s from almost 1 year ago! A lot has changed in this space since then.

评论 #40550896 未加载

OutOfHere12 个月前

Almost all of this should flow from common-sense. I would use what makes sense for your application, and not worry about the rest. It's a toolbox, not a rulebook. The one point that comes more from experience than from common-sense is to always pin your model versions. As a final tip, if despite trying everything, you still don't like the LLM's output, just run it again!Here is a summary of all points:1. Focus on Prompting Techniques:<pre><code> 1.1. Start with n-shot prompts to provide examples demonstrating tasks. 1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific. 1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG). </code></pre> 2. Structure Inputs and Outputs:<pre><code> 2.1. Format inputs using serialization methods like XML, JSON, or Markdown. 2.2. Ensure outputs are structured to integrate seamlessly with downstream systems. </code></pre> 3. Simplify Prompts:<pre><code> 3.1. Break down complex prompts into smaller, focused ones. 3.2. Iterate and evaluate each prompt individually for better performance. </code></pre> 4. Optimize Context Tokens:<pre><code> 4.1. Minimize redundant or irrelevant context in prompts. 4.2. Structure the context clearly to emphasize relationships between parts. </code></pre> 5. Leverage Information Retrieval/RAG:<pre><code> 5.1. Use RAG to provide the LLM with knowledge to improve output. 5.2. Ensure retrieved documents are relevant, dense, and detailed. 5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval. </code></pre> 6. Workflow Optimization:<pre><code> 6.1. Decompose tasks into multi-step workflows for better accuracy. 6.2. Prioritize deterministic execution for reliability and predictability. 6.3. Use caching to save costs and reduce latency. </code></pre> 7. Evaluation and Monitoring:<pre><code> 7.1. Create assertion-based unit tests using real input/output samples. 7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs. 7.3. Regularly review LLM inputs and outputs for new patterns or issues. </code></pre> 8. Address Hallucinations and Guardrails:<pre><code> 8.1. Combine prompt engineering with factual inconsistency guardrails. 8.2. Use content moderation APIs and PII detection packages to filter outputs. </code></pre> 9. Operational Practices:<pre><code> 9.1. Regularly check for development-prod data skew. 9.2. Ensure data logging and review input/output samples daily. 9.3. Pin specific model versions to maintain consistency and avoid unexpected changes. </code></pre> 10. Team and Roles:<pre><code> 10.1. Educate and empower all team members to use AI technology. 10.2. Include designers early in the process to improve user experience and reframe user needs. 10.3. Ensure the right progression of roles and hire based on the specific phase of the project. </code></pre> 11. Risk Management:<pre><code> 11.1. Calibrate risk tolerance based on the use case and audience. 11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.</code></pre>

gengstrand12 个月前

Interesting blog. It seems to be a compendium of advice for all kinds of folks ranging from end user to integration partner. For a slightly different take on how to use LLMs to build software, you might be interested in <a href="https://www.infoq.com/articles/llm-productivity-experiment/" rel="nofollow">https://www.infoq.com/articles/llm-productivity-experiment/</a> which documents an experiment where the same prompt was given to various prominent LLMs asking to write two unit tests for an already existing code base. The results were collected, metrics were analyzed, then comparisons were made. No advice on how to write better prompts but some insight on how to work with and what you can expect from LLMs in order to improve developer productivity.

solidasparagus12 个月前

No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

评论 #40549391 未加载

评论 #40551400 未加载

评论 #40551668 未加载

评论 #40548901 未加载

评论 #40549178 未加载

评论 #40549100 未加载

felixbraun12 个月前

related discussion (3 days ago): <a href="https://news.ycombinator.com/item?id=40508390">https://news.ycombinator.com/item?id=40508390</a>

blumomo12 个月前

> PUBLISHED> June 8, 2024Is this an article from the future?

评论 #40552665 未加载

pklee12 个月前

This is pure gold !! Thank you so much eugene and gang for doing this. For those of them which I have encountered, I can 100 % agree with them. This is fantastic !! So many good insights.

hakanderyal12 个月前

If you didn't follow what has been happing in the LLM space, this document gives you everything you need to know about state of the art LLM usage & applications.Thanks a lot for this!

dbs12 个月前

Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.

评论 #40551594 未加载

评论 #40550413 未加载

评论 #40549570 未加载

评论 #40549552 未加载

评论 #40549485 未加载

评论 #40549313 未加载

评论 #40551622 未加载

评论 #40550766 未加载

评论 #40551635 未加载

评论 #40554507 未加载

评论 #40551538 未加载