In the land of LLMs, can we do better mock data generation?

140 点作者 pncnmnp8 个月前

22 条评论

Big fan of this write up as it presents a really easy to understand and at the same time brutally honest example of a domain in which a) you would expect LLMs to perform very well, b) they don't and c) the solution is to make the use of ML more targeted, a complement to human reasoning rather than a replacement for it.Over and over again we see businesses sinking money into "AI" where they are effectively doing a) and then calling it a day, blithely expecting profit to roll in. The day cannot come too soon when these businesses all lose their money and the hype finally dies - and we can go back to using ML the way this write up does (ie the way it is meant to be used). Let's hope no critical systems (eg healthcare or law enforcement) make the same mistake businesses are before that time.

评论 #41720224 未加载

jumploops7 个月前

The title and the contents don’t match.The author expected to use LLMs to just solve the mock data problem, including traversing the schema and generating the correct Rust code for DB insertions.This demonstrates little about using LLMs for _mock data_ and more about using LLMs for understanding existing system architecture.The latter is a hard problem, as humans are known to create messy and complex systems (see: any engineer joining a new company).For mock data generation, we’ve[0] actually found LLMs to be fantastic, however there are a few tricks.1. Few shot prompting: use a couple of example “records” by inserting user/assistant messages to “prime” the context 2. Keep the records you’ve generated in context, as in, treat every record generated as a historical chat message. This helps avoid duplicates/repeats of common tropes (e.g. John Smith) 3. Split your tables into multiple generations steps — e.g. start with “users” and then for each user generate an “address” (with history!), and so on. Model your mock data creation after your schema and its constraints, don’t rely on the LLM for this step. 4. Separate out mock data generation and DB updates into disparate steps. First generate CSVs (or JSON/YAML) of your data, and then use a separate script(s) to insert that data. This helps avoid issues at insertion as you can easily tweak, retry, or pass on malformed data.LLMs are fantastic tools for mock data creation, but don’t expect them to also solve the problem of understanding your legacy DB schemas and application code all at once (yet?).[0]<a href="https://www.youtube.com/watch?v=BJ1wtjdHn-E" rel="nofollow">https://www.youtube.com/watch?v=BJ1wtjdHn-E</a>

edrenova7 个月前

Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync

评论 #41723407 未加载

评论 #41727882 未加载

danielbln7 个月前

Did I miss it or did the article not mention which LLM they tried, what prompts they've used and then they also mention zero-shot only, meaning no in-context learning? And they didn't think to tweak the instructions after it failed the first time? I don't know, doesn't seem like they really tried all that hard and basically just quickly checked the "yep, LLMs don't work here" box.

dogma11387 个月前

Most LLMs I’ve played with are terrible at generating mock data that is in any way useful because they are strongly reinforced against anything that could be used for “recall”.At least for playing around with llama2 for this you need to abliterate it the point of lobotomy to do anything and then the usefulness drops for other reasons.

pitah17 个月前

The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (<a href="https://github.com/data-catering/data-caterer">https://github.com/data-catering/data-caterer</a>) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

SkyVoyager997 个月前

I think this article does a good job in capturing the complexities of generating test data for real world databases. Generating mock data using LLMs for individual tables based on the naming of the fields is one thing, but doing it across multiple tables, while honoring complex relationships across them (primary-foreign keys across 1:1, 1:N, and M:N with intermediate tables) is a whole another level of a challenge. And it's even harder for databases such as MongoDB, where the relationships across collections are often implicit and can best be inferred based on the names of the fields.

评论 #41723660 未加载

nonameiguess7 个月前

We faced probably about the worst form of this problem you can face when working for the NRO on ground processing of satellite data. When new orbital sensor platforms are developed, new processing software has to be developed in tandem, but the software has to be developed and tested before the platforms are actually launched, so real data is impossible and you have to generate and process synthetic data instead.Even then, it's an entirely tractable problem. If you understand the physical characteristics and capabilities of the sensors and the basic physics of satellite imaging in general, you simply use that knowledge. You can't possibly know what you're really going to see when you get into space and look, but you at least know the mathematical characteristics the data will have.The entire problem here is you need a lot of expertise to do this. It's not even expertise I have or any other software developer had or has. We needed PhDs in orbital mechanics, atmospheric studies, and image science to do it. There isn't and probably never will be a "one-click" button to just make it happen, but this kind of thing might honestly be a great test for anyone that truly believes LLMs can reason at a level equal to human experts. Generate a form of data that has never existed, thus cannot have been in your training set, from first principles of basic physics.

sgarland7 个月前

IMO, nothing beats a carefully curated selection of data, randomly selected (with correlations as needed). The problem is you rapidly start getting into absurd levels of detail for things like postal addresses, at least, if you want them to be accurate.

zebomon7 个月前

Good read. I wonder to what degree this kind of step-making which I suppose is what is often happening under the hood of OpenAI's o1 "reasoning" model, is set up manually (manually as in a case-by-case basis) as you've done here.I'm reminded of an evening that I spent playing Overcooked 2 with my partner recently. We made it through to the 4-star rounds, which are very challenging, and we realized that for one of the later 4-star rounds, one could reach the goal rather easily -- by taking advantage of a glitch in the way that items are stored on the map. This realization brought up an interesting conversation, as to whether or not we should then beat the round twice, once using the glitch and once not.With LLMs right now, I think there's still a widespread hope (wish?) that the emergent capabilities seen in scaled-up data and training epochs will yield ALL capabilities hereon. Fortunately for the users of this site, hacking together solutions seems like it's going to remain necessary for many goals.

yawnxyz7 个月前

ok so a long time ago I used "real-looking examples" in a bunch of client prototypes (for a big widely known company's web store) and the account managers couldn't tell whether these were items new that had been released or not... so somehow the mock data ended up in production (before it got caught and snipped)ever since then I use "real-but-dumb examples" so people know in a glance that it can't possibly be realthe reason I don't like latin placeholder text is b/c the word lengths are different than english so sentence widths end up very different

评论 #41722183 未加载

评论 #41723250 未加载

benxh7 个月前

I'm pretty sure that Neosync[0] does this to a pretty good degree, it is open source and YC funded too.[0] <a href="https://www.neosync.dev/">https://www.neosync.dev/</a>

WhiteOwlEd7 个月前

Building on this, Human preference optimization (such as Direct Preference Optimization or Kahneman Tversky Optimization) could be used to help in refining models to create better data.I wrote about this more recently in the context of using LLMs to improve data pipelines. That blog post is at: <a href="https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengineering-artificialintelligence-activity-7247270705803743233-lXTe" rel="nofollow">https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...</a>

larodi7 个月前

The thing is that this test data generation does not work if you don't account for the schema. Author did so, well done. Been following the same algo for an year, and it works as long, as context big enough to keep ids generated. or otherwise you feed ids for the FKs missing.But this is really not a breakthrough, anyone with fair knowledge of LLMs and E/R should be able to devise it. the fact not many people have interdisciplinary knowledge is very much evident from all text2sql papers for example which is a similar domain.

评论 #41722692 未加载

eesmith7 个月前

A European friend of mine told me about some of the problems of mock data generation.A hard one, at least for the legal requirements in her field, is that it must not include a real person's information.Like, if it says "John Smith, 123 Oak St." and someone actually lives there with that name, then it's a privacy violation.You end up having to use addresses that specifically do not exist, and driver's license numbers which are invalid, etc.

评论 #41720221 未加载

chromanoid7 个月前

The article reads like it was a bullet point list inflated by AI. But maybe I am just allergic to long texts nowadays.I wonder if we will use AI users to generate mock data and e2e test our applications in the near future. This would probably generate even more realistic data.

lysecret7 个月前

This is a very good point, that's probably my number one use-case of things like copilot chat, just to fill in some of my types and generate some test cases.

roywiggins7 个月前

a digression but> this text has been the industry's standard dummy text ever since some printed in the 1500sdoesn't seem to be true:<a href="https://slate.com/news-and-politics/2023/01/lorem-ipsum-history-origins.html" rel="nofollow">https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...</a>

hluska7 个月前

From the article:“It should generate realistic data based solely on the schema, without requiring any external user input—a “one-click” solution with minimal friction.“This is extremely ambitious and ambition will always be very cool.

dartos7 个月前

Maybe I’m confused, but why would an llm be better at mapping tuples to functions as opposed to a kind of switch statement?Especially since it doesn’t seem to totally understand the breadth of possible kinds of faked data?

erehweb7 个月前

See also the Charlie Javice case, where she allegedly defrauded JP Morgan into buying her student financial aid company using mock data <a href="https://www.nbcnews.com/news/us-news/startup-founder-charlie-javice-go-trial-2024-alleged-jpmorgan-fraud-rcna124530" rel="nofollow">https://www.nbcnews.com/news/us-news/startup-founder-charlie...</a>

thelostdragon8 个月前

This looks quite interesting and promising.