TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Shapeshift: Semantically map JSON objects using key-level vector embeddings

114 点作者 marvinkennis10 个月前

12 条评论

simonw10 个月前
This is the code that does the work: <a href="https:&#x2F;&#x2F;github.com&#x2F;rectanglehq&#x2F;Shapeshift&#x2F;blob&#x2F;d954dab2a866c750cd6862d1eb44830b825bcb06&#x2F;index.ts#L132-L160">https:&#x2F;&#x2F;github.com&#x2F;rectanglehq&#x2F;Shapeshift&#x2F;blob&#x2F;d954dab2a866c...</a><p>There are a few ways this could be made a less expensive to run:<p>1. Cache those embeddings somewhere. You&#x27;re only embedding simple strings like &quot;name&quot; and &quot;address&quot; - no need to do that work more than once in an entire lifetime of running the tool.<p>2. As suggested here <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40973028">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40973028</a> change the design of the tool so instead of doing the work it returns a reusable data structure mapping input keys to output keys, so you only have to run it once and can then use that generated data structure to apply the transformations on large amounts of data in the future.<p>3. Since so many of the keys are going to have predictable names (&quot;name&quot;, &quot;address&quot; etc) you could even pre-calculate embeddings for the 1,000 most common keys across all three embedding providers and ship those as part of the package.<p>Also: in <a href="https:&#x2F;&#x2F;github.com&#x2F;rectanglehq&#x2F;Shapeshift&#x2F;blob&#x2F;d954dab2a866c750cd6862d1eb44830b825bcb06&#x2F;index.ts#L57-L64">https:&#x2F;&#x2F;github.com&#x2F;rectanglehq&#x2F;Shapeshift&#x2F;blob&#x2F;d954dab2a866c...</a> you&#x27;re using Promise.map() to run multiple embeddings through the OpenAI API at once, which risks tripping their rate-limit. You should be able to pass the text as an array in a single call instead, something like this:<p><pre><code> const response = await this.openai!.embeddings.create({ model: this.embeddingModel, input: texts, encoding_format: &quot;float&quot;, }); return response.data.map(item =&gt; item.embedding); </code></pre> <a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;api-reference&#x2F;embeddings&#x2F;create" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;api-reference&#x2F;embeddings&#x2F;cr...</a> says input can be a string OR an array - that&#x27;s reflected in the TypeScript library here too: <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-node&#x2F;blob&#x2F;5873a017f0f2040ef97040a8df19c5b4dc2a66fd&#x2F;src&#x2F;resources&#x2F;embeddings.ts#L80-L90">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-node&#x2F;blob&#x2F;5873a017f0f2040ef...</a>
评论 #40973805 未加载
评论 #40976614 未加载
评论 #40977097 未加载
评论 #40973317 未加载
WhatsName10 个月前
Maybe I&#x27;m not the target audience, but here are simple questions to the author or potential users:<p>What about anything more complex like date of birth to age or the other way round? Also since we will inevitably incur costs, why not let a llm write a transformation rule for us?
评论 #40977219 未加载
评论 #40972830 未加载
lordofmoria10 个月前
Since LLMs are bad at the null hypothesis (in this case, when a key does not exist in the source JSON), how does this prevent hallucinating transformations for missing keys?
评论 #40974914 未加载
评论 #40974328 未加载
leobg10 个月前
The example could be handled with no machine learning at all. Just use a bag of words comparison with a subword tokenizer. And if you do need embeddings (to map synonyms&#x2F;topics), fastText is faster, cheaper and runs locally. For hard cases, you can feed the source&#x2F;target schemas to gpt-4o once to create a map - and then apply that one map to all instances.
评论 #40980310 未加载
评论 #40984274 未加载
lukasb10 个月前
What is this for? The examples given could be handled deterministically. Is this for situations where you don&#x27;t know JSON schemas in advance? What situations are those?
评论 #40973028 未加载
评论 #40972794 未加载
评论 #40972838 未加载
henry70010 个月前
Keep the bug generators going, we will need the jobs
visarga10 个月前
This task in the most general form is better done with question answering prompt than embeds. How do you solve &quot;Full Name&quot; -&gt; &quot;First Name&quot;, &quot;Last Name&quot; with embeds? QA is the right level of abstraction for schema conversion tasks. And it&#x27;s simple, just put the source JSON + target JSON schema in the prompt and ask for value extraction.
yetanotherjosh10 个月前
So this identifies keys from source and target objects that are fuzzy synonyms and copies the values over. What is a real world use case for this? Add the fact that it&#x27;s fuzzy and won&#x27;t always work, so would require a great deal of extra effort in QA&#x2F;testing (harder than just mapping the keys programmatically), and I&#x27;m puzzled.
评论 #40977267 未加载
happy_bacon10 个月前
Here is an another DSL for implementing object model mappings: <a href="https:&#x2F;&#x2F;github.com&#x2F;patleahy&#x2F;lir">https:&#x2F;&#x2F;github.com&#x2F;patleahy&#x2F;lir</a>
benzguo10 个月前
Put together a quick version with an LLM, using Substrate: <a href="https:&#x2F;&#x2F;www.val.town&#x2F;v&#x2F;substrate&#x2F;shapeshift" rel="nofollow">https:&#x2F;&#x2F;www.val.town&#x2F;v&#x2F;substrate&#x2F;shapeshift</a><p>I&#x27;ve turned the target object into a JSON schema, but you could probably generate that JSON schema pretty reliably using a codegen LLM.
explaininjs10 个月前
What’d be really great is a codegen aspect. A non-negligible part of any data munching operation is “this input object has fields X, Y, Z and we need an output object with fields X, f(X), Y, f(Y,Z)”. This is something and LLM has a decent chance at being really quite good at.
hendler10 个月前
Created a Rust version using devin.ai. (untested)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;HumanAssisted&#x2F;shapeshift-rust">https:&#x2F;&#x2F;github.com&#x2F;HumanAssisted&#x2F;shapeshift-rust</a>