Show HN: Weave - actually measure engineering productivity

22 pointsby adchurch6 months ago

Hey HN,We’re building Weave: an ML-powered tool to measure engineering output, that actually understands engineering output!Why? Here’s the thing: almost every eng leader already measures output - either openly or behind closed doors. But they rely on metrics like lines of code (correlation with effort: ~0.3), number of PRs, or story points (slightly better at ~0.35). These metrics are, frankly, terrible proxies for productivity.We’ve developed a custom model that analyzes code and its impact directly, with a far better 0.94 correlation. The result? A standardized engineering output metric that doesn’t reward vanity. Even better, you can benchmark your team’s output against peers while keeping everything private.Although this one metric is much better than anything else out there, of course it still doesn't tell the whole story. In the future, we’ll build more metrics that go deeper into things like code quality and technical leadership. And we'll build actionable suggestions on top of all of it to help teams improve and track progress.After testing with several startups, the feedback has been fantastic, so we’re opening it up today. Connect your GitHub and see what Weave can tell you: <a href="https://app.workweave.ai/welcome" rel="nofollow">https://app.workweave.ai/welcome</a>.I’ll be around all day to chat, answer questions, or take a beating. Fire away!

10 comments

senko6 months ago

"Hello Jane, please have a seat. We need to talk about your productivity. Yes, I know you helped the team through a crunch and delivered the new feature, which works flawlessly and is loved by our users. And our balance sheet is much healthier after you found that optimization that saves us $1mm/year. We also appreciate that younger teammates look to you for guidance and learn a lot from you.But you see, the AI scored your productivity at 47%, barely "meets expectations", while we expect everyone to score at least 72%, "exceeds expectations". How is that calculated? The AI is a state of the art proprietary model, I don't know the details...Anyways, we've got to design a Personal Improvement Plan for you. Here's what our AI recommends. We'll start with the TPS reports..."

评论 #42198741 未加载

rkagerer6 months ago

How did you come up with those magic correlation numbers?Is this generally just sniffing surface quality and quantity of written code, or is consideration given to how architecturally sound the system is built, whether the features introduced and their implementations make sense, how that power is exposed to users and whether the UI is approachable and efficient, user-feedback resulting from the effort, long-term sustainability and technical debt left behind (inadvertently or with deliberation), healthy practices for things like passwords & sensitive data, etc?I'm glad to see an effort at capturing better metrics, but my own feeling is trying to precisely measure developer productivity is like trying to measure IQ - it's a flawed errand and all you wind up capturing is one corner of a larger picture. Your website shares zero information prior to login, and I'm looking forward to you elaborating a little more on your offering!EDIT: Would also love to hear feedback from developers at the startups you tested at - did they like it and felt it better reflected their efforts during periods they felt productive vs. not, was there any initial or ongoing resistance & skepticism, did it make managers more aware of factors not traditionally captured by the alternative metrics you mentioned, etc.

评论 #42198642 未加载

adchurch6 months ago

Our metric is approximately "hours of work for an expert engineer." Here are some example open source PRs and their output metrics calculated by our algorithm:<a href="https://github.com/PostHog/posthog/pull/25056">https://github.com/PostHog/posthog/pull/25056</a>: 15.266 (Adds backend, frontend, and tests for a new feature)<a href="https://github.com/microsoft/vscode/pull/222315">https://github.com/microsoft/vscode/pull/222315</a>: 8.401 (Refactors code to use a new service and adds new tests)<a href="https://github.com/facebook/react/pull/27977">https://github.com/facebook/react/pull/27977</a>: 5.787 (Small change with extensive, high effort tests; approximately 1 day of work for expert engineer)<a href="https://github.com/microsoft/vscode/pull/213262">https://github.com/microsoft/vscode/pull/213262</a>: 1.06 (Mostly straightforward refactor; well under 1 day of work)

评论 #42198054 未加载

评论 #42197826 未加载

jaredsohn6 months ago

If you build something that doesn't solve problems with impact to the business, your real productivity is zero. How does this account for that?<a href="https://blog.pragmaticengineer.com/the-product-minded-engineer/" rel="nofollow">https://blog.pragmaticengineer.com/the-product-minded-engine...</a>

评论 #42203378 未加载

评论 #42199744 未加载

henning6 months ago

As soon as people know how the metric is calculated, they will game that metric and it will cease to be useful.

评论 #42198397 未加载

评论 #42197870 未加载

id006 months ago

Let me just ignore my natural distain to the whole thing (as a engineer and a manager)> We’ve developed a custom model that analyzes code and its impact directly...This is a bold claim all things considering. Don't you need to fine tune this model for every customer as their business metrics likely vastly different? How do you measure the impact of refactoing? What about regressions or design mistakes that surface themselves after months or even years?

评论 #42198795 未加载

评论 #42198819 未加载

jaredsohn6 months ago

I'm looking forward to developers setting up LLM prompts to make their code seem more complex and like it required more effort.

itsdrewmiller6 months ago

What do you see as the major threats to validity for your approach?

mg576 months ago

Pretty dumb to think you can infer effort from the code itself. You make one "smart invocation" to a remote microservice and replace 1000 lines of code!The information for effort is not available at the code level - sorry to burst your bubble.

评论 #42198476 未加载

adambeecee6 months ago

Hey HN! I'm one of the co-founders of Weave, and I wanted to jump in here to share a bit more.Building this has been a wild ride. The challenge of measuring engineering output in a way that’s fair and useful is something we’ve thought deeply about—especially because so many of the existing metrics feel fundamentally broken.The 0.94 correlation is based on rigorous validation with several teams (happy to dive into the details if anyone’s curious). We’re also really mindful that even the best metrics only tell part of the story—this is why our focus is on building a broader set of signals and actionable insights as the next step.Would love to hear your thoughts, feedback, or even skepticism—it’s all helpful as we keep refining the product.

评论 #42197766 未加载

评论 #42197855 未加载

10 comments

senko6 months ago

评论 #42198741 未加载

rkagerer6 months ago

评论 #42198642 未加载

adchurch6 months ago

评论 #42198054 未加载

评论 #42197826 未加载

jaredsohn6 months ago

评论 #42203378 未加载

评论 #42199744 未加载

henning6 months ago

As soon as people know how the metric is calculated, they will game that metric and it will cease to be useful.

评论 #42198397 未加载

评论 #42197870 未加载

id006 months ago

评论 #42198795 未加载

评论 #42198819 未加载

jaredsohn6 months ago

I'm looking forward to developers setting up LLM prompts to make their code seem more complex and like it required more effort.