TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Launch HN: Sematic (YC S22) – Open-source framework to build ML pipelines faster

121 pointsby neutralino1almost 3 years ago
Hi HN – I’m Emmanuel, founder of Sematic (<a href="https:&#x2F;&#x2F;sematic.dev" rel="nofollow">https:&#x2F;&#x2F;sematic.dev</a>). Sematic is an open-source framework to prototype and productionize end-to-end Machine Learning (ML) and Data Science (DS) pipelines in days instead of weeks or months. The idea is to do for ML development what Rails and Heroku did for web development.<p>I started my career searching for Supersymmetry and the Higgs boson on the Large Hadron Collider at CERN, then moved to industry. I spent the last four years building ML infrastructure at Cruise. In both academia and industry, I witnessed researchers, data scientists, and ML engineers spending an absurd share of their time building makeshift tooling, stitching up infrastructure, and battling obscure systems, instead of focusing on their core area of expertise: extracting insights and predictions from data.<p>This was painfully apparent at Cruise where the ML Platform team needed to grow linearly with the number of users to support and models to ship to the car. What should have just taken a click (e.g. retraining a model when world conditions change – COVID parklets, road construction sites, deployment to new cities) often required weeks of painstaking work. Existing tools for prototyping and productionizing ML&#x2F;DS models did not enable developers to become autonomous and tackle new projects instead of babysitting current ones.<p>For example, a widely adopted tool such as Kubeflow Pipelines requires users to learn an obscure Python API, package and deploy their code and dependencies by hand, and does not offer exhaustive tracking and visualization of artifacts beyond simple metadata.<p>In order to become autonomous, users needed a dead-simple way to iterate seamlessly between local and cloud environments (change code, validate locally, run at scale in the cloud, repeat) and visualize objects (metrics, plots, datasets, configs) in a UI. Strong guarantees around dependency packaging, traceability of artifact lineage, and reproducibility would have to be provided out-of-the-box.<p>Sematic lets ML&#x2F;DS developers build and run pipelines of arbitrary complexity with nothing more than minimalistic Python APIs. Business logic, dynamic pipeline graphs, configurations, resource requirements, etc. — all with only Python. We are bringing the lovable aspects of Jupyter Notebooks (iterative development, visualizations) to the actual pipeline.<p>How it works: Sematic resolves dynamic nested graphs of pipeline steps (simple Python functions) and intercepts all inputs and outputs of each step to type-check, serialize, version, and track them. Individual steps are orchestrated as Kubernetes jobs according to required resources (e.g. GPU, high-memory), and all tracking and visualization information is surfaced in a modern UI. Build assets (user code, third-party dependencies, drivers, static libraries) are packaged and shipped to remote workers at runtime, which enables a fast and seamless iterative development experience.<p>Sematic lets you achieve results much faster by not wasting time on packaging dependencies, foraging for output artifacts to visualize, investigating obscure failures in black-box container jobs, bookkeeping configurations, writing complex YAML templates to run multiple experiments, etc.<p>It can run on a local machine or be deployed to leverage cloud resources (e.g. GPUs, high-memory instances, map&#x2F;reduce clusters, etc.) with minimal external dependencies: Python, PostgreSQL, and Kubernetes.<p>Sematic is open-source and free to use locally or self-hosted in your own cloud. We will provide a SaaS offering to enable access to cloud resources without the hassle of maintaining a cloud deployment. To get started, simply run `$ pip install sematic; sematic start`. Check us out at <a href="https:&#x2F;&#x2F;sematic.dev" rel="nofollow">https:&#x2F;&#x2F;sematic.dev</a>, star our Github repo, and join our Discord for updates, feature requests, and bug reports.<p>We would love to hear from everyone about your experience building reliable end-to-end ML training pipelines, and anything else you’d like to share in the comments!

16 comments

ricklamersalmost 3 years ago
For people in this thread interested in what this tool is an alternative to: Airflow, Luigi, Kubeflow, Kedro, Flyte, Metaflow, Sagemaker Pipelines, GCP Vertex Workbench, Azure Data Factory, Azure ML, Dagster, DVC, ClearML, Prefect, Pachyderm, and Orchest.<p>Disclaimer: author of Orchest <a href="https:&#x2F;&#x2F;github.com&#x2F;orchest&#x2F;orchest" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;orchest&#x2F;orchest</a>
评论 #32416030 未加载
评论 #32421694 未加载
kajecounterhackalmost 3 years ago
Looks cool!<p>&gt; Sematic makes I&#x2F;O between steps in your pipelines as simple as passing an output of one python function as the input of another. Airflow provides APIs which can pass data between tasks, but involves some boilerplate around explicitly pushing&#x2F;pulling data around, and coupling producers and consumers via named data keys.<p>In robotics you sometimes need high performance data transformation e.g. convert pile of raw robot log data protos --&gt; pile of simulation inputs --&gt; pile of extracted data --&gt; munged into net input format<p>Does semantic support this if the communication between tasks uses python functions? Like if my simulator is C++, will I have to use SWIG?<p>In some of the competing systems, the input&#x2F;output between nodes are just produced files as side effects, which is nice because it doesn&#x27;t care what language &#x2F; infra you use as long as you produce the required input&#x2F;output.
评论 #32416796 未加载
llaollehalmost 3 years ago
I will check it out after work. Let me just say that this is indeed a legitimate problem. After you train the model, to me it takes at least 3x the amount of effort to deploy and push to production.<p>I wish it was as easy as drag and dropping the model to target servers after building the model.
评论 #32413741 未加载
Smergnusalmost 3 years ago
I have an idea where I want to build an ML system that generates different sets of board game rules (think tic-tac-toe type games), then trains models to play that game, and scores each set of rules based on a set of criteria. For example: no side should always win, the skill ceiling should be high (models should keep improving when trained more). A less skilled (trained) model should sometimes be able to beat a more skilled model. The games should end within a reasonable number of turns. Etc. The high level system should then generate new rulesets, searching for a ruleset that scores optimally on the criteria. Would Sematic be good for this?
评论 #32413359 未加载
actusualalmost 3 years ago
&gt; Machine Learning (ML) and Data Science (DS) developers are not software engineers, and vice-versa.<p>Woof, I&#x27;m out.
评论 #32422874 未加载
boredumbalmost 3 years ago
&gt; with minimal external dependencies: Python, PostgreSQL, and <i>Kubernetes</i><p>What a time to be alive.<p>All jokes aside, this is really awesome and i&#x27;m glad to see more and more tools to make ML more developer friendly and accessible. Out of curiosity do you guys come from a TF, pytorch, jax, etc background?
评论 #32414133 未加载
phissenschaftalmost 3 years ago
Congratulation on the launch! Best wishes! Would absolutely love to dive into it soon.<p>Here are some high level questions:<p>- How does it handle failure of individual tasks in the pipeline? - What if the underlying jobs (e.g. training or dataset extraction or metrics evaluation) need to run outside the k8s cluster (e.g. running bare-metal, slurm, sagemaker, or even a separate k8s cluster)? - How does caching work if multiple pipeline can share some common components (e.g. dataset extraction)?
评论 #32426709 未加载
benjismithalmost 3 years ago
Sounds great! Very interested when the SaaS offering opens up. Definitely not keen on running a Kubernetes cluster for the sake of simplifying ML operations.
评论 #32417390 未加载
wodenokotoalmost 3 years ago
Do I still need to manage a kubernetes cluster?
评论 #32415536 未加载
rob-lambertalmost 3 years ago
Love this, as an MLOps practitioner who has repeatedly needed to build this at multiple banks, Sematic seems finally like a real solution for the wider world, a real place to bring best practice to data science pipelining.
morelandjsalmost 3 years ago
What are my options for big data that won’t fit completely into memory? Is it easy to hook up to a spark cluster?<p>Do I have the option to access the underlying infra through a Unix shell, when the UI isn’t enough?
评论 #32426584 未加载
edublancasalmost 3 years ago
&gt; The idea is to do for ML development what Rails and Heroku did for web development.<p>I think this is a great way to explain what you&#x27;re doing. I&#x27;m working in the same space (ML&#x2F;DS tooling) and I feel like we, as the ML&#x2F;DS community, haven&#x27;t cracked exactly how Rails for data looks like, I actually wrote some ideas on this a while ago (<a href="https:&#x2F;&#x2F;ploomber.io&#x2F;blog&#x2F;rails4ml&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ploomber.io&#x2F;blog&#x2F;rails4ml&#x2F;</a>).<p>Congrats on the launch and best of luck with the product!
评论 #32416125 未加载
bunthaalmost 3 years ago
How is it different from MLFlow? the recent MLFlow pipeline, does it has any similarities?
评论 #32416325 未加载
brochingtonalmost 3 years ago
Just a note to say that even though the name &quot;Sematic&quot; is the same, this is not the same open source project as mine that I posted to Show HN about a week ago here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=32364193" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=32364193</a>.
评论 #32417797 未加载
pottertheotteralmost 3 years ago
How is this different from or complementary to Tecton?
评论 #32416958 未加载
rubenfiszelalmost 3 years ago
This is amazing. Long live open-source platforms that let developers and data scientist focus on the interesting parts of their jobs and day and make them more productive.