Launch HN: Sematic (YC S22) – Open-source framework to build ML pipelines faster

121 pointsby neutralino1almost 3 years ago

Hi HN – I’m Emmanuel, founder of Sematic (<a href="https://sematic.dev" rel="nofollow">https://sematic.dev</a>). Sematic is an open-source framework to prototype and productionize end-to-end Machine Learning (ML) and Data Science (DS) pipelines in days instead of weeks or months. The idea is to do for ML development what Rails and Heroku did for web development.I started my career searching for Supersymmetry and the Higgs boson on the Large Hadron Collider at CERN, then moved to industry. I spent the last four years building ML infrastructure at Cruise. In both academia and industry, I witnessed researchers, data scientists, and ML engineers spending an absurd share of their time building makeshift tooling, stitching up infrastructure, and battling obscure systems, instead of focusing on their core area of expertise: extracting insights and predictions from data.This was painfully apparent at Cruise where the ML Platform team needed to grow linearly with the number of users to support and models to ship to the car. What should have just taken a click (e.g. retraining a model when world conditions change – COVID parklets, road construction sites, deployment to new cities) often required weeks of painstaking work. Existing tools for prototyping and productionizing ML/DS models did not enable developers to become autonomous and tackle new projects instead of babysitting current ones.For example, a widely adopted tool such as Kubeflow Pipelines requires users to learn an obscure Python API, package and deploy their code and dependencies by hand, and does not offer exhaustive tracking and visualization of artifacts beyond simple metadata.In order to become autonomous, users needed a dead-simple way to iterate seamlessly between local and cloud environments (change code, validate locally, run at scale in the cloud, repeat) and visualize objects (metrics, plots, datasets, configs) in a UI. Strong guarantees around dependency packaging, traceability of artifact lineage, and reproducibility would have to be provided out-of-the-box.Sematic lets ML/DS developers build and run pipelines of arbitrary complexity with nothing more than minimalistic Python APIs. Business logic, dynamic pipeline graphs, configurations, resource requirements, etc. — all with only Python. We are bringing the lovable aspects of Jupyter Notebooks (iterative development, visualizations) to the actual pipeline.How it works: Sematic resolves dynamic nested graphs of pipeline steps (simple Python functions) and intercepts all inputs and outputs of each step to type-check, serialize, version, and track them. Individual steps are orchestrated as Kubernetes jobs according to required resources (e.g. GPU, high-memory), and all tracking and visualization information is surfaced in a modern UI. Build assets (user code, third-party dependencies, drivers, static libraries) are packaged and shipped to remote workers at runtime, which enables a fast and seamless iterative development experience.Sematic lets you achieve results much faster by not wasting time on packaging dependencies, foraging for output artifacts to visualize, investigating obscure failures in black-box container jobs, bookkeeping configurations, writing complex YAML templates to run multiple experiments, etc.It can run on a local machine or be deployed to leverage cloud resources (e.g. GPUs, high-memory instances, map/reduce clusters, etc.) with minimal external dependencies: Python, PostgreSQL, and Kubernetes.Sematic is open-source and free to use locally or self-hosted in your own cloud. We will provide a SaaS offering to enable access to cloud resources without the hassle of maintaining a cloud deployment. To get started, simply run `$ pip install sematic; sematic start`. Check us out at <a href="https://sematic.dev" rel="nofollow">https://sematic.dev</a>, star our Github repo, and join our Discord for updates, feature requests, and bug reports.We would love to hear from everyone about your experience building reliable end-to-end ML training pipelines, and anything else you’d like to share in the comments!

16 comments

ricklamersalmost 3 years ago

For people in this thread interested in what this tool is an alternative to: Airflow, Luigi, Kubeflow, Kedro, Flyte, Metaflow, Sagemaker Pipelines, GCP Vertex Workbench, Azure Data Factory, Azure ML, Dagster, DVC, ClearML, Prefect, Pachyderm, and Orchest.Disclaimer: author of Orchest <a href="https://github.com/orchest/orchest" rel="nofollow">https://github.com/orchest/orchest</a>

评论 #32416030 未加载

评论 #32421694 未加载

kajecounterhackalmost 3 years ago

Looks cool!> Sematic makes I/O between steps in your pipelines as simple as passing an output of one python function as the input of another. Airflow provides APIs which can pass data between tasks, but involves some boilerplate around explicitly pushing/pulling data around, and coupling producers and consumers via named data keys.In robotics you sometimes need high performance data transformation e.g. convert pile of raw robot log data protos --> pile of simulation inputs --> pile of extracted data --> munged into net input formatDoes semantic support this if the communication between tasks uses python functions? Like if my simulator is C++, will I have to use SWIG?In some of the competing systems, the input/output between nodes are just produced files as side effects, which is nice because it doesn't care what language / infra you use as long as you produce the required input/output.

评论 #32416796 未加载

llaollehalmost 3 years ago

I will check it out after work. Let me just say that this is indeed a legitimate problem. After you train the model, to me it takes at least 3x the amount of effort to deploy and push to production.I wish it was as easy as drag and dropping the model to target servers after building the model.

评论 #32413741 未加载

Smergnusalmost 3 years ago

I have an idea where I want to build an ML system that generates different sets of board game rules (think tic-tac-toe type games), then trains models to play that game, and scores each set of rules based on a set of criteria. For example: no side should always win, the skill ceiling should be high (models should keep improving when trained more). A less skilled (trained) model should sometimes be able to beat a more skilled model. The games should end within a reasonable number of turns. Etc. The high level system should then generate new rulesets, searching for a ruleset that scores optimally on the criteria. Would Sematic be good for this?

评论 #32413359 未加载

actusualalmost 3 years ago

> Machine Learning (ML) and Data Science (DS) developers are not software engineers, and vice-versa.Woof, I'm out.

评论 #32422874 未加载

boredumbalmost 3 years ago

> with minimal external dependencies: Python, PostgreSQL, and KubernetesWhat a time to be alive.All jokes aside, this is really awesome and i'm glad to see more and more tools to make ML more developer friendly and accessible. Out of curiosity do you guys come from a TF, pytorch, jax, etc background?

评论 #32414133 未加载

phissenschaftalmost 3 years ago

Congratulation on the launch! Best wishes! Would absolutely love to dive into it soon.Here are some high level questions:- How does it handle failure of individual tasks in the pipeline? - What if the underlying jobs (e.g. training or dataset extraction or metrics evaluation) need to run outside the k8s cluster (e.g. running bare-metal, slurm, sagemaker, or even a separate k8s cluster)? - How does caching work if multiple pipeline can share some common components (e.g. dataset extraction)?

评论 #32426709 未加载

benjismithalmost 3 years ago

Sounds great! Very interested when the SaaS offering opens up. Definitely not keen on running a Kubernetes cluster for the sake of simplifying ML operations.

评论 #32417390 未加载

wodenokotoalmost 3 years ago

Do I still need to manage a kubernetes cluster?

评论 #32415536 未加载

rob-lambertalmost 3 years ago

Love this, as an MLOps practitioner who has repeatedly needed to build this at multiple banks, Sematic seems finally like a real solution for the wider world, a real place to bring best practice to data science pipelining.

morelandjsalmost 3 years ago

What are my options for big data that won’t fit completely into memory? Is it easy to hook up to a spark cluster?Do I have the option to access the underlying infra through a Unix shell, when the UI isn’t enough?

评论 #32426584 未加载

edublancasalmost 3 years ago

> The idea is to do for ML development what Rails and Heroku did for web development.I think this is a great way to explain what you're doing. I'm working in the same space (ML/DS tooling) and I feel like we, as the ML/DS community, haven't cracked exactly how Rails for data looks like, I actually wrote some ideas on this a while ago (<a href="https://ploomber.io/blog/rails4ml/" rel="nofollow">https://ploomber.io/blog/rails4ml/</a>).Congrats on the launch and best of luck with the product!

评论 #32416125 未加载

bunthaalmost 3 years ago

How is it different from MLFlow? the recent MLFlow pipeline, does it has any similarities?

评论 #32416325 未加载

brochingtonalmost 3 years ago

Just a note to say that even though the name "Sematic" is the same, this is not the same open source project as mine that I posted to Show HN about a week ago here: <a href="https://news.ycombinator.com/item?id=32364193" rel="nofollow">https://news.ycombinator.com/item?id=32364193</a>.

评论 #32417797 未加载

pottertheotteralmost 3 years ago

How is this different from or complementary to Tecton?

评论 #32416958 未加载

rubenfiszelalmost 3 years ago

This is amazing. Long live open-source platforms that let developers and data scientist focus on the interesting parts of their jobs and day and make them more productive.