TechEcho

Hi HN, we’re Eduardo & Ido, the founders of Ploomber (<a href="https://ploomber.io" rel="nofollow">https://ploomber.io</a>). We’re building an open-source framework (<a href="https://github.com/ploomber/ploomber" rel="nofollow">https://github.com/ploomber/ploomber</a>) that helps data scientists quickly deploy the code they develop in interactive environments (Jupyter/VScode/PyCharm), eliminating the need for time-consuming manual porting to production platforms.Jupyter and other interactive environments are the go-to tools for most data scientists. However, many production data pipeline platforms (e.g. Airflow, Kubernetes) drag them into non-interactive development paradigms. Hence, when moving to production, the data scientist’s code needs to move from the interactive environment to a more traditional software environment (e.g. declaring workflows as Python classes). This process creates friction since the code needs to cross this gap every time the data scientist deploys their work. Data scientists often pair with software engineers to work on the conversion, but this is time-consuming and costly. It’s also frustrating because it’s just busy work.We encountered this problem while working in the data space. Eduardo was a data scientist at Fidelity for a few years. He deployed ML models and always found it annoying and wasteful to port the code from his notebooks into a production framework like Airflow or Kubernetes. Ido worked as a consultant at AWS and constantly found that data science projects would allocate about 30% of their time to convert a notebook prototype into a production pipeline.Interactive environments have historically been used for prototyping and are considered unsuitable for production; this is reasonable because, in our experience, most of the code developed interactively exists in a single file with little to no structure (e.g., a gigantic notebook). However, we believe it’s possible to bring software engineering best practices and apply them to the interactive development world so data scientists can produce maintainable projects to streamline deployment.Ploomber allows data scientists to quickly develop their code in modular pipelines rather than a giant single file. When developed this way, their code is suitable for deployment to production platforms; we currently support exporting to Kubernetes, AWS Batch, Airflow, Kubeflow, and SLURM with no code changes. Our integration with Jupyter/VSCode/PyCharm allows them to iteratively build these modular pipelines without moving away from the interactive environment. In addition, modularizing the work enables them to create more maintainable and testable projects. Our goal is ease of use, with minimal disturbance to the data scientist’s existing workflows.Users can install Ploomber with pip, open Jupyter/VSCode/PyCharm, and start building in minutes. We’ve made a significant effort to create a simple tool so people can get started quickly and learn the advanced features when they need them. Ploomber is available at <a href="https://github.com/ploomber/ploomber" rel="nofollow">https://github.com/ploomber/ploomber</a> under the Apache 2.0 license. In addition, we are working on a cloud version to help enterprises operationalize models. We’re still working on the pricing details, but if you’d like us to let you know when we open the private beta, you can sign up here: <a href="https://ploomber.io/cloud" rel="nofollow">https://ploomber.io/cloud</a>. However, the core of our offering is the open-source framework, and it will remain free.We’re thrilled to share Ploomber with you! If you’re a data scientist who has experienced these endless cycles of porting your code for deployment, an ML engineer who helps data scientists deploy their work, or you have any feedback, please share your thoughts! We love chatting about this domain since exchanging ideas always sheds light on aspects we haven’t considered before! You may also reach out to me at eduardo@ploomber.io.

8 comments

ensemblehqover 3 years ago

Congrats on the launch. I'm a MLOps consultant that helps enterprises with productionizing their models on cloud platforms. Previously, also a startup founder who iterated in the same space and can probably exchange notes.The problem is definitely a time-consuming and costly one and I'm intrigued to play around with Ploomber. How does Ploomber handle collaboration/code sharing across data scientists?

评论 #30197443 未加载

评论 #30197434 未加载

tracyhenryover 3 years ago

Hey congrats on the launch! This is definitely a useful concept.I haven't dug deep, but is code reviews possible? A big point of the whole data-as-code movement is to enable easier review of the data generation process, make abstractions and versioning. Being able to generate pipelines from Jupyter notebooks sounds exciting in theory, but I'd imagine code reviewing the generated pipelines can be a pain.

评论 #30197283 未加载

wizwit999over 3 years ago

I think this is a good idea. Decoupling seems like an interesting approach. When I worked in this space as an engineer, bridging the notebook - production-ization divide was annoying. I'd be interested to see if this solves it.

评论 #30197264 未加载

arjvikover 3 years ago

I've had a great experience using DVC for both data versioning and pipelines before. Can you tell us why Ploomber is a better solution than DVC?

评论 #30202735 未加载

cardosofover 3 years ago

Congrats on the launch! Do you guys by any chance know Deepnote? They're in YC as well, also in the tools for data scientists space. I lead a small team of DSs in a big corp and we'd happily pay for a single tool that would be Deepnote+Ploomber in terms of features (collaboration + deployment)

评论 #30203371 未加载

jiriroover 3 years ago

The audio in the landing page video is hard to understand. Is this only my broken speakers?Also the video cannot be made fullscreen on my phone. Is this by design?

评论 #30198406 未加载

hoerzuover 3 years ago

Really helpful keeping notebooks tidy :)

评论 #30197306 未加载

ricklamersover 3 years ago

Congrats on the launch! It’s great to see validation of the usefulness of notebooks in data workflows even when moving beyond the proof of concept/exploration stage into production type workloads and deployments. Once deployed, iteration is often still necessary or desirable and that’s where having notebooks available for continued iteration is a big advantage.For those who’d like to compare and contrast different solutions that support the use of notebooks in the (batch) deployment context you can also check out Orchest (<a href="https://github.com/orchest/orchest" rel="nofollow">https://github.com/orchest/orchest</a>). I’d say a meaningful point of difference between Ploomber and Orchest is that we are more container oriented as we’ve found that gives robust units to deploy in production with isolated and well defined dependencies. In addition we have a more GUI first approach which might be more familiar to those who come from RStudio, JupyterLab, Spyder, MATLAB, etc.Disclaimer: I’m one of the Orchest creators.

评论 #30230259 未加载

8 comments

ensemblehqover 3 years ago

评论 #30197443 未加载

评论 #30197434 未加载

tracyhenryover 3 years ago

评论 #30197283 未加载

wizwit999over 3 years ago

评论 #30197264 未加载

arjvikover 3 years ago

I've had a great experience using DVC for both data versioning and pipelines before. Can you tell us why Ploomber is a better solution than DVC?

评论 #30202735 未加载

cardosofover 3 years ago

评论 #30203371 未加载

jiriroover 3 years ago

The audio in the landing page video is hard to understand. Is this only my broken speakers?Also the video cannot be made fullscreen on my phone. Is this by design?

Launch HN: Ploomber (YC W22) – Quickly deploy data pipelines from Jupyter/VSCode

8 comments

Launch HN: Ploomber (YC W22) – Quickly deploy data pipelines from Jupyter/VSCode

8 comments