Hello Hacker News! We are Rick & Yannick from Orchest (https://www.orchest.io - https://github.com/orchest/orchest). We're building a visual pipeline tool for data scientists. The tool can be considered to be high-code because you write your own Python/R notebooks and scripts, but we manage the underlying infrastructure to make it 'just work™'. You can think of it as a simplified version of Kubeflow.<p>We created Orchest to free data scientists from the tedious engineering related tasks of their job. Similar to how companies like Netflix, Uber and Booking.com support their data scientists with internal tooling and frameworks to increase productivity. When we worked as data scientists ourselves we noticed how heavily we had to depend on our software engineering skills to perform all kinds of tasks. From configuring cloud instances for distributed training, to optimizing the networking and storage for processing large amounts of data. We believe data scientists should be able to focus on the data and the domain specific challenges.<p>Today we are just at the very beginning of making better tooling available for data science and are launching our GitHub project that will give enhanced pipelining abilities to data scientists using the PyData/R stack, with deep integration of Jupyter Notebooks.<p>Currently Orchest supports:<p>1) visually and interactively editing a pipeline that is represented using a simple JSON schema;<p>2) running remote container based kernels through the Jupyter Enterprise Gateway integration;<p>3) scheduling experiments by launching parameterized pipelines on top of our Celery task scheduler;<p>4) configuring local and remote data sources to separate code versioning from the data passing through your pipelines.<p>We are here to learn and get feedback from the community. As youngsters we don't have all the answers and are always looking to improve.
Reminds me a bit of <a href="https://plynx.com/" rel="nofollow">https://plynx.com/</a> , and it's also open source. Is there a major differentiator I'm missing? Also, what is your idea regarding the use case. Why would I need to run it locally for example? Is it mostly about productionizing ML?
This looks cool! A couple of questions:<p>1. Currently, if I install something in the notebook, does it get re-installed every time the pipeline is run? Is there any way to "snapshot" the state of the container?<p>2. Where is the data stored between the steps?<p>3. How well-integrated is it with AWS cloud primitives such as EC2 instances, EFS, and S3?
Congratulations! I remember your earlier project: grid studio. Do you support scheduling periodic tasks? Do you support execution triggered with webhook? or some way to expose notebook as REST API?