TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Metaflow, Netflix's Python framework for data science, is now open source

517 点作者 vtuulos超过 5 年前

22 条评论

amirathi超过 5 年前
After going through a lot of marketing fluff, I landed on this useful page which explains Metaflow basics: <a href="https:&#x2F;&#x2F;docs.metaflow.org&#x2F;metaflow&#x2F;basics" rel="nofollow">https:&#x2F;&#x2F;docs.metaflow.org&#x2F;metaflow&#x2F;basics</a><p>Here&#x27;s my understanding:<p>- It&#x27;s a python library for creating &amp; executing DAGs<p>- Each node is a processing step &amp; the results are stored after each step so you can restart failed workflows from where it failed<p>- Tight integration with AWS ECS to run the whole DAG on cloud<p>I don&#x27;t know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.
评论 #21699813 未加载
评论 #21703816 未加载
评论 #21699798 未加载
评论 #21700083 未加载
Thorentis超过 5 年前
How is this different &#x2F; better to existing tools or workflows? I don&#x27;t like to criticise new frameworks &#x2F; tools without first understanding them, but I like to know what some key differences are without the marketing&#x2F;PR fluff before giving one a go.<p>For instance, this tutorial example here (<a href="https:&#x2F;&#x2F;github.com&#x2F;Netflix&#x2F;metaflow&#x2F;blob&#x2F;master&#x2F;metaflow&#x2F;tutorials&#x2F;01-playlist&#x2F;playlist.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Netflix&#x2F;metaflow&#x2F;blob&#x2F;master&#x2F;metaflow&#x2F;tut...</a>) does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.<p>Is the main feature the fact I can quickly put my workflows into the cloud?
评论 #21698711 未加载
vtuulos超过 5 年前
hey, I&#x27;m one of the authors of Metaflow. Happy to answer any questions! Netflix has been using Metaflow internally for about two years, so we have many war stories :)
评论 #21697965 未加载
评论 #21697882 未加载
评论 #21699492 未加载
评论 #21700301 未加载
评论 #21713802 未加载
评论 #21698052 未加载
评论 #21697868 未加载
评论 #21698784 未加载
评论 #21698482 未加载
评论 #21697745 未加载
aniketpanjwani超过 5 年前
This looks exciting! I&#x27;ll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.<p>1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?<p>2. When you move to a new text editor, there are some initial frictions while you&#x27;re trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow&#x27;s environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.<p>3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and&#x2F;or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?<p>4. What&#x27;s the closest open-source alternative to Metaflow on the market? Off the top of my head, I can&#x27;t think of anything which quite matches.
评论 #21699031 未加载
评论 #21698942 未加载
Datenstrom超过 5 年前
Is there a reason to use this over DVC[1] which is language and framework agnostic and supports a large number of storage backends? It works with any git repo and even polyglot implementations and can run the DAG on any system.<p>Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don&#x27;t deploy to the &quot;cloud.&quot;<p>[1]: <a href="https:&#x2F;&#x2F;dvc.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dvc.org&#x2F;</a>
评论 #21704306 未加载
edparcell超过 5 年前
My team has a similar library called Loman, which we open-sourced. Instead of nodes representing tasks, they represent data, and the library keeps track of which nodes are up-to-date or stale as you provide new inputs or change how nodes are computed. Each node is either an input node with a provided value, or a computed node with a function to calculate its value. Think of it as a grown-up Excel calculation tree. We&#x27;ve found it quite useful for quant research, and in production it works nicely because you can serialize entire computation graph which gives an easy way to diagnose what failed and why in hundreds of interdependent computations. It&#x27;s also useful for real-time displays, where you can bind market and UI inputs to nodes and calculated nodes back to the UI - some things you want to recalculate frequently, whereas some are slow and need to happen infrequently in the background.<p>[1] Github: <a href="https:&#x2F;&#x2F;github.com&#x2F;janushendersonassetallocation&#x2F;loman" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;janushendersonassetallocation&#x2F;loman</a><p>[2] Docs: <a href="https:&#x2F;&#x2F;loman.readthedocs.io&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">https:&#x2F;&#x2F;loman.readthedocs.io&#x2F;en&#x2F;latest&#x2F;</a><p>[3] Examples: <a href="https:&#x2F;&#x2F;github.com&#x2F;janushendersonassetallocation&#x2F;loman&#x2F;tree&#x2F;master&#x2F;examples" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;janushendersonassetallocation&#x2F;loman&#x2F;tree&#x2F;...</a>
russfink超过 5 年前
I am disappointed that when I click on documentation, &quot;why metaflow,&quot; I get a bunch of cartoony BS instead of a simple text explanation. Glad these folks don&#x27;t write RFC&#x27;S.<p>Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.
purple-again超过 5 年前
We are on Azure using Spark via Databricks. We had to abandon sci kit learn because of this choice. Does your service require AWS and can it be used in conjunction with Spark? Thank you for your time and consideration.
评论 #21699894 未加载
评论 #21698680 未加载
评论 #21698822 未加载
vtuulos超过 5 年前
btw, if you happen to be at AWS Reinvent right now, you can get a stylish, collector&#x27;s edition Metaflow t-shirt if you drop by at the Netflix booth at the expo hall and&#x2F;or ping us otherwise!
cpintomammee超过 5 年前
How does this compare to snakemake[1] and nextflow[2]?<p>[1] <a href="https:&#x2F;&#x2F;snakemake.readthedocs.io&#x2F;en&#x2F;stable&#x2F;" rel="nofollow">https:&#x2F;&#x2F;snakemake.readthedocs.io&#x2F;en&#x2F;stable&#x2F;</a> [2] <a href="https:&#x2F;&#x2F;www.nextflow.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.nextflow.io&#x2F;</a>
评论 #21699892 未加载
评论 #21699577 未加载
softwarelimits超过 5 年前
Can anybody provide a good comparison e.g. with Meltano?<p>I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.<p>I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.<p>I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an &quot;Autocad&quot; or &quot;Word&quot; file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!
评论 #21702601 未加载
dj18超过 5 年前
Seems like a cool addition to the DAG ML tooling family. Thanks for sharing! Do you support, or plan to support, features commonly found in data science platform tools like Domino (<a href="https:&#x2F;&#x2F;www.dominodatalab.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.dominodatalab.com&#x2F;</a>)? I&#x27;m thinking of container management, automatic publishing of web apps and API endpoints, providing a search for artifacts like code or projects, etc.
评论 #21704805 未加载
tristanz超过 5 年前
This looks like a fantastically clean API for Python data and ML pipelines. Congratulations!<p>It would be great to have a scheduler and monitoring UI that are equally lightweight.
评论 #21698372 未加载
评论 #21756793 未加载
MostlyAmiable超过 5 年前
The link in the docs to the CloudFormation template source is broken: <a href="https:&#x2F;&#x2F;docs.metaflow.org&#x2F;metaflow-on-aws&#x2F;deploy-to-aws#cloudformation-template" rel="nofollow">https:&#x2F;&#x2F;docs.metaflow.org&#x2F;metaflow-on-aws&#x2F;deploy-to-aws#clou...</a> Instead of &#x2F;Netflix&#x2F;metaflow-tools&#x2F;aws it should probably be &#x2F;Netflix&#x2F;metaflow-tools&#x2F;tree&#x2F;master&#x2F;aws
评论 #21703621 未加载
评论 #21703663 未加载
posedge超过 5 年前
Very interesting project. I love that this allows you to transparently switch &quot;runtime&quot; from local to cloud, like spark does, but integrated with common python tools like sklearn&#x2F;tf etc. Looking forward to test metaflow out myself.
评论 #21700911 未加载
somurzakov超过 5 年前
i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?<p>is data really being read in .csv format and processed in memory with pandas ?<p>because I see &quot;petabytes of data&quot; being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn&#x27;t simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?<p>i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how &quot;terabytes&quot; of data get processed into a model
评论 #21704180 未加载
bitfhacker超过 5 年前
It&#x27;s so simple and intuitive to run two steps in parallel. Thank you, Netflix!
评论 #21702571 未加载
manojlds超过 5 年前
How does it compare to dragster.io?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;dagster-io&#x2F;dagster" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dagster-io&#x2F;dagster</a>
评论 #21701038 未加载
elwell超过 5 年前
At first glance, I see BASIC&#x27;s GOTO statement.
firedup超过 5 年前
How does this compare to Kedro?
sriharshams超过 5 年前
Awesome!!!
ZenPsycho超过 5 年前
this seems like it&#x27;s very similar to <a href="http:&#x2F;&#x2F;metaflow.fr" rel="nofollow">http:&#x2F;&#x2F;metaflow.fr</a> is there any relation, or is this a name collision?
评论 #21698376 未加载