I dislike that pytorch advertises TPU support. Pytorch doesn’t support TPUs. Pytorch supports a gimped version of TPUs that have no access to the TPU CPU, a massive 300GB memory store that handles infeed. No infeed means you have to feed the TPUs manually, on demand, like a gpu. And TPUs are not GPUs. When you try to do that, you’re talking <i>at least</i> a 40x slowdown, no exaggeration. The TPU CPU is the heart of the TPU’s power and advantage over GPUs, and neither pytorch nor Jax support it at all yet. No MLPerf benchmark will ever use pytorch in its current form on TPUs.<p>Luckily, that form is changing. There are interesting plans. But they are still just plans.<p>It’s better to go the other direction, I think. I ported pytorch to tensorflow: <a href="https://twitter.com/theshawwn/status/1311925180126511104?s=21" rel="nofollow">https://twitter.com/theshawwn/status/1311925180126511104?s=2...</a><p>Pytorch is mostly just an api. And that api is mostly python. When people say they “like pytorch”, they’re expressing a preference for how to organize ML code, not for the set of operations available to you when you use pytorch.
I am extremely pessimistic for ML ops startup like this. At the end of the day, cloud service providers have too much of an incentive to provide these tools for free as a cloud value add.<p>The other thing is that stitching together other open source tools like this is simply not enough value. Who will be incentivised to buy?<p>Saying this as FAANG ML org person where I see the push to open source ops tooling like this.
Congratulations to the Grid team on the fundraise and the announcement! Exciting stuff.<p>It seems like there is an emerging consensus that (a) DL development requires access to massive compute, but (b) if you’re only using off-the-shelf PyTorch or TensorFlow, moving your model from your personal development environment to a cluster or cloud setting is too difficult — it is easy to spend most of your time managing infrastructure rather than developing models. At Determined AI, we’ve spent the last few years building an open source DL training platform that tries to make that process a lot simpler (<a href="https://github.com/determined-ai/determined" rel="nofollow">https://github.com/determined-ai/determined</a>), but I think it's fair to say that this is still very much an open space and an important problem. Curious to take a look at Grid AI and see how it compares to other tools in the space -- some other alternatives include Kubeflow, Polyaxon, and Spell AI.
So <i>this</i> is the endgame of pytorch-lightning, which was always a mystery to me. (if you haven't used it, it's strongly recommended if you use PyTorch: <a href="https://github.com/PyTorchLightning/pytorch-lightning" rel="nofollow">https://github.com/PyTorchLightning/pytorch-lightning</a> )<p>IMO, open source is at its best when it's supported by a SaaS as it provides a strong incentive to keep the project up-to-date, and the devs of PL have been very proactive.
How do you handle the security of training data? If the data is super sensitive how do you deal with it?<p>I know the same could be said about Azure and AWS, but the big name cloud providers stake their prestige on having tight security, while a startup has much less to lose.
The name is unfortunately close to “The Grid”, an AI website builder that had a lot of buzz then scammed a lot of people out of money then disappeared <a href="https://medium.com/@seibelj/the-grid-over-promise-under-deliver-and-the-lies-told-by-ai-startups-40aa98415d8e" rel="nofollow">https://medium.com/@seibelj/the-grid-over-promise-under-deli...</a>
More on this is here <a href="https://techcrunch.com/2020/10/08/grid-ai-raises-18-6m-series-a-to-help-ai-researchers-and-engineers-bring-their-models-to-production/" rel="nofollow">https://techcrunch.com/2020/10/08/grid-ai-raises-18-6m-serie...</a><p>What do you think folks?
Seems like Pytorch lightening is the only first-class citizen in your offering. Is that true? Or are there value-added features for TensorFlow and other non-DL libraries such as scikit-learn?<p>Also, is there support for distributed training for large datasets that don't fit into single instance memory? or just distributed grid-search/hyper-parameter optimization?
i used pytorch lightning back in may when i was working on pretraining gpt2 on TPUs (<a href="https://bkkaggle.github.io/blog/nlp-research-part-2/" rel="nofollow">https://bkkaggle.github.io/blog/nlp-research-part-2/</a>). it was really impressive how stable it was especially given how a lot of features were still being added at a very fast pace.<p>also, this was probably the first (and maybe still is?) high-level pytorch library that let you train on tpus without a lot of refactoring and bugs which was a really nice thing to be able to do given how the pytorch-xla api was still unstable at that point. <3