TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: GPU-Accelerated Inference Hosting

56 点作者 theo31将近 4 年前

12 条评论

etaioinshrdlu将近 4 年前
This is nice, and I&#x27;ve wanted this kind of thing repeatedly over the last 5 years! I think you often want to run little bits of CPU-based code in addition to your deep learning graph. So I think a better deployment model might be basically Lambda but with CUDA access... or something like that.<p>The factors that I think would make this service most valuable are low cost (think, lower than GPU&#x27;s on AWS or similar, even at scale), high burst capability from cold start (1000QPS is a good target), and of course low cold start delays (&lt; 1s, or .5s).<p>This led me down a rabbit hole in years past and the technical solution seems to be generally, the ability to swap models in and out of GPU ram very quickly. Possibly using NVIDIA&#x27;s unified memory subsystem.
评论 #27817825 未加载
37ef_ced3将近 4 年前
Or, do your inference using an AVX-512 CPU:<p><a href="https:&#x2F;&#x2F;NN-512.com" rel="nofollow">https:&#x2F;&#x2F;NN-512.com</a> (open source, free software, no dependencies)<p>With batch size 1, NN-512 is easily 2x faster than TensorFlow and does 27 ResNet50 inferences per second on a c5.xlarge instance. For more unusual networks, like DenseNet or ResNeXt, the performance gap is wider.<p>Even if you allow TensorFlow to use a larger ResNet50 batch size, NN-512 is easily 1.3x faster.<p>If you need a few dozen inferences per second per server, this is the cheapest way. And you&#x27;re not depending on a proprietary solution whose parent company could go out of business in a year.<p>If you need Transformers instead of convolutions, Fabrice Bellard&#x27;s LibNC is a good solution: <a href="https:&#x2F;&#x2F;bellard.org&#x2F;libnc&#x2F;" rel="nofollow">https:&#x2F;&#x2F;bellard.org&#x2F;libnc&#x2F;</a>
评论 #27822377 未加载
ackbar03将近 4 年前
So is this mainly focused on deployment for applications with high-speed inference requirements? I didn&#x27;t dive into product in detail. I run my own deep-learning based web-app and inference speed optimization is pretty non-trivial. As far as I know production level speed requirements require use of tensorrt which is definitely not hot-start and requires more than a few minutes to load (i&#x27;m not too sure what&#x27;s going on under the hood, not an expert) but has inference speeds of up to x2 or more, so not quite sure what your targeting or if you&#x27;ve actually managed to solve that problem which would be highly impressive
评论 #27819065 未加载
评论 #27822419 未加载
johndough将近 4 年前
&gt; Guaranteed &lt; 200ms response time<p>This sounds confusing to me. Surely it is possible to craft a neural network that takes longer to process?<p>&gt; Max. model size: X GB<p>Do you really mean model size or should this also include the size of the intermediate tensors?<p>The full screen option on the YouTube video is turned off by the way, so it is impossible to read without leaving your website.<p>Overall, this offer looks quite competitive. Are you planning to offer your service in the EU in the future?
评论 #27817878 未加载
rootdevelop将近 4 年前
What are the specs of an Nvidia m80?<p>I’ve never heard of that type before and I wasn’t able to find anything with google.<p>Furthermore more, the lack of company information (address, company registration nr etc) and the fact that it’s not clear where the servers are located geographically makes me a bit hesitant.
评论 #27820974 未加载
sjnair96将近 4 年前
Looks awesome. Do you know if and how you guys support NVIDIA&#x27;s software. For my project the NVIDIA software I&#x27;m using states it needs:<p>CUDA 11.3.0<p>cuBLAS 11.5.1.101<p>cuDNN 8.2.0.41<p>NCCL 2.9.6<p>TensorRT 7.2.3.4<p>Triton Inference Server 2.9.0<p>I&#x27;m new to deploying to production inference so I&#x27;m not sure if those are easily portable across such platforms or not really.
评论 #27840280 未加载
spullara将近 4 年前
Does it need to reinitialize for each request or is there a warm start &#x2F; cold start model like lambda? I don&#x27;t really understand how you can charge per request.
评论 #27817810 未加载
评论 #27817746 未加载
评论 #27817711 未加载
nextaccountic将近 4 年前
Looking at the examples in the landing page.. so I don&#x27;t need any kind of authentication to do inference? Anyone can run the models I upload?
评论 #27822429 未加载
gigatexal将近 4 年前
Looks amazing! A 3 line getting started animation? Sold. That’s all I need to see. Very good work folks.
derekhsu将近 4 年前
May I deploy multiple models in the same billing account?
评论 #27817872 未加载
manceraio将近 4 年前
could I run spleeter on it?
评论 #27817748 未加载
inshadows将近 4 年前
White screen without JS