科技回声

11 条评论

lostdog超过 4 年前

This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.

评论 #24741921 未加载

评论 #24742594 未加载

评论 #24742407 未加载

评论 #24741727 未加载

评论 #24743000 未加载

评论 #24740926 未加载

t-vi超过 4 年前

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL. We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.

评论 #24742137 未加载

nraynaud超过 4 年前

How do you keep track of the shutter clock in this kind of system? For example the camera clocks at 60fps, but the image processing is a few frames late, the gyroscope clocks at 4kHz, the accelerometer way slower, lidar is a slug, etc. Then you have to get all that stuff in your kalman filter to estimate the state and the central question is: “when did you collect this data?” I guess “no clue it comes from USB then disappeared into a GPU pipeline” is not a scientifically sound answer, you want to know if it goes before or after sample no 3864 of the gyroscope.Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?

NikolaeVarius超过 4 年前

I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.

评论 #24742628 未加载

评论 #24740817 未加载

vj44超过 4 年前

Good job digging into all of this Paul! At my company (onspecta.com) we solve similar problems (and more!) to accelerate AI/deep learning/computer vision problems, across both CPUs, GPUs as well as other types of chips.This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.

spockz超过 4 年前

I think this is a great explanation. Are this kind of manual optimisations still needed when using the higher level frameworks? Or at least those should make it clear in the types when a pipeline moves from cpu to gpu and vice versa.

threatripper超过 4 年前

How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.

评论 #24742417 未加载

评论 #24742660 未加载

O5vYtytb超过 4 年前

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.What about using pytorch multiprocessing[1]?[1] <a href="https://pytorch.org/docs/stable/notes/multiprocessing.html" rel="nofollow">https://pytorch.org/docs/stable/notes/multiprocessing.html</a>

评论 #24740380 未加载

评论 #24741714 未加载

andrewbridger超过 4 年前

Has anyone looked at Julia? It’s claim is C like performance with the ease of use of a language like python.

评论 #24745861 未加载

mleonhard超过 4 年前

Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.

egberts1超过 4 年前

Try this one.<a href="https://github.com/streamlit/demo-self-driving" rel="nofollow">https://github.com/streamlit/demo-self-driving</a>It uses StreamLit<a href="https://github.com/streamlit/streamlit" rel="nofollow">https://github.com/streamlit/streamlit</a>

评论 #24740573 未加载

11 条评论

lostdog超过 4 年前

评论 #24741921 未加载

评论 #24742594 未加载

评论 #24742407 未加载

评论 #24741727 未加载

评论 #24743000 未加载

评论 #24740926 未加载

t-vi超过 4 年前

评论 #24742137 未加载

nraynaud超过 4 年前

NikolaeVarius超过 4 年前

评论 #24742628 未加载

评论 #24740817 未加载

vj44超过 4 年前

spockz超过 4 年前

threatripper超过 4 年前

评论 #24742417 未加载

评论 #24742660 未加载

O5vYtytb超过 4 年前

评论 #24740380 未加载

评论 #24741714 未加载

andrewbridger超过 4 年前

Has anyone looked at Julia? It’s claim is C like performance with the ease of use of a language like python.

评论 #24745861 未加载

mleonhard超过 4 年前

Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.

egberts1超过 4 年前

评论 #24740573 未加载

Object Detection from 9 FPS to 650 FPS

11 条评论

Object Detection from 9 FPS to 650 FPS

11 条评论