Been working on this tool for my PhD which involves training multi task vision models using various pre-trained models as inputs or pseudolabels in order to improve generalization. I work mostly on UAV datasets, but it should work okay on indoor scenes or self driving (at least Marigold and Mask2Former).<p>For example, this dataset was generated using this tool: <a href="https://huggingface.co/datasets/Meehai/dronescapes" rel="nofollow">https://huggingface.co/datasets/Meehai/dronescapes</a><p>I'm quite aggressively trying to "just get the nn.Module" from the public repos that other researchers put up in their overly convoluted frameworks. A simple `forward(rgb_input: torch.Tensor) -> torch.Tensor` is nice, having 100 imports from a generic framework that has versions incompatibilities with everything else is not.<p>PS: most mains are standalone runnable too, i.e.
- <a href="https://gitlab.com/meehai/video-representations-extractor/-/blob/master/vre/representations/depth/marigold/marigold.py" rel="nofollow">https://gitlab.com/meehai/video-representations-extractor/-/...</a>
or
- <a href="https://gitlab.com/meehai/video-representations-extractor/-/blob/master/vre/representations/semantic_segmentation/mask2former/mask2former.py?ref_type=heads#L110" rel="nofollow">https://gitlab.com/meehai/video-representations-extractor/-/...</a>