author here:<p>hey HN, we use used to run an ML consultancy for a year that helped companies build & host models in prod. We learned how tedious & expensive it was to host ML. Customer models had to run on a fleet of always-on GPUs that would often get <10% utilization, which felt like a big money sink.<p>Over time we built infrastructure to improve GPU utilization. Six months ago we made a pivot to focus solely on productizing this infra into a hosting platform for ML teams to use that would remove the pain of deployment and reduce the cost of hosting models.<p>We deploy on A100 GPUs, and you pay per second of inference. If you aren’t running inferences you pay nothing. Couple points to clarify: Yes, the models are actually cold-booted, we aren’t just running them in the background. We boot models faster due to how we manage OS memory. Yes, there is still cold-boot time, it’s not instant but it’s significantly faster (e.g., 15 seconds instead of 10 minutes for some transformers like GPTJ).<p>Lastly, model quality is not lost on Banana because we aren’t doing traditional weight quantization or network pruning which makes networks smaller/faster but sacrifices quality. You can think of Banana more as a compiler + hosting platform. We break down your code to run faster on GPUs.<p>Try it out and let us know what you think!
This is really cool, but I can't wait for a classic HN comment like:<p>-HN midwit: "Who names a company after a fruit?".
-Erik and Kyle: "Well..."