I may need to perform a bit of speech-to-text (English at least, but in perspective - multilingual also) from video or audio files.
Which speech-to-text model/API would you recommend,
which sort of performs the best and can also do noise etc reduction?
Whisper, 100%. It's small, fast and does a really good job with most of the recordings I can feed it. IIRC, there are both English and mixed-language models to choose from as well.