Creating ad hoc microphone arrays from personal devices (2019)

186 点作者 tomstokes大约 5 年前

11 条评论

This is a really interesting technical concept.Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem.Microphones are crazy sensitive and pick up things like footsteps and conversations outside the door, shuffling feet and tapping on keyboards, and construction and HVAC noise like you wouldn't believe.So filtering those things out, and then capturing the best quality audio from the current speaker, and trying to get everyone's voice at roughly the same volume whether they're sitting directly across from the microphone or are piping up from the corner of the room......and do this all while cancelling 100% of the echo that might be coming from two or three speakers at once......it's an insanely hard problem. Beamforming microphones absolutely help in a huge way, because if you know the speaker's voice is coming from 45° then knowing that any sound coming from any other angle can be removed is a really helpful piece of info.Now, with beamforming microphones, the precise relative location and direction of each mic is known. The idea of creating one big beamforming mic for the room out of people's individual mics is... insanely hard, but super cool.It's interesting to me that this article is about measuring the quality of voice transcription, rather than about the quality of audio in an actual meeting. But I suppose the voice transcription quality measurement is simply a proxy for the speaker audio quality generally, no?This could actually be a huge step forward in not needing videoconferencing equipment in meeting rooms. So far, one of the biggest reasons has actually been dealing with echo and feedback -- when people are in the same call with multiple devices in the same room, it tends to end badly. But if the audio processing is designed for that... the results could actually be quite amazing.And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

评论 #22988064 未加载

评论 #22987817 未加载

评论 #22989074 未加载

评论 #22987909 未加载

评论 #22987594 未加载

评论 #22987733 未加载

pjc50大约 5 年前

My employer calls this "far field" audio, and has a number of hardware/firmware solutions: <a href="https://www.cirrus.com/products/cs48lv41f/" rel="nofollow">https://www.cirrus.com/products/cs48lv41f/</a> (we're also very secretive, so I can't really discuss it beyond the public website)The specific improvement Microsoft are touting is blind beamforming, without knowing where the microphones are located relative to each other. Regular beamforming is already in use in some products.

itchyjunk大约 5 年前

There are obvious(?) privacy issues and what not here. But ignoring all that for a second, it does sound pretty cool to be able to leverage all the little computers we walk around with.Think of all those shitty little video clips people take at a concert. Could all those be combined to make some high quality panoramic video? Probably a lot of other cool applications that I can't even comprehend for now. What a time to be alive.

评论 #22987688 未加载

评论 #22987653 未加载

评论 #22997295 未加载

评论 #22987719 未加载

评论 #22987975 未加载

评论 #22987634 未加载

评论 #22987543 未加载

Zenst大约 5 年前

Interesting, doable and from my experience of this area, need a reference sound to calibrate, though that calibration could be ongoing for such things like this.Gets down to matching a single sound and working out the timing of that sound from the multiple sources. Then you also need to factor in the frequency response as well.That last part would be important to handle things like the table the devices are sat upon picking up vibrations from the desk. Remember that phones don't have a rubber base to isolate them from the table so any vibration of that surface would propagate into the device and microphone. Then the whole aspect of varying devices and with that, varying microphone quality and device housings. So calibrating at some level would be key for this to work, though doable and processing wise you could even run a master device and handle the processing there and remove the server aspect with some of the processing done upon each local device and passed onto the main device for correlating. Certainly some phones have the power to handle this type of affair to replace the server aspect. But that would be more work/effort and something that may well see later on. Though makes it harder to sell a bit of server processing software then.Though one test I'd like to see this system handle would be how well it filters out those vibrations.After all you don't want to hear somebody writing or putting a cup or other object down whilst somebody else is talking.I'd also wonder what type of jitter tolerances they are working with across those devices and how that scales with devices/jitter - does jitter increase after so many devices.

评论 #22987942 未加载

peter_d_sherman大约 5 年前

Excerpt:"While the idea sounds simple, it requires overcoming many technical challenges to be effective. The audio quality of devices varies significantly. The speech signals captured by different microphones are not aligned with each other. The number of devices and their relative positions are unknown. For these reasons and others, consolidating the information streams from multiple independent devices in a coherent way is much more complicated than it may seem. In fact, although the concept of ad hoc microphone arrays dates back to the beginning of this century, to our knowledge it has not been realized as a product or public prototype so far."Thoughts:There's something deep here, not with respect to microphones and speech transcription (although I wish Microsoft and whoever else attempts to wrestle with those problems the greatest of success!)There's a related deep problem in physics here.If we consider signals that emanate from outer space, let's say they're from the big bang, or heck, let's just say they're from one of our past-the-edge-of-this-solar-system satelites -- that wants to communicate back to earth.Well, due to the incredible distances involved, the signal will get garbled in various ways...So here's the $64,000 question:When that signal from deep space gets garbled, isn't it possible that it turns into various other signals, at various different other frequencies and wavelengths?In other words, space itself, over long distances, acts as a prism (not really, but as an easy way to wrap your mind around this concept), for radio, and other electromagnetic waves...Now, if you want to reconstruct the orignal message at these long distances, you must be able to reconstruct garbled radio (and other em) waves, which are moving at different frequencies, and may even arrive at the destination at different rates of speed with various time shifts...Basically, you've got to take those pieces -- move them to the correct frequency, time correct them, speed them up or slow them down, sync them, and overlay them -- to reconstruct the original message...That's the greater question in physics -- the ability to do all of that, with em signals from a long way off in space...The article referenced -- is the microphone/audio/slow speed equivalent -- of that larger problem...

pabs3大约 5 年前

This reminds me of this open source project (and its predecessor manyears and open hardware projects 8/16soundsusb).<a href="https://github.com/introlab/odas" rel="nofollow">https://github.com/introlab/odas</a> <a href="https://github.com/introlab/manyears" rel="nofollow">https://github.com/introlab/manyears</a> <a href="https://github.com/introlab/16SoundsUSB" rel="nofollow">https://github.com/introlab/16SoundsUSB</a>Website of the team behind these:<a href="https://introlab.3it.usherbrooke.ca/" rel="nofollow">https://introlab.3it.usherbrooke.ca/</a>

geokon大约 5 年前

Does anyone have any insight into why neural nets are used for the "blind" beamforming? I don't have first hand experience with machine learning, but this just doesn't seem to me like a machine learning type of problem. I get it's not trivial, but it seems like there should be an analytic solution - more or less

评论 #22990523 未加载

stragies大约 5 年前

I look forward to exploring that github source drop.

stuaxo大约 5 年前

Oh, I wanted this years ago when phones had terrible microphones and audio codes.The idea was that at a gig loads of people would record and you could reconstruct a much better recording.

评论 #22988699 未加载

andrewfromx大约 5 年前

wow i just added <a href="https://news.ycombinator.com/item?id=22956082" rel="nofollow">https://news.ycombinator.com/item?id=22956082</a> a few days ago, on point no?

kohtatsu大约 5 年前

Would be cool if Microsoft gave more shits about privacy.Edit: This would be cool if I trusted Microsoft to properly handle privacy.

评论 #22987426 未加载