telescreens, or, well, audio telescreens.<p>Whisper, the new Speech recognition model from Open Ai, is very, very, good with English, and really decent with other languages. With Polish for example, on a normal conversation, it gets almost everything except some very specialized words.<p>For now, such models require somewhat significant resources, and running them for every conversation of every citizen, 24/7, isn't really feasible, but Moore's law in AI is going strong, and that will change in a few years. Even if you want to run the largest possible model right now, an RTX 3090 can do it in faster than realtime.<p>I'm pretty sure that, with the amount of data the Chinese government has access to, they can build something even better than Whisper, and there are enough labor camps in China in case they don't.<p>I'm imagining something like a watch, or maybe AR glasses, which every citizen has to wear and which transmit everything they hear in occasional bursts, encoded with something like opus to save on bandwidth.<p>Once you get every conversation of every citizen transcribed and meticulously cataloged, you can do a lot of pretty interesting things. You can pass it through transformer models for sentiment analysis (to flag anti-party opinions), you can search for specific subjects (and with AI, those searches can be much more advanced than a simple keyword search) etc. Because of how smart these AI systems actually are, you can't trivially go around them by rephrasing what you say. If you start replacing "I want the men in power to die" with "I want the humans in electricity to paint", it will eventually pick up on that and flag you anyway.<p>If your watches also identify who's around and use voice recognition to figure out who's talking to whom, you basically have the whole social graph mapped out. One criminal slips up and gets flagged, and you can start going from there. Decrease the alarm threshold for anybody in close contact with that person, take additional signals into account, like locations which trouble often starts in, and that gives you a lot of possiblities.