Large crawls for example commoncrawl or companies that have in house web scrapers/crawlers to collect data / scrape from the web to train ai .
1. How are blogspams / regular spams are cleaned from such huge data ?
2. Is it possible to teach ai to clean datasets with small human annotated datasets to clean large datasets for ai training ?
3. How misinfo/disinfo in cleaned ?