Ask HN: How Spam are removed from large datasets of web crawls for AI training

1 pointsby arromaticabout 1 year ago

Large crawls for example commoncrawl or companies that have in house web scrapers/crawlers to collect data / scrape from the web to train ai . 1. How are blogspams / regular spams are cleaned from such huge data ? 2. Is it possible to teach ai to clean datasets with small human annotated datasets to clean large datasets for ai training ? 3. How misinfo/disinfo in cleaned ?

Ask HN: How Spam are removed from large datasets of web crawls for AI training

no comments

Ask HN: How Spam are removed from large datasets of web crawls for AI training

no comments