TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How Spam are removed from large datasets of web crawls for AI training

1 pointsby arromaticabout 1 year ago
Large crawls for example commoncrawl or companies that have in house web scrapers/crawlers to collect data / scrape from the web to train ai . 1. How are blogspams / regular spams are cleaned from such huge data ? 2. Is it possible to teach ai to clean datasets with small human annotated datasets to clean large datasets for ai training ? 3. How misinfo/disinfo in cleaned ?

no comments

no comments