I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?