I'd imagine tanimoto or cosine similarity would get you most of the way there while being very off-the-shelf.<p>If you're going to go the route of binary classification, I'd personally do it via RFs as variable importance (product importance) is built in. (But that's just personal pref)<p>I think that it's a solid step-by-step thought process on tackling the problem -- I'd probably think of the false positive vs false negative in terms of the relative expected values of success/failure in those classifications. (And perhaps cost-sensitivity could even be added to your original classifier -- perhaps if you had a forest of 500 trees, and you get even 100 votes for pregnant, that's enough to decide to send a pregnancy-targeted mailer)
After watching a coworker innocently ask a woman who wasn't expecting, "When are you due?", I've developed a simple rule for this:<p>If she tells you she's pregnant: <i>Congratulate her.</i><p>If she doesn't: <i>Keep your mouth shut.</i><p>Seriously, if you want to target expectant mothers, let them register for a discount program. <i>Diapers are expensive!</i> Any marketing effort that begins, "We think you might be pregnant..." is doomed.