Non-obvious indicators that a transaction might be fraudulent

117 点作者 teuobk大约 9 年前

18 条评论

Being pegged as a fraud risk just because the user is concerned about privacy rubs me the wrong way. Even if it is entirely legal and the statistical model is valid.Some years back, some law enforcement agency in the US (but I forgot which) produced a highly-reliable indicator of drug mules on highways. The indicator was black male, driving a late-model sedan, and traveling at exactly the speed limit (when everyone else was traveling 10-15 mph over the limit). The police were stopping and searching anyone who fit these indicators.Putting aside the questions of racial profiling and whether the statistical model was correct, do you think this is a good idea? I can't articulate my feelings exactly, but I feel it's wrong.

评论 #11436886 未加载

评论 #11437142 未加载

评论 #11437049 未加载

评论 #11438392 未加载

评论 #11436090 未加载

评论 #11437647 未加载

评论 #11439063 未加载

评论 #11438410 未加载

评论 #11460452 未加载

评论 #11436168 未加载

评论 #11437109 未加载

评论 #11436080 未加载

评论 #11439744 未加载

评论 #11436714 未加载

jedberg大约 9 年前

Ugh. The author just gave away all the good tricks.This is one area where security by obscurity actually works (well worked, depending on how many fraudsters read this).Fraudsters are generally pretty dumb when it comes to technology, so even if a lot of these seem obvious to the tech savvy HN audience, they weren't obvious to the fraudsters till now.The good news is that most of them don't read HN.

评论 #11436217 未加载

danieltillett大约 9 年前

My wife was seeing a lot of fraud with her business. She only takes payment via PayPal. The fraudsters have been ordering via a local hacked computer with a hacked PayPal account that matches. Everything looks 100% OK (physical address matches, IP address matches, browser nice and normal, etc), but she gets hit by a PayPal chargeback 3 to 4 weeks later for a "non authorised payment". She now has to call each new customer to make sure that they are real. Interestingly, when she posted that she would call all new customers the fraud attempts went down to about 5% of the previous level.

recursive大约 9 年前

What is meant by referrer "history"? As far as I know, referrer in http headers can refer only to the one most recently visited resource.Edit: And what's with this?> There is another feature in browsers which is “Do Not Track” (<a href="http://donottrack.us/" rel="nofollow">http://donottrack.us/</a>). For organic/real users the possible options are “Yes”, “No”, “Unspecified”;The DNT http header has 2 values. "0" and "1".

评论 #11436956 未加载

评论 #11436592 未加载

scottm30大约 9 年前

How do they know that fraudsters with fresh cookies and no referrer history aren't just in private browser mode? Sounds like the server would view them as the same in most cases.

评论 #11437979 未加载

评论 #11437225 未加载

studentrob大约 9 年前

This type of service seems like it would be really useful to smaller merchants.I remember reading awhile back on HN that smaller e-commerce shops were often targeted by fraudsters. So, many use Amazon as a go-between when they'd prefer to have their own site and payment processing.Companies like this could empower competition of the services provided by Amazon, eBay etc.

评论 #11436399 未加载

nodesocket大约 9 年前

The no referrer and user-agent (non Mac) are certainly signals, but there are some others that are interesting like caps lock on, geo-ip (obvious), and screen-size.

jpeg_hero大约 9 年前

Are all of these indicators available "over the wire" in browser fingerprinting?can you tell cpu from browser finger printing?and cookie age? is he talking about cross-domain cookies from the ad networks?

评论 #11436460 未加载

logicallee大约 9 年前

I admire the research that went into collecting these signals, but I consider it a poor idea to have published what could be used as a checklist. I believe some of the kind of people (criminals, bad people: you should stop) who take the technical actions listed certainly read hackernews. yet without exception all of these are easy to modify, losing your hard-won signals. better not to mention what they are.that said, perhaps they did not pubish all of the signals they found.

statictype大约 9 年前

Can a web server get a list of plugins installed by the client? That can't be right, can it?

评论 #11436467 未加载

评论 #11436225 未加载

DyslexicAtheist大约 9 年前

about the #3 and DNT Null values ...DNT wasn't proposed until 2009 (and implemented in 2010). So this would be normal (DNT header not sent) on older OS versions. So it's the same reason as the 32 bit OS on a 64 bit machine assuming people use IE coming with the installation (or downloading an old browser that is still usable in 32bit).

评论 #11440788 未加载

评论 #11437185 未加载

cpr大约 9 年前

I assume that the bigger guys like Stripe are already doing this? Is that a valid assumption?

评论 #11436888 未加载

评论 #11436606 未加载

评论 #11436483 未加载

cheez大约 9 年前

They call it machine learning, so did they apply unsupervised learning to determine these factors? How did they determine these factors were relevant?

评论 #11440845 未加载

gculliss大约 9 年前

These were reliable indicators until they were disclosed here...

graycat大约 9 年前

It appears that we have a special, powerful, valuable opportunity for how to manipulate the data in the OP.So, the OP has "7 Leading Fraud Indicators: From Fresh Cookies to Null Values".Suppose for those 7 indicators, 4 of them have just two possible values and the other three have just 4 possible values or some such. Then for one connection to the server from a Web browser, the 7 signals have jointly just<pre><code> 2^4 * 4^3 = 1,024 </code></pre> possible values. That is, there are only n = 1,024 possible cases of signal data from a Web browser from a connection to the server. And apparently we have good data on each of the cases.Or, to be practical, if for some case we have no data at all, then we just assume that the reason is that the probability of that case is so low that we can ignore that case.The central problem here is how to detect "fraudsters". For such detection, necessarily there are two ways to be wrong: (1) a false alarm when we say that a connection is from fraud when it is not and (2) a missed detection when we say that a connection is not from fraud when it is.Our mission, and we have to accept it, is essentially to find ways of manipulating the large amount of relevant data so that (A) from the false alarm (1), we can specify the highest probability of a false alarm f we are willing to tolerate, (B) get that probability of a false alarm f in practice, and (C) from the missed detections in (2), for that probability of a false alarm f, get the lowest probability of a missed detection (2) we can.Or, for the false alarms we are willing to tolerate, we want to manipulate the data to get all the detections we can.So, for some notation:P -- probabilityn -- positive integer, number of different possible cases of data from connections, e.g., as above, n = 1,024B -- event, connection is bad, fraudG -- event, connection is good, not fraudP(B) + P(G) = P(B OR G) = 1C -- random variable, case of connection, i = 1, 2, ..., n.So random variable C takes values in the set {1, 2, ..., n}.p(i) = P(C = i)b(i) = P(B | C = i) = P(B AND C = i)/P(C = i)= P(B AND C = i)/p(i)g(i) = P(G | C = i) = P(G AND C = i)/P(C = i)= P(G AND C = i)/p(i)B = U_i {B AND C = i}P(B) = Sum_i P(B AND C = i)= Sum_i p(i) P(B | C = i)= Sum_i p(i) b(i)P(G) = Sum_i p(i) g(i)b(i) + g(i) = P(B | C = i) + P(G | C = i)= P(B AND C = i)/P(C = i) + P(G AND C = i)/P(C = i)= ( P(B AND C = i) + P(G AND C = i) )/P(C = i)= P( (B AND C = i) OR (G AND C = i) )/P(C = i)= P(C = i)/P(C = i) = 1M -- event, a missed detection of a bad connection, fraudD -- event, detection of a bad connection, fraudF -- event, false alarmDetection Rule:Suppose for some set I a subset of {1, 2, ..., n} we raise an alarm of a detection of a bad connection, that is, fraud, when C in I.With this detection rule, probability of a false alarm isP(F) = Sum_{C in i} P(G AND C = i)= Sum_{C in i} P(G | C = i) p(i)= Sum_{C in i} g(i) p(i)the probability of a detection isP(D) = Sum_{i in I} P(B AND C = i)= Sum_{i in I} P(B | C = i) p(i)= Sum_{i in I} b(i) p(i)and the probability of a missed detection isP(M) = P(B AND C not in I)= Sum_{j not in I} P(B AND C = j)= Sum_{j not in I} P(B | C = j) p(j)= Sum_j P(B | C = j) p(j)- Sum_{i in I} p(B | C = i) p(i)= Sum_j P(B | C = j) p(j) - P(D)= Sum_j P(B AND C = j) - P(D)= P(B) - P(D)So, to minimize the probability of a missed detection P(M) we want to maximize the probability of a detection P(D). We guessed this intuitively.To maximize the probability of a detection P(D), suppose we have sorted our data on the n cases so that the ratios b(i)/g(i) are in descending order, that is, so thatb(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)Suppose we pick k in {1, 2, ..., n} and let I = {1, 2, ..., k}.Then for our detection rule with this k and I, the probability of a false alarm isP(F) = Sum_{i in I} g(i) p(i)So, note that here really we are just summing i = 1, 2, ..., k where, as just above,b(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)So, we just sort these ratios and then sum the products g(i) p(i) on i until we get our selected probability of false alarms f.As we will prove below, this is just the thing we should do.If we pick k too large, then our probability of false alarms will be larger than our selected value f. If we pick k too small, then our probability of detection will be smaller than we want.Also for our detection rule with this k and I, the probability of a detection, what we want to maximize, isP(D) = Sum_{i in I} b(i) p(i)So, suppose we pick k just large enough that P(F) = f (or close enough for government work).Claim: With this selection of k and I, we get, as in (1), the probability of a false alarm f we selected and, for that probability of a false alarm f, get the probability of a detection P(D) the largest possible and, as in (2) the probability of a missed detection the smallest possible.To see this claim, we want to select x_1, x_2, ..., x_n to solve the operations research applied mathematics resource allocation optimization problemProblem 1:max z = P(D) = Sum_i x_i b(i) p(i)subject toP(F) = Sum_i x_i g(i) p(i) <= fx_i = 0, 1Yes, from the x_i, I = {i | x_i = 1}.Problem 2:Suppose for some L >= 0, x = (x_i) solvesmax z = Sum_i x_i b(i) p(i)- L ( Sum_i x_i g(i) p(i) - f )subject tox_i = 0, 1Then since x = (x_i) solves Problem 2, we have that for any y = (y_i) that satisfies the constraints of Problem 1, that isSum_i y_i g(i) p(i) <= fandy_i = 0, 1we have thatSum_i x_i b(i) p(i)= Sum_i x_i b(i) p(i)- L ( Sum_i x_i g(i) p(i) - f )>= Sum_i y_i b(i) p(i)- L ( Sum_i y_i g(i) p(i) - f )>= Sum_i y_i b(i) p(i)so that x = (x_i) solves Problem 1.For more, in Problem 2, we havemax z = Sum_i x_i b(i) p(i)- L ( Sum_i x_i g(i) p(i) - f )= Sum_i ( x_i b(i) p(i) - L x_i g(i) p(i) )- L f= Sum_i ( x_i ( b(i) p(i) - L g(i) p(i) ) )- L fso that x_i = 1 if and only ifx_i ( b(i) p(i) - L g(i) p(i) ) >= 0andb(i)/g(i) >= LSo, the way to solve Problem 1 is to pick k = 1, 2, ..., n, set I = {1, 2, ..., k}, and set x_i = 1 for i in I and x_i = 0 otherwise so thatSum_{i in I} x_i g(i) p(i) = fIn particular,L = b(k)/g(l)That is, intuitively, we are making investments in real estate, our probability of a false alarmP(F) = Sum_i x_i g(i) p(i) = fis like money.We get to invest the money in cases i = 1, 2, ..., n. For case i,b(i)/g(i) = (b(i)p(i))/(g(i)p(i))is our return on investment, that is, at investment i, the probability of detection we get for the probability of false alarms we are willing to tolerate.So, we sort so that the ratiosb(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)and make investments in the order i = 1, 2, ... until we have spent all our money.Then, for the money we have spent, that is the best return on our investment we can get.Thanks to J. Lagrange, K. Pearson, J. Neyman, and H. Everett.

评论 #11444602 未加载

brightball大约 9 年前

This would have been such a huge help to me a couple of years ago. Good read and good looking service.

eveningcoffee大约 9 年前

Or big FY to privacy conscious users. Does not mean that these are not a strong indicators of a possible fraud, but when applied fiercely would really hurt legitimate users.

danharaj大约 9 年前

These are not useful statistics at all the way they're presented. Base rate fallacy or something? Idk. There's issues.

评论 #11435785 未加载