Causal inference as a blind spot of data scientists

225 点作者 Dzidas超过 1 年前

19 条评论

bertil超过 1 年前

The main reason for not using causal inference is not because data scientists don’t know about the different approaches or can’t imagine something equivalent (a lot of reinvention); forecasting is one of the most common tasks, after all.The main reason is that they generally work for software companies where it’s easier and less susceptible to analyst influence to implement the suggested change and test it with a Random Control Trial. I remember running an analysis that found that gender was a significant explaining factor for behavior on our site; my boss asked (dismissively): What can we do with that information? If there is an assumption of how things work that doesn’t translate to a product change, that insight isn’t useful; if there is a product intuition, testing the product change itself is key, and there’s no reason to delay that.There are cases where RCTs are hard to organize (for example, multi-sided platform businesses) of changes that can’t be tested in isolation (major brand changes). Those tend to benefit from the techniques described there——and they have dedicated teams. But this is a classic case of a complicated tool that doesn’t fit most use cases.

评论 #37891305 未加载

评论 #37891682 未加载

评论 #37891891 未加载

Anon84超过 1 年前

For a hands on introduction to Causality, I would recommend “Causal Inference in Python” by M. Facure <a href="https://amzn.to/46byWnl" rel="nofollow noreferrer">https://amzn.to/46byWnl</a> Well written and to the point.<ShamelessSelfPromotion> I also have a series of blog posts on the topic: <a href="https://github.com/DataForScience/Causality">https://github.com/DataForScience/Causality</a> where I work through Pearls Primer: <a href="https://amzn.to/3gsFlkO" rel="nofollow noreferrer">https://amzn.to/3gsFlkO</a> </ShamelessSelfPromotion>

评论 #37893090 未加载

评论 #37892410 未加载

评论 #37894447 未加载

mbowcut2超过 1 年前

For what it’s worth, my undergraduate was in Economics with an emphasis in econometrics and this article touched on probably 80% of the curriculum.The only problem is by the time I graduated I was somewhat disillusioned with most causal inference methods. It takes a perfect storm natural experiment to get any good results. Plus every 5 years a paper comes out that refutes all previous papers that use whatever method was in vogue at the time.This article makes me want to get back into this type of thinking though. It’s refreshing after years of reading hand-wavy deep learning papers where SOTA is king and most theoretical thinking seems to occur post hoc, the day of the submission deadline.

评论 #37894542 未加载

评论 #37892091 未加载

g42gregory超过 1 年前

In Corporate and Medical data science fields, people begin to accept causal inference. It is difficult, as the subject is still in flux and under development.I am aware of three reputable causal inference frameworks:1. Judea Pearl's framework, which dominates in CS and AI circles2. Neyman-Rubin causal model: <a href="https://en.wikipedia.org/wiki/Rubin_causal_model" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Rubin_causal_model</a>3. Structural equation modelling: <a href="https://en.wikipedia.org/wiki/Structural_equation_modeling" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Structural_equation_modeling</a>None of them would acknowledge each other, but I believe the underlying methodology is the same/similar. :-)It's good to see that it is becoming more accepted, especially in Medicine, as it will give more, potentially life-saving, information to make decisions.In Social Sciences, on the other hand, causal inference is being completely willfully ignored. Why? Causal inference is an obstacle to making a preconceived conclusions based on pure correlations: something correlates with something, therefore ... invest large sums of money, change laws in our favor, etc... This works for both sides. Sadly, I don't think this could be fixed.

评论 #37892880 未加载

评论 #37892104 未加载

评论 #37895153 未加载

评论 #37893955 未加载

评论 #37891819 未加载

评论 #37897551 未加载

mikpanko超过 1 年前

An important topic. Today most tech companies worship a/b experiments as the main way of being data-driven and bringing causality into decision-making. It deserves to be the gold standard.However, most experiments are usually expensive: they require investing in building the feature in question and then collecting data for 1-4 weeks before being certain of the effects (plus there are long-term ones to worry about). Some companies report that fewer than 50% of their experiments prove truly impactful (my experience as well). That’s why only a small number of business decisions are made using experiments today.Observational causal inference offers another approach, trading off full confidence in causality with speed and cost. It was pretty hard to run correctly so far, so it is not widely adopted. We are working on changing that with Motif Analytics and wrote a post with an in depth exploration of the problem: <a href="https://www.motifanalytics.com/blog/bringing-more-causality-to-analytics" rel="nofollow noreferrer">https://www.motifanalytics.com/blog/bringing-more-causality-...</a> .

tmoertel超过 1 年前

Interestingly, recent research suggest that you can make better decisions by combining experimental and observational data than by using either alone:<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r513.pdf" rel="nofollow noreferrer">https://ftp.cs.ucla.edu/pub/stat_ser/r513.pdf</a>> Abstract: Personalized decision making targets the behavior of a specific individual, while population-based decision making concerns a sub-population resembling that individual. This paper clarifies the distinction between the two and explains why the former leads to more informed decisions. We further show that by combining experimental and observational studies we can obtain valuable information about individual behavior and, consequently, improve decisions over those obtained from experimental studies alone.

mwexler超过 1 年前

I 100% agree with this blind spot. Most data science coursework avoids the very thing making it a science: the explanation of what change causes what effect. I've been surprised that year after year, programs at so many "Schools of Data Science" keep gliding over this area, perhaps alluding to it in an early stats course if at all.It's an important part of validating that your data-driven output or decision is actually creating the change you hope for. So many fields either do poor experimentation or none at all, others are prevented from doing the usual "full unrestricted RCT": med and fin svcs and other regulated industries have legal constraints on what they can experiment with; in other cases, data privacy restricts the measures one can take.I've had many data folks throw up their hands if they can't do a full RCT, and instead look to pre-post with lots of methodological errors. You can guess how many of those projects end up. (No, not every change needs a full test, and some things are easy rollback. But think of how many others would have benefitted from some uncertainty reduction.)Sure, "LLM everything" and "just gbm it!" and "ok, just need a new feature table and I'm done!" are all important and fun parts of a data science day. But if I can't show that a data driven decision or output makes things better, then it's just noise.Causal modeling gets us there. It improves the impact of ml models that recognize the power of causal interventions, and it gives us evidence that we are helping (or harming).It's (IMO) necessary, but of course, not sufficient. Lots of other great things are done by ML eng and data scientists and data eng and the rest, having nothing to do with casual inference... But I keep thinking how much better things get when we apply a causal lens to our work.(And next on my list would be having more data folks understanding slowly changing dimension tables, but this can wait for another time).

评论 #37891506 未加载

tomrod超过 1 年前

I've self-learned for a long time in the causal inference space and model evaluation is a concern for me. My biggest concern is falsification of hypotheses. In ML, you have a clear mechanism to check estimation/prediction through holdout approaches. In classical metrics, you have model metrics that can be used to define reasonable rejection regions for hypothesis tests. But causal inference doesn't seem to have this, outside traditional model fit metrics or ML holdout assessment? So the only way a model is deemed acceptable is by prior biases?If my understanding is right, this means that each model has to be hand-crafted, adding significant technical debt to complex systems, and we can't get ahead of the assessment. And yet, it's probably the only way forward for viable AI governance.

评论 #37891777 未加载

评论 #37891367 未加载

mumblemumble超过 1 年前

I would argue it's more a blind spot of big data, which tends to tacitly imply just doing correlational studies on data that happens to be laying around.Most data scientists work for companies that don't really want to pay for controlled experiments outside of maybe letting the UI team do A/B tests. Natural experiments can be hard to come by in a business setting. And all of the wild mathematical gyrations that econometricians and political scientists have developed to try to do causal inference from correlational data have a tendency not to be as popular in business because, outside of some special domains such as politics and consumer finance, it can be rather difficult to get away with dressing your emperor in math that nobody can understand instead of actual clothing.

评论 #37893523 未加载

felixleungsc超过 1 年前

The Atlantic/American Causal Inference Conference (ACIC) hosts a data challenge every year, I think. Useful to see many different methods compared on simulated data.Does anyone know similar challenges/competitions?ACIC links to years I could find:- 2016: <a href="https://arxiv.org/abs/1707.02641" rel="nofollow noreferrer">https://arxiv.org/abs/1707.02641</a>- 2017: <a href="https://arxiv.org/abs/1905.09515" rel="nofollow noreferrer">https://arxiv.org/abs/1905.09515</a>- 2019: <a href="https://sites.google.com/view/acic2019datachallenge/data-challenge" rel="nofollow noreferrer">https://sites.google.com/view/acic2019datachallenge/data-cha...</a>- 2022: <a href="https://acic2022.mathematica.org/results" rel="nofollow noreferrer">https://acic2022.mathematica.org/results</a>- 2023: <a href="https://sci-info.org/data-competition/" rel="nofollow noreferrer">https://sci-info.org/data-competition/</a>

shahbazac超过 1 年前

I’ve tried to understand causal inference several times and failed. Tutorials seem unnecessarily long winded. I wish authors would give simple, to the point examples.Say I have a simple table of outdoor temperatures and ice cream sales.What can the machinery of causal inference do for me in this situation?If it doesn’t apply here, what do I need to add to my dataset to make it appropriate for causal inference? More columns of data? Explicit assumptions?If I can use causal inference, what can it tell me? If I think of it as a function CA(data), can it tell me if the relationship is actually causal? Can it tell me the direction of the relationship? If there were more columns, could it return a graph of causal relationships and their strength? Or do I need to provide that graph to this function?I know a wet pavement can be caused by rain or spilled water or that an alarm can go off due to an earthquake or a burglary. I have common sense. I also understand the basics of graph traversal from comp sci classes.How do I practically use causal inference?To the authors of future articles on this (or any technical tutorial), please explain the essence, the easy path, then the caveats and corner cases. Only then will abstract philosophizing make sense.

评论 #37894195 未加载

评论 #37895905 未加载

评论 #37893833 未加载

评论 #37893643 未加载

评论 #37893700 未加载

评论 #37894175 未加载

评论 #37893664 未加载

评论 #37894347 未加载

gordon_freeman超过 1 年前

There is an excellent video on YouTube by MIT Prof Sontag on Casual Inference worth checking out [1]And if you like it, 2nd part is here [2][1] <a href="https://youtu.be/gRkUhg9Wb-I?si=6oMUgdjia_4g6-DR" rel="nofollow noreferrer">https://youtu.be/gRkUhg9Wb-I?si=6oMUgdjia_4g6-DR</a>[2] <a href="https://www.youtube.com/watch?v=g5v-NvNoJQQ">https://www.youtube.com/watch?v=g5v-NvNoJQQ</a>

spywaregorilla超过 1 年前

> The DoubleML method is founded on machine learning modeling and consists of two key steps. First, we build a model that predicts the treatment variable based on the input variables . Then, we create a separate model that predicts the outcome variable using the same set of input variables . Subsequently, we calculate the residuals from the former model and regress them against the residuals from the latter model. An important feature of this method is its flexibility in accommodating non-linear models, which allows us to capture non-linear relationships — a distinctive advantage of this approach.Just... don't do this. You're not going to be able to math your way to better conclusions. Make your model, make your plots, and use critical thinking to evaluate your results.

评论 #37894249 未加载

civilized超过 1 年前

Contrasting frequentist statistics and causal inference, and saying the latter often goes beyond the former, makes for a bizarre opening. It's like saying apples have nutritional value, unlike soccer balls. It's like saying trigonometry often goes beyond the scope of calculus.

tqi超过 1 年前

I think in practice most of these techniques are useless (or worse, confer a false sense of precision) because they require so many nuanced judgement calls that they become little more than a way to launder biases.

Acsmaggart超过 1 年前

A co-worker pointed me to this e-book, which I thought did a great job of presenting he concepts in a relatable and applied way:<a href="https://matheusfacure.github.io/python-causality-handbook/landing-page.html" rel="nofollow noreferrer">https://matheusfacure.github.io/python-causality-handbook/la...</a>But I agree with other comments here, at the end of the day it seems like causal analysis often boils down to whether you trust the analyst and/or their techniques since it is hard to validate the results.

loa_observer超过 1 年前

We built some causal discovery and inference features with graph visualization in Kanaries RATH: <a href="https://docs.kanaries.net/rath/discover-causals/causal-analysis" rel="nofollow noreferrer">https://docs.kanaries.net/rath/discover-causals/causal-analy...</a>It's also open-sourced. Welcome to have a try.

pocketsand超过 1 年前

One gripe with this article—-regression coefficient doesn’t provide ATE under most circumstances using observational data.

paulpauper超过 1 年前

It's not that hard. if the causality cannot make sense logically or plausibly, then you can reasonably reject it . no reasonable person would ever get the umbrella puddles thing confused.

评论 #37891549 未加载

评论 #37894274 未加载