How much can we learn from other people's guesses?

David Johnston

I think that if someone's proposing multiple answers to a question, an answer appearing among their first guesses is some evidence that this answer is correct. If the set of possible answers is large, the amount of evidence might be substantial. I explain a thought experiment to illustrate this, and an actual experiment that might test it.

I'm not totally sure my example holds up.

Adjusting for single comparisons

Here's a thought experiment. Conventional wisdom says: when you're doing research, if you test multiple hypotheses you need to make special adjustments to your results, whereas you can just do the "normal thing" (whatever that is) if you're only testing one hypothesis. I'm going to suggest that this is, in a sense, backwards: at least, from a Bayesian point of view, it is single comparisons, not multiple comparisons, that merit special treatment, because the choice of hypothesis is itself quite informative in the first case.

Suppose some economists publish a study that shows that people whose third grade teacher was called "Margaret" earn $1000 more each year on average than people whose third grade teacher was not called "Margaret", p=0.001 (they had a big sample). It sounds like nonsense, but on further investigation you discover that

this study was pre-registered
the research group pre-registers all of their studies
the only income association they ever investigated was the one they found: between having a third grade teacher called "Margaret" and earnings in later life

It seems to me that I would then be inclined to conclude that this association is real, and that if I am trying to forecast someone's earnings it might be a little bit helpful to ask them the name of their third grade teacher. I can even think of explanations - "Margarets" are, perhaps, more likely to work at affluent schools.

Now imagine exactly the same situation, except the research team investigated 10 000 different associations, none of which they pre-specified were especially plausible. Now it seems clear that the association is most likely an artefact of multiple comparisons.

There's a standard answer for how to deal with this second case from a Bayesian perspective, and that is to say (roughly) "my prior probability on having a 'Margaret' for third grade being strongly associated with earnings is too low for this evidence to rescue it". This position would also oblige you to reject the association in the first case because it's the same evidence about the same hypothesis, so if your conclusion only depends on this evidence and your prior probability and the prior probability is identical in both cases, then you must reach the same conclusion. This is, importantly, different to the non-Bayesian answer, which says you need to treat multiple comparisons in a special way. To the Bayesian, the evidence has exactly the same relevance to hypothesis number 1 whether it's the unique hypothesis tested or if it's hypothesis 1 out of 10 000. Thus if we are approximately Bayesian and we come to different conclusions for each experiment, it must be because we bring different priors to each experiment.

Do we draw different conclusions from each experiment? We could test whether people do treat these two scenarios differently. Present the first scenario to one group of people and ask them how likely they think the association is, and present the second scenario to another group of people and ask them the same question (I'm speaking loosely here; I think the language would need to be clearer to actually run a test). My guess is that the first group would think the association is quite likely real, and the second group would think it is not very likely to be real, but I'm pretty unsure about this. Of course, whether people do treat these scenarios differently and whether people should treat them differently are different questions - maybe we're just not approximately Bayesian in the appropriate sense.

But let's suppose we are approximately Bayesian. Then, if there is an asymmetry, that asymmetry must be due to one of two things:

In the first case (single hypothesis test) the hypothesis of a substantial association between a third grade "Margaret" and income was one of the first things the researchers thought of, and in the second case it was not, and this in and of itself is significant evidence
In the second case (multiple hypothesis test), we "adjust our priors" on the strength of income/demographic relationships on the basis of the the full set of relationships studied, and substantially revise down the probability of large relationships in the process

We could distinguish the two by positing a third scenario: we have the first scenario play out exactly as described, then the same researchers follow up with a second paper examining 9 999 other prospective associations with income, and finding that only 1 in 100 is stronger than the "3rd grade Margaret" association, and the vast majority of the stronger associations make more intuitive sense. Now we have evidence truly identical to scenario 2 in all ways except for the researchers' hypothesis prioritisation. In particular, the follow-up experiment yields the exact same "downward adjustment" in association strengths proposed in explanation (2). So does this, or should this, cause people revise their initial assessment of the reality of the "3rd grade Margaret" association? I think probably not too much, but again I'm pretty unsure. My main reason for thinking this is that I already expect reasonable priors here to put a fairly low weight on strong inexplicable relationships, and the additional evidence to do relatively little work.

So of the two explanations, I'm inclined to think the author's prioritisation is substantial evidence in favour of the prioritised hypothesis. We could reason something like this: the authors chose this first (out of 10 000 alternatives!), so we can suppose that it is the best of all alternatives according to some "scoring function" (to be clear: we don't know much about what this scoring function is; it ranges from highly considered prioritisation methods to very-slightly-educated guesses). We then need to make some reasonable guess about how strongly this scoring function correlates with "actual effect size".

Even a modest probability that the scoring function has a substantial correlation with "actual effect size" will often yield a huge boost to the probability of an actually large effect. If we initially had a Gaussian prior over effect sizes and this effect was, say, 4 standard deviations above the mean, then our prior probability of an effect at least this large is tiny (about 5 in 100 000). But if we replace this with a 10 percent chance that the author's scoring function is pretty good; specifically, that the correlation between score and actual effect size is 0.5, then there is a 10 percent chance it is only 2 standard deviations above the mean. Suddenly we're up to a 5 in 1000 chance of an effect this large, a factor of 100 improvement, which means that we could come to believe such an effect on the basis of much weaker data.

I think one of the unusual features of this example is that the following two things usually go together, but I've tried here to separate them:

By my own lights, I think some hypothesis is a priori reasonably likely
Some researcher thinks the same hypothesis is a priori reasonably likely

and I'm arguing that the latter should generally cause me to update my own views substantially.

A rough theory of what's going on

Here's a rough theory of what might be going on:

When we're trying to answer a question, we (or economics researchers) can propose rough answers before we've fully worked it out^[1]
The likelihood of an answer being proposed is proportional to its weight in some prior distribution over answers, along with a constraint that it's sufficiently different to answers already considered
As a result, early answers tend to have higher prior probability than later ones
If we know that someone with some degree of expertise (not necessarily particularly high degree) assigns high probability to some hypothesis X, we should probably assign nontrivial probability to it

This might be related to "anchoring bias", where people's numerical estimates can depend on obviously unreliable pieces of information presented just before the estimates are elicited. Unlike in the classic anchoring bias experiments, I think that taking author's opinions seriously in my example is reasonable. The restrictions I specified (preregistration, credible absence of alternative tests) imply that the author's choice of hypothesis is likely to be somewhat representative of how plausible they think it is relative to other things they could have investigated. The setup of anchoring bias experiments where these initial guesses are specifically crafted to be unreliable might be unusual in this sense – perhaps, usually, initial guesses are in fact fairly honest and somewhat credible.

To the extent that this is a sensible heuristic, it's also an easily exploited one. I went out of my way to specify that the researchers didn't have some large body of hidden research in addition to the headline result for my first example. It is easy, in general, for someone to say "I always thought X was likely" whether or not they did, and if they want to persuade you of X then they might well say it dishonestly.

Conclusion

I'm not sure what the practical implications of this are. I know that I don't typically make any adjustments for authors' beliefs when reading studies, and given that I rarely know how many other hypotheses they've tested I'm unlikely to change this policy much. But it does suggests something odd is going on in the way we interpret research: the fact that multiple comparisons adjustments are the exception and not the norm seems to suggest that "normal interpretation of normal research" involves both significantly lower priors on the size of arbitrary effects and significant votes of confidence in study authors' judgement. This suggests, for example, that attempts to aggregate study results without including any terms for author reliability may be missing info that we implicitly make use of when interpreting studies in an ad-hoc way, and may therefore persistently underperform.

^{^}
I'm agnostic about whether the generation process is brainstorming or something more involved.

Effective Altruism Forum
EA Forum

How much can we learn from other people's guesses?

5

Adjusting for single comparisons

A rough theory of what's going on

Conclusion

5

Reactions

More posts like this