A widely shared AI productivity paper was retracted, is possibly fraudulent
Confidence notes: I am a physicist working on computational material science, so I have some familiarity with the field, but don’t know much about R&D firms or economics. Some of the links in this article were gathered from a post at pivot-to-ai.com and the BS detector.
The paper “Artificial Intelligence, Scientific Discovery, and Product Innovation” was published as an Arxiv preprint last December, roughly 5 months ago, and was submitted to a top economics journal.
The paper claimed to show the effect of an experiment at a large R&D company. It claimed the productivity of a thousand material scientists was tracked before and after the introduction of an machine learning material generation tool. The headline results was that the AI caused a 44% increase in materials discovery at the firm, with a productivity increase of 81% for top-decile scientists.
This research was breathlessly reported on in the atlantic, the Wall street journal, and in the news section of Nature. Nobel economics prize winner Darren Acemoglu promoted the research and was acknowledged for his support in the paper. The pre-print article was shared widely, and has been cited dozens of times already in the academic literature. I have seen it cited several times on this forum, most notably by @80000_Hours in their case for AGI by 2030. I myself looked at the paper and took it’s findings at face value, something I am now kicking myself for[1].
At some point someone else with some computational materials science expertise noticed serious issues with the paper, and took their concerns to Acemoglu, who in turn took it up with MIT directly. And then everything started to fall apart. Both Acemoglu and MIT have publicly withdrawn support from the paper and urged that it be retracted. In a press release, MIT stated:
“Earlier this year, the COD conducted a confidential internal review based upon allegations it received regarding certain aspects of this paper. While student privacy laws and MIT policy prohibit the disclosure of the outcome of this review, we are writing to inform you that MIT has no confidence in the provenance, reliability or validity of the data and has no confidence in the veracity of the research contained in the paper. Based upon this finding, we also believe that the inclusion of this paper in arXiv may violate arXiv’s Code of Conduct.
They also noted that “the author is no longer at MIT.” Reading between the lines, MIT likely found evidence that the author engaged in some form of serious academic misconduct, and subsequently either resigned or was kicked out by MIT.
It is my understanding that MIT is bound by law not to reveal the results of internal disciplinary investigations. As a result, the level of misconduct on the paper is unknown, but is presumably serious. Honest mistakes and regular poor methodology would not warrant this level of response. It is not known as of yet whether the experiment described in the paper actually happened or if the whole thing was made up.
I have seen pretty good arguments for outright fraud: The “BS detector” blog has a pretty in depth post pointing to serious warning signs in the paper, as does this twitter thread from a material science professor. I recommend reading those posts.
To summarize some of the problems:
The study was hugely ambitious and wide reaching, and would have required a collaboration with one of the largest r&d companies in the world, yet it’s sole author is a 2nd year PhD student. Why wasn’t his supervisor a co-author?
The paper claimed the experiment started in 2020, multiple years before the student started his PhD. Why would a company do this large multiyear experiment themselves, then hand off all their data to a random PhD student? Wouldn’t it make more sense for a company to either keep the results to themselves or take public credit for them?
Very few companies would match the specific profile of the company in the article (employing thousands of scientists, huge range of materials, etc).
The materials in question are widely different from each other, and as far as I know nobody would use the same ML tool to generate materials for both ceramics and glasses and polymers. Computational modelling of glasses in particular is incredibly difficult and is in no state to be used in a study like this.
They claimed to do a highly complicated materials science technique of comparing “materials similarity”, and gave very few details and displayed no signs of material science competency.
The data was suspiciously clean and neat: nearly every sub measure of success gave a clear and statistically significant result.
It’s still possible that there was a genuine experiment at a big company somewhere, but given the response from MIT, even if it is genuine I would not trust the analysis of that data at all. You should treat the amount of informational content in this paper as zero. You should undo any “updates” you have done based on this paper. It is not implausible that some real AI tools could speed up scientific discovery, but as of yet there does not seem to be solid evidence on the topic one way or the other.
We don’t know whether this paper would have made it through peer review. It was apparently in the “revise and resubmit” stage, which means that reviewers had raised some issues with it but not rejected it entirely.
I think this is an extreme case, that misconduct of this level is rare, and that the vast majority of AI researchers are honestly trying their best. But keep in mind that for every case of outright misconduct, there are many more cases of papers having serious, hard to detect methodological flaws. In a previous blog post I detailed the case of another, more credible paper on AI assisted materials discovery which was published in one of top research journals in the world before being mostly debunked a few years later.
I think everyone should take stock at how many of their beliefs are based on articles that are un-replicated, un-peer-reviewed, or that have not yet been substantially intellectually scrutinized by subject matter experts.
- ^
I did notice that the paper was lacking in material science expertise and that the paper wasn’t high quality, but I missed some of the more serious warning signs in the paper.
This seems substantially different from “was retracted” in the title. Also, Arxiv apparently hasn’t yet followed MIT’s request to remove the paper, presumably following it’s own policy and waiting for the author to issue his own request.
I never read the paper and haven’t looked closely into the recent news and events about it. But, I will admit I didn’t (and still don’t) find the general direction and magnitude of the results implausible, even if the actual paper has no value or validity and is fraudulent. For about a decade, leading materials Informatics companies have reported that use of machine learning for experimental design in materials and chemicals research reduces the number of experiments needed to reach a target level of performance by 50-70%. The now-presumably-fraudulent MIT paper mostly seemed to claim the same, but in a way that is much broader and deeper.
So: yes, given recent news we should regard this particular paper as providing essentially zero information. But also, if you were paying attention to prior work on AI in materials discovery, and the case studies and marketing claims made regarding same, then the result was also reasonably on-trend. As for the claimed effects on the people doing materials research, I have no idea, I hadn’t seen it studied before; that’s what I’m disappointed about, and I really would like to know the reality.
I think aside from the general implausibility of the effect sizes and the claimed AI tech (GANs?) delivering those effect sizes across so many areas of materials, one of the odder claims which people highlighted at the time was that supposedly the best users got a lot more productivity enhancement than the worst ones. This is pretty unusual: usually low performers get a lot more out of AI assistance, for obvious reasons. And this lines up with what I see anecdotally for LLMs: until very recently, possibly, they were just a lot more useful for people not very good at writing or other stuff, than for people like me who are.
Also picked up here