A new study in which artificial intelligence outperformed expert virologists in specialized laboratory tasks is raising hopes for faster biomedical breakthroughs and fears about bioweapon risks.
Researchers tested leading AI models against the Virology Capabilities Test, a benchmark designed to assess expert-level knowledge in virology and wet lab protocols. The results suggest that AI models like OpenAI’s GPT-4o surpassed the accuracy of most human virologists.
Testing the virology benchmark against LLMs
“VCT consists of 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories,” from the study.
Scientists with or working on their Ph.D. in virology tested the VCT questions against large language models (LLMs) developed by OpenAI, Google, Anthropic, and DeepSeek. VCT used benchmark questions in four categories: important, difficult, validated, and multimodal.
- Questions under the “important” category tested the subject’s essential knowledge in virology; this is a level of understanding required of a competent lab researcher.
- The second set of questions, “difficult,” required deeper knowledge or domain expertise.
- The “validated” category consisted of questions with answers reviewed and validated by experts.
- The “multimodal” questions included images reflecting real laboratory scenarios.
Researchers conducted the study at the Center for AI Safety, MIT’s Media Lab, Brazil’s university UFABC, and SecureBio.
Findings from the virology benchmark vs. LLMs study
The results showed experts with access to the internet doing VCT scored an average of 22.1% accuracy, but AI models scored higher.
- Open AI’s o3 scored 43.8%, outperforming 94% of expert virologists asked to answer questions specific to their specialized expertise.
- DeepSeek-R1 scored 38.6%.
- Google’s Gemini 2.5 Pro scored 37.6%.
- OpenAI’s o4-mini scored 37% and its earlier version, GPT-4 model, scored 35.4%.
- Anthropic’s (Oct ’24) Claude 3.5 Sonnet scored 33.6%.
Safety concerns based on the survey results
“The VCT’s results underscore the urgent need for thoughtful access controls to balance beneficial research with safety concerns,” the researchers said.
Even riskier would be AI virologist chatbots capable of performing tasks independently. In the wrong hands, AI models could be used to produce biological weapons that could cause massive destruction.
Although AI sped up the process and increased accuracy, scientists warn of its inherent danger. While scientists can use AI to prevent an epidemic or pandemic-level outbreak of infectious diseases, at the hands of non-experts, AI models could be weaponized for creating and producing biological weapons.
“Previously, we found that the models had a lot of theoretical knowledge, but not practical knowledge,” Dan Hendrycks, director of the Center for AI Safety, said in an interview with TIME. “But now, they are getting a concerning amount of practical knowledge.”
“We want to give the people who have a legitimate use for asking how to manipulate deadly viruses — like a researcher at the MIT biology department — the ability to do so… But random people who made an account a second ago don’t get those capabilities,” Hendrycks said.
Responding with a risk management framework
In response to the researchers’ findings, xAI released a risk management framework for its Grok model. xAI outlined safeguards such as training its AI model to decline harmful requests, providing circuit breakers for harmful outputs, and filtering queries related to cybercrimes and weapons of mass destruction.