Language Model Evaluation

Holistic evaluation of large language models for medical tasks with MedHELM

LLMs have shown impressive performance on medical knowledge benchmarks, achieving ~99% accuracy on standardized exams like MedQA 1. This has sparked interest in deploying them in healthcare settings: ...

Nature

Automating expert-level medical reasoning evaluation of large language models

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring trustworthy reasoning is paramount. However, current evaluation strategies of LLMs’ medical ...

Android Police

The Stanford Holistic Evaluation of Language Models and its AI research explained

Zach was an Author at Android Police from January 2022 to June 2025. He specialized in Chromebooks, Android smartphones, Android apps, smart home devices, and Android services. Zach loves unique and ...

Forbes

Why Human Evaluation Matters When Choosing The Right AI Model For Your Business

As enterprises increasingly integrate AI across their operations, the stakes for selecting the right model have never been higher and many technology leaders lean heavily on standard industry ...

Forbes

The Importance Of Evaluation In The Reinforcement Learning Revolution

David Shan is the Co-Founder and CTO of Clado, who trains in-house small language models to build the best people search algorithm. We celebrate RL breakthroughs, but behind the hype lies a brittle ...

InfoWorld

Microsoft open sources AI evaluation framework for enterprise agents

A new tool enters a growing AI testing market as analysts say most organizations still do not evaluate agent behavior before ...

1mon

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

Frontier AI models corrupt 25% of document content in multi-step workflows — rewriting rather than deleting, which makes the ...

Wamda

Arabic.AI partners with Stanford to introduce HELM Arabic Enterprise

Arabic.AI, a regional leader in Arabic artificial intelligence and enterprise technology, announced the launch of ...

InfoWorld

Large language models: The foundations of generative AI

Large language models evolved alongside deep-learning neural networks and are critical to generative AI. Here's a first look, including the top LLMs and what they're used for today. Large language ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results