Large Language Models Benchmarks

Ai2 releases Olmo 3 open models, rivaling Meta, DeepSeek and others on performance and efficiency

The Allen Institute for AI (Ai2) unveiled Olmo 3, a new generation of open language models that it says outperforms rivals ...

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

Alibaba's AgentEvolver lifts model performance in tool use by ~30% using synthetic, auto-generated tasks

The new framework from Tongyi Lab enables agents to create their own training data by exploring and interacting with new ...

Morning Overview on MSN

New AI benchmark checks if chatbots protect human well-being

Artificial intelligence systems are increasingly woven into everyday decisions about health, money and work, yet most tests ...

8don MSN

Elon Musk’s xAI Grok 4.1 Gets Big Upgrade: Check Features, Benchmarks And How To Use It

Elon Musk's xAI has launched Grok 4.1, an upgraded AI model that significantly enhances speed, stability, and answer accuracy ...

Gizmodo

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study ...

The Battalion

‘The future of AI is here’: How Texas A&M students and faculty use large language models in research, classrooms

This year, Stanford University organized Agents4Science , the first open conference to accept papers written entirely by ...

Anthropic introduces cheaper, more powerful, more efficient Opus 4.5 model

Anthropic today released Opus 4.5, its flagship frontier model, and it brings improvements in coding performance, as well as ...

eWeek

9 Best Large Language Models (2025) For Your Tech Stack

eSpeaks host Corey Noles sits down with Qualcomm's Craig Tellalian to explore a workplace computing transformation: the rise of AI-ready PCs. Matt Hillary, VP of Security and CISO at Drata, details ...

InfoWorld

How to test large language models

Companies investing in generative AI find that testing and quality assurance are two of the most critical areas for improvement. Here are four strategies for testing LLMs embedded in generative AI ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results