Large Language Models Benchmarks

Ai2 releases Olmo 3 open models, rivaling Meta, DeepSeek and others on performance and efficiency

The Allen Institute for AI (Ai2) unveiled Olmo 3, a new generation of open language models that it says outperforms rivals ...

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

Alibaba's AgentEvolver lifts model performance in tool use by ~30% using synthetic, auto-generated tasks

The new framework from Tongyi Lab enables agents to create their own training data by exploring and interacting with new ...

Morning Overview on MSN

New AI benchmark checks if chatbots protect human well-being

Artificial intelligence systems are increasingly woven into everyday decisions about health, money and work, yet most tests ...

8don MSN

Elon Musk’s xAI Grok 4.1 Gets Big Upgrade: Check Features, Benchmarks And How To Use It

Elon Musk's xAI has launched Grok 4.1, an upgraded AI model that significantly enhances speed, stability, and answer accuracy ...

The Battalion

‘The future of AI is here’: How Texas A&M students and faculty use large language models in research, classrooms

This year, Stanford University organized Agents4Science , the first open conference to accept papers written entirely by ...

Anthropic introduces cheaper, more powerful, more efficient Opus 4.5 model

Anthropic today released Opus 4.5, its flagship frontier model, and it brings improvements in coding performance, as well as ...

InfoWorld

How to test large language models

Companies investing in generative AI find that testing and quality assurance are two of the most critical areas for improvement. Here are four strategies for testing LLMs embedded in generative AI ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results