AI模型评测

MMLU Benchmark (Multi-task Language Understanding) | Papers With Code - 智海流光AI导航网

The current state-of-the-art on MMLU is Gemini Ultra ~1760B. See a full comparison of 109 papers with code.

cluebenchmarks.com/static/superclue.html - 智海流光AI导航网

OpenCompass司南 - 评测榜单

评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名，同时提供多能力维度的评分参考，以便用户能够更全面地了解大模型的能力水平。

haonan-li/CMMLU: CMMLU: Measuring massive multitask language understanding in Chinese - 智海流光AI导航网

MMBench - 智海流光AI导航网

Holistic Evaluation of Language Models (HELM) - 智海流光AI导航网

The Holistic Evaluation of Language Models (HELM) serves as a living benchmark for transparency in language models. Providing broad coverage and recognizing incompleteness, multi-metric measurements, and standardization. All data and analysis are freely accessible on the website for exploration and study.

智海流光AI导航网

站内搜索

AI模型评测

友情链接