SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1631–1640 of 5548 papers

Title	Date	Tasks	Status	Hype
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time	Sep 20, 2024	BenchmarkingWorld Knowledge	—Unverified	0
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks	Sep 20, 2024	Benchmarkingobject-detection	—Unverified	0
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models	Sep 20, 2024	BenchmarkingImage Captioning	CodeCode Available	1
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions	Sep 20, 2024	BenchmarkingSensitivity	CodeCode Available	0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data	Sep 20, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards	Sep 19, 2024	Benchmarking	CodeCode Available	0
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines	Sep 19, 2024	Benchmarking	—Unverified	0
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation	Sep 19, 2024	BenchmarkingSocial Navigation	—Unverified	0
Hard-Label Cryptanalytic Extraction of Neural Network Models	Sep 18, 2024	Benchmarking	CodeCode Available	0
PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models	Sep 18, 2024	BenchmarkingModel Selection	CodeCode Available	0

Show:10 25 50

← PrevPage 164 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified