SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1981–1990 of 5548 papers

Title	Date	Tasks	Status	Hype
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models	Apr 14, 2025	BenchmarkingDescriptive	—Unverified	0
Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study	Apr 14, 2025	BenchmarkingGaze Estimation	CodeCode Available	0
Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design	Apr 14, 2025	BenchmarkingLanguage Modeling	—Unverified	0
NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding	Apr 12, 2025	BenchmarkingDocument AI	—Unverified	0
SortBench: Benchmarking LLMs based on their ability to sort lists	Apr 11, 2025	Benchmarking	—Unverified	0
TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning	Apr 11, 2025	BenchmarkingLanguage Modeling	—Unverified	0
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark	Apr 10, 2025	Benchmarking	CodeCode Available	0
Geological Inference from Textual Data using Word Embeddings	Apr 10, 2025	BenchmarkingWord Embeddings	CodeCode Available	0
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge	Apr 10, 2025	Adversarial RobustnessBenchmarking	CodeCode Available	0
Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories	Apr 10, 2025	Additive modelsBenchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 199 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified