Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1426–1450 of 5548 papers

Title	Date	Tasks	Status	Hype
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding	Oct 25, 2024	Benchmarkingdocument understanding	—Unverified	0
A Survey of Small Language Models	Oct 25, 2024	BenchmarkingModel Compression	—Unverified	0
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery	Oct 25, 2024	Benchmarkingimage-classification	—Unverified	0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs	Oct 25, 2024	BenchmarkingFairness	—Unverified	0
An Auditing Test To Detect Behavioral Shift in Language Models	Oct 25, 2024	BenchmarkingChange Detection	CodeCode Available	0
CoqPilot, a plugin for LLM-based generation of proofs	Oct 25, 2024	Benchmarking	CodeCode Available	2
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	Oct 25, 2024	BenchmarkingDiversity	CodeCode Available	1
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach	Oct 24, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Conditional diffusions for amortized neural posterior estimation	Oct 24, 2024	Bayesian InferenceBenchmarking	CodeCode Available	0
Benchmarking Graph Learning for Drug-Drug Interaction Prediction	Oct 24, 2024	BenchmarkingGraph Learning	—Unverified	0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems	Oct 24, 2024	BenchmarkingCommon Sense Reasoning	—Unverified	0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework	Oct 24, 2024	BenchmarkingDiversity	CodeCode Available	0
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances	Oct 24, 2024	BenchmarkingImage to Video Generation	CodeCode Available	3
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation	Oct 23, 2024	ArticlesBenchmarking	CodeCode Available	0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling	Oct 23, 2024	Benchmarking	—Unverified	0
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage	Oct 23, 2024	Benchmarking	—Unverified	0
Benchmarking Large Language Models for Image Classification of Marine Mammals	Oct 22, 2024	Benchmarkingimage-classification	CodeCode Available	0
VoiceBench: Benchmarking LLM-Based Voice Assistants	Oct 22, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	3
Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies	Oct 22, 2024	Benchmarkingcontinuous-control	—Unverified	0
Benchmarking Multi-Scene Fire and Smoke Detection	Oct 22, 2024	Benchmarking	CodeCode Available	1
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images	Oct 22, 2024	BenchmarkingSelf-Supervised Learning	CodeCode Available	0
Safe Load Balancing in Software-Defined-Networking	Oct 22, 2024	BenchmarkingDeep Reinforcement Learning	—Unverified	0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing	Oct 22, 2024	AttributeBenchmarking	—Unverified	0
Building Conformal Prediction Intervals with Approximate Message Passing	Oct 21, 2024	BenchmarkingConformal Prediction	CodeCode Available	0
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2

Show:10 25 50

← PrevPage 58 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified