Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2751–2775 of 5548 papers

Title	Date	Tasks	Status
AlphaZip: Neural Network-Enhanced Lossless Text Compression	Sep 23, 2024	BenchmarkingData Compression	CodeCode Available
Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images	Sep 23, 2024	BenchmarkingSegmentation	CodeCode Available
Building a continuous benchmarking ecosystem in bioinformatics	Sep 23, 2024	Benchmarking	—Unverified
Benchmarking Edge AI Platforms for High-Performance ML Inference	Sep 23, 2024	BenchmarkingCPU	—Unverified
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking	Sep 23, 2024	BenchmarkingDiversity	CodeCode Available
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests	Sep 22, 2024	Benchmarking	—Unverified
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra	Sep 22, 2024	Benchmarking	—Unverified
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance	Sep 22, 2024	AutoMLBenchmarking	CodeCode Available
Margin-bounded Confidence Scores for Out-of-Distribution Detection	Sep 22, 2024	Autonomous DrivingBenchmarking	CodeCode Available
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology	Sep 21, 2024	BenchmarkingDepth Estimation	—Unverified
Present and Future Generalization of Synthetic Image Detectors	Sep 21, 2024	BenchmarkingDiversity	CodeCode Available
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators	Sep 21, 2024	Benchmarking	CodeCode Available
An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions	Sep 21, 2024	BenchmarkingScheduling	—Unverified
CONGRA: Benchmarking Automatic Conflict Resolution	Sep 21, 2024	Benchmarking	CodeCode Available
Efficient and Effective Model Extraction	Sep 21, 2024	Benchmarkingmodel	CodeCode Available
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection	Sep 20, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time	Sep 20, 2024	BenchmarkingWorld Knowledge	—Unverified
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions	Sep 20, 2024	BenchmarkingSensitivity	CodeCode Available
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data	Sep 20, 2024	BenchmarkingLanguage Modeling	—Unverified
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks	Sep 20, 2024	Benchmarkingobject-detection	—Unverified
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation	Sep 19, 2024	BenchmarkingSocial Navigation	—Unverified
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines	Sep 19, 2024	Benchmarking	—Unverified
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards	Sep 19, 2024	Benchmarking	CodeCode Available
ASR Benchmarking: Need for a More Representative Conversational Dataset	Sep 18, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available
Efficacy of Synthetic Data as a Benchmark	Sep 18, 2024	BenchmarkingFew-Shot Learning	—Unverified

Show:10 25 50

← PrevPage 111 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified