Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1601–1650 of 5548 papers

Title	Date	Tasks	Status	Hype
HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing	Sep 25, 2024	BenchmarkingImage Dehazing	CodeCode Available	1
Benchmarking Domain Generalization Algorithms in Computational Pathology	Sep 25, 2024	BenchmarkingData Augmentation	CodeCode Available	0
Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices	Sep 25, 2024	Autonomous VehiclesBenchmarking	—Unverified	0
SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking	Sep 25, 2024	BenchmarkingManagement	—Unverified	0
Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling	Sep 24, 2024	ArticlesBenchmarking	—Unverified	0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation	Sep 24, 2024	BenchmarkingMovie Recommendation	CodeCode Available	0
HLB: Benchmarking LLMs' Humanlikeness in Language Use	Sep 24, 2024	Benchmarking	—Unverified	0
Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted Data	Sep 24, 2024	BenchmarkingDepth Estimation	CodeCode Available	0
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization	Sep 24, 2024	3D geometry3DGS	CodeCode Available	2
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework	Sep 24, 2024	Benchmarkingcounterfactual	CodeCode Available	0
Small Language Models: Survey, Measurements, and Insights	Sep 24, 2024	BenchmarkingDecoder	CodeCode Available	2
Building a continuous benchmarking ecosystem in bioinformatics	Sep 23, 2024	Benchmarking	—Unverified	0
Benchmarking Edge AI Platforms for High-Performance ML Inference	Sep 23, 2024	BenchmarkingCPU	—Unverified	0
Boosting Healthcare LLMs Through Retrieved Context	Sep 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images	Sep 23, 2024	BenchmarkingSegmentation	CodeCode Available	0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking	Sep 23, 2024	BenchmarkingDiversity	CodeCode Available	0
AlphaZip: Neural Network-Enhanced Lossless Text Compression	Sep 23, 2024	BenchmarkingData Compression	CodeCode Available	0
RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code	Sep 23, 2024	BenchmarkingCode Generation	CodeCode Available	1
Margin-bounded Confidence Scores for Out-of-Distribution Detection	Sep 22, 2024	Autonomous DrivingBenchmarking	CodeCode Available	0
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra	Sep 22, 2024	Benchmarking	—Unverified	0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance	Sep 22, 2024	AutoMLBenchmarking	CodeCode Available	0
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests	Sep 22, 2024	Benchmarking	—Unverified	0
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	Sep 21, 2024	BenchmarkingSurvey	CodeCode Available	2
Efficient and Effective Model Extraction	Sep 21, 2024	Benchmarkingmodel	CodeCode Available	0
CONGRA: Benchmarking Automatic Conflict Resolution	Sep 21, 2024	Benchmarking	CodeCode Available	0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology	Sep 21, 2024	BenchmarkingDepth Estimation	—Unverified	0
Present and Future Generalization of Synthetic Image Detectors	Sep 21, 2024	BenchmarkingDiversity	CodeCode Available	0
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators	Sep 21, 2024	Benchmarking	CodeCode Available	0
An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions	Sep 21, 2024	BenchmarkingScheduling	—Unverified	0
Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection	Sep 20, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time	Sep 20, 2024	BenchmarkingWorld Knowledge	—Unverified	0
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models	Sep 20, 2024	BenchmarkingImage Captioning	CodeCode Available	1
Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks	Sep 20, 2024	Benchmarkingobject-detection	—Unverified	0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data	Sep 20, 2024	BenchmarkingLanguage Modeling	—Unverified	0
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions	Sep 20, 2024	BenchmarkingSensitivity	CodeCode Available	0
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards	Sep 19, 2024	Benchmarking	CodeCode Available	0
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-centric Navigation Using Generative-Model-based Environment Generation	Sep 19, 2024	BenchmarkingSocial Navigation	—Unverified	0
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines	Sep 19, 2024	Benchmarking	—Unverified	0
Efficacy of Synthetic Data as a Benchmark	Sep 18, 2024	BenchmarkingFew-Shot Learning	—Unverified	0
PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models	Sep 18, 2024	BenchmarkingModel Selection	CodeCode Available	0
Hard-Label Cryptanalytic Extraction of Neural Network Models	Sep 18, 2024	Benchmarking	CodeCode Available	0
ASR Benchmarking: Need for a More Representative Conversational Dataset	Sep 18, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	0
Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework	Sep 17, 2024	BenchmarkingFederated Learning	CodeCode Available	2
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration	Sep 17, 2024	Benchmarkingcounterfactual	CodeCode Available	0
WER We Stand: Benchmarking Urdu ASR Models	Sep 17, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
Improve Machine Learning carbon footprint using Parquet dataset format and Mixed Precision training for regression models -- Part II	Sep 17, 2024	BenchmarkingDescriptive	CodeCode Available	0
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models	Sep 17, 2024	BenchmarkingBinary Classification	CodeCode Available	0
The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection	Sep 17, 2024	BenchmarkingEvent Detection	CodeCode Available	0
Quantum Kernel Learning for Small Dataset Modeling in Semiconductor Fabrication: Application to Ohmic Contact	Sep 17, 2024	BenchmarkingQuantum Machine Learning	—Unverified	0
MetaFormer and CNN Hybrid Model for Polyp Image Segmentation	Sep 16, 2024	BenchmarkingImage Segmentation	CodeCode Available	1

Show:10 25 50

← PrevPage 33 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified