Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1451–1500 of 5548 papers

Title	Date	Tasks	Status	Hype
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style	Oct 21, 2024	BenchmarkingLanguage Modeling	CodeCode Available	2
Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios	Oct 21, 2024	BenchmarkingFew-Shot Learning	CodeCode Available	0
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following	Oct 21, 2024	BenchmarkingInstruction Following	CodeCode Available	2
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping	Oct 21, 2024	Benchmarking	—Unverified	0
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data	Oct 21, 2024	Benchmarking	—Unverified	0
Comprehensive benchmarking of large language models for RNA secondary structure prediction	Oct 21, 2024	Benchmarking	CodeCode Available	1
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence	Oct 20, 2024	Benchmarking	—Unverified	0
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning	Oct 19, 2024	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation	Oct 19, 2024	AI AgentBenchmarking	CodeCode Available	2
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning	Oct 19, 2024	BenchmarkingDrug Discovery	CodeCode Available	0
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review	Oct 18, 2024	BenchmarkingDeep Learning	—Unverified	0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs	Oct 18, 2024	BenchmarkingFairness	—Unverified	0
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments	Oct 18, 2024	Autonomous NavigationBenchmarking	CodeCode Available	1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Oct 18, 2024	BenchmarkingQuestion Answering	CodeCode Available	1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all	Oct 17, 2024	AllBenchmarking	CodeCode Available	1
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models	Oct 17, 2024	Benchmarking	CodeCode Available	0
Sum Secrecy Rate Maximization for Full Duplex ISAC Systems	Oct 17, 2024	BenchmarkingIntegrated sensing and communication	—Unverified	0
Trust but Verify: Programmatic VLM Evaluation in the Wild	Oct 17, 2024	BenchmarkingLanguage Modelling	—Unverified	0
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p	Oct 17, 2024	Benchmarkingregression	CodeCode Available	0
debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias	Oct 17, 2024	BenchmarkingBias Detection	CodeCode Available	0
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization	Oct 17, 2024	BenchmarkingStance Detection	CodeCode Available	0
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs	Oct 17, 2024	Benchmarking	CodeCode Available	0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	Oct 16, 2024	BenchmarkingFairness	CodeCode Available	1
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks	Oct 16, 2024	BenchmarkingLarge Language Model	CodeCode Available	0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation	Oct 16, 2024	BenchmarkingPanoptic Segmentation	—Unverified	0
AERO: Softmax-Only LLMs for Efficient Private Inference	Oct 16, 2024	BenchmarkingDecoder	—Unverified	0
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions	Oct 16, 2024	Benchmarking	—Unverified	0
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs	Oct 16, 2024	Benchmarking	—Unverified	0
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI	Oct 15, 2024	Benchmarking	CodeCode Available	4
Benchmarking Data Efficiency in Δ-ML and Multifidelity Models for Quantum Chemistry	Oct 15, 2024	Benchmarking	CodeCode Available	0
Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos	Oct 15, 2024	BenchmarkingBlind Face Restoration	—Unverified	0
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting	Oct 15, 2024	Benchmarkingenergy management	—Unverified	0
RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation	Oct 15, 2024	BenchmarkingInteractive Segmentation	CodeCode Available	1
The Trap of Presumed Equivalence: Artificial General Intelligence Should Not Be Assessed on the Scale of Human Intelligence	Oct 14, 2024	Benchmarking	—Unverified	0
Personalised Feedback Framework for Online Education Programmes Using Generative AI	Oct 14, 2024	BenchmarkingManagement	—Unverified	0
ChakmaNMT: A Low-resource Machine Translation On Chakma Language	Oct 14, 2024	BenchmarkingMachine Translation	—Unverified	0
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory	Oct 14, 2024	BenchmarkingLarge Language Model	CodeCode Available	3
Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective	Oct 14, 2024	BenchmarkingContrastive Learning	CodeCode Available	0
Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)	Oct 14, 2024	BenchmarkingMulti-Task Learning	—Unverified	0
SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing	Oct 14, 2024	BenchmarkingManagement	CodeCode Available	0
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Oct 14, 2024	2kBenchmarking	CodeCode Available	1
Transforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning	Oct 14, 2024	Atari GamesBenchmarking	—Unverified	0
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment	Oct 13, 2024	Benchmarking	CodeCode Available	1
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models	Oct 13, 2024	BenchmarkingGraph Generation	CodeCode Available	2
LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond	Oct 13, 2024	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
Yesterday's News: Benchmarking Multi-Dimensional Out-of-Distribution Generalisation of Misinformation Detection Models	Oct 12, 2024	BenchmarkingMisinformation	CodeCode Available	0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English	Oct 12, 2024	Benchmarking	CodeCode Available	0
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback	Oct 12, 2024	Benchmarking	CodeCode Available	0
A Comparative Analysis on Ethical Benchmarking in Large Language Models	Oct 11, 2024	BenchmarkingDecision Making	—Unverified	0
Enterprise Benchmarks for Large Language Model Evaluation	Oct 11, 2024	BenchmarkingLanguage Model Evaluation	CodeCode Available	0

Show:10 25 50

← PrevPage 30 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified