Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2601–2650 of 5548 papers

Title	Date	Tasks	Status
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models	Oct 29, 2024	Benchmarking	—Unverified
SS3DM: Benchmarking Street-View Surface Reconstruction with a Synthetic 3D Mesh Dataset	Oct 29, 2024	3D ReconstructionAutonomous Driving	—Unverified
AI Cyber Risk Benchmark: Automated Exploitation Capabilities	Oct 29, 2024	BenchmarkingVulnerability Detection	—Unverified
Benchmarking LLM Guardrails in Handling Multilingual Toxicity	Oct 29, 2024	Benchmarking	—Unverified
Benchmarking Human and Automated Prompting in the Segment Anything Model	Oct 29, 2024	BenchmarkingImage Segmentation	CodeCode Available
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants	Oct 28, 2024	Benchmarking	CodeCode Available
CODES: Benchmarking Coupled ODE Surrogates	Oct 28, 2024	BenchmarkingUncertainty Quantification	CodeCode Available
NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates	Oct 28, 2024	Benchmarking	CodeCode Available
BongLLaMA: LLaMA for Bangla Language	Oct 28, 2024	BenchmarkingData Augmentation	—Unverified
LLM-initialized Differentiable Causal Discovery	Oct 28, 2024	BenchmarkingCausal Discovery	—Unverified
Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce	Oct 28, 2024	Benchmarkinggraph construction	—Unverified
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?	Oct 28, 2024	BenchmarkingQuestion Answering	CodeCode Available
Exploring Capabilities of Time Series Foundation Models in Building Analytics	Oct 28, 2024	Benchmarkingenergy management	—Unverified
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training	Oct 28, 2024	BenchmarkingLanguage Modeling	—Unverified
Project MPG: towards a generalized performance benchmark for LLM capabilities	Oct 28, 2024	BenchmarkingChatbot	—Unverified
Sequential Large Language Model-Based Hyper-parameter Optimization	Oct 27, 2024	Bayesian OptimizationBenchmarking	CodeCode Available
Multi-input Multi-output Loewner Framework for Vibration-based Damage Detection on a Trainer Jet	Oct 26, 2024	BenchmarkingCantilever Beam	—Unverified
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels	Oct 26, 2024	BenchmarkingInformation Retrieval	CodeCode Available
SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects	Oct 26, 2024	BenchmarkingMulti-Object Tracking	—Unverified
An Auditing Test To Detect Behavioral Shift in Language Models	Oct 25, 2024	BenchmarkingChange Detection	CodeCode Available
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding	Oct 25, 2024	Benchmarkingdocument understanding	—Unverified
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs	Oct 25, 2024	BenchmarkingFairness	—Unverified
A Survey of Small Language Models	Oct 25, 2024	BenchmarkingModel Compression	—Unverified
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery	Oct 25, 2024	Benchmarkingimage-classification	—Unverified
Benchmarking Graph Learning for Drug-Drug Interaction Prediction	Oct 24, 2024	BenchmarkingGraph Learning	—Unverified
Conditional diffusions for amortized neural posterior estimation	Oct 24, 2024	Bayesian InferenceBenchmarking	CodeCode Available
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework	Oct 24, 2024	BenchmarkingDiversity	CodeCode Available
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems	Oct 24, 2024	BenchmarkingCommon Sense Reasoning	—Unverified
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage	Oct 23, 2024	Benchmarking	—Unverified
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation	Oct 23, 2024	ArticlesBenchmarking	CodeCode Available
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling	Oct 23, 2024	Benchmarking	—Unverified
Safe Load Balancing in Software-Defined-Networking	Oct 22, 2024	BenchmarkingDeep Reinforcement Learning	—Unverified
Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies	Oct 22, 2024	Benchmarkingcontinuous-control	—Unverified
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing	Oct 22, 2024	AttributeBenchmarking	—Unverified
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images	Oct 22, 2024	BenchmarkingSelf-Supervised Learning	CodeCode Available
Benchmarking Large Language Models for Image Classification of Marine Mammals	Oct 22, 2024	Benchmarkingimage-classification	CodeCode Available
Building Conformal Prediction Intervals with Approximate Message Passing	Oct 21, 2024	BenchmarkingConformal Prediction	CodeCode Available
Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios	Oct 21, 2024	BenchmarkingFew-Shot Learning	CodeCode Available
Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification	Oct 21, 2024	Benchmarking	—Unverified
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data	Oct 21, 2024	Benchmarking	—Unverified
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping	Oct 21, 2024	Benchmarking	—Unverified
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence	Oct 20, 2024	Benchmarking	—Unverified
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning	Oct 19, 2024	BenchmarkingDrug Discovery	CodeCode Available
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review	Oct 18, 2024	BenchmarkingDeep Learning	—Unverified
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs	Oct 18, 2024	BenchmarkingFairness	—Unverified
Trust but Verify: Programmatic VLM Evaluation in the Wild	Oct 17, 2024	BenchmarkingLanguage Modelling	—Unverified
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs	Oct 17, 2024	Benchmarking	CodeCode Available
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p	Oct 17, 2024	Benchmarkingregression	CodeCode Available
debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias	Oct 17, 2024	BenchmarkingBias Detection	CodeCode Available
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models	Oct 17, 2024	Benchmarking	CodeCode Available

Show:10 25 50

← PrevPage 53 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified