Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2001–2050 of 5548 papers

Title	Date	Tasks	Status	Hype
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective	Jun 19, 2024	BenchmarkingContinual Pretraining	—Unverified	0
A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentations	Jun 19, 2024	Benchmarking	CodeCode Available	2
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models	Jun 19, 2024	BenchmarkingOpen-Domain Question Answering	—Unverified	0
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration	Jun 19, 2024	BenchmarkingDistractor Generation	—Unverified	0
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	Jun 19, 2024	BenchmarkingImage Generation	CodeCode Available	3
BeHonest: Benchmarking Honesty in Large Language Models	Jun 19, 2024	BenchmarkingMisinformation	CodeCode Available	1
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN	Jun 19, 2024	BenchmarkingIntrusion Detection	CodeCode Available	0
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications	Jun 19, 2024	BenchmarkingMachine Reading Comprehension	—Unverified	0
M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere	Jun 19, 2024	BenchmarkingSpatio-Temporal Forecasting	CodeCode Available	0
Exploring and Benchmarking the Planning Capabilities of Large Language Models	Jun 18, 2024	BenchmarkingIn-Context Learning	—Unverified	0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions	Jun 18, 2024	BenchmarkingMultiple-choice	CodeCode Available	0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance	Jun 18, 2024	Benchmarking	—Unverified	0
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI	Jun 18, 2024	Benchmarkingscientific discovery	CodeCode Available	2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models	Jun 18, 2024	BenchmarkingDepth Estimation	CodeCode Available	2
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts	Jun 18, 2024	ArticlesBenchmarking	—Unverified	0
Automatic benchmarking of large multimodal models via iterative experiment programming	Jun 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
WebCanvas: Benchmarking Web Agents in Online Environments	Jun 18, 2024	AI AgentBenchmarking	CodeCode Available	3
TSI-Bench: Benchmarking Time Series Imputation	Jun 18, 2024	BenchmarkingDeep Learning	CodeCode Available	3
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	Jun 18, 2024	BenchmarkingWorld Knowledge	CodeCode Available	0
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models	Jun 17, 2024	Benchmarkingcounterfactual	—Unverified	0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading	Jun 17, 2024	Autonomous VehiclesBenchmarking	—Unverified	0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States	Jun 17, 2024	BenchmarkingContrastive Learning	—Unverified	0
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models	Jun 17, 2024	BenchmarkingSurvey	—Unverified	0
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking	Jun 17, 2024	BenchmarkingDemand Forecasting	CodeCode Available	1
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available	0
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models	Jun 17, 2024	Benchmarking	CodeCode Available	2
The Liouville Generator for Producing Integrable Expressions	Jun 17, 2024	Benchmarking	—Unverified	0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations	Jun 17, 2024	BenchmarkingDataset Generation	CodeCode Available	0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content	Jun 17, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	0
Standardizing Structural Causal Models	Jun 17, 2024	BenchmarkingCausal Inference	CodeCode Available	0
Benchmarking of LLM Detection: Comparing Two Competing Approaches	Jun 17, 2024	Benchmarking	—Unverified	0
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models	Jun 17, 2024	BenchmarkingFact Checking	CodeCode Available	1
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex	Jun 16, 2024	BenchmarkingObject Recognition	—Unverified	0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models	Jun 16, 2024	Benchmarking	CodeCode Available	0
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics	Jun 16, 2024	Benchmarkingde novo peptide sequencing	—Unverified	0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment	Jun 16, 2024	Action UnderstandingBenchmarking	—Unverified	0
Evaluating the Performance of Large Language Models via Debates	Jun 16, 2024	Benchmarking	—Unverified	0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning	Jun 16, 2024	BenchmarkingMath	—Unverified	0
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models	Jun 16, 2024	Adversarial AttackBenchmarking	CodeCode Available	2
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences	Jun 16, 2024	BenchmarkingSpatial Reasoning	—Unverified	0
GANmut: Generating and Modifying Facial Expressions	Jun 16, 2024	BenchmarkingDiversity	—Unverified	0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters	Jun 16, 2024	BenchmarkingInstance Segmentation	CodeCode Available	0
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models	Jun 15, 2024	BenchmarkingData Augmentation	CodeCode Available	0
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results	Jun 15, 2024	BenchmarkingHumanEval	—Unverified	0
A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking	Jun 15, 2024	BenchmarkingGPU	CodeCode Available	1
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework	Jun 14, 2024	Benchmarking	—Unverified	0
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading	Jun 14, 2024	BenchmarkingMathematical Proofs	CodeCode Available	0
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures	Jun 14, 2024	Answer GenerationBenchmarking	CodeCode Available	0
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs	Jun 14, 2024	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming	Jun 14, 2024	BenchmarkingGeneral Knowledge	—Unverified	0

Show:10 25 50

← PrevPage 41 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified