Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3551–3575 of 5548 papers

Title	Date	Tasks	Status
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests	Oct 31, 2023	Benchmarking	—Unverified
A Metadata-Driven Approach to Understand Graph Neural Networks	Oct 30, 2023	BenchmarkingGraph Learning	—Unverified
Domain Generalization in Computational Pathology: Survey and Guidelines	Oct 30, 2023	BenchmarkingDiagnostic	—Unverified
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection	Oct 29, 2023	BenchmarkingDiversity	—Unverified
Evaluating LLP Methods: Challenges and Approaches	Oct 29, 2023	BenchmarkingModel Selection	CodeCode Available
Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness	Oct 28, 2023	Benchmarkingimage-classification	CodeCode Available
On General Language Understanding	Oct 27, 2023	BenchmarkingEthics	—Unverified
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression	Oct 27, 2023	BenchmarkingGPU	CodeCode Available
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User	Oct 26, 2023	Anomaly DetectionBenchmarking	—Unverified
RDBench: ML Benchmark for Relational Databases	Oct 25, 2023	Benchmarking	—Unverified
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair	Oct 25, 2023	BenchmarkingFault localization	—Unverified
XFEVER: Exploring Fact Verification across Languages	Oct 25, 2023	BenchmarkingFact Verification	CodeCode Available
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting	Oct 25, 2023	BenchmarkingHyperparameter Optimization	—Unverified
BLESS: Benchmarking Large Language Models on Sentence Simplification	Oct 24, 2023	BenchmarkingDiversity	CodeCode Available
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic	Oct 23, 2023	BenchmarkingInstruction Following	—Unverified
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification	Oct 23, 2023	BenchmarkingTime Series	CodeCode Available
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design	Oct 23, 2023	BenchmarkingImage Generation	CodeCode Available
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video	Oct 22, 2023	3D ReconstructionAnatomy	—Unverified
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation	Oct 21, 2023	BenchmarkingLanguage Model Evaluation	—Unverified
Benchmarking and Improving Text-to-SQL Generation under Ambiguity	Oct 20, 2023	BenchmarkingDiversity	CodeCode Available
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models	Oct 20, 2023	Activity PredictionBenchmarking	CodeCode Available
Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package	Oct 20, 2023	Benchmarking	—Unverified
Almost Equivariance via Lie Algebra Convolutions	Oct 19, 2023	Benchmarking	—Unverified
Benchmarking GPUs on SVBRDF Extractor Model	Oct 19, 2023	BenchmarkingGPU	—Unverified
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available

Show:10 25 50

← PrevPage 143 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified