Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2801–2850 of 5548 papers

Title	Date	Tasks	Status	Hype
Grounded Intuition of GPT-Vision's Abilities with Scientific Images	Nov 3, 2023	Benchmarkingcounterfactual	CodeCode Available	0
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction	Nov 3, 2023	BenchmarkingSentence	—Unverified	0
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval	Nov 3, 2023	BenchmarkingFairness	CodeCode Available	0
Decentralized Federated Learning on the Edge over Wireless Mesh Networks	Nov 2, 2023	BenchmarkingFederated Learning	—Unverified	0
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia	Nov 2, 2023	BenchmarkingMachine Translation	CodeCode Available	0
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO	Nov 2, 2023	BenchmarkingEdge-computing	CodeCode Available	1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence	Nov 1, 2023	BenchmarkingCryogenic Electron Microscopy (cryo-EM)	CodeCode Available	1
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs	Nov 1, 2023	BenchmarkingQuestion Answering	—Unverified	0
SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization	Nov 1, 2023	Benchmarkingreinforcement-learning	—Unverified	0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges	Oct 31, 2023	Benchmarking	—Unverified	0
A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain	Oct 31, 2023	BenchmarkingDiagnostic	—Unverified	0
Next-generation MRD assays: do we have the tools to evaluate them properly?	Oct 31, 2023	BenchmarkingSensitivity	—Unverified	0
In Search of Lost Online Test-time Adaptation: A Survey	Oct 31, 2023	BenchmarkingGPU	CodeCode Available	1
What's In My Big Data?	Oct 31, 2023	Benchmarking	CodeCode Available	2
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests	Oct 31, 2023	Benchmarking	—Unverified	0
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks	Oct 30, 2023	Benchmarkingobject-detection	CodeCode Available	2
Domain Generalization in Computational Pathology: Survey and Guidelines	Oct 30, 2023	BenchmarkingDiagnostic	—Unverified	0
A Metadata-Driven Approach to Understand Graph Neural Networks	Oct 30, 2023	BenchmarkingGraph Learning	—Unverified	0
Re-evaluating Retrosynthesis Algorithms with Syntheseus	Oct 30, 2023	BenchmarkingMulti-step retrosynthesis	CodeCode Available	1
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection	Oct 29, 2023	BenchmarkingDiversity	—Unverified	0
Evaluating LLP Methods: Challenges and Approaches	Oct 29, 2023	BenchmarkingModel Selection	CodeCode Available	0
Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness	Oct 28, 2023	Benchmarkingimage-classification	CodeCode Available	0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression	Oct 27, 2023	BenchmarkingGPU	CodeCode Available	0
On General Language Understanding	Oct 27, 2023	BenchmarkingEthics	—Unverified	0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User	Oct 26, 2023	Anomaly DetectionBenchmarking	—Unverified	0
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting	Oct 25, 2023	BenchmarkingHyperparameter Optimization	—Unverified	0
RDBench: ML Benchmark for Relational Databases	Oct 25, 2023	Benchmarking	—Unverified	0
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair	Oct 25, 2023	BenchmarkingFault localization	—Unverified	0
XFEVER: Exploring Fact Verification across Languages	Oct 25, 2023	BenchmarkingFact Verification	CodeCode Available	0
MLFMF: Data Sets for Machine Learning for Mathematical Formalization	Oct 24, 2023	BenchmarkingRecommendation Systems	CodeCode Available	1
BLESS: Benchmarking Large Language Models on Sentence Simplification	Oct 24, 2023	BenchmarkingDiversity	CodeCode Available	0
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks	Oct 23, 2023	Benchmarking	CodeCode Available	1
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic	Oct 23, 2023	BenchmarkingInstruction Following	—Unverified	0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design	Oct 23, 2023	BenchmarkingImage Generation	CodeCode Available	0
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification	Oct 23, 2023	BenchmarkingTime Series	CodeCode Available	0
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video	Oct 22, 2023	3D ReconstructionAnatomy	—Unverified	0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation	Oct 21, 2023	BenchmarkingLanguage Model Evaluation	—Unverified	0
Fast hyperboloid decision tree algorithms	Oct 20, 2023	BenchmarkingRiemannian optimization	CodeCode Available	1
Benchmarking and Improving Text-to-SQL Generation under Ambiguity	Oct 20, 2023	BenchmarkingDiversity	CodeCode Available	0
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models	Oct 20, 2023	Activity PredictionBenchmarking	CodeCode Available	0
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark	Oct 20, 2023	Benchmarkingde-en	CodeCode Available	1
Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package	Oct 20, 2023	Benchmarking	—Unverified	0
Benchmarking GPUs on SVBRDF Extractor Model	Oct 19, 2023	BenchmarkingGPU	—Unverified	0
Almost Equivariance via Lie Algebra Convolutions	Oct 19, 2023	Benchmarking	—Unverified	0
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift	Oct 19, 2023	Adversarial RobustnessBenchmarking	CodeCode Available	1
Formalizing and Benchmarking Prompt Injection Attacks and Defenses	Oct 19, 2023	Benchmarking	CodeCode Available	2
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection	Oct 18, 2023	BenchmarkingHallucination	CodeCode Available	1
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available	0
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	Oct 18, 2023	Adversarial Robustness	CodeCode Available	1
Object-aware Inversion and Reassembly for Image Editing	Oct 18, 2023	BenchmarkingDenoising	CodeCode Available	1

Show:10 25 50

← PrevPage 57 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified