Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2301–2350 of 5548 papers

Title	Date	Tasks	Status	Hype
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation	Apr 14, 2024	BenchmarkingDiversity	CodeCode Available	0
Towards Sim-to-Real Industrial Parts Classification with Synthetic Dataset	Apr 12, 2024	Benchmarking	CodeCode Available	1
Practical Guidelines for Cell Segmentation Models Under Optical Aberrations in Microscopy	Apr 12, 2024	BenchmarkingCell Segmentation	—Unverified	0
Exploring the Decentraland Economy: Multifaceted Parcel Attributes, Key Insights, and Benchmarking	Apr 11, 2024	AttributeBenchmarking	—Unverified	0
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	Apr 11, 2024	Benchmarking	CodeCode Available	7
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs	Apr 10, 2024	Benchmarkingknowledge editing	CodeCode Available	0
Certifying almost all quantum states with few single-qubit measurements	Apr 10, 2024	AllBenchmarking	—Unverified	0
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models	Apr 10, 2024	BenchmarkingDenoising	—Unverified	0
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model	Apr 10, 2024	BenchmarkingImage-to-Image Translation	CodeCode Available	1
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS	Apr 9, 2024	BenchmarkingNeural Architecture Search	CodeCode Available	0
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution	Apr 9, 2024	Benchmarking	—Unverified	0
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs	Apr 9, 2024	BenchmarkingCode Generation	—Unverified	0
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents	Apr 9, 2024	Benchmarking	CodeCode Available	1
EFSA: Towards Event-Level Financial Sentiment Analysis	Apr 8, 2024	ArticlesBenchmarking	CodeCode Available	0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering	Apr 8, 2024	BenchmarkingMedical Question Answering	—Unverified	0
HOEG: A New Approach for Object-Centric Predictive Process Monitoring	Apr 8, 2024	BenchmarkingGraph Neural Network	CodeCode Available	0
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level	Apr 8, 2024	Benchmarking	CodeCode Available	0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes	Apr 7, 2024	Benchmarking	—Unverified	0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models	Apr 7, 2024	Benchmarkingknowledge editing	CodeCode Available	0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics	Apr 6, 2024	BenchmarkingHallucination	CodeCode Available	0
SDFR: Synthetic Data for Face Recognition Competition	Apr 6, 2024	BenchmarkingFace Recognition	—Unverified	0
Multicalibration for Confidence Scoring in LLMs	Apr 6, 2024	BenchmarkingQuestion Answering	—Unverified	0
Enhancing Video Summarization with Context Awareness	Apr 6, 2024	BenchmarkingInformativeness	CodeCode Available	0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text Generation	Apr 5, 2024	AttributeBenchmarking	CodeCode Available	0
GNNBENCH: Fair and Productive Benchmarking for Single-GPU GNN System	Apr 5, 2024	BenchmarkingGPU	—Unverified	0
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)	Apr 5, 2024	Benchmarking	CodeCode Available	0
Dynamic Risk Assessment Methodology with an LDM-based System for Parking Scenarios	Apr 5, 2024	Benchmarking	—Unverified	0
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance	Apr 4, 2024	BenchmarkingImage Generation	CodeCode Available	2
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models	Apr 4, 2024	BenchmarkingQuantization	CodeCode Available	1
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model	Apr 4, 2024	3D Part SegmentationBenchmarking	CodeCode Available	1
Benchmarking ChatGPT on Algorithmic Reasoning	Apr 4, 2024	Benchmarking	CodeCode Available	0
Benchmarking Parameter Control Methods in Differential Evolution for Mixed-Integer Black-Box Optimization	Apr 4, 2024	Benchmarking	CodeCode Available	0
Schroedinger's Threshold: When the AUC doesn't predict Accuracy	Apr 4, 2024	Benchmarking	CodeCode Available	0
A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-off	Apr 4, 2024	Benchmarking	CodeCode Available	0
DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior	Apr 4, 2024	BenchmarkingImage Restoration	—Unverified	0
NL2KQL: From Natural Language to Kusto Query	Apr 3, 2024	BenchmarkingNatural Language Queries	—Unverified	0
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT	Apr 3, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	1
Atom-Level Optical Chemical Structure Recognition with Limited Supervision	Apr 2, 2024	Benchmarking	CodeCode Available	1
On the reduction of Linear Parameter-Varying State-Space models	Apr 2, 2024	BenchmarkingDimensionality Reduction	—Unverified	0
PATCH! Psychometrics-AssisTed BenCHmarking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics	Apr 2, 2024	Benchmarking	CodeCode Available	0
PREGO: online mistake detection in PRocedural EGOcentric videos	Apr 2, 2024	Action RecognitionBenchmarking	CodeCode Available	1
Advancing LLM Reasoning Generalists with Preference Trees	Apr 2, 2024	BenchmarkingCode Generation	CodeCode Available	3
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and Benchmarking	Apr 2, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach	Apr 2, 2024	BenchmarkingCommon Sense Reasoning	—Unverified	0
Diffusion-Driven Domain Adaptation for Generating 3D Molecules	Apr 1, 2024	BenchmarkingDecoder	—Unverified	0
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations	Apr 1, 2024	BenchmarkingMath	—Unverified	0
Are large language models superhuman chemists?	Apr 1, 2024	Benchmarking	CodeCode Available	2
SpiralMLP: A Lightweight Vision MLP Architecture	Mar 31, 2024	Benchmarking	—Unverified	0
Comparing Hyper-optimized Machine Learning Models for Predicting Efficiency Degradation in Organic Solar Cells	Mar 29, 2024	Benchmarking	—Unverified	0
IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context	Mar 29, 2024	BenchmarkingSentence	CodeCode Available	0

Show:10 25 50

← PrevPage 47 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified