Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2576–2600 of 5548 papers

Title	Date	Tasks	Status
Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale	Nov 7, 2024	Active LearningBenchmarking	—Unverified
Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking	Nov 6, 2024	Benchmarking	—Unverified
Beemo: Benchmark of Expert-edited Machine-generated Outputs	Nov 6, 2024	Benchmarking	CodeCode Available
SPINEX_ Symbolic Regression: Similarity-based Symbolic Regression with Explainable Neighbors Exploration	Nov 5, 2024	Benchmarkingregression	—Unverified
On the Loss of Context-awareness in General Instruction Fine-tuning	Nov 5, 2024	BenchmarkingInstruction Following	CodeCode Available
TDDBench: A Benchmark for Training data detection	Nov 5, 2024	BenchmarkingComputational Efficiency	—Unverified
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level	Nov 5, 2024	Bayesian OptimisationBenchmarking	—Unverified
Imagining and building wise machines: The centrality of AI metacognition	Nov 4, 2024	BenchmarkingNavigate	—Unverified
Benchmarking XAI Explanations with Human-Aligned Evaluations	Nov 4, 2024	Benchmarking	—Unverified
SinaTools: Open Source Toolkit for Arabic Natural Language Processing	Nov 3, 2024	BenchmarkingLemmatization	—Unverified
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models	Nov 2, 2024	Benchmarking	—Unverified
FEET: A Framework for Evaluating Embedding Techniques	Nov 2, 2024	BenchmarkingRepresentation Learning	CodeCode Available
Artificial Intelligence for Microbiology and Microbiome Research	Nov 2, 2024	BenchmarkingDeep Learning	—Unverified
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations	Nov 1, 2024	BenchmarkingComputational Efficiency	—Unverified
Benchmarking Bias in Large Language Models during Role-Playing	Nov 1, 2024	BenchmarkingFairness	—Unverified
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image Editing	Nov 1, 2024	BenchmarkingSemantic Segmentation	CodeCode Available
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model	Nov 1, 2024	BenchmarkingCross-Domain Named Entity Recognition	—Unverified
A Review of Reinforcement Learning in Financial Applications	Nov 1, 2024	BenchmarkingDecision Making	—Unverified
IdeaBench: Benchmarking Large Language Models for Research Idea Generation	Oct 31, 2024	Benchmarkingscientific discovery	CodeCode Available
Benchmark Data Repositories for Better Benchmarking	Oct 31, 2024	Benchmarking	—Unverified
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentation	Oct 30, 2024	BenchmarkingContinual Learning	CodeCode Available
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning	Oct 30, 2024	BenchmarkingHallucination	—Unverified
Evaluating Cultural and Social Awareness of LLM Web Agents	Oct 30, 2024	BenchmarkingNavigate	—Unverified
Low-Density 3D Point Cloud Classification	Oct 30, 2024	3D Point Cloud ClassificationAutonomous Driving	—Unverified
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes	Oct 30, 2024	Benchmarking	—Unverified

Show:10 25 50

← PrevPage 104 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified