Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1376–1400 of 5548 papers

Title	Date	Tasks	Status	Hype
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations	Nov 1, 2024	BenchmarkingComputational Efficiency	—Unverified	0
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model	Nov 1, 2024	BenchmarkingCross-Domain Named Entity Recognition	—Unverified	0
MIRFLEX: Music Information Retrieval Feature Library for Extraction	Nov 1, 2024	BenchmarkingInformation Retrieval	CodeCode Available	1
Benchmarking Bias in Large Language Models during Role-Playing	Nov 1, 2024	BenchmarkingFairness	—Unverified	0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image Editing	Nov 1, 2024	BenchmarkingSemantic Segmentation	CodeCode Available	0
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models	Nov 1, 2024	BenchmarkingMixture-of-Experts	CodeCode Available	1
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators	Oct 31, 2024	BenchmarkingText Generation	CodeCode Available	2
IdeaBench: Benchmarking Large Language Models for Research Idea Generation	Oct 31, 2024	Benchmarkingscientific discovery	CodeCode Available	0
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction	Oct 31, 2024	BenchmarkingPrediction	CodeCode Available	1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking	Oct 31, 2024	BenchmarkingImputation	CodeCode Available	1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography	Oct 31, 2024	BenchmarkingElectromyography (EMG)	CodeCode Available	1
Benchmark Data Repositories for Better Benchmarking	Oct 31, 2024	Benchmarking	—Unverified	0
XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAM	Oct 31, 2024	3DGSBenchmarking	CodeCode Available	3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Oct 31, 2024	Benchmarking	CodeCode Available	3
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios	Oct 31, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery	Oct 31, 2024	BenchmarkingCloud Removal	CodeCode Available	1
CALE: Continuous Arcade Learning Environment	Oct 31, 2024	Atari GamesBenchmarking	CodeCode Available	7
Low-Density 3D Point Cloud Classification	Oct 30, 2024	3D Point Cloud ClassificationAutonomous Driving	—Unverified	0
Survey of Cultural Awareness in Language Models: Text and Beyond	Oct 30, 2024	Benchmarking	CodeCode Available	1
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentation	Oct 30, 2024	BenchmarkingContinual Learning	CodeCode Available	0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning	Oct 30, 2024	BenchmarkingHallucination	—Unverified	0
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes	Oct 30, 2024	Benchmarking	—Unverified	0
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models	Oct 30, 2024	Benchmarking	CodeCode Available	2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation	Oct 30, 2024	BenchmarkingPassage Retrieval	CodeCode Available	2
Evaluating Cultural and Social Awareness of LLM Web Agents	Oct 30, 2024	BenchmarkingNavigate	—Unverified	0

Show:10 25 50

← PrevPage 56 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified