Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2701–2750 of 5548 papers

Title	Date	Tasks	Status
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms	Oct 5, 2024	BenchmarkingGPU	—Unverified
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels	Oct 5, 2024	Benchmarking	—Unverified
PersoBench: Benchmarking Personalized Response Generation in Large Language Models	Oct 4, 2024	BenchmarkingDialogue Generation	CodeCode Available
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension	Oct 4, 2024	BenchmarkingComputational chemistry	—Unverified
Ward: Provable RAG Dataset Inference via LLM Watermarks	Oct 4, 2024	BenchmarkingRAG	—Unverified
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities	Oct 4, 2024	Benchmarkingcounterfactual	—Unverified
Towards a Benchmark for Large Language Models for Business Process Management Tasks	Oct 4, 2024	BenchmarkingManagement	CodeCode Available
Benchmarking the Fidelity and Utility of Synthetic Relational Data	Oct 4, 2024	BenchmarkingFeature Importance	—Unverified
Lightning UQ Box: A Comprehensive Framework for Uncertainty Quantification in Deep Learning	Oct 4, 2024	BenchmarkingUncertainty Quantification	—Unverified
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices	Oct 4, 2024	BenchmarkingLanguage Modeling	—Unverified
IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models	Oct 3, 2024	BenchmarkingIn-Context Learning	—Unverified
MANTRA: The Manifold Triangulations Assemblage	Oct 3, 2024	Benchmarking	CodeCode Available
Repurposing Foundation Model for Generalizable Medical Time Series Classification	Oct 3, 2024	BenchmarkingDiagnostic	—Unverified
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning	Oct 3, 2024	BenchmarkingLanguage Modeling	—Unverified
Deep learning for action spotting in association football videos	Oct 2, 2024	Action SpottingBenchmarking	—Unverified
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving	Oct 2, 2024	BenchmarkingDocument Summarization	—Unverified
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations	Oct 2, 2024	BenchmarkingLong Form Question Answering	—Unverified
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs	Oct 2, 2024	BenchmarkingHallucination	—Unverified
Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description	Oct 2, 2024	BenchmarkingFacial expression generation	—Unverified
A Real Benchmark Swell Noise Dataset for Performing Seismic Data Denoising via Deep Learning	Oct 2, 2024	BenchmarkingDenoising	—Unverified
Deep Unlearn: Benchmarking Machine Unlearning	Oct 2, 2024	BenchmarkingMachine Unlearning	—Unverified
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset	Oct 1, 2024	BenchmarkingContrastive Learning	—Unverified
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks	Oct 1, 2024	BenchmarkingFairness	—Unverified
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents	Oct 1, 2024	BenchmarkingConversational Question Answering	—Unverified
Match Stereo Videos via Bidirectional Alignment	Sep 30, 2024	BenchmarkingStereo Matching	—Unverified
Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration	Sep 30, 2024	BenchmarkingIntent Detection	—Unverified
ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning	Sep 30, 2024	BenchmarkingDisparity Estimation	CodeCode Available
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs	Sep 30, 2024	BenchmarkingMultiple-choice	—Unverified
Constrained Reinforcement Learning for Safe Heat Pump Control	Sep 29, 2024	Benchmarkingreinforcement-learning	CodeCode Available
Tracking Everything in Robotic-Assisted Surgery	Sep 29, 2024	Benchmarking	—Unverified
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks	Sep 29, 2024	Benchmarking	—Unverified
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy	Sep 29, 2024	AstronomyBenchmarking	—Unverified
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement	Sep 28, 2024	BenchmarkingCode Generation	—Unverified
Data Analysis in the Era of Generative AI	Sep 27, 2024	Benchmarking	—Unverified
Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study	Sep 27, 2024	Benchmarkingtabular-regression	CodeCode Available
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting	Sep 27, 2024	ArticlesBenchmarking	—Unverified
bnRep: A repository of Bayesian networks from the academic literature	Sep 27, 2024	Benchmarking	—Unverified
MCUBench: A Benchmark of Tiny Object Detectors on MCUs	Sep 27, 2024	BenchmarkingModel Selection	—Unverified
EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes	Sep 27, 2024	BenchmarkingDataset Generation	—Unverified
Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in Graphs	Sep 26, 2024	BenchmarkingConformal Prediction	CodeCode Available
Benchmarking Domain Generalization Algorithms in Computational Pathology	Sep 25, 2024	BenchmarkingData Augmentation	CodeCode Available
Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices	Sep 25, 2024	Autonomous VehiclesBenchmarking	—Unverified
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning	Sep 25, 2024	BenchmarkingFormal Logic	—Unverified
Omnibenchmark (alpha) for continuous and open benchmarking in bioinformatics	Sep 25, 2024	Benchmarking	—Unverified
SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking	Sep 25, 2024	BenchmarkingManagement	—Unverified
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework	Sep 24, 2024	Benchmarkingcounterfactual	CodeCode Available
HLB: Benchmarking LLMs' Humanlikeness in Language Use	Sep 24, 2024	Benchmarking	—Unverified
Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted Data	Sep 24, 2024	BenchmarkingDepth Estimation	CodeCode Available
Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling	Sep 24, 2024	ArticlesBenchmarking	—Unverified
Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation	Sep 24, 2024	BenchmarkingMovie Recommendation	CodeCode Available

Show:10 25 50

← PrevPage 55 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified