Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2151–2200 of 5548 papers

Title	Date	Tasks	Status	Hype
Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data	May 29, 2024	BenchmarkingSpecificity	—Unverified	0
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification	May 29, 2024	Benchmarking	—Unverified	0
Benchmarking and Improving Detail Image Caption	May 29, 2024	BenchmarkingImage Captioning	CodeCode Available	2
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions	May 29, 2024	BenchmarkingDialogue Understanding	CodeCode Available	1
Quantitative Certification of Bias in Large Language Models	May 29, 2024	Benchmarking	CodeCode Available	1
Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion	May 28, 2024	BenchmarkingEmotion Recognition	—Unverified	0
Risk-Neutral Generative Networks	May 28, 2024	Benchmarking	—Unverified	0
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime	May 28, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	1
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences	May 28, 2024	BenchmarkingFeature Engineering	CodeCode Available	1
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters	May 27, 2024	BenchmarkingGSM8K	CodeCode Available	2
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving	May 27, 2024	Autonomous DrivingBenchmarking	CodeCode Available	3
A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis	May 27, 2024	Benchmarking	—Unverified	0
Benchmarking General-Purpose In-Context Learning	May 27, 2024	BenchmarkingDecision Making	—Unverified	0
GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases	May 25, 2024	BenchmarkingHallucination	—Unverified	0
BOLD: Boolean Logic Deep Learning	May 25, 2024	BenchmarkingDeep Learning	—Unverified	0
Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"	May 24, 2024	BenchmarkingClassification	—Unverified	0
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification	May 24, 2024	BenchmarkingData Augmentation	—Unverified	0
NuwaTS: a Foundation Model Mending Every Incomplete Time Series	May 24, 2024	BenchmarkingContrastive Learning	—Unverified	0
Benchmarking Hierarchical Image Pyramid Transformer for the classification of colon biopsies and polyps in histopathology images	May 24, 2024	BenchmarkingClassification	—Unverified	0
Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study	May 24, 2024	BenchmarkingVulnerability Detection	—Unverified	0
MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model	May 24, 2024	BenchmarkingDemand Forecasting	—Unverified	0
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems	May 24, 2024	BenchmarkingDeep Learning	—Unverified	0
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks	May 24, 2024	BenchmarkingDecoder	—Unverified	0
Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling	May 23, 2024	Benchmarking	CodeCode Available	1
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models	May 23, 2024	Benchmarking	CodeCode Available	2
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models	May 23, 2024	Autonomous DrivingBenchmarking	—Unverified	0
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents	May 23, 2024	Benchmarking	CodeCode Available	4
GCondenser: Benchmarking Graph Condensation	May 23, 2024	BenchmarkingGraph Representation Learning	CodeCode Available	1
A Gap in Time: The Challenge of Processing Heterogeneous IoT Data in Digitalized Buildings	May 23, 2024	BenchmarkingData Integration	—Unverified	0
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models	May 22, 2024	BenchmarkingHallucination	—Unverified	0
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture Breeding	May 21, 2024	BenchmarkingKeypoint Detection	CodeCode Available	1
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models	May 20, 2024	BenchmarkingDiversity	—Unverified	0
EXACT: Towards a platform for empirically benchmarking Machine Learning model explanation methods	May 20, 2024	BenchmarkingExplainable artificial intelligence	—Unverified	0
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning	May 20, 2024	BenchmarkingMRI segmentation	CodeCode Available	2
DispaRisk: Auditing Fairness Through Usable Information	May 20, 2024	BenchmarkingBias Detection	CodeCode Available	0
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering	May 20, 2024	BenchmarkingQuestion Answering	CodeCode Available	2
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models	May 18, 2024	BenchmarkingSpecificity	—Unverified	0
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT	May 17, 2024	BenchmarkingMultiple-choice	—Unverified	0
SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge	May 17, 2024	BenchmarkingSocial Media Popularity Prediction	—Unverified	0
BraTS-Path Challenge: Assessing Heterogeneous Histopathologic Brain Tumor Sub-regions	May 17, 2024	BenchmarkingPrognosis	—Unverified	0
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset	May 17, 2024	16kBenchmarking	CodeCode Available	3
A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text	May 16, 2024	Anomaly DetectionBenchmarking	—Unverified	0
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions	May 16, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	0
An Integrated Framework for Multi-Granular Explanation of Video Summarization	May 16, 2024	BenchmarkingPanoptic Segmentation	CodeCode Available	0
DocuMint: Docstring Generation for Python using Small Language Models	May 16, 2024	BenchmarkingCode Generation	CodeCode Available	1
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models	May 15, 2024	Benchmarking	CodeCode Available	2
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation	May 14, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
SpeechVerse: A Large-scale Generalizable Audio Language Model	May 14, 2024	Automatic Speech RecognitionBenchmarking	—Unverified	0
UCCIX: Irish-eXcellence Large Language Model	May 13, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Divergent Creativity in Humans and Large Language Models	May 13, 2024	Benchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 44 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified