Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 801–850 of 5548 papers

Title	Date	Tasks	Status	Hype
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking	Feb 14, 2024	BenchmarkingLanguage Modelling	CodeCode Available	1
Explainable Global Wildfire Prediction Models using Graph Neural Networks	Feb 11, 2024	BenchmarkingCommunity Detection	CodeCode Available	1
Retrieve, Merge, Predict: Augmenting Tables with Data Lakes	Feb 9, 2024	AutoMLBenchmarking	CodeCode Available	1
Improved off-policy training of diffusion samplers	Feb 7, 2024	Benchmarking	CodeCode Available	1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching	Feb 5, 2024	BenchmarkingSentence	CodeCode Available	1
GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning	Feb 3, 2024	BenchmarkingDeepFake Detection	CodeCode Available	1
Benchmarking Transferable Adversarial Attacks	Feb 1, 2024	Adversarial AttackBenchmarking	CodeCode Available	1
We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline	Feb 1, 2024	BenchmarkingDomain Adaptation	CodeCode Available	1
Explainable Benchmarking for Iterative Optimization Heuristics	Jan 31, 2024	BenchmarkingEvolutionary Algorithms	CodeCode Available	1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels	Jan 30, 2024	Benchmarkingimage-classification	CodeCode Available	1
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets	Jan 29, 2024	BenchmarkingMachine Translation	CodeCode Available	1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle Perception	Jan 24, 2024	Benchmarking	CodeCode Available	1
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval	Jan 24, 2024	BenchmarkingImage Captioning	CodeCode Available	1
Benchmarking Large Multimodal Models against Common Corruptions	Jan 22, 2024	BenchmarkingImage to text	CodeCode Available	1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling	Jan 21, 2024	Benchmarking	CodeCode Available	1
RSUD20K: A Dataset for Road Scene Understanding In Autonomous Driving	Jan 14, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital Twins	Jan 6, 2024	Autonomous VehiclesBenchmarking	CodeCode Available	1
German Text Embedding Clustering Benchmark	Jan 5, 2024	BenchmarkingClustering	CodeCode Available	1
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models	Jan 1, 2024	Benchmarking	CodeCode Available	1
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions	Jan 1, 2024	BenchmarkingInstruction Following	CodeCode Available	1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA	Dec 29, 2023	AnatomyBenchmarking	CodeCode Available	1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond	Dec 25, 2023	Animal Pose EstimationBenchmarking	CodeCode Available	1
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models	Dec 21, 2023	Benchmarking	CodeCode Available	1
RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation	Dec 21, 2023	BenchmarkingProduct Recommendation	CodeCode Available	1
FiFAR: A Fraud Detection Dataset for Learning to Defer	Dec 20, 2023	BenchmarkingDecision Making	CodeCode Available	1
TAO-Amodal: A Benchmark for Tracking Any Object Amodally	Dec 19, 2023	Amodal TrackingAutonomous Driving	CodeCode Available	1
How to Train Neural Field Representations: A Comprehensive Study and Benchmark	Dec 16, 2023	Benchmarking	CodeCode Available	1
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models	Dec 15, 2023	BenchmarkingCode Summarization	CodeCode Available	1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation	Dec 12, 2023	Anomaly DetectionAutonomous Driving	CodeCode Available	1
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning	Dec 11, 2023	BenchmarkingHuman-Object Interaction Detection	CodeCode Available	1
Benchmarking Distribution Shift in Tabular Data with TableShift	Dec 10, 2023	BenchmarkingBinary Classification	CodeCode Available	1
STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers	Dec 9, 2023	AnatomyAutoML	CodeCode Available	1
Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images	Dec 8, 2023	BenchmarkingObject	CodeCode Available	1
Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym	Dec 6, 2023	BenchmarkingDecision Making	CodeCode Available	1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Dec 5, 2023	BenchmarkingVisual Question Answering	CodeCode Available	1
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks	Dec 5, 2023	BenchmarkingMinecraft	CodeCode Available	1
Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions	Dec 5, 2023	BenchmarkingConversational Question Answering	CodeCode Available	1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms	Nov 30, 2023	BenchmarkingOpenAI Gym	CodeCode Available	1
Enhancing Ligand Pose Sampling for Molecular Docking	Nov 30, 2023	BenchmarkingMolecular Docking	CodeCode Available	1
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation	Nov 30, 2023	Benchmarkingcounterfactual	CodeCode Available	1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs	Nov 29, 2023	Benchmarking	CodeCode Available	1
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation	Nov 26, 2023	BenchmarkingHallucination	CodeCode Available	1
Benchmarking Robustness of Text-Image Composed Retrieval	Nov 24, 2023	AttributeBenchmarking	CodeCode Available	1
IMGTB: A Framework for Machine-Generated Text Detection Benchmarking	Nov 21, 2023	BenchmarkingText Detection	CodeCode Available	1
BEND: Benchmarking DNA Language Models on biologically meaningful tasks	Nov 21, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
Towards a more inductive world for drug repurposing approaches	Nov 21, 2023	BenchmarkingPrediction	CodeCode Available	1
LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly Detector	Nov 20, 2023	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Pathology Feature Extractors for Whole Slide Image Classification	Nov 20, 2023	Benchmarkingimage-classification	CodeCode Available	1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction	Nov 16, 2023	BenchmarkingEvent Extraction	CodeCode Available	1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization	Nov 15, 2023	BenchmarkingInstruction Following	CodeCode Available	1

Show:10 25 50

← PrevPage 17 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified