Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 826–850 of 5548 papers

Title	Date	Tasks	Status	Hype
TAO-Amodal: A Benchmark for Tracking Any Object Amodally	Dec 19, 2023	Amodal TrackingAutonomous Driving	CodeCode Available	1
How to Train Neural Field Representations: A Comprehensive Study and Benchmark	Dec 16, 2023	Benchmarking	CodeCode Available	1
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models	Dec 15, 2023	BenchmarkingCode Summarization	CodeCode Available	1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation	Dec 12, 2023	Anomaly DetectionAutonomous Driving	CodeCode Available	1
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning	Dec 11, 2023	BenchmarkingHuman-Object Interaction Detection	CodeCode Available	1
Benchmarking Distribution Shift in Tabular Data with TableShift	Dec 10, 2023	BenchmarkingBinary Classification	CodeCode Available	1
STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers	Dec 9, 2023	AnatomyAutoML	CodeCode Available	1
Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images	Dec 8, 2023	BenchmarkingObject	CodeCode Available	1
Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym	Dec 6, 2023	BenchmarkingDecision Making	CodeCode Available	1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Dec 5, 2023	BenchmarkingVisual Question Answering	CodeCode Available	1
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks	Dec 5, 2023	BenchmarkingMinecraft	CodeCode Available	1
Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions	Dec 5, 2023	BenchmarkingConversational Question Answering	CodeCode Available	1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms	Nov 30, 2023	BenchmarkingOpenAI Gym	CodeCode Available	1
Enhancing Ligand Pose Sampling for Molecular Docking	Nov 30, 2023	BenchmarkingMolecular Docking	CodeCode Available	1
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation	Nov 30, 2023	Benchmarkingcounterfactual	CodeCode Available	1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs	Nov 29, 2023	Benchmarking	CodeCode Available	1
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation	Nov 26, 2023	BenchmarkingHallucination	CodeCode Available	1
Benchmarking Robustness of Text-Image Composed Retrieval	Nov 24, 2023	AttributeBenchmarking	CodeCode Available	1
IMGTB: A Framework for Machine-Generated Text Detection Benchmarking	Nov 21, 2023	BenchmarkingText Detection	CodeCode Available	1
BEND: Benchmarking DNA Language Models on biologically meaningful tasks	Nov 21, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
Towards a more inductive world for drug repurposing approaches	Nov 21, 2023	BenchmarkingPrediction	CodeCode Available	1
LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly Detector	Nov 20, 2023	Anomaly DetectionBenchmarking	CodeCode Available	1
Benchmarking Pathology Feature Extractors for Whole Slide Image Classification	Nov 20, 2023	Benchmarkingimage-classification	CodeCode Available	1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction	Nov 16, 2023	BenchmarkingEvent Extraction	CodeCode Available	1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization	Nov 15, 2023	BenchmarkingInstruction Following	CodeCode Available	1

Show:10 25 50

← PrevPage 34 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified