Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 851–900 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization	Nov 15, 2023	BenchmarkingInstruction Following	CodeCode Available	1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration	Nov 14, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
Combinatorial Optimization with Policy Adaptation using Latent Space Search	Nov 13, 2023	BenchmarkingCombinatorial Optimization	CodeCode Available	1
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime	Nov 13, 2023	BenchmarkingCombinatorial Optimization	CodeCode Available	1
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models	Nov 13, 2023	BenchmarkingInstruction Following	CodeCode Available	1
Flames: Benchmarking Value Alignment of LLMs in Chinese	Nov 12, 2023	BenchmarkingFairness	CodeCode Available	1
MultiIoT: Benchmarking Machine Learning for the Internet of Things	Nov 10, 2023	BenchmarkingRepresentation Learning	CodeCode Available	1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation	Nov 10, 2023	BenchmarkingCloud Computing	CodeCode Available	1
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs	Nov 9, 2023	BenchmarkingQuestion Answering	CodeCode Available	1
The PetShop Dataset -- Finding Causes of Performance Issues across Microservices	Nov 8, 2023	Benchmarking	CodeCode Available	1
The voraus-AD Dataset for Anomaly Detection in Robot Applications	Nov 8, 2023	Anomaly DetectionBenchmarking	CodeCode Available	1
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts	Nov 7, 2023	BenchmarkingMachine Translation	CodeCode Available	1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089	Nov 6, 2023	BenchmarkingKnowledge Base Question Answering	CodeCode Available	1
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding	Nov 6, 2023	BenchmarkingData Compression	CodeCode Available	1
JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds	Nov 5, 2023	Autonomous NavigationAutonomous Vehicles	CodeCode Available	1
Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical Cyclones	Nov 5, 2023	Benchmarking	CodeCode Available	1
NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning Applications	Nov 4, 2023	BenchmarkingDeep Learning	CodeCode Available	1
FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation	Nov 4, 2023	BenchmarkingDrug Discovery	CodeCode Available	1
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO	Nov 2, 2023	BenchmarkingEdge-computing	CodeCode Available	1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence	Nov 1, 2023	BenchmarkingCryogenic Electron Microscopy (cryo-EM)	CodeCode Available	1
In Search of Lost Online Test-time Adaptation: A Survey	Oct 31, 2023	BenchmarkingGPU	CodeCode Available	1
Re-evaluating Retrosynthesis Algorithms with Syntheseus	Oct 30, 2023	BenchmarkingMulti-step retrosynthesis	CodeCode Available	1
MLFMF: Data Sets for Machine Learning for Mathematical Formalization	Oct 24, 2023	BenchmarkingRecommendation Systems	CodeCode Available	1
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks	Oct 23, 2023	Benchmarking	CodeCode Available	1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark	Oct 20, 2023	Benchmarkingde-en	CodeCode Available	1
Fast hyperboloid decision tree algorithms	Oct 20, 2023	BenchmarkingRiemannian optimization	CodeCode Available	1
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift	Oct 19, 2023	Adversarial RobustnessBenchmarking	CodeCode Available	1
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	Oct 18, 2023	Adversarial Robustness	CodeCode Available	1
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection	Oct 18, 2023	BenchmarkingHallucination	CodeCode Available	1
Object-aware Inversion and Reassembly for Image Editing	Oct 18, 2023	BenchmarkingDenoising	CodeCode Available	1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations	Oct 17, 2023	BenchmarkingEmotion Recognition	CodeCode Available	1
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models	Oct 17, 2023	BenchmarkingLanguage Modelling	CodeCode Available	1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding	Oct 16, 2023	Action RecognitionBenchmarking	CodeCode Available	1
Welfare Diplomacy: Benchmarking Language Model Cooperation	Oct 13, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
pose-format: Library for Viewing, Augmenting, and Handling .pose Files	Oct 13, 2023	BenchmarkingManagement	CodeCode Available	1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters	Oct 13, 2023	BenchmarkingFairness	CodeCode Available	1
Towards Evaluating Generalist Agents: An Automated Benchmark in Open World	Oct 12, 2023	BenchmarkingDiversity	CodeCode Available	1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts	Oct 12, 2023	Benchmarking	CodeCode Available	1
MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning	Oct 12, 2023	Benchmarking	CodeCode Available	1
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models	Oct 10, 2023	BenchmarkingCode Generation	CodeCode Available	1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach	Oct 10, 2023	BenchmarkingCode Generation	CodeCode Available	1
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language Modeling	Oct 5, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
Can Language Models Employ the Socratic Method? Experiments with Code Debugging	Oct 4, 2023	Benchmarking	CodeCode Available	1
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking	Oct 3, 2023	Benchmarkingcounterfactual	CodeCode Available	1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery	Oct 3, 2023	BenchmarkingCausal Discovery	CodeCode Available	1
PGDQN: Preference-Guided Deep Q-Network	Oct 3, 2023	Atari GamesBenchmarking	CodeCode Available	1
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench	Oct 2, 2023	BenchmarkingSafety Alignment	CodeCode Available	1
NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation	Oct 2, 2023	BenchmarkingNews Recommendation	CodeCode Available	1
FELM: Benchmarking Factuality Evaluation of Large Language Models	Oct 1, 2023	BenchmarkingMath	CodeCode Available	1
Benchmarking Cognitive Biases in Large Language Models as Evaluators	Sep 29, 2023	BenchmarkingIn-Context Learning	CodeCode Available	1

Show:10 25 50

← PrevPage 18 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified