| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 |
| PC-Gym: Benchmark Environments For Process Control Problems | Oct 29, 2024 | BenchmarkingChemical Process | CodeCode Available | 2 |
| Image2Struct: Benchmarking Structure Extraction for Vision-Language Models | Oct 29, 2024 | Benchmarking | —Unverified | 0 |
| SS3DM: Benchmarking Street-View Surface Reconstruction with a Synthetic 3D Mesh Dataset | Oct 29, 2024 | 3D ReconstructionAutonomous Driving | —Unverified | 0 |
| AI Cyber Risk Benchmark: Automated Exploitation Capabilities | Oct 29, 2024 | BenchmarkingVulnerability Detection | —Unverified | 0 |
| Benchmarking LLM Guardrails in Handling Multilingual Toxicity | Oct 29, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Human and Automated Prompting in the Segment Anything Model | Oct 29, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 0 |
| Exploring Capabilities of Time Series Foundation Models in Building Analytics | Oct 28, 2024 | Benchmarkingenergy management | —Unverified | 0 |
| Project MPG: towards a generalized performance benchmark for LLM capabilities | Oct 28, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| ODRL: A Benchmark for Off-Dynamics Reinforcement Learning | Oct 28, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 2 |
| NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates | Oct 28, 2024 | Benchmarking | CodeCode Available | 0 |
| LLM-initialized Differentiable Causal Discovery | Oct 28, 2024 | BenchmarkingCausal Discovery | —Unverified | 0 |
| Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training | Oct 28, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| CODES: Benchmarking Coupled ODE Surrogates | Oct 28, 2024 | BenchmarkingUncertainty Quantification | CodeCode Available | 0 |
| BongLLaMA: LLaMA for Bangla Language | Oct 28, 2024 | BenchmarkingData Augmentation | —Unverified | 0 |
| Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce | Oct 28, 2024 | Benchmarkinggraph construction | —Unverified | 0 |
| AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? | Oct 28, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants | Oct 28, 2024 | Benchmarking | CodeCode Available | 0 |
| Sequential Large Language Model-Based Hyper-parameter Optimization | Oct 27, 2024 | Bayesian OptimizationBenchmarking | CodeCode Available | 0 |
| SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance | Oct 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Multi-input Multi-output Loewner Framework for Vibration-based Damage Detection on a Trainer Jet | Oct 26, 2024 | BenchmarkingCantilever Beam | —Unverified | 0 |
| AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels | Oct 26, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects | Oct 26, 2024 | BenchmarkingMulti-Object Tracking | —Unverified | 0 |
| OGBench: Benchmarking Offline Goal-Conditioned RL | Oct 26, 2024 | Benchmarkingreinforcement-learning | CodeCode Available | 3 |
| MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding | Oct 25, 2024 | Benchmarkingdocument understanding | —Unverified | 0 |
| A Survey of Small Language Models | Oct 25, 2024 | BenchmarkingModel Compression | —Unverified | 0 |
| OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery | Oct 25, 2024 | Benchmarkingimage-classification | —Unverified | 0 |
| FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs | Oct 25, 2024 | BenchmarkingFairness | —Unverified | 0 |
| An Auditing Test To Detect Behavioral Shift in Language Models | Oct 25, 2024 | BenchmarkingChange Detection | CodeCode Available | 0 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach | Oct 24, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| Conditional diffusions for amortized neural posterior estimation | Oct 24, 2024 | Bayesian InferenceBenchmarking | CodeCode Available | 0 |
| Benchmarking Graph Learning for Drug-Drug Interaction Prediction | Oct 24, 2024 | BenchmarkingGraph Learning | —Unverified | 0 |
| From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems | Oct 24, 2024 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework | Oct 24, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances | Oct 24, 2024 | BenchmarkingImage to Video Generation | CodeCode Available | 3 |
| Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation | Oct 23, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling | Oct 23, 2024 | Benchmarking | —Unverified | 0 |
| FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage | Oct 23, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Large Language Models for Image Classification of Marine Mammals | Oct 22, 2024 | Benchmarkingimage-classification | CodeCode Available | 0 |
| VoiceBench: Benchmarking LLM-Based Voice Assistants | Oct 22, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 3 |
| Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies | Oct 22, 2024 | Benchmarkingcontinuous-control | —Unverified | 0 |
| Benchmarking Multi-Scene Fire and Smoke Detection | Oct 22, 2024 | Benchmarking | CodeCode Available | 1 |
| ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images | Oct 22, 2024 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 0 |
| Safe Load Balancing in Software-Defined-Networking | Oct 22, 2024 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing | Oct 22, 2024 | AttributeBenchmarking | —Unverified | 0 |
| Building Conformal Prediction Intervals with Approximate Message Passing | Oct 21, 2024 | BenchmarkingConformal Prediction | CodeCode Available | 0 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |