| WebCanvas: Benchmarking Web Agents in Online Environments | Jun 18, 2024 | AI AgentBenchmarking | CodeCode Available | 3 |
| Refusal in Language Models Is Mediated by a Single Direction | Jun 17, 2024 | Instruction Following | CodeCode Available | 3 |
| HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model | Jun 17, 2024 | Computational EfficiencyEarth Observation | CodeCode Available | 3 |
| Unveiling Encoder-Free Vision-Language Models | Jun 17, 2024 | DecoderInductive Bias | CodeCode Available | 3 |
| GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement | Jun 17, 2024 | speech-recognitionSpeech Recognition | CodeCode Available | 3 |
| DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models | Jun 17, 2024 | Document ClassificationVisual Grounding | CodeCode Available | 3 |
| AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| An Imitative Reinforcement Learning Framework for Autonomous Dogfight | Jun 17, 2024 | Imitation Learningreinforcement-learning | CodeCode Available | 3 |
| GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding | Jun 16, 2024 | | CodeCode Available | 3 |
| Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | Jun 16, 2024 | | CodeCode Available | 3 |
| Step-level Value Preference Optimization for Mathematical Reasoning | Jun 16, 2024 | Learning-To-RankMath | CodeCode Available | 3 |
| AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | Jun 16, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 3 |
| CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph | Jun 16, 2024 | Drug DesignFairness | CodeCode Available | 3 |
| AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology | Jun 16, 2024 | Code Generation | CodeCode Available | 3 |
| IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization | Jun 15, 2024 | GPUImage Manipulation | CodeCode Available | 3 |
| TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs | Jun 14, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 3 |
| DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning | Jun 14, 2024 | Offline RL | CodeCode Available | 3 |
| CarLLaVA: Vision language models for camera-only closed-loop driving | Jun 14, 2024 | Autonomous DrivingBench2Drive | CodeCode Available | 3 |
| Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | Jun 14, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 3 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation | Jun 13, 2024 | Multi-agent Reinforcement Learning | CodeCode Available | 3 |
| DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks | Jun 13, 2024 | Benchmarking | CodeCode Available | 3 |
| Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models | Jun 13, 2024 | Mathobject-detection | CodeCode Available | 3 |
| OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | Jun 13, 2024 | Video GenerationVideo Prediction | CodeCode Available | 3 |
| MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning | Jun 13, 2024 | Instruction FollowingMath | CodeCode Available | 3 |
| RobustSAM: Segment Anything Robustly on Degraded Images | Jun 13, 2024 | DeblurringImage Dehazing | CodeCode Available | 3 |
| Is Value Learning Really the Main Bottleneck in Offline RL? | Jun 13, 2024 | Imitation LearningOffline RL | CodeCode Available | 3 |
| AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring | Jun 13, 2024 | DeblurringDecoder | CodeCode Available | 3 |
| Multimodal Table Understanding | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text | Jun 12, 2024 | In-Context Learning | CodeCode Available | 3 |
| RVT-2: Learning Precise Manipulation from Few Demonstrations | Jun 12, 2024 | Robot ManipulationRobot Manipulation Generalization | CodeCode Available | 3 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 |
| Enhancing End-to-End Autonomous Driving with Latent World Model | Jun 12, 2024 | Autonomous DrivingNavSim | CodeCode Available | 3 |
| Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks | Jun 12, 2024 | BenchmarkingChatbot | CodeCode Available | 3 |
| Image and Video Tokenization with Binary Spherical Quantization | Jun 11, 2024 | DecoderImage Generation | CodeCode Available | 3 |
| Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey | Jun 11, 2024 | DeepFake DetectionFace Swapping | CodeCode Available | 3 |
| An Image is Worth 32 Tokens for Reconstruction and Generation | Jun 11, 2024 | Image GenerationImage Reconstruction | CodeCode Available | 3 |
| MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance | Jun 11, 2024 | Image GenerationText to Image Generation | CodeCode Available | 3 |
| Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation | Jun 11, 2024 | DecoderKnowledge Distillation | CodeCode Available | 3 |
| EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark | Jun 11, 2024 | Cross-corpusEmotion Recognition | CodeCode Available | 3 |
| Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis | Jun 10, 2024 | 2k3DGS | CodeCode Available | 3 |
| GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation | Jun 10, 2024 | 3D GenerationNeRF | CodeCode Available | 3 |
| DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | Jun 10, 2024 | Benchmarkingscientific discovery | CodeCode Available | 3 |
| GraphStorm: all-in-one graph machine learning framework for industry applications | Jun 10, 2024 | Allgraph construction | CodeCode Available | 3 |
| Merlin: A Vision Language Foundation Model for 3D Computed Tomography | Jun 10, 2024 | 3D Semantic SegmentationComputed Tomography (CT) | CodeCode Available | 3 |
| AutoSurvey: Large Language Models Can Automatically Write Surveys | Jun 10, 2024 | RetrievalSurvey | CodeCode Available | 3 |
| Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation | Jun 10, 2024 | ChunkingSpeech Separation | CodeCode Available | 3 |
| EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation | Jun 10, 2024 | Speech Enhancement | CodeCode Available | 3 |
| Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning | Jun 10, 2024 | Multi-hop Question AnsweringQuestion Answering | CodeCode Available | 3 |
| A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning | Jun 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |