OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

2024-06-24Code Available0· sign in to hype

Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/LoserCheems/OTCE
Officialpytorch★ 1

Abstract

Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.

Tasks

Language Modeling Language Modelling Mamba Mixture-of-Experts Position

OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

Code

Abstract

Tasks

Reproductions