BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention.

Yassine El Kheir

Deepfake Engineer, Researcher @ Deutsches Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)

Dr.-Ing. Tim Polzehl

CEO & Founder Gretchen AI,
Senior Researcher at the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) in the Berlin department of Speech and Language Technology (SLT)

Prof. Dr.-Ing. Sebastian Möller

Head of the Speech and Language Technology research department at the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)

Share:

Voice synthesis technology has reached a point where AI can create convincing imitations of anyone's voice from just a few minutes of audio.

 This poses serious risks to:
 
  • Banking systems that rely on voice authentication
  • Legal proceedings where audio evidence is crucial
  • Personal security against impersonation attacks
and many more
 
Traditional detection methods are struggling to keep pace with these rapidly evolving deepfake techniques.
Yassine El Kheir, Dr.-Ing. Tim Polzehl and Prof. Dr.-Ing. Sebastian Möller from the Speech and Language Technology, DFKI, Germany and Technical University of Berlin have developed BiCrossMamba-ST, a novel detection system that outperforms existing methods by substantial margins.
The breakthrough lies in its dual-perspective analysis: By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech.
 
The performance improvements are striking:
 
  • 28.2% fewer parameters while maintaining superior accuracy
  • 67.74% and 26.3% improvement over other state-of-the-art models like AASIST on the ASVspoof19 and ASVspoofDF21 benchmark datasets, respectively.