Improving Acoustic and Bone Conduction Speech Enhancement | News

0
Improving Acoustic and Bone Conduction Speech Enhancement | News

Frustrated by background noise making it difficult to hear and understand calls, Northwestern Engineering’s Stephen Xia aimed to develop a system that transmits speech clearly, even in noisy public spaces.

Unlike traditional, over-the-air microphones commonly found in earbuds or headphones, which convert variations in air pressure into electrical signals, bone conduction microphones (BCMs) pick up vocal cord vibrations through the skin and skull. Used in devices such as osseointegrated hearing aids and sports headsets, BCMs are not sensitive to changes in air pressure, making them naturally robust against ambient noise.

Bone-conduction acoustic modalities, however, have a significant drawback. They attenuate vocal frequency, reducing sound quality and degrading intelligibility, particularly in the higher range.

Xia and his team, which included first-year computer engineering PhD students Yueyuan Sui and Junxi Xia, set out to reconstruct the missing and attenuated frequencies — a process called super resolution, or bandwidth expansion — to provide high quality and noise-free speech for real-time mobile, wearable, and ‘earable’ (audio plus sensor application) systems.

Stephen Xia“Existing super resolution methods can provide high quality speech but are computationally expensive and not practical for our personal mobile devices, such as smartphones, earbuds, or other wearables, while methods that have a small footprint provide significantly lower fidelity speech,” said Xia, assistant professor of electrical and computer engineering and (by courtesy) computer science at the McCormick School of Engineering. “Our goal was to bridge the gap between speed, efficiency, and performance.”

The team developed TRAMBA, a hybrid transformer and Mamba deep learning architecture for acoustic and bone conduction speech enhancement. Their method achieved higher quality speech with a memory footprint of only ~5MB compared to ~500MB for other state-of-the-art models. TRAMBA can process half a second of audio on a smartphone in ~20 milliseconds.

The super-resolution model also improves word error rate — the ratio of incorrect words perceived to the total number of words spoken — by up to 75 percent in noisy environments compared to traditional noise-suppression approaches.

Moreover, the team demonstrated that the lack of high-frequency components in bone conduction-based sensing modalities significantly reduces both the sampling rate of the sensor and the transmission rate, which can improve the battery life of wearables by up to 160 percent.

link

Leave a Reply

Your email address will not be published. Required fields are marked *