Image processing methods to segment speech spectrograms for word level recognition

Al-Darkazali, Mohammed

Al-Darkazali, Mohammed.pdf (12.36 MB)

Image processing methods to segment speech spectrograms for word level recognition

thesis

posted on 2023-06-09, 09:06 authored by Mohammed Al-Darkazali

The ultimate goal of automatic speech recognition (ASR) research is to allow a computer to recognize speech in real-time, with full accuracy, independent of vocabulary size, noise, speaker characteristics or accent. Today, systems are trained to learn an individual speaker's voice and larger vocabularies statistically, but accuracy is not ideal. A small gap between actual speech and acoustic speech representation in the statistical mapping causes a failure to produce a match of the acoustic speech signals by Hidden Markov Model (HMM) methods and consequently leads to classification errors. Certainly, these errors in the low level recognition stage of ASR produce unavoidable errors at the higher levels. Therefore, it seems that ASR additional research ideas to be incorporated within current speech recognition systems. This study seeks new perspective on speech recognition. It incorporates a new approach for speech recognition, supporting it with wider previous research, validating it with a lexicon of 533 words and integrating it with a current speech recognition method to overcome the existing limitations. The study focusses on applying image processing to speech spectrogram images (SSI). We, thus develop a new writing system, which we call the Speech-Image Recogniser Code (SIR-CODE). The SIR-CODE refers to the transposition of the speech signal to an artificial domain (the SSI) that allows the classification of the speech signal into segments. The SIR-CODE allows the matching of all speech features (formants, power spectrum, duration, cues of articulation places, etc.) in one process. This was made possible by adding a Realization Layer (RL) on top of the traditional speech recognition layer (based on HMM) to check all sequential phones of a word in single step matching process. The study shows that the method gives better recognition results than HMMs alone, leading to accurate and reliable ASR in noisy environments. Therefore, the addition of the RL for SSI matching is a highly promising solution to compensate for the failure of HMMs in low level recognition. In addition, the same concept of employing SSIs can be used for whole sentences to reduce classification errors in HMM based high level recognition. The SIR-CODE bridges the gap between theory and practice of phoneme recognition by matching the SSI patterns at the word level. Thus, it can be adapted for dynamic time warping on the SIR-CODE segments, which can help to achieve ASR, based on SSI matching alone.

History

File Version

Published version

Pages

219.0

Department affiliated with

Engineering and Design Theses

Qualification level

doctoral

Qualification name

phd

Language

eng

Institution

University of Sussex

Full text available

Yes

Legacy Posted Date

2017-12-19

Usage metrics

Keywords

Uncategorised value

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Image processing methods to segment speech spectrograms for word level recognition

History

File Version

Pages

Department affiliated with

Qualification level

Qualification name

Language

Institution

Full text available

Legacy Posted Date

Usage metrics

Categories

Keywords

Licence

Exports