mfcc.rar_MFCC_extraction资源-CSDN文库

共1个文件

pdf：1个

版权申诉

47 浏览量 2022-09-19 21:42:32 上传评论收藏 171KB RAR 举报

MFCC（Mel Frequency Cepstral Coefficients，梅尔频率倒谱系数）是信号处理领域，尤其是语音识别和音频分析中广泛使用的一种特征提取技术。它通过对原始的语音信号进行预加重、分帧、窗口化、傅立叶变换、梅尔滤波器组处理、对数运算以及离散余弦变换等步骤，来提取能够有效表征语音特征的一组系数。MFCC的核心思想是模拟人类听觉系统的感知特性，从而更好地理解和识别语音信号。 1. **预加重**：为了消除低频部分的影响并突出高频成分，通常采用一阶预加重滤波器，其系数一般为0.97左右。 2. **分帧与窗口化**：将连续的语音信号分成若干个短帧，每个帧大约20-30毫秒，并且应用汉明窗或矩形窗等窗函数，以减少信号的边界效应。 3. **傅立叶变换**：对每个帧应用快速傅立叶变换(FFT)，将时域信号转换到频域，得到频谱图。 4. **梅尔滤波器组**：在频域上应用一组基于梅尔尺度的滤波器，梅尔尺度是一种模拟人类听觉感知的频率尺度。通过滤波器组，我们可以得到每个滤波器通道的功率谱。 5. **对数运算**：将通过滤波器组后的功率谱取对数，模拟人类听觉系统对声音强度的感知非线性。 6. **离散余弦变换**：对对数功率谱进行DCT（离散余弦变换），这一步骤主要是为了减少特征之间的相关性，得到一组线性不相关的系数，通常保留前13-26个系数作为MFCC特征。 7. **归一化和动态特性计算**：为了提高识别性能，可以对MFCC特征进行归一化处理，并计算如差分和加速率等动态特性，这些特性可以帮助捕捉语音的音调变化。 MFCC特征在语音识别、情感识别、说话人识别等多个领域有广泛应用。例如，在语音识别系统中，MFCC作为输入特征，可以用于训练模型以区分不同的词汇或命令。在音频分类任务中，MFCC也可以帮助区分不同类型的音乐或环境声音。在“mfcc.rar_MFCC_extraction”这个项目中，提供的“mfcc.pdf”文档很可能是关于MFCC特征提取的详细教程或研究论文，可能涵盖了MFCC的理论、实现方法以及实际应用案例。了解MFCC的原理和应用对于从事语音处理、自然语言处理和机器学习等相关领域的专业人士至关重要。通过深入学习和实践MFCC，我们可以更好地设计和优化语音识别系统，提高其准确性和效率。

资源推荐

资源详情

资源评论

收起资源包目录

mfcc.rar （1个子文件）

mfcc.pdf 199KB

Aldebaro Klautau - 11/22/05. Page 1.

The MFCC

1- How are MFCCs used in speech recognition ?

Generally speaking, a conventional automatic speech recognition (ASR) system can be

organized in two blocks: the feature extraction and the modeling stage. In practice, the

modeling stage is subdivided in acoustical and language modeling, both based on HMMs as

described in Figure 1.

Figure 1- Simple representation of a conventional ASR.

The feature extraction is usually a non-invertible (lossy) transformation, as the MFCC

described pictorially in Figure 2. Making an analogy with filter banks, such transformation does

not lead to perfect reconstruction, i.e., given only the features it is not possible to reconstruct

the original speech used to generate those features.

Computational complexity and robustness are two primary reasons to allow loosing

information. Increasing the accuracy of the parametric representation by increasing the

number of parameters leads to an increase of complexity and eventually does not lead to a

better result due to robustness issues. The greater the number of parameters in a model, the

greater should be the training sequence.

Figure 2- Pictorial representation of mel-frequency cepstrum (MFCC) calculation.

Speech is usually segmented in frames of 20 to 30 ms, and the window analysis is

shifted by 10 ms. Each frame is converted to 12 MFCCs plus a normalized energy parameter.

The first and second derivatives (∆’s and ∆∆’s) of MFCCs and energy are estimated, resulting in

39 numbers representing each frame. Assuming a sample rate of 8 kHz, for each 10 ms the

feature extraction module delivers 39 numbers to the modeling stage. This operation with

overlap among frames is equivalent to taking 80 speech samples without overlap and

representing them by 39 numbers. In fact, assuming each speech sample is represented by

one byte and each feature is represented by four bytes (float number), one can see that the

parametric representation increases the number of bytes to represent 80 bytes of speech (to

136 bytes). If a sample rate of 16 kHz is assumed, the 39 parameters would represent 160

samples. For higher sample rates, it is intuitive that 39 parameters do not allow to reconstruct

the speech samples back. Anyway, one should notice that the goal here is not speech

compression but using features suitable for speech recognition.

Aldebaro Klautau - 11/22/05. Page 2.

2- What are MFCCs ? How are they calculated ?

The block diagrams for calculating MFCCs is given below.

Figure 3- MFCC calculation.

The point is: Why these blocks ? What is the intuition behind each one ?

In my opinion there are two ways of looking to the MFCCs: (a) as a filter-bank

processing adapted to speech specificities and (b) as a modification of the conventional

cepstrum, a well known deconvolution technique based on homomorphic processing. These

points of view are complementary and help getting insight about MFCCs. I will briefly describe

each one.

2.1- Mel-scale: from auditory modeling

Before proceeding, let us take in account some characteristics of the human auditory

system. Two famous experiments generated the Bark and mel scales, given below. Describe

the experiments.

Bark Mel

Filter Frequency (Hz) BW (Hz) Frequency (Hz) BW (Hz)

1 50 100 100 100

2 150 100 200 100

3 250 100 300 100

4 350 100 400 100

5 450 110 500 100

6 570 120 600 100

7 700 140 700 100

8 840 150 800 100

9 1000 160 900 100

10 1170 190 1000 124

11 1370 210 1149 160

12 1600 240 1320 184

13 1850 280 1516 211

14 2150 320 1741 242

15 2500 380 2000 278

16 2900 450 2297 320

17 3400 550 2639 367

18 4000 700 3031 422

19 4800 900 3482 484

20 5800 1100 4000 556

21 7000 1300 4595 639

22 8500 1800 5278 734

23 10500 2500 6063 843

Speech

Output

energy of

filters on

Mel-scale

LOG

DFT

Pre-

emphasis

Hamming

window

DCT

MFCC

Aldebaro Klautau - 11/22/05. Page 3.

24 13500 3500 6964 969

So, we use the mel scale to organize the filter bank used in MFCC calculation. Using function melbankm (this

function returns a sparse matrix, so I used command full to convert it to a regular matrix). Notice the weight is 2.

0 1000 2000 3000 4000 5000 6000 7000 8000

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Frequency (Hz)

Weight

MFCC filter-bank (crosses indicate center frequency of each filter)

filterAmplitudes=full(melbankm(NfiltersOfMelBank ,frame_duration,Fs));

peak = max(filterAmplitudes');

for index = 1:length(peak)

filterCenter(index) = find(filterAmplitudes(index,:)==peak(index));

end

xaxis_in_Hz = (0:128)*Fs/frame_duration;

plot(xaxis_in_Hz,filterAmplitudes'); hold on

HzScale = filterCenter * Fs / frame_duration;

plot(HzScale,ones(1,length(filterCenter)),'x');

plot(HzScale,zeros(1,length(filterCenter)),'x');

xlabel('Frequency (Hz)');

ylabel('Weight');

title('MFCC filter-bank (crosses indicate center frequency of each filter)');

2.1 - Cepstral analysis, the historical father of the MFCCs

Homomorphic processing is well discussed by Oppenheim in his textbooks. Cepstrum is

maybe the most popular homomorphic processing because it is useful for deconvolution. To

understand it, one should remember that in speech processing, the basic human speech

production model adopted is a source-filter model.

è Source: is related to the air expelled from the lungs. If the sound is unvoiced, like in

"s" and "f", the glottis is open and the vocal cords are relaxed. If the sound is voiced,

"a", "e", for example, the vocal cords vibrate and the frequency of this vibration is

related to the pitch.

è Filter: is responsible for giving a shape to the spectrum of the signal in order to produce

different sounds. It is related to the vocal tract organs.

Aldebaro Klautau - 11/22/05. Page 4.

Roughly speaking: a good parametric representation for a speech recognition system tries to

eliminate the influence of the source (the system must give the same "answer" for a high pitch

female voice and for a low pitch male voice), and characterize the filter. The problem is:

source e(n) and filter impulse response h(n) are convoluted. Then we need deconvolution in

speech recognition applications. Mathematically:

In the time domain, convolution: source * filter = speech,

e(n) * h(n) = x(n). (1)

In the frequency domain, multiplication: source x filter = speech,

E(z) H(z) = X(z). (2)

How can we make the deconvolution ? Cepstral analysis is an alternative.

è Working in the frequency domain, use the logarithm to transform the multiplication in (2)

into a summation (obs: log ab = log a + log b). It is not easy to separate (to filter) things that

are multiplied as in (2), but it is easy to design filters to separate things that are parcels of a

sum as below:

C(z) = log X(z) = log E(z) + log H(z). (3)

We hope that H(z) is mainly composed by low frequencies and E(z) has most of its

energy in higher frequencies, in a way that a simple low-pass filter can separate H(z) from E(z)

if we were dealing with E(z) + H(z). In fact, let us suppose for the sake of simplicity that we

have, instead of (3), the following equation:

(z) = E(z) + H(z). (4)

We could use a linear filter to eliminate E(z) and then calculate the Z-inverse transform

to get a time-sequence c

(z). Notice that in this case, co(z) would have dimension of time

(seconds, for example).

Having said that, let us now face our problem: the log operation in (3). Log is a non-

linear operation and it can "create" new frequencies. For example, expanding the log of a

cosine in Taylor series shows that harmonics are created. So, even if E(z) and H(z) are well

separated in the frequency domain, log E(z) and log H(z) could eventually have considerable

overlap. Fortunately, that is not the case in practice for speech processing. The other point is

that, because of the log operation, the Z-inverse of C(z) in (3) has NOT the dimension of time

as in (4). We call cepstrum the Z-inverse of C(z) and its dimension is quefrency (a time

domain parameter).

è There are 2 basic types of cepstrum: complex cepstrum and real cepstrum. Besides, there

are two ways of calculating the real cepstrum (used in speech processing because phase is not

important): LPC cepstrum and FFT cepstrum.

LPC cepstrum: the cepstral coefficients are obtained from the LPC coefficients

FFT cepstrum: from a FFT

Which one is better ? The most widely parametric representation for speech recognition is the

FFT cepstrum derived based on a mel scale [Davis, 80].

Aldebaro Klautau - 11/22/05. Page 5.

2.2 - Filter-bank interpretation: the simplest way to look at MFCCs

We go to frequency domain and disregard phase, working only with the power spectrum. Then, we use log

because our ears work in decibels. To reduce dimensionality, we use a filter-bank with around 20 filters. The filters

follow mel-scale. We take the DCT-II because it is good for compressing information, uncorrelating the vector, doing a

work asymptotically close to KLT. Then we disregard the high-frequency DCT coefficients. Much easier, is it not ?

Reconstructing the spectrum based on MFCC

Given MFCCs, how to reconstruct the speech spectrum back ? See the appendix.

Some examples below: one segment of voice speech and another unvoiced.

Voiced speech

0 1000 2000 3000 4000 5000 6000 7000 8000

-80

-70

-60

-50

-40

-30

-20

-10

Frequency (Hz)

dB Magnitude

FFT-red, LPC-blue, DFT cepstra-green, MFCC-black

Unvoiced speech

0 1000 2000 3000 4000 5000 6000 7000 8000

-70

-60

-50

-40

-30

-20

-10

Frequency (Hz)

dB Magnitude

FFT-red, LPC-blue, DFT cepstra-green, MFCC-black

The examples below show cases where the MFCC did not capture the formants structure, i.e., they did not

perform a good job.

评论收藏

内容反馈

版权申诉

钱亚锋

粉丝: 101
资源: 1万+

mfcc.rar_MFCC_extraction

mfcc.rar_extraction

mfcc.rar_MFCC

MFCC.zip_EYI_MFCC_MFCC matlab_features extraction_提取音频的MFCC特征

audmfcc.rar_MFCC_extraction

dtw.rar_dtw_extraction

calc_mfcc-v0.2.zip_MFCC_extraction_mfcc code in matlab

MFCC.rar_Extraction of Mel_mel frequency_mfcc特征_信号提取matlab_倒谱检测

mfcc.rar_MFCC_MFCC matlab

mfcc.rar_MFCC_MFCC 13

mfcc.rar_MFCC_梅尔倒谱系数

mfcc.rar_Free!_MFCC_MFCC CODE

mfcc.rar_MFCC_matlab code for MFCC

1999-james.rar_humming _melody extraction_哼唱_旋律哼唱检索_音乐特征

feature_extraction.py

opensmile-source-1.0.1.zip_Opensmile download_opensmile_opensmil

veles.sound_feature_extraction:分布式机器学习平台

yuhuimin_opensmilePython_opensmile_python_MFCC_

Navi.rar_MFCC matlab_MFCC 图_mfcc for image_mfcc图

MFCC.rar_MFCC_MFCC matlab_MFCC识别_htk格式_mfcc htk

MFCC.rar_MFCC_MFCC c++_c mfcc_speech

mfcc.rar_MFCC_speech recognition

MFCC.rar_MFCC_c mfcc_语音 信号 处理

dsp_proj_speech_riceagx_extraction_

音频特征参数MFCC的MATLAB程序

通过IEEE论文33篇看懂FFT处理器设计.rar

mfcc特征提取的matlab代码-features_extraction:从wav到h5features格式的音频功能提取工具

openSMILE-2.1.0.rar

最新资源

MFCC.rar_MFCC_c mfcc_语音信号处理