VoidAfastandlightvoicelivenessdetectionsystem.pdf资源-CSDN文库

需积分: 9 71 浏览量 2021-03-19 16:10:26 上传评论 1 收藏 2.17MB PDF 举报

标题中提到的“Void: A fast and light voice liveness detection system”指的是一个快速、轻量级的声音活体检测系统，它被命名为“Void”。在声音活体检测的领域，这是一个重要而具体的研究课题，尤其是考虑到随着技术的发展，越来越多的智能设备开始支持语音交互。这样的系统旨在区分活体发出的声音和通过扬声器播放的录音，以防止语音辅助设备受到声音重放攻击或语音合成攻击。在描述中，“Void”系统被认为是高效且高效的，它通过使用活人声音与通过扬声器重放的声音在光谱功率上的差异来检测声音仿冒攻击。与现有的采用多个深度学习模型和成千上万特征的方法不同，“Void”仅使用一个分类模型，并且只用了97个特征来进行声音活体检测。这项技术通过评估两个数据集的性能来展示其有效性：一个是由120名参与者、15个播放设备和12个录音设备生成的255,173个声音样本，另一个是由42名参与者、26个播放设备和25个录音设备生成的18,030个公开可用的声音样本。该系统在这两个数据集上分别取得了0.3%和11.6%的错误率，相较于最先进的深度学习解决方案，在公开数据集上以7.4%的错误率取得了显著的性能提升。此外，当与高斯混合模型结合使用时，“Void”可以利用已经在语音识别服务中广泛使用的梅尔频率倒谱系数（MFCC）作为分类特征，进而实现更为强大的声音活体检测能力。从知识的角度来看，首先需要了解声音活体检测技术的背景。该技术的主要目的是保护语音识别系统免受非法访问或控制，特别是防止通过重放录音来欺骗语音辅助设备的攻击。由于语音助手的输入通道具有开放性，攻击者可以轻易地录制人们的语音命令并回放它们以冒充真人，这种行为对于个人隐私和安全构成严重威胁。接着，研究团队提出的“Void”系统在技术上采用了比现有系统更简洁的设计，显著降低了计算复杂度和内存需求。它不依赖于复杂的深度学习模型，而是通过单一分类模型和相对较少的特征来完成检测任务。这种高效的设计意味着该系统在实际部署时可以更快地响应，更加节省资源，并且在安全性方面具有更高的实用性。 “Void”系统还对各种攻击方法，包括隐藏语音命令、不可听见的语音命令、语音合成、均衡操作攻击，以及结合重放攻击与真人声音的攻击表现出较高的检测率。这显示了该系统在应对多样化的攻击策略时的健壮性和适应性。具体实现上，“Void”系统着重于光谱功率差异的检测。这是因为在通过扬声器重放的声音与真人的声音在光谱功率分布上存在显著差异，这种差异可以通过分析声音样本的频谱特征来识别。在此基础上，结合MFCC特征进行活体检测，这些特征在现有的语音识别系统中已被广泛使用，因此这样的集成策略在实际应用中可能会更加顺利，因为它不会对现有的语音识别基础设施产生过多干扰。总结来说，“Void: A fast and light voice liveness detection system”在当前语音识别和智能设备日益普及的背景下，提供了一种高效且具有高检测率的声音活体检测解决方案。这不仅仅是对现有技术的优化，更可能在实际应用中成为一道有效的防线，保护用户的语音数据和隐私安全。此外，它简化的模型设计和对现有技术的兼容性，使其具有较高的推广潜力和应用前景。

资源推荐

资源详情

资源评论

Void: A fast and light voice liveness detection system

Muhammad Ejaz Ahmed

Data61, CSIRO

Sungkyunkwan University

Il-Youp Kwak

∗

Chung-Ang University

Jun Ho Huh

Samsung Research

Iljoo Kim

Samsung Research

Taekkyung Oh

KAIST

Sungkyunkwan University

Hyoungshick Kim

Sungkyunkwan University

Abstract

Due to the open nature of voice assistants’ input channels, ad-

versaries could easily record people’s use of voice commands,

and replay them to spoof voice assistants. To mitigate such

spooﬁng attacks, we present a highly efﬁcient

ice l

veness

etection solution called “Void.” Void detects voice spoof-

ing attacks using the differences in spectral power between

live-human voices and voices replayed through speakers. In

contrast to existing approaches that use multiple deep learn-

ing models, and thousands of features, Void uses a single

classiﬁcation model with just 97 features.

We used two datasets to evaluate its performance: (1)

255,173 voice samples generated with 120 participants, 15

playback devices and 12 recording devices, and (2) 18,030

publicly available voice samples generated with 42 partici-

pants, 26 playback devices and 25 recording devices. Void

achieves equal error rate of 0.3% and 11.6% in detecting voice

replay attacks for each dataset, respectively. Compared to a

state of the art, deep learning-based solution that achieves

7.4% error rate in that public dataset, Void uses 153 times

less memory and is about 8 times faster in detection. When

combined with a Gaussian Mixture Model that uses Mel-

frequency cepstral coefﬁcients (MFCC) as classiﬁcation fea-

tures – MFCC is already being extracted and used as the main

feature in speech recognition services – Void achieves 8.7%

error rate on the public dataset. Moreover, Void is resilient

against hidden voice command, inaudible voice command,

voice synthesis, equalization manipulation attacks, and com-

bining replay attacks with live-human voices achieving about

99.7%, 100%, 90.2%, 86.3%, and 98.2% detection rates for

those attacks, respectively.

1 Introduction

Popular voice assistants like Siri (Apple), Alexa (Amazon)

and Now (Google) allow people to use voice commands to

∗

Part of this work done while Dr. Kwak was at Samsung Research.

quickly shop online, make phone calls, send messages, con-

trol smart home appliances, access banking services, and so

on. However, such privacy- and security-critical commands

make voice assistants lucrative targets for attackers to exploit.

However, recent studies [11, 12, 23] demonstrated that voice

assistants are vulnerable to various forms of voice presenta-

tion attacks including “voice replay attacks” (attackers simply

record victims’ use of voice assistants and replay them) and

“voice synthesis attacks” (attackers train victims’ voice bio-

metric models and create new commands).

To distinguish between live-human voices and replayed

voices, several voice liveness detection techniques have been

proposed. Feng et al. [11] proposed the use of wearable de-

vices, such as eyeglasses, or earbuds to detect voice liveness.

They achieved about 97% detection rate but rely on the use

additional hardware that users would have to buy, carry, and

use. Deep learning-based approaches [7, 30] have also been

proposed. The best known solution from an online replay

attack detection competition called “2017 ASVspoof Chal-

lenge” [7] is highly accurate, achieving about 6.7% equal

error rate (EER) – but it is computationally expensive and

complex: two deep learning models (LCNN and CNN with

RNN) and one SVM-based classiﬁcation model were all used

together to achieve high accuracy. The second best solution

achieved 12.3% EER using an ensemble of 5 different classiﬁ-

cation models and multiple classiﬁcation features: Constant Q

Cepstral Coefﬁcients (CQCC), Perceptual Linear Prediction

(PLP), and Mel Frequency Cepstral Coefﬁcients (MFCC) fea-

tures were all used. CQCC alone is heavy and would consist

of about 14,000 features.

To reduce computational burden and maintain high detec-

tion accuracy, we present “Void” (

ice l

veness

etection),

which is a highly efﬁcient voice liveness detection system

that relies on the analysis of cumulative power patterns in

spectrograms to detect replayed voices. Void uses a single

classiﬁcation model with just 97 spectrogram features. In par-

ticular, Void exploits the following two distinguishing charac-

teristics in power patterns: (1) Most loudspeakers inherently

add distortions to original sounds while replaying them. In

consequence, the overall power distribution over the audible

frequency range often show some uniformity and linearity. (2)

With human voices, the sum of power observed across lower

frequencies is relatively higher than the sum observed across

higher frequencies [15, 29]. As a result, there are signiﬁcant

differences in the cumulative power distributions between

live-human voices and those replayed through loudspeakers.

Void extracts those differences as classiﬁcation features to

accurately detect replay attacks.

Our key contributions are summarized below:

•

Design of a fast and light voice replay attack detection

system that uses a single classiﬁcation model and just 97

classiﬁcation features related to signal frequencies and cu-

mulative power distribution characteristics. Unlike existing

approaches that rely on multiple deep learning models and

do not provide much insight into complex spectral features

being extracted [7, 30], we explain the characteristics of

key spectral power features, and why those features are

effective in detecting voice spooﬁng attacks.

•

Evaluation of voice replay attack detection accuracy using

two large datasets consisting of 255,173 voice samples col-

lected from 120 participants, 15 playback devices and 12

recording devices, and 18,030 ASVspoof competition voice

samples collected from 42 participants, 26 playback speak-

ers and 25 recording devices, respectively, demonstrating

0.3% and 11.6% EER. Based on the latter EER, Void would

be ranked as the second best solution in the ASVspoof 2017

competition. Compared to the best-performing solution

from that competition, Void is about 8 times faster and uses

153 times less memory in detection. Void achieves 8.7%

EER on the ASVspoof dataset when combined with an

MFCC-based model – MFCC is already available through

speech recognition services, and would not require addi-

tional computation.

•

Evaluation of Void’s performance against hidden com-

mand, inaudible voice command, voice synthesis, equal-

ization (EQ) manipulation attacks, and combining replay at-

tacks with live-human voices showing 99.7%, 100%, 90.2%,

86.3%, and 98.2% detection rates, respectively.

2 Threat Model

2.1 Voice replay attacks

We deﬁne live-human audio sample as a voice utterance ini-

tiated from a human user that is directly recorded through

a microphone (such that would normally be processed by a

voice assistant). In a voice replay attack, an attacker uses a

recording device (e.g., a smartphone) in a close proximity

to a victim, and ﬁrst records the victim’s utterances (spoken

words) of voice commands used to interact with voice assis-

tants [3, 11, 12]. The attacker then replays the recorded sam-

ples using an in-built speaker (e.g., available on her phone) or

Figure 1: Steps for a voice replay attack.

a standalone speaker to complete the attack (see Figure 1).

Voice replay attack may be the easiest attack to perform

but it is the most difﬁcult one to detect as the recorded voices

have similar characteristics compared to the victim’s live

voices. In fact, most of the existing voice biometric-based

authentication (human speaker veriﬁcation) systems (e.g.,

[31, 32]) are vulnerable to this kind of replay attack.

2.2 Adversarial attacks

We also consider more sophisticated attacks such as “hidden

voice command” [24, 25], “inaudible voice command” [18

–

20], and “voice synthesis” [6, 12] attacks that have been dis-

cussed in recent literature. Further, EQ manipulation attacks

are speciﬁcally designed to game the classiﬁcation features

used by Void by adjusting speciﬁc frequency bands of attack

voice signals.

3 Requirements

3.1 Latency and model size requirements

Our conversations with several speech recognition engineers

at a large IT company (that run their own voice assistant ser-

vices with millions of subscribed users) revealed that there are

strict latency and computational power usage requirements

that must be considered upon deploying any kind of machine

learning-based services. This is because additional use of

computational power and memory through continuous invo-

cation of machine learning algorithms may incur (1) unac-

ceptable costs for businesses, and (2) unacceptable latency

(delays) for processing voice commands. Upon receiving a

voice command, voice assistants are required to respond im-

mediately without any noticeable delay. Hence, processing

delays should be close to 0 second – typically, engineers do

not consider solutions that add 100 or more milliseconds of

delay as portable solutions. A single GPU may be expected to

concurrently process 100 or more voice sessions (streaming

commands), indicating that machine learning algorithms must

be lightweight, simple, and fast.

Further, as part of future solutions, businesses are consid-

ering on-device voice assistant implementations (that would

not communicate with remote servers) to improve response

latency, save server costs, and minimize privacy issues related

to sharing users’ private voice data with remote servers. For

such on-device solutions with limited computing resources

available, the model and feature complexity and size (CPU

Figure 2: Spectrogram of an example phrase “The Blue

Lagoon is a 1980 romance and adventure ﬁlm” lively uttered

by a human user (left), and cumulative power spectral decay

of the corresponding command (right).

Figure 3: Spectrogram of the same example phrase (as in Fig-

ure 2) replayed using iPhone 6S Plus (left), and cumulative

power spectral decay (right).

and memory usage) requirements would be even more con-

straining.

3.2 Detection accuracy requirements

Our main objective is to achieve competitively high accuracy

while keeping the latency and resource usage requirements

at acceptable levels (see above). Again, our conversations

with the speech recognition engineers revealed that businesses

require around 10% or below EER to be considered as a usable

solution. For reference, the best performing solution from the

ASVspoof 2017 competition achieved 6.7% EER [30], and

the second best solution achieved 12.3% [7].

4 Key classiﬁcation features

Void exploits the differences in frequency-dependent spectral

power characteristics between live-human voices and voices

replayed through loudspeakers. Through numerous trials and

experiments, we observed three distinct features related to

power spectrum of speech signals that may distinguish live-

human voices from voices replayed through loudspeakers.

This section explores those features in detail.

Figure 1 shows the steps involved in replaying recorded

voice signals. An attacker would ﬁrst record a victim’s voice

command using her own recording device. Then the attacker

would use the same device (in-built speaker) to replay the

recorded voice command, targeted at the victim’s device. This

attack command is then processed by the voice assistant ser-

vice running on the victim’s device. While performing this

replay attack, some distortions may be added to the victim’s

original sound while being recorded with the microphone on

the attacker’s device, and also while being replayed through

the in-built speaker due to hardware imperfections. The fol-

lowing sections explore the spectral power characteristics of

replayed voices, and analyze key classiﬁcation features that

are used to classify voice replay attacks.

4.1 Decay patterns in spectral power

In general, low quality loudspeakers are designed to achieve

high sensitivity and volume but at the cost of compromising

audio ﬁdelity and adding unwanted distortions [35]. As a

result, distortions that contribute to non-linearity may be more

prevalent in low quality loudspeakers, and less visible in high

quality loudspeakers [36, 37].

Figure 2 (left) shows the spectrogram of a sentence “The

Blue Lagoon is a 1980 romance and adventure ﬁlm” uttered

live, and processed by an audio chipset in a laptop. Here, the

audio sampling rate was 44.1kHz, and the utterance duration

was 5 seconds. In this voice sample, most of the spectral

power lies in the frequency range between 20Hz and 1kHz.

The cumulative spectral power measured for each frequency

is also shown in Figure 2 (right). There is an exponential

power decay of human voice at frequency around 1kHz.

On the other hand, the spectrogram of a phrase replayed

through iPhone 6s Plus in-built speaker (see Figure 3) shows

some uniformity – spectrum spread is shown in the power

distributions between 1 and 5kHz. Unlike live-human voice

trends shown in Figure 2, the cumulative spectral power does

not decrease exponentially; rather, there is a relatively more

linear decay between 1 and 5kHz. To show the difference

between Figure 2 and 3 quantitatively, we added quadratic

ﬁtting curves on them and computed Root Mean Square Error

(RMSE) separately.

Our experimentation with 11 in-built smartphone speakers

showed similar behaviors in their spectral power distributions;

i.e., power decreased gradually across frequencies and did not

decay exponentially. An example cumulative distribution of

spectral power density is shown in Figure 4. With the human

voice example, about 70% of the overall power lies in the

frequency range below 1kHz. However, in the loudspeaker

case, the cumulative distribution increases almost linearly,

and 70% of the total power lies within the frequency range of

about 4kHz.

One possible explanation for this spreading out charac-

teristic is low-quality hardware boosting power in certain

frequency ranges. Consequently, such a linear decay pattern

in spectral power (over audible frequency range) could be

剩余17页未读，继续阅读

评论收藏

内容反馈

染阳

粉丝: 2099
资源: 6

Void A fast and light voice liveness detection system.pdf

最新资源

Void A fast and light voice liveness detection system.pdf

A DCNN Based Fingerprint Liveness Detection Algorithm with Voting Strategy

Face liveness detection with recaptured feature extraction

LCPD.zip_LCPD_descriptor_image processing_liveness detection_pha

A New Multispectral Method for Face Liveness Detection.

find_minutiae.rar_fingerprint_in_liveness detection_minutiae_min

liveness-detection:活动检测应用程序的后端和JavaScript前端

Fingerprint Liveness Detection by Local Phase Quantization

Face-Liveness-Detection:人脸活动检测-一种防止人脸识别系统中欺骗的工具

智能手机指纹识别技术研究综述.pdf

Liveness-Detection(android.support) (1).zip

基于自适应非均匀量化的安全人脸识别方法.pdf

Liveness-Detection-(AndroidX).zip

fingerprint_liveness_detection

matlab的egde源代码-Face_Liveness_Detection:Face_Liveness_Detection

Liveness-Detect:最简单的实时人脸识别API

活体检测方法调研，introduction_to_liveness_detection_liveness-detection-

动态纹理人脸活体检测

liveness_src.zip

DevOps上玩转Kubernetes共21页.pdf.z

Handbook_of_iris_recognition.pdf v2 虹膜识别手册

Java.Concurrency.in.Practice.pdf

Handbook of Biometric Anti-Spoofing: Presentation Attack Detection

Baier, Katoen - 2008 - Principles of Model Checking.pdf

Introductory Tutorial on Petri Nets.pdf

论文研究-Marking Analysis for Optimal Supervisor for a Class of Petri Nets.pdf

最新资源

Handbook_of_iris_recognition.pdf v2　虹膜识别手册