使用挑战-响应对Deepfake音频通话进行人工智能辅助标记.pdf资源-CSDN文库

155 浏览量 2024-04-26 21:49:16 上传评论收藏 585KB PDF 举报

资源推荐

资源详情

资源评论

AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

Govind Mittal

, Arthur Jakobsson

, Kelly O. Marshall

, Chinmay Hegde

, Nasir Memon

New York University, Tandon School of Engineering, Brooklyn, NY

Carnegie Mellon University, Pittsburgh, PA

{mittal, km3888, chinmay.h, memon}@nyu.edu, ajakobss@andrew.cmu.edu

Abstract

Scammers are aggressively leveraging AI voice-cloning tech-

nology for social engineering attacks, a situation signiﬁ-

cantly worsened by the advent of audio Real-time Deepfakes

(RTDFs). RTDFs can clone a target’s voice in real-time over

phone calls, making these interactions highly interactive and

thus far more convincing. Our research conﬁdently addresses

the gap in the existing literature on deepfake detection, which

has largely been ineffective against RTDF threats. We in-

troduce a robust challenge-response-based method to detect

deepfake audio calls, pioneering a comprehensive taxonomy

of audio challenges. Our evaluation pitches 20 prospective

challenges against a leading voice-cloning system.

We have compiled a novel open-source challenge dataset

with contributions from 100 smartphone and desktop users,

yielding 18,600 original and 1.6 million deepfake samples.

Through rigorous machine and human evaluations of this

dataset, we achieved a deepfake detection rate of 86% and an

80% AUC score, respectively. Notably, utilizing a set of 11

challenges signiﬁcantly enhances detection capabilities. Our

ﬁndings reveal that combining human intuition with machine

precision offers complementary advantages. Consequently,

we have developed an innovative human-AI collaborative sys-

tem that melds human discernment with algorithmic accuracy,

boosting ﬁnal joint accuracy to 82.9%. This system highlights

the signiﬁcant advantage of AI-assisted pre-screening in call

veriﬁcation processes.

https://mittalgovind.github.

io/autch-samples/

1 Introduction

Recent advancements in synthetic speech generation have

obscured the distinction between authentic and fabricated me-

dia [14,18,24]. The impact of such advancements is felt partic-

ularly with the advent of voice-cloning tools that can generate

speech that sounds authentic in real-time [12,25, 26, 41, 43].

The existence of such tools presents a critical vulnerability

for exploitation during social engineering and raises an urgent

question: How can one ascertain the authenticity of a caller

in an age where synthetic voices are indistinguishable from

real ones?

Phone call-based social engineering attacks are highly plau-

sible, especially since 2019, when online interactions became

mainstream. An example of a phone scam is robocalls, whose

numbers in the U.S. peaked at an astonishing 58.5 billion.

Robocalls even used the voices of the U.S. President Biden [7]

and Mayor of New York City for voter suppression and out-

reach [8]. This concerning escalation prompted repeated warn-

ings from the FBI [21], the FTC [36], and an eventual prohi-

bition against robocalls by the FCC [20].

Consider a more believable and targeted form of phone

scams than robocalls leveraging Real-time Deepfakes or

RTDFs. Such deepfakes make scam calls sound convincingly

close to a target’s voice, and being real-time, they become in-

teractive and capable of conversing. Such fake calls are more

risky as they are more believable and, hence, more persua-

sive. We will shortly discuss how we used RTDF’s inherent

interactivity to tackle them.

One could assume that decades of technological advance-

ments in voice calling should be able to counter most scams.

However, several incidents where scammers exploited audio

deepfakes highlight the inadequacy of current telecommuni-

cations and regulatory measures to contain such threats. Such

incidents include convincing a ﬁnance worker to remit $25

million by sounding like their chief ﬁnancial ofﬁcer over a

group video call [6], demanding a $50,000 ransom by mimick-

ing a daughter’s distressed voice to deceive her mother [16]

and defrauding an energy company of $243,000 by imper-

sonating their boss’ voice over the phone [22]. Moreover, a

McAfee survey of 7,000 individuals worldwide found that

25% had encountered an AI voice cloning scam or knew some-

one who had [29], revealing a troubling reality: the end users

have been left responsible for discerning a genuine caller from

an imposter.

In an ideal scenario, an aware individual would approach

every call with a healthy degree of skepticism, and a suspi-

cious caller could never make any gains. In reality, receiving

arXiv:2402.18085v1 [cs.SD] 28 Feb 2024

a call from a known caller ID [2] and the mere recognition of

a familiar voice can swiftly establish trust, even amidst noise

or voice distortions. Typically, an explanation from the caller,

asserting they are under the weather or in an area with poor re-

ception, is enough to dispel any doubts [42]. Moreover, sophis-

ticated RTDFs now challenge conventional security protocols,

such as speaker veriﬁcation systems and liveness detection in

KYC processes, making them prone to evasion [3–5].

Numerous audio deepfake detection methods have been

developed [47]. However, they are primarily designed for

static content and have limited efﬁcacy against RTDFs. These

methods assume a non-interactive scenario and work ofﬂine,

allowing impostors ample time to reﬁne manipulated content

with sophisticated editing tools.

In this work, we depart from the traditional techniques

and leverage challenge-response to pursue an alternative ap-

proach to detecting deepfake audio calls. Challenge-response

mechanisms, traditionally pivotal in distinguishing bots from

humans via the ubiquitous CAPTCHAs, have taken on new

ground with the introduction of GOTCHA [32] and D-

CAPTCHA [46] for unmasking video and audio deepfakes.

These pioneering studies capitalize on the asymmetric advan-

tage where the burden to maintain high quality in real-time,

under challenging situations, is now squarely on the imposter

caller. This work aims to harness the full spectrum of audio

challenges during phone calls and orchestrate a collaborative

dynamic between humans and AI, thereby fortifying detection

accuracy and conﬁdence.

We began by curating a detailed taxonomy of audio chal-

lenges, categorizing them into eight main types and 22 sub-

types, including vocal distortions (e.g., whispering), wave-

form manipulations (e.g., high-pitch speaking), language-

speciﬁc articulations (e.g., rolling R sounds), environmental

noise (e.g., clapping while speaking), and background play-

backs (e.g., talking over music). We selected and evaluated

20 prospective challenges from the taxonomy.

Our investigation revolves around three research questions:

RQ1: Do challenges enhance machine detection?

RQ2:

Can human evaluators harness these challenges to

sharpen their discernment?

RQ3:

Does supporting humans with automated detectors

further improve overall performance?

We collected an open-source novel dataset to address these

questions by engaging 100 participants across mobile and

desktop interfaces to produce 18,600 original voice record-

ings. We subsequently generated 1.6 million samples through

a state-of-the-art one-shot voice-cloning technology.

We performed machine evaluation

by designing a non-

intrusive degradation metric for assessing audio samples on

compliance, naturalness, and the preservation of word infor-

mation. This metric unveiled that a select group of 11 chal-

lenges dramatically boosted the deepfake detection perfor-

we used ‘Machines’ and ‘AI’ interchangeably

mance (Area Under Curve or AUC) to 86.7%, compared to

56.0% in deepfakes of everyday speech, highlighting the ben-

eﬁt of using challenges to aid machines. Notably, challenges

like whispering, cupping mouth, and playback were among

the most effective for machines.

Concurrently, we performed a human evaluation on a

harder subset of the whole dataset. SpeechBrain’s speaker-

veriﬁcation system [37] classiﬁed all deepfake samples in this

subset as genuine. Humans assessed the compliance and qual-

ity of each audio sample and achieved an AUC of 80.1%, with

challenges bringing out discernible degradation in deepfakes.

Informed by these results and the propensity of humans

to produce more false positives, we envisioned a framework

where machines assisted humans in making decisions. This

integration makes our approach scalable while keeping the

outcome interpretable with human oversight. A subsequent

human evaluation with top-performing challenges, using a

balanced dataset, revealed that machine assistance increased

human detection accuracy from 72.3% to 78.5%. Furthermore,

machine assistance signiﬁcantly boosted human conﬁdence

and rectiﬁed their errors in 43% of cases (when machines

were correct) while causing misjudgments in 29% (when

machines were wrong).

Taking this integration a step further, in scenarios where

humans were uncertain, machines were allowed to take charge

and make the ﬁnal call. This blend of Human-AI contributions,

at a ratio of 56:44, enhanced the overall detection accuracy to

82.9%, representing a 14.3% improvement over the human-

only baseline (with machine accuracy standing at 85%). This

approach underscores that while humans maintain their pri-

mary role in decision-making, strategic machine involvement

can dramatically enhance the accuracy of detecting fake calls.

Our ﬁndings validated the instrumental role of challenge-

response mechanisms in bolstering both machine and human

capabilities in audio deepfake detection, showcasing a synergy

between human insight and AI accuracy that mitigates the

threats posed by real-time synthetic speech technologies.

Distinction from Robocalls and Liveness Detection.

Robocalls are like spam, distributed en masse to the pop-

ulation in an automated manner. Generally, the message is un-

targeted and is spread to a large population. However, RTDFs

(context for this work) are targeted and possibly curated for

the particular receiver.

In principle, liveness detection mechanisms are ineffective

against RTDFs, as the imposter generates deepfakes in real-

time; hence, they are already live. Also, there is a person

behind the call who can understand the instructions and can

converse. Thus, each interaction could unroll differently.

2 Background

Deﬁnition. An audio deepfake refers to a digital imperson-

ation in which an imposter mimics audio characteristics to

convincingly match a speciﬁc target’s likeness. When such

impersonations can be done live with sufﬁcient ﬁdelity, we

call them RTDFs.

2.1 Cognitive Bias and Human Susceptibility

In the context of deepfakes, individuals are left to navigate

the landscape on their own, letting scammers abuse their trust

during phone calls. Two pivotal factors contribute to the phe-

nomenon of rapid trust-building over calls. First, there exists a

general human acknowledgment of the fallibility of our audi-

tory perception. Mishearing is a shared experience, arguably

more so than visual misinterpretation. We naturally integrate

multi-sensory inputs into listening, such as live transcription,

facial expressions, mouth movements, and the conversation

context, to enhance our ability to discern speech. These cues

are sufﬁciently potent to augment and even override auditory

perception, as demonstrated by the McGurk effect [30]. Un-

der this effect, the visual information from seeing a person’s

mouth speak changes how one hears the speech, leading to

misheard syllables. This phenomenon highlights our innate

understanding that auditory signals could be unreliable, which

can get ampliﬁed over audio calls where the channel is noisy

and the visual component is absent.

Second, the technological landscape has molded our expec-

tations of sound quality. Individuals who grew accustomed

to the quirks of early mobile telephony – limited bandwidth,

dropped calls, background noise, and frequent verbal conﬁr-

mations to ensure mutual understanding – have been condi-

tioned to interpret and understand communication even in

subpar auditory conditions. Despite improvements in audio

clarity, exposure to imperfect sound has ingrained tolerance

for noisy or unclear speech in us. Such tolerance accentuates

our inherent bias to trust noisy speech.

This evolutionary and technological adaptation in auditory

perception is evident in how humans rate audio quality. When

deploying mean opinion scores (MOS) for speech natural-

ness assessment, the scale customarily spans from 1 (being

incomprehensible) to 5 (akin to face-to-face). Nonetheless,

the average difference between the highest and lowest MOS

scores is often only two points and skewed towards the higher

end [38]. This indicates a degree of leniency or an adaptive

bias in our auditory judgments, underscoring our propensity

to make sense of auditory inputs despite the quality.

This innate human tendency to trust auditory inputs instead

of mistrusting them offers a potent avenue for exploitation in

social engineering attacks employing audio deepfakes.

2.2 Problem Description

Threat Model. Three parties are involved in this scenario:

an impostor, the intended identity (the target), and a defender.

The defender receives a call of a sensitive nature, answering

to interact directly with the target. However, an impostor may

employ AI-generated speech synthesis tools to communicate

with the defender, masquerading as the target. The primary

objective of the defender is to authenticate the caller’s identity

before proceeding further.

A sensitive call refers to a remote communication initiated

to acquire something from the recipient, who possesses both

the capability and willingness to grant it. Such calls include

job interviews, phone banking, disreputable interactions, and

pranks.

Defenders. The defender does not assume any trust in the

potential imposter or their devices. The defender requests

imposters to perform a speciﬁc task. The nature of requests

made is public. However, a seed is initialized for randomizing

the task. No identiﬁable target information, such as voice bio-

metrics, is collected through an extensive enrollment process.

The defender works against a strong threat model, as identity

information can help them.

Imposters. An imposter can access sufﬁcient computational

resources to run voice-cloning software in real-time. They

can discern and respond to requests made to them.

An imposter can obtain a high-quality speech sample of the

target, potentially via social media or an unsolicited phone

call (pre-recorded voicemail greetings). This scenario is par-

ticularly impactful, as it requires scant data and can cause

widespread harm.

Hypothesis. Speech communication constitutes several vi-

tal elements, including phonetics, articulation, pitch, tone,

rhythm, stress, voice quality, and ﬂuency, all of which com-

bine to create a distinct auditory experience. An audio deep-

fake system should fail to maintain ﬁdelity in real time while

supporting all such elements.

3 Related Work

Voice-Cloning. Voice cloning systems aim to separate

speaker identity from content in a given target’s data and

replace it with the desired content at inference time. Such

speech generation systems broadly differ in their form of

input: text or speech. In our analysis, Text-to-Speech mod-

els [12, 39] excelled in content modulation but fell short in

capturing speech nuances like emotions or prosody, severely

limiting them against our challenges. Our focus thus shifted

to speech-to-speech systems, which convert a source speech

into the targeted speaker’s voice while maintaining subtle

elements of speech.

Among the various voice converters available, such

as PPG-VC [28], StarGANv2-VC [27], FREE-VC [25],

StreamVC [45], and KNN-VC [9], we choose FREE-VC for

our evaluation. As FREE-VC is noted for its open-source

†

any-to-any voice cloning capabilities, outperforming earlier

models and matching recent advancements, with Korshunov

et al. [23] considering it as a high threat to speaker recognition

systems, we assume it will be an imposter’s top choice.

Speaker Veriﬁcation. Speaker Veriﬁcation aims to match

the speaker in an unknown speech sample to a clean reference

speech sample. In our work, when we test SpeechBrain [37]

on our original and deepfake regular speech dataset, it scores

a near-perfect 98.5%. However, the availability of a clean

reference ﬁle makes such an approach impractical and incom-

patible with our threat model.

Challenge-Response Systems. Authentication schemes uti-

lizing challenge-response mechanisms are used in the ubiqui-

tous CAPTCHA to differentiate bots masquerading as humans

on the internet. Audio CAPTCHAs [19] extends this approach

for people with visual impairment. The GOTCHA system [32]

introduced a challenge-response approach for real-time video

deepfake detection, analyzing eight video challenges. This

study concluded that challenges lead to consistent and mea-

surable degradation in detecting real-time video deepfakes,

with a focus on interpretability.

D-CAPTCHA [46] developed a challenge-response system

for audio calls, evaluating nine challenges and measuring de-

tection performance based on realism, compliance, and iden-

tity against deepfake generators, with StarGANv2-VC [27] be-

ing the most potent deepfake generator tested. Their research

is a vital proof-of-concept for a more systematic and robust

approach, indicating the promising potential of challenge-

response systems in detecting deepfake audio calls. Our work

builds upon this by exploring 20 challenges and developing a

pre-screener for calls that assists and collaborates with human

receivers.

Human Evaluation of Deepfakes. Muller et al. [34] con-

ducted a study in which participants competed against an

AI model, RawNet2 [40], to identify audio deepfakes. Par-

ticipants were shown the AI’s classiﬁcation after listening

to an audio clip and deciding its authenticity. The study

compared human performance against AI in detecting deep-

fakes without allowing participants to revise their choices.

The conclusion was that humans and AI exhibit similar

strengths and weaknesses across various spooﬁng tasks from

ASVSpoof2019 [35]. Contrary to their ﬁndings, our work

provides empirical evidence that human and AI performances

are not strongly correlated, often complementing each other’s

decision-making.

†

https://github.com/OlaWod/FreeVC

Table 1: Taxonomy of Audio Challenges encompassing eight

categories and 22 subcategories with examples. We evaluated

the bold numbered examples as 20 prospective challenges.

Category Sub-category Example(s)

No Challenge Read Normally Regular Speech (0)

Vocal Peripherals Static Mouth (1), Cup Mouth (2)Vocal

Distortions Whisper Whispering (3)

Vocal Cavity Hold Nose (8), humming

Frequency

High (9), Low pitch (10), Sing (13)

Waveform Amplitude Speak Loudly (4) & Softly (5)

Temporal Speak Quickly (6) & Slowly (7)

Difﬁcult Sequences

Foreign Words (11), Reverse Count

Mimicry Mimic another Accent (12)Language /

Articulation Phonetics Rolled R’s & Tongue clicks

Deception Sudden Interruption while speaking

Emotion Sound happy / sad (14)

Tone of Voice

Phonology Questions (inﬂection) (15)

Vocal Cough/whistle (16)

Intentional Non-vocal Clap (17), ﬂick microphone

Noise

Background Noise Birds, Cars

Echo two mics on same call

Speech Cross-talk (18)Playback

(desktop only)

Music Instrumental (19), Lyrical(20)

Out-of-Scope

Unique Habits Mannerisms, Person-of-Interest

Behavioral

Biometric Voice Print Detection

Perturbations Adversarial PerturbationsPassive

Distortions Software Editing Modulating Pitch, Noise, or Bass

■Lingual challenge, ■Non-lingual challenge, ■Replay challenge

4 Speech Challenges and their Taxonomy

Deﬁnition. An audio challenge is:

• a task plausible to perform during a phone call,

• induces degradation in real-time deepfakes,

• supports randomization, and

• optionally be veriﬁable by humans.

This section describes a taxonomy of audio tasks that chal-

lenge RTDFs to adapt to novel inputs. RTDFs facing unusual

voice patterns produce speech artifacts, which later aid us in

discovering their presence on calls. We delineated the taxon-

omy in Table 1, intending to stimulate variation in several

factors that constitute phone calls, like linguistics, tone of

voice, background, and behavioral cues.

4.1 Lingual Challenges

Lingual challenges prompt individual to alter their vocal char-

acteristics in distinctive ways. These challenges exploit a

human vocal system’s degrees of freedom, such as vocal fold

shape, peripherals movements, lung pressure, resonance and

phonation modes.

Vocal Distortions include challenges that manipulate mouth

peripherals (including the tongue, lips, and teeth) and vocal

chord resonance. We sub-categorize them as follows:

剩余17页未读，继续阅读

评论收藏

内容反馈

百态老人

粉丝: 2216
资源: 2万+

使用挑战-响应对 Deepfake 音频通话进行人工智能辅助标记.pdf

最新资源

使用挑战-响应对 Deepfake 音频通话进行人工智能辅助标记.pdf

信息安全-终端检测响应平台EDR解决方案.pdf

23-网络与信息安全应急响应技术规范与指南.pdf

2021中国人工智能应用趋势报告.pdf

富士_高性能多功能变FRENIC-MEGA Lite系列样本.pdf.pdf

web渗透系列教学下载共64份.zip

《ORB-SLAM2源码解析》学习手册v1.0-对外.pdf.pdf

德勤-智能工厂-响应度高、适应性强的互联制造-1-24页.pdf

取证应急响应手册-受保护.pdf

3.4--高响应比优先调度算法.pdf

空间规划响应人工智能的机遇与挑战初探.pdf

网络安全应急响应 (2).pdf

信息安全事件与应急响应管理规范V1.0完整篇.doc.pdf

下一代终端安全平台-深信服终端检测响应EDR.pdf

aspose.pdf.dll

基于GOOGLE地图的客户响应综合管理平台的研究和实现.pdf

lshort-zh-cn.pdf lshort中文版

基于Arduino微控制板测量响应延迟外文资料翻译.pdf

进程调度算法--最高响应比调度算法.pdf

html5-css-响应式web开发.pdf.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOV5 + 双目相机实现三维测距（新版本）

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

全新的SOTA模型YOLOv9

最新资源