没有合适的资源?快使用搜索试试~ 我知道了~
使用挑战-响应对 Deepfake 音频通话进行人工智能辅助标记.pdf
0 下载量 155 浏览量
2024-04-26
21:49:16
上传
评论
收藏 585KB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/89228571/0001-e509b67f22d56acd21ff8e1f1f1d1082_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
18页
使用挑战-响应对 Deepfake 音频通话进行人工智能辅助标记.pdf
资源推荐
资源详情
资源评论
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![vsdx](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![csv](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/release/download_crawler_static/89228571/bg1.jpg)
AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response
Govind Mittal
1
, Arthur Jakobsson
2
, Kelly O. Marshall
1
, Chinmay Hegde
1
, Nasir Memon
1
1
New York University, Tandon School of Engineering, Brooklyn, NY
2
Carnegie Mellon University, Pittsburgh, PA
{mittal, km3888, chinmay.h, memon}@nyu.edu, ajakobss@andrew.cmu.edu
Abstract
Scammers are aggressively leveraging AI voice-cloning tech-
nology for social engineering attacks, a situation signifi-
cantly worsened by the advent of audio Real-time Deepfakes
(RTDFs). RTDFs can clone a target’s voice in real-time over
phone calls, making these interactions highly interactive and
thus far more convincing. Our research confidently addresses
the gap in the existing literature on deepfake detection, which
has largely been ineffective against RTDF threats. We in-
troduce a robust challenge-response-based method to detect
deepfake audio calls, pioneering a comprehensive taxonomy
of audio challenges. Our evaluation pitches 20 prospective
challenges against a leading voice-cloning system.
We have compiled a novel open-source challenge dataset
with contributions from 100 smartphone and desktop users,
yielding 18,600 original and 1.6 million deepfake samples.
Through rigorous machine and human evaluations of this
dataset, we achieved a deepfake detection rate of 86% and an
80% AUC score, respectively. Notably, utilizing a set of 11
challenges significantly enhances detection capabilities. Our
findings reveal that combining human intuition with machine
precision offers complementary advantages. Consequently,
we have developed an innovative human-AI collaborative sys-
tem that melds human discernment with algorithmic accuracy,
boosting final joint accuracy to 82.9%. This system highlights
the significant advantage of AI-assisted pre-screening in call
verification processes.
https://mittalgovind.github.
io/autch-samples/
1 Introduction
Recent advancements in synthetic speech generation have
obscured the distinction between authentic and fabricated me-
dia [14,18,24]. The impact of such advancements is felt partic-
ularly with the advent of voice-cloning tools that can generate
speech that sounds authentic in real-time [12,25, 26, 41, 43].
The existence of such tools presents a critical vulnerability
for exploitation during social engineering and raises an urgent
question: How can one ascertain the authenticity of a caller
in an age where synthetic voices are indistinguishable from
real ones?
Phone call-based social engineering attacks are highly plau-
sible, especially since 2019, when online interactions became
mainstream. An example of a phone scam is robocalls, whose
numbers in the U.S. peaked at an astonishing 58.5 billion.
Robocalls even used the voices of the U.S. President Biden [7]
and Mayor of New York City for voter suppression and out-
reach [8]. This concerning escalation prompted repeated warn-
ings from the FBI [21], the FTC [36], and an eventual prohi-
bition against robocalls by the FCC [20].
Consider a more believable and targeted form of phone
scams than robocalls leveraging Real-time Deepfakes or
RTDFs. Such deepfakes make scam calls sound convincingly
close to a target’s voice, and being real-time, they become in-
teractive and capable of conversing. Such fake calls are more
risky as they are more believable and, hence, more persua-
sive. We will shortly discuss how we used RTDF’s inherent
interactivity to tackle them.
One could assume that decades of technological advance-
ments in voice calling should be able to counter most scams.
However, several incidents where scammers exploited audio
deepfakes highlight the inadequacy of current telecommuni-
cations and regulatory measures to contain such threats. Such
incidents include convincing a finance worker to remit $25
million by sounding like their chief financial officer over a
group video call [6], demanding a $50,000 ransom by mimick-
ing a daughter’s distressed voice to deceive her mother [16]
and defrauding an energy company of $243,000 by imper-
sonating their boss’ voice over the phone [22]. Moreover, a
McAfee survey of 7,000 individuals worldwide found that
25% had encountered an AI voice cloning scam or knew some-
one who had [29], revealing a troubling reality: the end users
have been left responsible for discerning a genuine caller from
an imposter.
In an ideal scenario, an aware individual would approach
every call with a healthy degree of skepticism, and a suspi-
cious caller could never make any gains. In reality, receiving
1
arXiv:2402.18085v1 [cs.SD] 28 Feb 2024
![](https://csdnimg.cn/release/download_crawler_static/89228571/bg2.jpg)
a call from a known caller ID [2] and the mere recognition of
a familiar voice can swiftly establish trust, even amidst noise
or voice distortions. Typically, an explanation from the caller,
asserting they are under the weather or in an area with poor re-
ception, is enough to dispel any doubts [42]. Moreover, sophis-
ticated RTDFs now challenge conventional security protocols,
such as speaker verification systems and liveness detection in
KYC processes, making them prone to evasion [3–5].
Numerous audio deepfake detection methods have been
developed [47]. However, they are primarily designed for
static content and have limited efficacy against RTDFs. These
methods assume a non-interactive scenario and work offline,
allowing impostors ample time to refine manipulated content
with sophisticated editing tools.
In this work, we depart from the traditional techniques
and leverage challenge-response to pursue an alternative ap-
proach to detecting deepfake audio calls. Challenge-response
mechanisms, traditionally pivotal in distinguishing bots from
humans via the ubiquitous CAPTCHAs, have taken on new
ground with the introduction of GOTCHA [32] and D-
CAPTCHA [46] for unmasking video and audio deepfakes.
These pioneering studies capitalize on the asymmetric advan-
tage where the burden to maintain high quality in real-time,
under challenging situations, is now squarely on the imposter
caller. This work aims to harness the full spectrum of audio
challenges during phone calls and orchestrate a collaborative
dynamic between humans and AI, thereby fortifying detection
accuracy and confidence.
We began by curating a detailed taxonomy of audio chal-
lenges, categorizing them into eight main types and 22 sub-
types, including vocal distortions (e.g., whispering), wave-
form manipulations (e.g., high-pitch speaking), language-
specific articulations (e.g., rolling R sounds), environmental
noise (e.g., clapping while speaking), and background play-
backs (e.g., talking over music). We selected and evaluated
20 prospective challenges from the taxonomy.
Our investigation revolves around three research questions:
RQ1: Do challenges enhance machine detection?
RQ2:
Can human evaluators harness these challenges to
sharpen their discernment?
RQ3:
Does supporting humans with automated detectors
further improve overall performance?
We collected an open-source novel dataset to address these
questions by engaging 100 participants across mobile and
desktop interfaces to produce 18,600 original voice record-
ings. We subsequently generated 1.6 million samples through
a state-of-the-art one-shot voice-cloning technology.
We performed machine evaluation
*
by designing a non-
intrusive degradation metric for assessing audio samples on
compliance, naturalness, and the preservation of word infor-
mation. This metric unveiled that a select group of 11 chal-
lenges dramatically boosted the deepfake detection perfor-
*
we used ‘Machines’ and ‘AI’ interchangeably
mance (Area Under Curve or AUC) to 86.7%, compared to
56.0% in deepfakes of everyday speech, highlighting the ben-
efit of using challenges to aid machines. Notably, challenges
like whispering, cupping mouth, and playback were among
the most effective for machines.
Concurrently, we performed a human evaluation on a
harder subset of the whole dataset. SpeechBrain’s speaker-
verification system [37] classified all deepfake samples in this
subset as genuine. Humans assessed the compliance and qual-
ity of each audio sample and achieved an AUC of 80.1%, with
challenges bringing out discernible degradation in deepfakes.
Informed by these results and the propensity of humans
to produce more false positives, we envisioned a framework
where machines assisted humans in making decisions. This
integration makes our approach scalable while keeping the
outcome interpretable with human oversight. A subsequent
human evaluation with top-performing challenges, using a
balanced dataset, revealed that machine assistance increased
human detection accuracy from 72.3% to 78.5%. Furthermore,
machine assistance significantly boosted human confidence
and rectified their errors in 43% of cases (when machines
were correct) while causing misjudgments in 29% (when
machines were wrong).
Taking this integration a step further, in scenarios where
humans were uncertain, machines were allowed to take charge
and make the final call. This blend of Human-AI contributions,
at a ratio of 56:44, enhanced the overall detection accuracy to
82.9%, representing a 14.3% improvement over the human-
only baseline (with machine accuracy standing at 85%). This
approach underscores that while humans maintain their pri-
mary role in decision-making, strategic machine involvement
can dramatically enhance the accuracy of detecting fake calls.
Our findings validated the instrumental role of challenge-
response mechanisms in bolstering both machine and human
capabilities in audio deepfake detection, showcasing a synergy
between human insight and AI accuracy that mitigates the
threats posed by real-time synthetic speech technologies.
Distinction from Robocalls and Liveness Detection.
Robocalls are like spam, distributed en masse to the pop-
ulation in an automated manner. Generally, the message is un-
targeted and is spread to a large population. However, RTDFs
(context for this work) are targeted and possibly curated for
the particular receiver.
In principle, liveness detection mechanisms are ineffective
against RTDFs, as the imposter generates deepfakes in real-
time; hence, they are already live. Also, there is a person
behind the call who can understand the instructions and can
converse. Thus, each interaction could unroll differently.
2
![](https://csdnimg.cn/release/download_crawler_static/89228571/bg3.jpg)
2 Background
Definition. An audio deepfake refers to a digital imperson-
ation in which an imposter mimics audio characteristics to
convincingly match a specific target’s likeness. When such
impersonations can be done live with sufficient fidelity, we
call them RTDFs.
2.1 Cognitive Bias and Human Susceptibility
In the context of deepfakes, individuals are left to navigate
the landscape on their own, letting scammers abuse their trust
during phone calls. Two pivotal factors contribute to the phe-
nomenon of rapid trust-building over calls. First, there exists a
general human acknowledgment of the fallibility of our audi-
tory perception. Mishearing is a shared experience, arguably
more so than visual misinterpretation. We naturally integrate
multi-sensory inputs into listening, such as live transcription,
facial expressions, mouth movements, and the conversation
context, to enhance our ability to discern speech. These cues
are sufficiently potent to augment and even override auditory
perception, as demonstrated by the McGurk effect [30]. Un-
der this effect, the visual information from seeing a person’s
mouth speak changes how one hears the speech, leading to
misheard syllables. This phenomenon highlights our innate
understanding that auditory signals could be unreliable, which
can get amplified over audio calls where the channel is noisy
and the visual component is absent.
Second, the technological landscape has molded our expec-
tations of sound quality. Individuals who grew accustomed
to the quirks of early mobile telephony – limited bandwidth,
dropped calls, background noise, and frequent verbal confir-
mations to ensure mutual understanding – have been condi-
tioned to interpret and understand communication even in
subpar auditory conditions. Despite improvements in audio
clarity, exposure to imperfect sound has ingrained tolerance
for noisy or unclear speech in us. Such tolerance accentuates
our inherent bias to trust noisy speech.
This evolutionary and technological adaptation in auditory
perception is evident in how humans rate audio quality. When
deploying mean opinion scores (MOS) for speech natural-
ness assessment, the scale customarily spans from 1 (being
incomprehensible) to 5 (akin to face-to-face). Nonetheless,
the average difference between the highest and lowest MOS
scores is often only two points and skewed towards the higher
end [38]. This indicates a degree of leniency or an adaptive
bias in our auditory judgments, underscoring our propensity
to make sense of auditory inputs despite the quality.
This innate human tendency to trust auditory inputs instead
of mistrusting them offers a potent avenue for exploitation in
social engineering attacks employing audio deepfakes.
2.2 Problem Description
Threat Model. Three parties are involved in this scenario:
an impostor, the intended identity (the target), and a defender.
The defender receives a call of a sensitive nature, answering
to interact directly with the target. However, an impostor may
employ AI-generated speech synthesis tools to communicate
with the defender, masquerading as the target. The primary
objective of the defender is to authenticate the caller’s identity
before proceeding further.
A sensitive call refers to a remote communication initiated
to acquire something from the recipient, who possesses both
the capability and willingness to grant it. Such calls include
job interviews, phone banking, disreputable interactions, and
pranks.
Defenders. The defender does not assume any trust in the
potential imposter or their devices. The defender requests
imposters to perform a specific task. The nature of requests
made is public. However, a seed is initialized for randomizing
the task. No identifiable target information, such as voice bio-
metrics, is collected through an extensive enrollment process.
The defender works against a strong threat model, as identity
information can help them.
Imposters. An imposter can access sufficient computational
resources to run voice-cloning software in real-time. They
can discern and respond to requests made to them.
An imposter can obtain a high-quality speech sample of the
target, potentially via social media or an unsolicited phone
call (pre-recorded voicemail greetings). This scenario is par-
ticularly impactful, as it requires scant data and can cause
widespread harm.
Hypothesis. Speech communication constitutes several vi-
tal elements, including phonetics, articulation, pitch, tone,
rhythm, stress, voice quality, and fluency, all of which com-
bine to create a distinct auditory experience. An audio deep-
fake system should fail to maintain fidelity in real time while
supporting all such elements.
3 Related Work
Voice-Cloning. Voice cloning systems aim to separate
speaker identity from content in a given target’s data and
replace it with the desired content at inference time. Such
speech generation systems broadly differ in their form of
input: text or speech. In our analysis, Text-to-Speech mod-
els [12, 39] excelled in content modulation but fell short in
capturing speech nuances like emotions or prosody, severely
limiting them against our challenges. Our focus thus shifted
to speech-to-speech systems, which convert a source speech
into the targeted speaker’s voice while maintaining subtle
elements of speech.
3
![](https://csdnimg.cn/release/download_crawler_static/89228571/bg4.jpg)
Among the various voice converters available, such
as PPG-VC [28], StarGANv2-VC [27], FREE-VC [25],
StreamVC [45], and KNN-VC [9], we choose FREE-VC for
our evaluation. As FREE-VC is noted for its open-source
†
any-to-any voice cloning capabilities, outperforming earlier
models and matching recent advancements, with Korshunov
et al. [23] considering it as a high threat to speaker recognition
systems, we assume it will be an imposter’s top choice.
Speaker Verification. Speaker Verification aims to match
the speaker in an unknown speech sample to a clean reference
speech sample. In our work, when we test SpeechBrain [37]
on our original and deepfake regular speech dataset, it scores
a near-perfect 98.5%. However, the availability of a clean
reference file makes such an approach impractical and incom-
patible with our threat model.
Challenge-Response Systems. Authentication schemes uti-
lizing challenge-response mechanisms are used in the ubiqui-
tous CAPTCHA to differentiate bots masquerading as humans
on the internet. Audio CAPTCHAs [19] extends this approach
for people with visual impairment. The GOTCHA system [32]
introduced a challenge-response approach for real-time video
deepfake detection, analyzing eight video challenges. This
study concluded that challenges lead to consistent and mea-
surable degradation in detecting real-time video deepfakes,
with a focus on interpretability.
D-CAPTCHA [46] developed a challenge-response system
for audio calls, evaluating nine challenges and measuring de-
tection performance based on realism, compliance, and iden-
tity against deepfake generators, with StarGANv2-VC [27] be-
ing the most potent deepfake generator tested. Their research
is a vital proof-of-concept for a more systematic and robust
approach, indicating the promising potential of challenge-
response systems in detecting deepfake audio calls. Our work
builds upon this by exploring 20 challenges and developing a
pre-screener for calls that assists and collaborates with human
receivers.
Human Evaluation of Deepfakes. Muller et al. [34] con-
ducted a study in which participants competed against an
AI model, RawNet2 [40], to identify audio deepfakes. Par-
ticipants were shown the AI’s classification after listening
to an audio clip and deciding its authenticity. The study
compared human performance against AI in detecting deep-
fakes without allowing participants to revise their choices.
The conclusion was that humans and AI exhibit similar
strengths and weaknesses across various spoofing tasks from
ASVSpoof2019 [35]. Contrary to their findings, our work
provides empirical evidence that human and AI performances
are not strongly correlated, often complementing each other’s
decision-making.
†
https://github.com/OlaWod/FreeVC
Table 1: Taxonomy of Audio Challenges encompassing eight
categories and 22 subcategories with examples. We evaluated
the bold numbered examples as 20 prospective challenges.
Category Sub-category Example(s)
No Challenge Read Normally Regular Speech (0)
Vocal Peripherals Static Mouth (1), Cup Mouth (2)Vocal
Distortions Whisper Whispering (3)
Vocal Cavity Hold Nose (8), humming
Frequency
High (9), Low pitch (10), Sing (13)
Waveform Amplitude Speak Loudly (4) & Softly (5)
Temporal Speak Quickly (6) & Slowly (7)
Difficult Sequences
Foreign Words (11), Reverse Count
Mimicry Mimic another Accent (12)Language /
Articulation Phonetics Rolled R’s & Tongue clicks
Deception Sudden Interruption while speaking
Emotion Sound happy / sad (14)
Tone of Voice
Phonology Questions (inflection) (15)
Vocal Cough/whistle (16)
Intentional Non-vocal Clap (17), flick microphone
Noise
Background Noise Birds, Cars
Echo two mics on same call
Speech Cross-talk (18)Playback
(desktop only)
Music Instrumental (19), Lyrical(20)
Out-of-Scope
Unique Habits Mannerisms, Person-of-Interest
Behavioral
Biometric Voice Print Detection
Perturbations Adversarial PerturbationsPassive
Distortions Software Editing Modulating Pitch, Noise, or Bass
■Lingual challenge, ■Non-lingual challenge, ■Replay challenge
4 Speech Challenges and their Taxonomy
Definition. An audio challenge is:
• a task plausible to perform during a phone call,
• induces degradation in real-time deepfakes,
• supports randomization, and
• optionally be verifiable by humans.
This section describes a taxonomy of audio tasks that chal-
lenge RTDFs to adapt to novel inputs. RTDFs facing unusual
voice patterns produce speech artifacts, which later aid us in
discovering their presence on calls. We delineated the taxon-
omy in Table 1, intending to stimulate variation in several
factors that constitute phone calls, like linguistics, tone of
voice, background, and behavioral cues.
4.1 Lingual Challenges
Lingual challenges prompt individual to alter their vocal char-
acteristics in distinctive ways. These challenges exploit a
human vocal system’s degrees of freedom, such as vocal fold
shape, peripherals movements, lung pressure, resonance and
phonation modes.
Vocal Distortions include challenges that manipulate mouth
peripherals (including the tongue, lips, and teeth) and vocal
chord resonance. We sub-categorize them as follows:
4
剩余17页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/68ef26bd67034c68b8d314222b3e4014_weixin_41429382.jpg!1)
百态老人
- 粉丝: 2216
- 资源: 2万+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
- asp.net高校网上教材征订系统的设计与实现(源码)
- asp.net动态口令认证的网上选课系统的设计与实现(源码)
- NetAssist网络调试助手
- ASP.NET公文管理系统的设计与实现(源码)
- 操作系统原理与设计Chapter 2: OS Structure
- torch-2.3.1-cp312-cp312-manylinux2014-aarch64.whl
- CSR8675蓝牙芯片 CSR内部培训资料教材资料.zip
- 43-2-每日英语听力 10.9.2会员版_鹿蜀 【20240530更新】.apk
- 期末大作业基于EasyX和C语言的可视化学生成绩管理系统(95分以上)
- 数字电路芯片74系列芯片datasheet技术手册资料总汇合集(241个).zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)