Schroff等。-2015-FaceNetAUnifiedEmbeddingforFaceRecognition.pdf

需积分: 9 198 浏览量 2019-12-09 15:00:20 上传评论收藏 4.43MB PDF 举报

### FaceNet: A Unified Embedding for Face Recognition #### 摘要《FaceNet: A Unified Embedding for Face Recognition》是一篇由Florian Schroff、Dmitry Kalenichenko和James Philbin三位作者共同发表于2015年的论文。该研究针对当前面部识别技术中存在的挑战,提出了一种名为FaceNet的新方法。FaceNet是一种深度学习模型,它能够直接学习从面部图像到紧凑欧几里得空间的映射,在这个空间中,距离可以直接表示面部相似度。这种统一的嵌入不仅能够高效地实现面部识别任务,还能应用于验证和聚类等其他相关任务。 #### 主要贡献 - **统一的嵌入**:FaceNet通过学习一种紧凑的欧几里得空间嵌入来表示面部图像,使得在该空间中的距离直接对应于面部的相似度。这为面部识别、验证和聚类提供了统一的解决方案。 - **高效的表示**:与传统的深度学习方法相比,FaceNet通过直接优化嵌入本身而不是中间瓶颈层,实现了更高的表示效率。实验结果显示,使用仅128字节的嵌入向量就能达到最先进的面部识别性能。 - **三元组损失函数**:为了训练FaceNet,研究人员开发了一种新颖的在线三元组挖掘方法,用于生成匹配/非匹配面部图像的三元组数据集,从而优化模型。 - **性能表现**:FaceNet在Labeled Faces in the Wild (LFW) 数据集上取得了99.63%的新记录精度,在YouTube Faces DB上的准确率为95.12%。与当时最好的发表结果相比,FaceNet在两个数据集上的错误率降低了30%。 - **和谐嵌入**:论文还引入了和谐嵌入(harmonic embeddings)的概念,以及相应的和谐三元组损失(harmonic triplet loss),用于描述不同网络产生的面部嵌入版本之间的兼容性和直接比较。 #### 方法论 - **深度卷积网络**:FaceNet使用深度卷积神经网络(CNN)进行训练,以直接优化嵌入层,而不是像传统深度学习方法那样优化中间瓶颈层。这种方法使得模型能够更有效地学习面部特征,并将其转换为紧凑的表示形式。 - **三元组损失**:训练过程中采用三元组损失函数,其中每个训练样本包含三个部分:一个锚点(anchor)、一个正样本(positive)和一个负样本(negative)。目的是最小化锚点与其正样本之间的距离,同时最大化锚点与负样本之间的距离。 - **在线三元组挖掘**:为了生成有效的训练数据,FaceNet采用了一种新颖的在线三元组挖掘方法,能够自动从大量未标记的面部图像中选择具有挑战性的三元组组合,以提高模型的学习效率。 #### 应用场景与优势 - **应用场景**: - 面部识别:FaceNet可以用于快速准确地识别个体的身份。 - 面部验证:确定两张图片是否属于同一人。 - 面部聚类:自动对未标记的面部图像进行分组,找到相同或相似个体的图像集合。 - **优势**: - 表示效率高:使用非常小的嵌入向量即可达到极高的识别精度。 - 训练数据灵活性:在线三元组挖掘方法允许使用大量未标记的数据进行训练,减少了对手工标注数据的依赖。 - 易于集成:由于其统一的嵌入特性,FaceNet很容易与其他系统和技术相结合,实现更广泛的应用。 #### 结论 FaceNet是一种基于深度学习的面部识别技术,它通过直接优化嵌入空间来提高表示效率和识别准确性。该方法不仅在面部识别领域取得了显著的成果,也为计算机视觉领域中的其他任务提供了有价值的思路和技术支持。随着技术的不断发展和完善,FaceNet及其衍生方法有望在未来的人脸识别和计算机视觉应用中发挥更加重要的作用。

资源推荐

资源详情

资源评论

FaceNet: A Uniﬁed Embedding for Face Recognition and Clustering

Florian Schroff

fschroff@google.com

Google Inc.

Dmitry Kalenichenko

dkalenichenko@google.com

Google Inc.

James Philbin

jphilbin@google.com

Google Inc.

Abstract

Despite signiﬁcant recent advances in the ﬁeld of face

recognition [10, 14, 15, 17], implementing face veriﬁcation

and recognition efﬁciently at scale presents serious chal-

lenges to current approaches. In this paper we present a

system, called FaceNet, that directly learns a mapping from

face images to a compact Euclidean space where distances

directly correspond to a measure of face similarity. Once

this space has been produced, tasks such as face recogni-

tion, veriﬁcation and clustering can be easily implemented

using standard techniques with FaceNet embeddings as fea-

ture vectors.

Our method uses a deep convolutional network trained

to directly optimize the embedding itself, rather than an in-

termediate bottleneck layer as in previous deep learning

approaches. To train, we use triplets of roughly aligned

matching / non-matching face patches generated using a

novel online triplet mining method. The beneﬁt of our

approach is much greater representational efﬁciency: we

achieve state-of-the-art face recognition performance using

only 128-bytes per face.

On the widely used Labeled Faces in the Wild (LFW)

dataset, our system achieves a new record accuracy of

99.63%. On YouTube Faces DB it achieves 95.12%. Our

system cuts the error rate in comparison to the best pub-

lished result [15] by 30% on both datasets.

We also introduce the concept of harmonic embeddings,

and a harmonic triplet loss, which describe different ver-

sions of face embeddings (produced by different networks)

that are compatible to each other and allow for direct com-

parison between each other.

1. Introduction

In this paper we present a uniﬁed system for face veri-

ﬁcation (is this the same person), recognition (who is this

person) and clustering (ﬁnd common people among these

faces). Our method is based on learning a Euclidean em-

bedding per image using a deep convolutional network. The

network is trained such that the squared L2 distances in

the embedding space directly correspond to face similarity:

1.04

1.22 1.33

0.78

1.33 1.26

0.99

Figure 1. Illumination and Pose invariance. Pose and illumina-

tion have been a long standing problem in face recognition. This

ﬁgure shows the output distances of FaceNet between pairs of

faces of the same and a different person in different pose and il-

lumination combinations. A distance of 0.0 means the faces are

identical, 4.0 corresponds to the opposite spectrum, two different

identities. You can see that a threshold of 1.1 would classify every

pair correctly.

faces of the same person have small distances and faces of

distinct people have large distances.

Once this embedding has been produced, then the afore-

mentioned tasks become straight-forward: face veriﬁca-

tion simply involves thresholding the distance between the

two embeddings; recognition becomes a k-NN classiﬁca-

tion problem; and clustering can be achieved using off-the-

shelf techniques such as k-means or agglomerative cluster-

ing.

Previous face recognition approaches based on deep net-

works use a classiﬁcation layer [15, 17] trained over a set of

known face identities and then take an intermediate bottle-

arXiv:1503.03832v3 [cs.CV] 17 Jun 2015

neck layer as a representation used to generalize recognition

beyond the set of identities used in training. The downsides

of this approach are its indirectness and its inefﬁciency: one

has to hope that the bottleneck representation generalizes

well to new faces; and by using a bottleneck layer the rep-

resentation size per face is usually very large (1000s of di-

mensions). Some recent work [15] has reduced this dimen-

sionality using PCA, but this is a linear transformation that

can be easily learnt in one layer of the network.

In contrast to these approaches, FaceNet directly trains

its output to be a compact 128-D embedding using a triplet-

based loss function based on LMNN [19]. Our triplets con-

sist of two matching face thumbnails and a non-matching

face thumbnail and the loss aims to separate the positive pair

from the negative by a distance margin. The thumbnails are

tight crops of the face area, no 2D or 3D alignment, other

than scale and translation is performed.

Choosing which triplets to use turns out to be very im-

portant for achieving good performance and, inspired by

curriculum learning [1], we present a novel online nega-

tive exemplar mining strategy which ensures consistently

increasing difﬁculty of triplets as the network trains. To

improve clustering accuracy, we also explore hard-positive

mining techniques which encourage spherical clusters for

the embeddings of a single person.

As an illustration of the incredible variability that our

method can handle see Figure 1. Shown are image pairs

from PIE [13] that previously were considered to be very

difﬁcult for face veriﬁcation systems.

An overview of the rest of the paper is as follows: in

section 2 we review the literature in this area; section 3.1

deﬁnes the triplet loss and section 3.2 describes our novel

triplet selection and training procedure; in section 3.3 we

describe the model architecture used. Finally in section 4

and 5 we present some quantitative results of our embed-

dings and also qualitatively explore some clustering results.

2. Related Work

Similarly to other recent works which employ deep net-

works [15, 17], our approach is a purely data driven method

which learns its representation directly from the pixels of

the face. Rather than using engineered features, we use a

large dataset of labelled faces to attain the appropriate in-

variances to pose, illumination, and other variational condi-

tions.

In this paper we explore two different deep network ar-

chitectures that have been recently used to great success in

the computer vision community. Both are deep convolu-

tional networks [8, 11]. The ﬁrst architecture is based on the

Zeiler&Fergus [22] model which consists of multiple inter-

leaved layers of convolutions, non-linear activations, local

response normalizations, and max pooling layers. We addi-

tionally add several 1×1×d convolution layers inspired by

the work of [9]. The second architecture is based on the

Inception model of Szegedy et al. which was recently used

as the winning approach for ImageNet 2014 [16]. These

networks use mixed layers that run several different convo-

lutional and pooling layers in parallel and concatenate their

responses. We have found that these models can reduce the

number of parameters by up to 20 times and have the poten-

tial to reduce the number of FLOPS required for comparable

performance.

There is a vast corpus of face veriﬁcation and recognition

works. Reviewing it is out of the scope of this paper so we

will only brieﬂy discuss the most relevant recent work.

The works of [15, 17, 23] all employ a complex system

of multiple stages, that combines the output of a deep con-

volutional network with PCA for dimensionality reduction

and an SVM for classiﬁcation.

Zhenyao et al. [23] employ a deep network to “warp”

faces into a canonical frontal view and then learn CNN that

classiﬁes each face as belonging to a known identity. For

face veriﬁcation, PCA on the network output in conjunction

with an ensemble of SVMs is used.

Taigman et al. [17] propose a multi-stage approach that

aligns faces to a general 3D shape model. A multi-class net-

work is trained to perform the face recognition task on over

four thousand identities. The authors also experimented

with a so called Siamese network where they directly opti-

mize the L

-distance between two face features. Their best

performance on LFW (97.35%) stems from an ensemble of

three networks using different alignments and color chan-

nels. The predicted distances (non-linear SVM predictions

based on the χ

kernel) of those networks are combined us-

ing a non-linear SVM.

Sun et al. [14, 15] propose a compact and therefore rel-

atively cheap to compute network. They use an ensemble

of 25 of these network, each operating on a different face

patch. For their ﬁnal performance on LFW (99.47% [15])

the authors combine 50 responses (regular and ﬂipped).

Both PCA and a Joint Bayesian model [2] that effectively

correspond to a linear transform in the embedding space are

employed. Their method does not require explicit 2D/3D

alignment. The networks are trained by using a combina-

tion of classiﬁcation and veriﬁcation loss. The veriﬁcation

loss is similar to the triplet loss we employ [12, 19], in that it

minimizes the L

-distance between faces of the same iden-

tity and enforces a margin between the distance of faces of

different identities. The main difference is that only pairs of

images are compared, whereas the triplet loss encourages a

relative distance constraint.

A similar loss to the one used here was explored in

Wang et al. [18] for ranking images by semantic and visual

similarity.

剩余9页未读，继续阅读

评论收藏

内容反馈

zjn.ai

粉丝: 40
资源: 30

Schroff 等。 - 2015 - FaceNet A Unified Embedding for Face Recogni...

最新资源

Schroff 等。 - 2015 - FaceNet A Unified Embedding for Face Recogni...

FaceNet: A Unified Embedding for Face Recognition and Clustering

英文论文原文FaceNet: A Unified Embedding for Face Recognition and Clustering

【深度学习论文】FaceNet: A Unified Embedding for Face Recognition and Clustering

SphereFace: Deep Hypersphere Embedding for Face Recognition.pdf

Disentangled Representation Learning GAN for Pose-Invariant Face Recognition

2016第八届智能人机系统与控制论国际会议（英文）.pdf

谷歌师兄的leetcode刷题笔记-face-recognition:它以非常好的准确度实时和图像识别人脸

人脸检测和识别：人脸检测和识别任务，例如人脸验证，用python标记人。 该模型是根据FaceBoxes和FaceNet架构构建的

openface:深度神经网络的人脸识别

基于目标检测模型的人脸识别技术.pdf

全球CVPR最佳论文

精选_基于mtcnn与facenet实现人脸登录系统_源码打包

ATCA携手“多核”成为下一代通信技术新宠.pdf

创造软件“FaceCount人脸识别”首次亮相CPSE深圳安博会.pdf

人脸识别技术发展研究.pdf

CPCI规范说明文件

Deep Face Recognition

随机森林图像matlab代码-Single-Class-Histogram-Model:使用基于Texton的单直方图类模型（SHCM）作为节

颜色分类leetcode-compressed_sensing:使用神经网络增强压缩感知

MicroTCA平台智能管理系统.doc

deeplab-demo:使用Deeplab进行图像分割

基于aTCA架构的IPTV CDN应用

tf2_Segmentation:由Tensorflow 2实现的分段框架

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

最新资源

人脸检测和识别：人脸检测和识别任务，例如人脸验证，用python标记人。该模型是根据FaceBoxes和FaceNet架构构建的