AutomaticUnderstandingofImageandVideoAdvertisements论文原文+翻译+ppt资源-CSDN文库

共4个文件

pptx：2个

docx：1个

pdf：1个

需积分: 10 43 浏览量 2017-11-01 18:37:56 上传评论收藏 2.97MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

自动了解图像和视频广告.rar （4个子文件）

论文中的图表.pptx 504KB

自动了解图像和视频广告.pptx 533KB

Automatic Understanding of Image and Video Advertisements.pdf 1.11MB

自动了解图像和视频广告.docx 956KB

Automatic Understanding of Image and Video Advertisements

Zaeem Hussain Mingda Zhang Xiaozhong Zhang Keren Ye

Christopher Thomas Zuha Agha Nathan Ong Adriana Kovashka

Department of Computer Science

University of Pittsburgh

{zaeem, mzhang, xiaozhong, yekeren, chris, zua2, nro5, kovashka}@cs.pitt.edu

Abstract

There is more to images than their objective physical

content: for example, advertisements are created to per-

suade a viewer to take a certain action. We propose the

novel problem of automatic advertisement understanding.

To enable research on this problem, we create two datasets:

an image dataset of 64,832 image ads, and a video dataset

of 3,477 ads. Our data contains rich annotations encom-

passing the topic and sentiment of the ads, questions and

answers describing what actions the viewer is prompted to

take and the reasoning that the ad presents to persuade the

viewer (“What should I do according to this ad, and why

should I do it?”), and symbolic references ads make (e.g. a

dove symbolizes peace). We also analyze the most common

persuasive strategies ads use, and the capabilities that com-

puter vision systems should have to understand these strate-

gies. We present baseline classiﬁcation results for several

prediction tasks, including automatically answering ques-

tions about the messages of the ads.

1. Introduction

Image advertisements are quite powerful, and web com-

panies monetize this power. In 2014, one ﬁfth of Google’s

revenue came from their AdSense product, which serves

ads automatically to targeted users [1]. Further, ads are an

integral part of our culture. For example, the two top-left

ads in Fig. 1 have likely been seen by every American, and

have been adapted and reused in countless ways. In terms

of video ads, Volkswagen’s 2011 commercial “The Force”

had received 8 million views before it aired on TV [25].

Ads are persuasive because they convey a certain mes-

sage that appeals to the viewer. Sometimes the message is

simple, and can be inferred from body language, as in the

“We can do it” ad in Fig. 1. Other ads use more complex

messages, such as the inference that because the eggplant

and pencil form the same object, the pencil gives a very

real, natural eggplant color, as in the top-right ad in Fig. 1.

Figure 1: Two iconic American ads, and three that require

robust visual reasoning to decode. Despite the potential ap-

plications of ad-understanding, this problem has not been

tackled in computer vision before.

Decoding the message in the bottom-right ad involves even

more steps, and reading the text (“Don’t buy exotic animal

souvenirs”) might be helpful. The viewer has to infer that

the woman went on vacation from the fact that she is car-

rying a suitcase, and then surmise that she is carrying dead

animals from the blood trailing behind her suitcase. A hu-

man knows this because she associates blood with injury

or death. In the case of the “forest lungs” image at the

bottom-left, lungs symbolize breathing and by extension,

life. However, a human ﬁrst has to recognize the groups of

trees as lungs, which might be difﬁcult for a computer to

do. These are just a few examples of how ads use different

types of visual rhetoric to convey their message, namely:

common-sense reasoning, symbolism, and recognition of

non-photorealistic objects. Understanding advertisements

automatically requires decoding this rhetoric. This is a

challenging problem that goes beyond listing objects and

their locations [72, 21, 61], or even producing a sentence

about the image [76, 14, 33], because ads are as much about

how objects are portrayed and why they are portrayed so, as

about what objects are portrayed.

We propose the problem of ad-understanding, and de-

velop two datasets to enable progress on it. We collect a

dataset of over 64,000 image ads (both product ads, such

as the pencil ad, and public service announcements, such

as the anti-animal-souvenirs ad). Our ads cover a diverse

range of subjects. We ask Amazon Mechanical Turk work-

ers to tag each ad with its topic (e.g. what product it

advertises or what the subject of the public service an-

nouncement is), what sentiment it attempts to inspire in

the viewer (e.g. disturbance in the environment conser-

vation ad), and what strategy it uses to convey its mes-

sage (e.g. it requires understanding of physical processes).

We also include crowdsourced answers to two questions:

“What should the viewer do according to this ad?” and

“Why should he/she do it?” Finally, we include any sym-

bolism that ads use (e.g. the fact that a dove in an im-

age might symbolically refer to the concept of “peace”).

We also develop a dataset of over 3,000 video ads with

similar annotations (except symbolism), and a few extra

annotations (e.g. “Is the ad funny?” and “Is it excit-

ing?”) Our data collection and annotation procedures were

informed by the literature in Media Studies, a discipline

which studies the messages in the mass media, in which

one of the authors has formal training. Our data is available

at http://www.cs.pitt.edu/~kovashka/ads/.

The dataset contains the ad images, video ad URLs, and

annotations we collected. We hope it will spur progress on

the novel and important problem of decoding ads.

In addition to creating the ﬁrst pair of datasets for under-

standing ad rhetoric, we propose several baselines that will

help judge progress on this problem. First, we formulate de-

coding ads as a question-answering problem. If a computer

vision system understood the rhetoric of an ad, it should be

able to answer questions such as “According to this ad, why

should I not bully children?” This is a very challenging task,

and accuracy on it is low. Second, we formulate and provide

baselines for other tasks such as topic and sentiment recog-

nition. These tasks are more approachable and have higher

baseline accuracy. Third, we show initial experiments on

how symbolism can be used for question-answering.

The ability to automatically understand ads has many ap-

plications. For example, we can develop methods that pre-

dict how effective a certain ad will be. Using automatic un-

derstanding of the strategies that ads use, we can help view-

ers become more aware of how ads are tricking them into

buying certain products. Further, if we can decode the mes-

sages of ads, we can perform better ad-targeting according

to user interests. Finally, decoding ads would allow us to

generate descriptions of these ads for the visually impaired,

and thus give them richer access to the content shown in

newspapers or on TV.

2. Related work

In this work, we demonstrate that there is an aspect of

visual data that has not been tackled before, namely ana-

lyzing the visual rhetoric of images. This problem has been

studied in Media Studies [79, 71, 51, 55, 54, 5, 44, 12]. Fur-

ther, marketing research [83] examines how viewers react to

ads and whether an ad causes them to buy a product. While

decoding ads has not been studied in computer vision, the

problem is related to several areas of prior work.

Beyond objects. Work on semantic visual attributes de-

scribes images beyond labeling the objects in them, e.g.

with adjective-like properties such as “furry”, “smiling”, or

“metallic” [39, 15, 56, 38, 68, 35, 36, 17, 77, 2, 27]. The

community has also made ﬁrst attempts in tackling content

which requires subjective judgement or abstract analysis.

For example, [59] learn to detect how well a person is per-

forming an athletic action. [41] use the machine’s “imag-

ination” to answer questions about images. [32, 73] study

the style of artistic photographs, and [13, 40] study style in

architecture and vehicles. While these works analyze po-

tentially subjective content, none of them analyze what the

image is trying to tell us. Ads constitute a new type of im-

ages, and understanding them requires new techniques.

Visual persuasion. Most related to our work is the vi-

sual persuasion work of [29] which analyzes whether im-

ages of politicians portray them in a positive or negative

light. The authors use features that capture facial expres-

sions, gestures, and image backgrounds to detect positive

or negative portrayal. However, many ads do not show peo-

ple, and even if they do, usually there is not an implication

about the qualities of the person. Instead, ads use a number

of other techniques, which we discuss in Sec. 4.

Sentiment. One of our tasks is predicting the sentiment

an ad aims to evoke in the viewer. [57, 58, 6, 42, 30, 45]

study the emotions shown or perceived in images, but for

generic images, rather than ones purposefully created to

convey an emotion. We compare to [6] and show the suc-

cess of their method does not carry over to predicting emo-

tion in ads. This again shows that ads represent a new do-

main of images whose decoding requires novel techniques.

Prior work on ads. We are not aware of any work in

decoding the meaning of advertisements as we propose.

[4, 10] predict click-through rates in ads using low-level

vision features, whereas we predict what the ad is about

and what message it carries. [47] predict how much human

viewers will like an ad by capturing their facial expressions.

[82, 48] determine the best placement of a commercial in a

video stream, or of image ads in a part of an image using

user affect and saliency. [64, 19] detect whether the current

video shown on TV is a commercial or not, and [65] detect

human trafﬁcking advertisements. [85] extract the object

being advertised from commercials (videos), by looking for

recurring patterns (e.g. logos). Human facial reactions, ad

placement and recognition, and detecting logos, are quite

distinct from our goal of decoding the messages of ads.

Visual question-answering. One of the tasks we propose

for advertisements is decoding their rhetoric, i.e. ﬁguring

out what they are trying to say. We formulate this problem

in the context of visual question-answering. The latter is a

recent vision-and-language joint problem [3, 62, 46, 80, 67,

81] also related to image captioning [76, 33, 14, 37, 16].

3. Image dataset

The ﬁrst dataset we develop and make available is a large

annotated dataset of image advertisements, such as the ones

shown in Fig. 2 (more examples are shown in the supple-

mentary ﬁle). Our dataset includes both advertisements for

products, and ads that campaign for/against something, e.g.

for preserving the environment and against bullying. We

call the former “product ads,” and the latter “public service

announcements,” or “PSAs”. We refer to the product or

subject of the ad as its “topic”. We describe the image col-

lection and annotation process below.

3.1. Collecting ad images

We ﬁrst assembled a list of keywords (shown in supp)

related to advertisements, focusing on possible ad topics.

We developed a hierarchy of keywords that describe topics

at different levels of granularity. This hierarchy included

both coarse topics, e.g. “fast food”, “cosmetics”, “electron-

ics”, etc., as well as ﬁne topics, such as the brand names of

products (e.g. “Sprite”, “Maybeline”, “Samsung”). Simi-

larly, for PSAs we used keywords such as: “smoking”, “an-

imal abuse”, “bullying”, etc. We used the entire hierarchy to

query Google and retrieve all the images (usually between

600 to 800) returned for each query. We removed all images

of size less than 256x256 pixels, and obtained an initial pool

of about 220,000 noisy images.

Next, we removed duplicates from this noisy set. We

computed a SIFT bag-of-words histogram per image, and

used the chi-squared kernel to compute similarity between

histograms. Any pair of images with a similarity greater

than a threshold were marked as duplicates. After de-

duplication, we ended up with about 190,000 noisy images.

Finally, we removed images that are not actually ad-

vertisements, using a two-stage approach. First, we se-

lected 21,945 images, and submitted those for annotation

on MTurk, asking “Is this image an advertisement? You

should answer yes if you think this image could appear as

an advertisement in a magazine.” We showed plentiful ex-

amples to annotators to demonstrate what we consider to be

an “ad” vs “not an ad” (examples in supp). We marked as

ads those images that at least 3/4 annotators labeled as an

ad, obtaining 8,348 ads and 13,597 not-ads.

Second, we used these to train a ResNet [24] to distin-

guish between ads and not ads on the remaining images. We

Type Count Example

Topic 204,340 Electronics

Sentiment 102,340 Cheerful

Action/Reason 202,090 I should bike because it’s healthy

Symbol 64,131 Danger (+ bounding box)

Strategy 20,000 Contrast

Slogan 11,130 Save the planet... save you

Table 1: The annotations collected for our image dataset.

The counts are before any majority-vote cleanup.

set the recall of our network to 80%, which corresponded to

85% precision evaluated on a held-out set from the human-

annotated pool of 21,945 images. We ran that ResNet on our

168,000 unannotated images for clean-up, obtaining about

63,000 images labeled as ads. We allowed annotators to la-

bel ResNet-classiﬁed “ads” as “not an ad” in a subsequent

stage; annotators only used this option in 10% of cases. Us-

ing the automatic classiﬁcation step, we saved $1,300 in an-

notation costs. In total, we obtained 64,832 cleaned-up ads.

3.2. Collecting image ad annotations

We collected the annotations in Tab. 1, explained below.

Note that we describe the strategies annotations in Sec. 4.

3.2.1 Topics and sentiments

The keyword query process used for image download does

not guarantee that the images returned for each keyword

actually advertise that topic. Thus, we developed a taxon-

omy of products, and asked annotators to label the images

with the topic that they advertise or campaign for. We also

wanted to know how an advertisement makes the viewer

feel, since the sentiment that the ad inspires is a powerful

persuasion tool [47]. Thus, we also developed a taxonomy

of sentiments. To get both taxonomies, we ﬁrst asked anno-

tators to write free-form topics and sentiments, on a small

batch of images and videos. This is consistent with the “self

report” approach used to measure emotional reactions to ads

[60]. We then semi-automatically clustered them and se-

lected a representative set of words to describe each topic

and sentiment type. We arrived at a list of 38 topics and

30 sentiments. In later tasks, we asked workers to select

a single topic and one or more sentiments. We collected

topic annotations on all ads, and sentiments on 30,340 ads.

For each image, we collected annotations from 3 to 5 dif-

ferent workers. Inter-annotator agreement on topic labels

was 85% (more details in supp). Examples are shown in

Tab. 2. The distribution of topics and sentiments is illus-

trated in Fig. 3 (left); we see that sports ads and human

rights ads inspire activity, while domestic abuse and human

and animal rights ads inspire disturbance and empathy. In-

terestingly, we observe that domestic abuse ads inspire dis-

turbance more frequently than animal rights ads do.

Straightforward/literal ads (+OCR/NLP required)

Understanding physical processes

Symbolism

Atypical objects Surprise/shock or humor/pun

Humans experience

product

Transfer of

qualities

Culture/memes

Contrast

Figure 2: Examples of ads grouped by strategy or visual understanding required for decoding the ad.

Topic Sentiment

Restaurants, cafe, fast food Active (energetic, etc.)

Coffee, tea Alarmed (concerned, etc.)

Sports equipment, activities Amazed (excited, etc.)

Phone, TV and web providers Angry (annoyed, irritated)

Education Cheerful (delighted, etc.)

Beauty products Disturbed (disgusted, shocked)

Cars, automobiles Educated (enlightened, etc.)

Political candidates Feminine (womanly, girlish)

Animal rights, animal abuse Persuaded (impressed, etc.)

Smoking, alcohol abuse Sad (depressed, etc.)

Table 2: A sample from our list of topics and sentiments.

See supp for the full list of 38 topics and 30 sentiments.

Figure 3: Statistics about topics and sentiments (left), and

topics and strategies (right).

3.2.2 Questions and answers

We collected 202,090 questions and corresponding an-

swers, with three question-answer pairs per image. Tab. 3

Question Answer

What should you

do, acc. to the ad?

I should buy Nike sportswear.

Why, acc. to the ad,

should you do it?

Because it will give me the determination

of a star athlete.

What? I should buy this video game.

Why? Because it is a realistic soccer experience.

What? I should drink Absolut Vodka.

Why? Because they support LGBT rights.

What? I should look out for domestic violence.

Why? Because it can hide in plain sight.

What? I should not liter in the ocean.

Why? Because it damages the ocean ecosystem.

Table 3: Examples of collected question-answer pairs.

What should you do? Why should you do it?

Educat. Travel Smoking Educat. Travel Smoking

go go smoke help fun smoking

college visit cigarette learn beautiful like

use ﬂy buy want like kill

attend travel stop career want make

school airline quit things great life

Table 4: Common words in responses to action and reason

questions for selected topics, from the image dataset.

shows a few examples. We asked MTurk workers “What

should you do, according to this ad, and why?” The an-

swer then describes the message of the ad, e.g. “I should

buy this dress because it will make me attractive.” We re-

评论收藏

内容反馈

黄金罗盘泽

粉丝: 3
资源: 3

Automatic Understanding of Image and Video Advertisements 论文原文+翻...

最新资源

Automatic Understanding of Image and Video Advertisements 论文原文+翻...

XVideo：一个能自动进行压缩的小视频录制库

XVideoPlayer 多视频格式播放器2.0.3高级版.zip

x-video-converter5.XX key.txt

xvideo-js：CLI应用程序选择视频，将其添加到收藏夹，查看视频标签，然后在浏览器上打开视频。 :waving_hand::crossed_fingers::vulcan_salute:

MacX Video Converter Free Edition:免费将视频实时转换为任何格式-开源

AVOD论文讲解PPT

计算机毕设论文原文+翻译

基于ANDROID系统的幻灯片无线播放系统研究学士学位论文.doc

Elements of forecasting 4th edition_solution

UCI中的Internet Advertisements Data Set数据

cisco ospf 官方ppt

x-video-converter

播放软件萝莉云-完整版【5.9.2】.rar

MacX Video Converter Pro For Windows v5.0中文注册版.rar

x-video-converter-ultimate6.2

ofxVideoSlicer:OpenFrameworks插件可对视频文件进行切片。 线程概念使用在后台运行的ffmpeg在末尾触发事件

TeeChart2013_131216_SourceCode

TeeChart2013_130818_SourceCode

新版CCNP 路由部分 PPT(英文版）

the.art.of.hiring.coders.b00s4chdx8

x-video-converter-ultimate7

XP128XP1024 Owner's Guide

ava中的递归的资源

vCenter6.7命令行手册

【5.9.13】萝莉云-完整版

ssd5---exercise2

基于jsp的零点户外广告管理系统源码数据库论文.doc

最新资源

ofxVideoSlicer:OpenFrameworks插件可对视频文件进行切片。线程概念使用在后台运行的ffmpeg在末尾触发事件