limitations of AI and HI, we propose CrowdLearn, a crowd-AI
hybrid system that leverages HI to troubleshoot, tune and even-
tually improve the performance of AI-based DDA application.
To acquire HI, we leverage the crowdsourcing platform (i.e.,
Amazon Mechanical Turk or MTurk) that provides a massive
amount of freelance workers with low cost. However, two
critical pitfalls exist by leveraging crowdsourcing platform:
1)
the freelance workers may not be able to provide responses
that are as accurate as domain experts due to the lack of
experience/expertise; 2) the delay of the crowd workers can
be potentially too high to be acceptable for DDA applications.
These two pitfalls are further exacerbated by the black-box
challenges of both the AI and crowdsourcing platform that
are not well addressed by the existing literature in human-AI
systems [9], [10]. We elaborate the challenges below.
Black-box AI Challenge: the first challenge in combining
HI and AI lies in the black-box nature of AI algorithms. In
particular, the lack of interpretability of the results from AI
algorithms makes it extremely hard to diagnose the failure
scenarios such as performance deficiency - why the AI model
fails? Is this due to lack of training data or the model itself?
Such questions make it hard for the crowd to effectively
improve the black-box AI model. The interpretability issue
has been initially identified in [10], [11] where accountable
AI solutions were proposed to leverage humans as annotators
to troubleshoot and correct the outputs of AI algorithms. How-
ever, these solutions simply use humans to verify the results of
AI and ignore the issue where human annotators can be both
slow and expensive. There also exist some human-AI systems
that use crowdsourcing platforms to obtain labels or features
to retrain the model [12], [13]. However, these systems do not
address the problem where the AI algorithms themselves are
problematic in which no matter how many training samples
are added, the AI performance will not increase. Given the
black-box nature of AI, the research question we address here
is: how do we accurately identify the failure scenarios of AI
that can be effectively addressed by the crowd?
Black-box Crowdsourcing Platform Challenge: the sec-
ond unique challenge lies in the black-box nature of the
crowdsourcing platform, which is characterized by two unique
features. First, the requester (the DDA application that queries
the platform) often cannot directly select and manage the
workers in the crowdsourcing platform. In fact, the requester
can only submit tasks and define the incentives for each
task. The lack of control makes the incentive design for the
crowdsourcing platform very difficult since we cannot cherry-
pick the highly reliable and responsive workers to complete the
tasks. For this reason, the current incentive design solutions
that assume the full control of the crowd workers cannot
be applied to our problem [14]–[18]. Second, the time and
quality of the responses from the crowd workers are highly
dynamic and unpredictable and their relationships to incentives
are not trivial to model. Existing solutions often assume that
more incentives will lead to less response time and high
response quality [13], [19]. However, we found the quality
of the responses from the crowd workers is diversified and
does not simply depend on the level of incentives provided in
our experiments (e.g., the quality can be high even with low
incentives provided). Similarly, we observe the response delay
from crowd is not simply proportional to the incentive level.
With these unique features, the research question to tackle here
is: how to effectively incentivize the crowd to provide reliable
and timely responses to improve AI performance?
In this work, we design a CrowdLearn framework that
leverages human feedback from the crowdsourcing platform to
troubleshoot, calibrate and boost the AI performance in DDA
applications. In particular, CrowdLearn address the black-
box challenges of AI and the crowdsourcing platform by
developing four new schemes: 1) a query set selection (QSS)
scheme to find the best strategy to query the crowdsourcing
platform for feedback; 2) a new incentive policy design (IPD)
scheme to incentivize the crowd to provide timely and accurate
response to the query; 3) a crowd quality control (CQC)
scheme that refines the responses from the crowd and provides
trustworthy feedback to the AI algorithms; 4) a machine
intelligence calibration (MIC) scheme that incorporates the
feedback from the crowd to improve the AI algorithms by
alleviating various failure scenarios of AI. The four compo-
nents are integrated into a holistic closed-loop system that
allows the AI and crowd to effectively interact with each
other and eventually achieve boosted performance for the
DDA application. The CrowdLearn framework was evaluated
using Amazon Mechanical Turk (MTurk) and a real-world
DDA application. We compared CrowdLearn with the state-
of-the-art baselines in both AI-only algorithms and human-
AI frameworks. The results show that our scheme achieves
significant performance gain in terms of classification accuracy
in disaster damage assessment with reasonably low response
time and costs.
II.
RELATED WORK
A.
Human-AI Systems
Humans have traditionally been an integral part of artificial
intelligence systems as a means of generating labeled training
data [3], [11], [20]. Such a paradigm has been proven to be
effective in supervised learning tasks such as image classifica-
tion [21], speech recognition [22], autonomous driving [23],
social media mining [24], and virtual reality [25]. However, it
also suffers from two key limitations. First, some applications
(e.g., damage assessment) may require a large amount of
training data to achieve reasonable performance, which could
be impractical due to the labor cost [5], [9]. Second, the
AI models are often black-box systems and it is difficult to
diagnose in the event of failure and unsatisfactory perfor-
mance. To address these limitations, a few human-AI hybrid
frameworks have been developed in recent years. For example,
Holzinger et al. proposed the notion of interactive human
machine learning (“iML”) where humans directly interact
with AI by identifying useful features that could be incor-
porated into the AI algorithms [26]. Branson et al. invented
a human-in-the-loop visual recognition system to accurately
classify the objects in the picture based on the descriptions