0% found this document useful (0 votes)
10 views9 pages

Bias in Multimodal AI: Testbed For Fair Automatic Recruitment

Uploaded by

Richa Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Bias in Multimodal AI: Testbed For Fair Automatic Recruitment

Uploaded by

Richa Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Bias in Multimodal AI: Testbed for Fair Automatic Recruitment

Alejandro Peña, Ignacio Serna, Aythami Morales, Julian Fierrez


School of Engineering, Universidad Autonoma de Madrid, Spain
{alejandro.penna, ignacio.serna.es, aythami.morales, julian.fierrez}@uam.es
arXiv:2004.07173v1 [cs.CV] 15 Apr 2020

Abstract ases [6] if appropriate measures are not taken. In this sce-
nario, machine learning models have the capacity to repli-
The presence of decision-making algorithms in society cate, or even amplify human biases present in the data
is rapidly increasing nowadays, while concerns about their [1, 13, 26, 35]. There are relevant models based on machine
transparency and the possibility of these algorithms becom- learning that have been shown to make decisions largely
ing new sources of discrimination are arising. In fact, many influenced by gender or ethnicity. Google’s [33] or Face-
relevant automated systems have been shown to make de- book’s [2] ad delivery systems generated undesirable dis-
cisions based on sensitive information or discriminate cer- crimination with disparate performance across population
tain social groups (e.g. certain biometric systems for per- groups [9]. New Yorks insurance regulator probed United-
son recognition). With the aim of studying how current Health Group over its use of an algorithm that researchers
multimodal algorithms based on heterogeneous sources of found to be racially biased, the algorithm prioritized health-
information are affected by sensitive elements and inner ier white patients over sicker black ones [14]. More re-
biases in the data, we propose a fictitious automated re- cently, Apple Credit service granted higher credit limits to
cruitment testbed: FairCVtest. We train automatic recruit- men than women1 even though it was programmed to be
ment algorithms using a set of multimodal synthetic profiles blind to that variable (the biased results in this case were
consciously scored with gender and racial biases. Fair- originated from other variables [24]).
CVtest shows the capacity of the Artificial Intelligence (AI) The usage of AI is also growing in human resources de-
behind such recruitment tool to extract sensitive informa- partments, with video- and text-based screening software
tion from unstructured data, and exploit it in combination becoming increasingly common in the hiring pipeline [10].
to data biases in undesirable (unfair) ways. Finally, we But automatic tools in this area have exhibited worrying bi-
present a list of recent works developing techniques capa- ased behaviors in the past. For example, Amazon’s recruit-
ble of removing sensitive information from the decision- ing tool was preferring male candidates over female candi-
making process of deep learning architectures. We have dates [11]. The access to better job opportunities is crucial
used one of these algorithms (SensitiveNets) to experiment to overcome differences of minority groups. However, in
discrimination-aware learning for the elimination of sen- cases such as automatic recruitment, both the models and
sitive information in our multimodal AI framework. Our their training data are usually private for corporate or legal
methodology and results show how to generate fairer AI- reasons. This lack of transparency, along with the long his-
based tools in general, and in particular fairer automated tory of bias in the hiring domain, hinder the technical evalu-
recruitment systems. ation of these systems in search of possible biases targeting
protected groups [27].
This deployment of automatic systems has led govern-
1. Introduction ments to adopt regulations in this matter, placing special
emphasis on personal data processing and preventing al-
Over the last decades we have witnessed great advances gorithmic discrimination. Among these regulations, the
in fields such as data mining, Internet of Things, or Artificial new European Union’s General Data Protection Regulation
Intelligence, among others, with data taking on special rel- (GDPR)2 , adopted in May 2018, is specially relevant for its
evance. Paying particular attention to the field of machine impact on the use of machine learning algorithms [19]. The
learning, the large amounts of data currently available have GDPR aims to protect EU citizens’ rights concerning data
led to a paradigm shift, with handcrafted algorithms being
replaced in recent years by deep learning technologies. 1 https://edition.cnn.com/2019/11/10/business/
Machine learning algorithms rely on data collected from goldman-sachs-apple-card-discrimination/
society, and therefore may reflect current and historical bi- 2 https://gdpr.eu/

1
Photograph: Name:
ID +++ Aluna Doe ID +++
Gender +++ Gender +++
Ethnicity ++ Data Chief Officer at XXXX Ethnicity ++
Age ++
Position:
Boston, USA ID ++
Short Bio:
Gender ++ Gender ++
Ethnicity + Short Bio: She helps leaders and organizations Age ++
Age ++ thrive with disruption as an expert on digital
transformation and leadership expert. Location:
Experience: Ethnicity ++
Gender ++ Experience:
Sociocultural ++
Age +++ • 2012- 2018 Senior Researcher
Sociocultural + • 2009-2012 Junior Researcher Skills:
Ethnicity ++
Education:
Education: Skills: Gender ++
Gender ++ • Master (2009) • Language: English, Kenian Age +
Age + • Bachelor (2007) • Programming: C++, Python Sociocultural +
Sociocultural ++

Figure 1: Information blocks in a resume and personal attributes that can be derived from each one. The number of crosses
represent the level of sensitive information (+++ = high, ++ = medium, + = low).

protection and privacy by regulating how to collect, store, • We have evaluated the capacity of popular neural net-
and process personal data (e.g. Articles 17 and 44). This work to learn biased target functions from multimodal
normative also regulates the “right to explanation” (e.g. Ar- sources of information including images and struc-
ticles 13-15), by which citizens can ask for explanations tured data from resumes.
about algorithmic decisions made about them, and requires • We develop a discrimination-aware learning method
measures to prevent discriminatory effects while process- based on the elimination of sensitive information such
ing sensitive data (according to Article 9, sensitive data in- as gender or ethnicity from the learning process of
cludes “personal data revealing racial or ethnic origin, po- multimodal approaches, and apply it to our automatic
litical opinions, religious or philosophical beliefs”). recruitment testbed for improving fairness.
On the other hand, one of the most active areas in ma- Our results demonstrate the high capacity of commonly
chine learning is around the development of new multi- used learning methods to expose sensitive information (e.g.
modal models capable of understanding and processing in- gender and ethnicity) and the necessity to implement appro-
formation from multiple heterogeneous sources of informa- priate techniques to guarantee discrimination-free decision-
tion [5]. Among such sources of information we can include making processes.
structured data (e.g. in tables), and unstructured data from The rest of the paper is structured as follows: Section 2
images, audio, and text. The implementation of these mod- analyzes the information available in a typical resume and
els in society must be accompanied by effective measures to the sensitive data associated to it. Section 3 presents the
prevent algorithms from becoming a source of discrimina- general framework for our work including problem formu-
tion. In this scenario, where multiple sources of both struc- lation and the dataset created in this work: FairCVdb. Sec-
tured and unstructured data play a key role in algorithms’ tion 4 reports the experiments in our testbed FairCVtest af-
decisions, the task of detecting and preventing biases be- ter describing the experimental methodology and the differ-
comes even more relevant and difficult. ent scenarios evaluated. Finally, Section 5 summarizes the
In this environment of desirable fair and trustworthy AI, main conclusions.
the main contributions of this work are:
2. What else does your resume data reveal?
• We present a new public experimental framework Studying multimodal biases in AI
around automated recruitment aimed to study how
multimodal machine learning is influenced by biases For the purpose of studying discrimination in Artificial
present in the training datasets: FairCVtest3 . Intelligence at large, in this work we propose a new experi-
mental framework inspired in a fictitious automated recruit-
3 https://github.com/BiDAlab/FairCVtest ing system: FairCVtest.

2
B C Target 𝑇
D
DB A

Raw
Data Input 𝐱 Trainer/Screener Prediction 𝑂 𝐱 𝐰
Information
Preprocessing
𝐰 F

E Learning Strategy: 𝑂 𝐱 𝐰 ≈ 𝑇

Figure 2: Block diagram of the automatic learning process and 6 (A to E) stages where bias can appear.

There are many companies that have adopted predictive while the best candidate would get 1. Biases can be intro-
tools in their recruitment processes to help hiring managers duced in different stages of the learning process (see Figure
find successful employees. Employers often adopt these 2): in the Data used to train the models (A), the Preprocess-
tools in an attempt to reduce the time and cost of hiring, or ing or feature selection (B), the Target function (C), and the
to maximize the quality of the hiring process, among other Learning strategy (E). As a result, a biased Model (F) will
reasons [8]. We chose this application because it comprises produce biased Results (D). In this work we focus on the
personal information from different nature [15]. Target function (C) and the Learning strategy (E). The Tar-
The resume is traditionally composed by structured data get function is critical as it could introduce cognitive biases
including name, position, age, gender, experience, or edu- from biased processes. The Learning strategy is tradition-
cation, among others (see Figure 1), and also includes un- ally based on the minimization of a loss function defined to
structured data such as a face photo or a short biography. obtain the best performance. The most popular approach for
A face image is rich in unstructured information such as supervised learning is to train the model w by minimizing a
identity, gender, ethnicity, or age [17, 28]. That information loss function L over a set of training samples S:
can be recognized in the image, but it requires a cognitive X
or automatic process trained previously for that task. The min L(O(xj |w), T j ) (1)
w
text is also rich in unstructured information. The language xj ∈S
and the way we use that language, determine attributes re-
3.1. FairCVdb: research dataset for multimodal AI
lated to your nationality, age, or gender. Both, image and
text, represent two of the domains that have attracted major We have generated 24,000 synthetic resume profiles in-
interest from the AI research community during last years. cluding 12 features obtained from 5 information blocks, 2
The Computer Vision and the Natural Language Process- demographic attributes (gender and ethnicity), and a face
ing communities have boosted the algorithmic capabilities photograph. The 5 blocks are: 1) education attainment
in image and text analysis through the usage of massive (generated from US Census Bureau 2018 Education At-
amounts of data, large computational capabilities (GPUs), tainment data4 , without gender or ethnicity distinction), 2)
and deep learning techniques. availability, 3) previous experience, 4) the existence of a
The resumes used in the proposed FairCVtest framework recommendation letter, and 5) language proficiency in a set
include merits of the candidate (e.g. experience, education of 8 different and common languages (chosen from US Cen-
level, languages, etc...), two demographic attributes (gender sus Bureau Language Spoken at Home data5 ). Each lan-
and ethnicity), and a face photograph (see Section 3.1 for guage is encoded with an individual feature (8 features in to-
all the details). tal) that represents the level of knowledge in that language.
Each profile has been associated according to the gen-
3. Problem formulation and dataset der and ethnicity attributes with an identity of the DiveFace
database [25]. DiveFace contains face images (120 × 120
The model represented by its parameters vector w is pixels) and annotations equitably distributed among 6 de-
trained according to multimodal input data defined by n mographic classes related to gender and 3 ethnic groups
features x = [x1 , ..., xn ] ∈ Rn , a Target function T , and a (Black, Asian, and Caucasian), including 24K different
learning strategy that minimizes the error between the out- identities (see Figure 3).
put O and the Target function T . In our framework where 4 https://www.census.gov/data/tables/2018/demo/
x is data obtained from the resume, T is a score within the education-attainment/cps-detailed-tables.html
interval [0, 1] ranking the candidates according to their mer- 5 https://www.census.gov/data/tables/2013/demo/

its. A score close to 0 corresponds to the worst candidate, 2009-2013-lang-tables.html

3
4. FairCVtest: Description and experiments
4.1. FairCVtest: Scenarios and protocols
In order to evaluate how and to what extent an algorithm
is influenced by biases that are present in the FairCVdb tar-
get function, we use the FairCVdb dataset previously in-
troduced in Section 3 to train various recruitment systems
under different scenarios. The proposed FairCVtest testbed
consist of FairCVdb, the trained recruitment systems, and
the related experimental protocols.
First, we present 4 different versions of the recruitment
tool, with slight differences in the input data and target func-
tion aimed at studying different scenarios concerning gen-
der bias. After that, we will show how those scenarios can
be easily extrapolated to ethnicity bias.
Figure 3: Examples of the six demographic groups included The 4 Scenarios included in FairCVtest were all trained
in DiveFace: male/female for 3 ethnic groups. using the competencies presented on Section 3, with the fol-
lowing particular configurations:
• Scenario 1: Training with Unbiased scores T U , and
Therefore, each profile in FairCVdb includes informa- the gender attribute as additional input.
tion on gender and ethnicity, a face image (correlated with • Scenario 2: Training with Gender-biased scores T G ,
the gender and ethnicity attributes), and the 12 resume fea- and the gender attribute as additional input.
tures described above, to which we will refer to candidate • Scenario 3: Training with Gender-biased scores T G ,
competencies xi . but the gender attribute wasn’t given as input.
The score T j for a profile j is generated by linear com- • Scenario 4: Training with Gender-biased scores T G ,
bination of the candidate competencies xj = [xj1 , ..., xjn ] as: and a feature embedding from the face photograph as
additional input.
n
X In all 4 cases, we designed the candidate score predic-
T j = βj + αi xji (2) tor as a feedforward neural network with two hidden layers,
i=1 both of them composed by 10 neurons with ReLU activa-
tion, and only one neuron with sigmoid activation in the
where n = 12 is the number of features (competencies), output layer, treating this task as a regression problem.
αi are the weighting factors for each competency xji (fixed In Scenario 4, where the system takes also as input an
manually based on consultation with a human recruitment embedding from the applicant’s face image, we use the pre-
expert), and β j is a small Gaussian noise to introduce a trained model ResNet-50 [20] as feature extractor to ob-
small degree of variability (i.e. two profiles with the same tain these embeddings. ResNet-50 is a popular Convolu-
competencies do not necessarily have to obtain the same re- tional Neural Network, originally proposed to perform face
sult in all cases). Those scores T j will serve as groundtruth and image recognition, composed with 50 layers includ-
in our experiments. ing residual or “shortcuts” connections to improve accu-
Note that, by not taking into account gender or eth- racy as the net depth increases. ResNet-50’s last convo-
nicity information during the score generation in Equa- lutional layer outputs embeddings with 2048 features, and
tion (2), these scores become agnostic to this information, we added a fully connected layer to perform a bottleneck
and should be equally distributed among different demo- that compresses these embeddings to just 20 features (main-
graphic groups. Thus, we will refer to this target function taining competitive face recognition performances). Note
as Unbiased scores T U , from which we define two target that this face model was trained exclusively for the task of
functions that include two types of bias: Gender bias T G face recognition. Gender and ethnicity information were
and Ethnicity bias T E . Biased scores are generated by ap- not intentionally employed during the training process. Of
plying a penalty factor Tδ to certain individuals belonging course, this information is part of the face attributes.
to a particular demographic group. This leads to a set of Figure 4 summarizes the general learning architecture of
scores where, with the same competencies, certain groups FairCVtest. The experiments performed in next section will
have lower scores than others, simulating the case where the try to evaluate the capacity of the recruitment AI to detect
process is influenced by certain cognitive biases introduced protected attributes (e.g. gender, ethnicity) without being
by humans, protocols, or automatic systems. explicitly trained for this task.

4
Aluna Doe Deep Architecture (ResNet-50) Multimodal Network: input layer +
Face image 2 hidden layers (2×10 units) + output layer (1 unit)
Data Chief Officer at XXXX

Boston, USA

Short Bio: She helps leaders and organizations


thrive with disruption as an expert on digital
transformation and leadership expert.

Score
Experience:
• 2017- Senior Consultant Face representation (20 features, Scenario 4) 𝑇𝑗
• 20014-2017 – Junior Consultant

Education: Skills:
• Master (2014) • Language: English, Kenia Merits (education, experience, languages, others) = 12 features (all Scenarios)
• Bachelor (2012) • Programming: C++, Python
Demographic attributes (gender and ethnicity) = 2 features (Scenarios 1 and 2)

Raw Data Structured Data + Unstructured Data+ Representation Learning Candidate Score Predictor

Figure 4: Multimodal learning architecture composed by a Convolutional Neural Network (ResNet-50) and a fully connected
network used to fuse the features from different domains (image and structured data). Note that some features are included
or removed from the learning architecture depending of the scenario under evaluation.

variability, see Equation (2), this loss will never converge


to 0. Scenario 3 shows the worst performance, what makes
sense since there’s no correlation between the bias in the
scores and the inputs of the network. Finally, Scenario 4
shows a validation loss between the other Scenarios. As we
will see later, the network is able to find gender features
in the face embeddings, even if the network and the em-
beddings were not trained for gender recognition. As we
can see in Figure 5, the validation loss obtained with biased
scores and sensitive features (Scenario 2) is lower than the
validation losses obtained for biased scores and blind fea-
tures (Scenarios 3 and 4).
In Figure 6 we can see the distributions of the scores pre-
dicted in each scenario by gender, where the presence of the
Figure 5: Validation loss during the training process ob- bias is clearly visible in some plots. For each scenario, we
tained for the different scenarios. compute the Kullback-Leibler divergence KL(P ||Q) from
the female score distribution Q to the male P as a measure
of the bias’ impact on the classifier output. In Scenarios 1
4.2. FairCVtest: Predicting the candidate score and 3, Figure 6.a and 6.c respectively, there is no gender
difference in the scores, a fact that we can corroborate with
The recruitment tool was trained with the 80% of the the KL divergence tending to zero (see top label in each
synthetic profiles (19,200 CVs) described in Section 3.1, plot). In the first case (Scenario 1) we obtain those results
and retaining 20% as validation set (4,800 CVs), each because we used the unbiased scores T U during the train-
set equally distributed among gender and ethnicity, using ing, so that the gender information in the input becomes
Adam optimizer, 10 epochs, batch size of 128, and mean irrelevant for the model, but in the second one (Scenario 3)
absolute error as loss metric. because we made sure that there was no gender informa-
In Figure 5 we can observe the validation loss during the tion in the training data, and both classes were balanced.
training process for each Scenario (see Section 4.1), which Despite using a target function biased, the absence of this
gives us an idea about the performance of each network in information makes the network blind to this bias, paying
the main task (i.e. scoring applicants’ resumes). In the first this effect with a drop of performance with respect to the
two scenarios the network is able to model the target func- gender-biased scores T G , but obtaining a fairer model.
tion more precisely, because in both cases it has all the fea- The Scenario 2 (Figure 6.b) leads us to the model with
tures that influenced in the score generation. Note that, by the most notorious difference between male-female classes
adding a small Gaussian noise to include some degree of

5
(a) (b)

(c) (d)

Figure 6: Hiring score distributions by gender for each Scenario. The results show how multimodal learning is capable to
reproduce the biases present in the training data even if the gender attribute is not explicitly available.

perform face recognition, not gender recognition. Similarly,


gender information could be present in the feature embed-
dings generated by networks oriented to other tasks (e.g.
sentiment analysis, action recognition, etc.). Therefore, de-
spite not having explicit access to the gender attribute, the
classifier is able to reproduce the gender bias, even though
the attribute gender was not explicitly available during the
training (i.e. the gender was inferred from the latent features
present in the face image). In this case, the KL divergence
is around 0.171, a lower value than the 0.452 of Scenario 2,
but anyway ten times higher than Unbiased Scenarios.
Moreover, gender information is not the only sensitive
information that algorithms like face recognition models
Figure 7: Hiring score distributions by ethnicity group can extract from unstructured data. In Figure 7 we present
trained according to the setup of the Scenario 4. the distributions of the scores by ethnicity predicted by a
network trained with Ethnicity-biased scores T E in an anal-
ogous way to Scenario 4 in the gender experiment. The
(note the KL divergence rising to 0.452), which makes network is also capable to extract the ethnicity informa-
sense because we’re explicitly providing it with gender in- tion from the same facial feature embeddings, leading to an
formation. In Scenario 4 the network is able to detect ethnicity-biased network when trained with skewed data. In
the gender information from the face embeddings, as men- this case, we compute the KL divergence by making 1-to-1
tioned before, and find the correlation between them and combinations (i.e. G1 vs G2, G1 vs G3, and G2 vs G3) and
the bias injected to the target function. Note that these em- reporting the average of the three divergences.
beddings were generated by a network originally trained to

6
authors develop a method to generate synthetic datasets that
approximate a given original one, but more fair with respect
to certain protected attributes. Other works focus on the
learning process as the key point to prevent biased mod-
els. The authors of [21] propose an adaptation of DANN
[16], originally proposed to perform domain adaptation, to
generate agnostic feature representations, unbiased related
to some protected concept. In [29] the authors propose a
method to mitigate bias in occupation classification without
having access to protected attributes, by reducing the cor-
relation between the classifier’s output for each individual
and the word embeddings of their names. A joint learning
and unlearning method is proposed in [3] to simultaneously
learn the main classification task while unlearning biases by
applying confusion loss, based on computing the cross en-
tropy between the output of the best bias classifier and an
uniform distribution. The authors of [23] propose a new
regularization loss based on mutual information between
feature embeddings and bias, training the networks using
adversarial [18] and gradient reversal [16] techniques. Fi-
nally, in [25] an extension of triplet loss [31] is applied to
remove sensitive information in feature embeddings, with-
out losing performance in the main task.
In this work we have used the method proposed in [25] to
generate agnostic representations with regard to gender and
ethnicity information. This method was proposed to im-
prove privacy in face biometrics by incorporating an adver-
Figure 8: Hiring score distributions by gender (Up) and eth- sarial regularizer capable of removing the sensitive infor-
nicity (Down) after removing sensitive information from the mation from the learned representations, see [25] for more
face feature embeddings. details. The learning strategy is defined in this case as:
X
min (L(O(xj |w), T j ) + ∆j ) (3)
w
4.3. FairCVtest: Training fair models xj ∈S

As we have seen, using data with biased labels is not where ∆j is generated with a sensitiveness detector and
a big concern if we can assure that there’s no information measures the amount of sensitive information in the learned
correlated with such bias in the algorithm’s input, but we model represented by w. We have trained the face repre-
can’t always assure that. Unstructured data are a rich source sentation used in the Scenario 4 according to this method
of sensitive information for complex deep learning models, (named as Agnostic scenario in next experiments).
which can exploit the correlations in the dataset, and end up In Figure 8 we present the distributions of the hiring
generating undesired discrimination. scores predicted using the new agnostic embeddings for the
Removing all sensitive information from the input in a face photographs instead of the previous ResNet-50 embed-
general AI setup is almost infeasible, e.g. [12] demon- dings (Scenario 4, compare with Figure 6.d). As we can see,
strates how removing explicit gender indicators from per- after the sensitive information removal the network can’t
sonal biographies is not enough to remove the gender bias extract gender information from the embeddings. As a re-
from an occupation classifier, as other words may serve as sult, the two distributions are balanced despite using the
“proxy”. On the other hand, collecting large datasets that gender-biased labels and facial information. In Figure 8
represent broad social diversity in a balanced manner can be we can see the results of the same experiment using the
extremely costly. Therefore, researchers in AI and machine ethnicity-biased labels (compare with Figure 7). Just like
learning have devised various ways to prevent algorithmic the gender case, the three distributions are also balanced af-
discrimination when working with unbalanced datasets in- ter removing the sensitive information from the face feature
cluding sensitive data. Some works in this line of fair AI embeddings, obtaining an ethnicity agnostic representation.
propose methods that act on the decision rules (i.e. algo- In both cases the KL divergence shows values similar to
rithm’s output) to combat discrimination [7, 22].In [30] the those obtained for unbiased Scenarios.

7
Table 1: Distribution of the top 100 candidates for each scenario in FairCVtest, by gender and ethnicity group. ∆ = maximum
difference across groups. Dem = Demographic attributes (gender and ethnicity).

Input Features Gender Ethnicity


Scenario Bias ∆ ∆
Merits Dem Face Male Female Group 1 Group 2 Group 3
1 no yes yes no 51% 49% 2% 33% 34% 33% 1%
2 yes yes yes no 87% 13% 74% 90% 9% 1% 89%
3 yes yes no no 50% 50% 0% 32% 34% 34% 2%
4 yes yes no yes 77% 23% 54% 53% 31% 16% 37%
Agnostic yes yes no yes 50% 50% 0% 35% 30% 35% 5%

Previous results suggest the potential of sensitive in- CVdb, a large set of 24,000 synthetic profiles with infor-
formation removal techniques to guarantee fair represen- mation typically found in job applicants’ resumes. These
tations. In order to evaluate further these agnostic repre- profiles were scored introducing gender and ethnicity bi-
sentations, we conducted another experiment simulating the ases, which resulted in gender and ethnicity discrimination
outcomes of a recruitment tool. We assume that the final in the learned models targeted to generate candidate scores
decision in a recruitment process will be managed by hu- for hiring purposes. Discrimination was observed not only
mans, and the recruitment tool will be used to realize a when those gender and ethnicity attributes were explicitly
first screening among a large list of candidates including given to the system, but also when a face image was given
the 4,800 resumes used as validation set in our previous ex- instead. In this scenario, the system was able to expose sen-
periments. For each scenario, we simulate the candidates sitive information from these images (gender and ethnicity),
screening by choosing the top 100 scores among them (i.e. and model its relation to the biases in the problem at hand.
scores with highest values). We present the distribution of This behavior is not limited to the case studied, where bias
these selections by gender and ethnicity in Table 1, as well lies in the target function. Feature selection or unbalanced
as the maximum difference across groups (∆). As we can data can also become sources of biases. This last case is
observe, in Scenarios 1 and 3, where the classifier shows no common when datasets are collected from historical sources
demographic bias, we have almost no difference ∆ in the that fail to represent the diversity of our society.
percentage of candidates selected from each demographic Finally, we discussed recent methods to prevent unde-
group. On the other hand, in Scenarios 2 and 4 the impact sired effects of these biases, and then experimented with
of the bias is notorious, being larger in the first one with a one of these methods (SensitiveNets) to improve fairness in
difference of 74% in the gender case and 89% in the eth- this AI-based recruitment framework. Instead of removing
nicity case. The results show differences of 54% for the the sensitive information at the input level, which may not
gender attribute in the Scenario 4, and 37% for the ethnic- be possible or practical, SensitiveNets removes sensitive in-
ity attribute. However, when the sensitive features removal formation during the learning process.
technique is applied [25], the demographic difference drops The most common approach to analyze algorithmic dis-
from 54% to 0% in the gender case, and from 37% to 5% crimination is through group-based bias [32]. However, re-
in the ethnicity one, effectively correcting the bias in the cent works are now starting to investigate biased effects in
dataset. These results demonstrate the potential hazards of AI with user-specific methods, e.g. [4, 34]. Future work
these recruitment tools in terms of fairness, and also serve will update FairCVtest with such user-specific biases in ad-
to show possible ways to solve them. dition to the considered group-based bias.

5. Conclusions 6. Acknowledgments
We present FairCVtest, a new experimental framework This work has been supported by projects BIBECA
(publicly available6 ) on AI-based automated recruitment to (RTI2018-101248-B-I00 MINECO/FEDER), TRESPASS-
study how multimodal machine learning is affected by bi- ETN (MSCA-ITN-2019-860813), PRIMA (MSCA-ITN-
ases present in the training data. Using FairCVtest, we have 2019-860315); and by Accenture. A. Pea is supported by
studied the capacity of common deep learning algorithms a research fellowship from Spanish MINECO.
to expose and exploit sensitive information from commonly
used structured and unstructured data.
The contributed experimental framework includes Fair-
6 https://github.com/BiDAlab/FairCVtest

8
References [18] I. Goodfellow, Pouget-Abadie, et al. Generative adversarial
nets. In Advances in Neural Information Processing Systems
[1] A. Acien, A. Morales, R. Vera-Rodriguez, I. Bartolome, and 27, pages 2672–2680. 2014. 7
J. Fierrez. Measuring the Gender and Ethnicity Bias in Deep
[19] B. Goodman and S. Flaxman. EU regulations on algorithmic
Models for Face Recognition. In Proc. of IbPRIA, Madrid,
decision-making and a ”Right to explanation”. AI Magazine,
Spain, Nov. 2018. 1
38, Jun. 2016. 1
[2] M. Ali, P. Sapiezynski, M. Bogen, A. Korolova, A. Mislove, [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
and A. Rieke. Discrimination through optimization: How for image recognition. In IEEE Conf. on CVPR, pages 770–
Facebook’s ad delivery can lead to skewed outcomes. In 778, Jun. 2016. 4
Proc. of ACM Conf. on CHI, 2019. 1 [21] S. Jia, T. Lansdall-Welfare, and N. Cristianini. Right for the
[3] M. Alvi, A. Zisserman, and C. Nellaker. Turning a blind right reason: Training agnostic networks. In Advances in
eye: Explicit removal of biases and variation from deep neu- Intelligent Data Analysis XVII, pages 164–174, 2018. 7
ral network embeddings. In Proceedings of the European [22] T. Kehrenberg, Z. Chen, and N. Quadrianto. Inter-
Conference on Computer Vision, Sep. 2018. 7 pretable fairness via target labels in gaussian process models.
[4] M. Bakker, H. Riveron Valdes, et al. Fair enough: Improving arXiv/1810.05598, Oct. 2018. 7
fairness in budget-constrained decision making using confi- [23] B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim. Learning not
dence thresholds. In AAAI Workshop on Artificial Intelli- to learn: Training deep neural networks with biased data. In
gence Safety, pages 41–53, New York, NY, USA, 2020. 8 Proc. IEEE Conf. on CVPR, 2019. 7
[5] T. Baltruaitis, C. Ahuja, and L. Morency. Multimodal ma- [24] W. Knight. The Apple Card didn’t ’see’ gender and that’s
chine learning: A survey and taxonomy. IEEE Transactions the problem. Wired, Nov. 2019. 1
on Pattern Analysis and Machine Intelligence, 41:423–443, [25] A. Morales, J. Fierrez, and R. Vera-Rodriguez. Sensi-
Feb. 2019. 2 tiveNets: Learning agnostic representations with application
[6] S. Barocas and A. D. Selbst. Big data’s disparate impact. to face recognition. arXiv/1902.00334, 2019. 3, 7, 8
California Law Review, 2016. 1 [26] S. Nagpal, M. Singh, R. Singh, M. Vatsa, and N. K. Ratha.
[7] B. Berendt and S. Preibusch. Exploring discrimination: A Deep learning for face recognition: Pride or prejudiced?
user-centric evaluation of discrimination-aware data mining. arXiv/1904.01219, 2019. 1
In IEEE ICDM Workshops, pages 344–351, Dec. 2012. 7 [27] M. Raghavan, S. Barocas, J. M. Kleinberg, and K. Levy. Mit-
[8] M. Bogen and A. Rieke. Help wanted: Examination of hiring igating bias in algorithmic employment screening: Evaluat-
algorithms, equity, and bias. Technical report, 2018. 3 ing claims and practices. arXiv/1906.09208, 2019. 1
[28] R. Ranjan, S. Sankaranarayanan, A. Bansal, N. Bodla, J.
[9] J. Buolamwini and T. Gebru. Gender shades: Intersectional
Chen, V. M. Patel, C. D. Castillo, and R. Chellappa. Deep
accuracy disparities in commercial gender classification. In
learning for understanding faces: Machines may be just as
Proc. ACM Conf. on FAccT, NY, USA, Feb 2018. 1
good, or better, than humans. IEEE Signal Processing Mag-
[10] S. Chandler. The AI chatbot will hire you now. Wired, Sep. azine, 35(1):66–83, 2018. 3
2017. 1 [29] A. Romanov, M. De-Arteaga, et al. What’s in a name? re-
[11] J. Dastin. Amazon scraps secret AI recruiting tool that ducing bias in bios without access to protected attributes. In
showed bias against women. Reuters, Oct. 2018. 1 Proceedings of NAACL-HLT, page 41874195, 2019. 7
[12] M. De-Arteaga, R. Romanov, et al. Bias in bios: A case study [30] P. Sattigeri, S. C. Hoffman, V. Chenthamarakshan, and K. R.
of semantic representation bias in a high-stakes setting. In Varshney. Fairness GAN: Generating datasets with fairness
Proc. of ACM FAccT, page 120128, 2019. 7 properties using a generative adversarial network. IBM Jour-
[13] P. Drozdowski, C. Rathgeb, A. Dantcheva, N. Damer, and nal of Research and Development, 63:3:1–3:9, 2019. 7
C. Busch. Demographic bias in biometrics: A survey on an [31] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A
emerging challenge. arXiv/2003.02488, 2020. 1 unified embedding for face recognition and clustering. In
[14] M. Evans and A. W. Mathews. New York regulator probes IEEE Conf. on CVPR, pages 815–823, Jun. 2015. 7
United Health algorithm for racial bias. The Wall Street Jour- [32] I. Serna, A. Morales, J. Fierrez, M. Cebrian, N. Obradovich,
nal, Oct. 2019. 1 and I. Rahwan. Algorithmic discrimination: Formulation
[15] J. Fierrez, A. Morales, R. Vera-Rodriguez, and D. Camacho. and exploration in deep learning-based face biometrics. In
Multiple classifiers in biometrics. part 1: Fundamentals and Proc. of AAAI Workshop on SafeAI, Feb. 2020. 8
review. Information Fusion, 44:57–64, November 2018. 3 [33] L. Sweeney. Discrimination in online ad delivery. Queue,
11(3):10:10–10:29, Mar. 2013. 1
[16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
[34] Y. Zhang, R. Bellamy, and K. Varshney. Joint optimization
F. Laviolette, M. Marchand, and V. Lempitsky. Domain-
of AI fairness and utility: A human-centered approach. In
adversarial training of neural networks. Journal of Machine
AAAI/ACM Conf. on AIES, NY, USA, 2020. 8
Learning Research, 17(1):2096–2030, Jan. 2016. 7
[35] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang.
[17] E. Gonzalez-Sosa, J. Fierrez, et al. Facial soft biometrics
Men also like shopping: Reducing gender bias amplification
for recognition in the wild: Recent works, annotation and
using corpus-level constraints. In Proc. of EMNLP, pages
COTS evaluation. IEEE Trans. on Information Forensics and
2979–2989, Jan. 2017. 1
Security, 13(8):2001–2014, August 2018. 3

You might also like