Detecting Sensitive Information From Unstructured Text in A Data-Constrained Environment
Detecting Sensitive Information From Unstructured Text in A Data-Constrained Environment
Abstract—For an enterprise, it is important to handle sensitive as the Global Data Protection Regulation (GDPR), Payment
customer data properly because any data breach or violation Card Industry-Data Security Standard (PCI-DSS), and the
can lead to hefty penalties. Past work has looked at various Health Insurance Portability and Accountability Act (HIPAA),
techniques for detecting sensitive data in free-flowing text for
a given regulation. However, most of them either produce have mandated enterprises to treat the personal data of their
many false positives or are very specific to certain types of customers with great caution. Any form of data leakage or
data, for example, email, account number, or social security privacy violation can have serious consequences for an enter-
number. Moreover, machine learning-based methods are difficult prise. For this, the enterprise needs to know the kind and nature
to use as finding large amounts of labeled data for training a of data residing on each device, as it will help in installing
supervised model poses a serious challenge. In this work, we
aim to address the issue of sensitive data discovery in a data- appropriate security tools and enabling required access rights
constrained environment by utilizing pre-trained models. We for its employees. However, within an enterprise, the personal
compare their effectiveness in the financial and health domains. and sensitive information of a customer can be present in
Further, we improve the performance of pre-trained models by unstructured, structured, and semi-structured forms. Manual
employing morphological-level features and propose a hybrid analysis of these documents can be difficult, expensive, time
model architecture. Our experimental results show that pre-
trained models in a data-constrained environment can reduce the consuming and often prone to error. Therefore, it becomes
turnaround time for sensitive data discovery, thus saving money imperative for enterprises to have a balance between security
and effort. and operational costs.
Index Terms—Sensitive Data, User Privacy, Named Entity In order to keep track of sensitive information, enterprises
Recognition, Pre-trained Language Models need to extract and identify sensitive information present
in unstructured data sources. Past research has looked into
I. I NTRODUCTION content fingerprinting, locality-sensitive hashes [1], regular
A data breach is a cybersecurity incident where confidential expressions [5], and machine learning-based approaches [6],
information is intentionally or unintentionally exposed to [7]. Some of these techniques either generate too many false
unauthorized parties. For an enterprise, a data breach incident positives (content fingerprinting and regular expressions), take
is a serious concern as it can result in monetary losses due to too long to be a practical solution due to pairwise comparison
legal penalties, loss of customers, and a negative effect on their (locality-sensitive hashes), or require a large amount of labeled
reputation [1]. It also impacts the end-user, whose data is lost data for training machine learning models. To a certain extent,
in the breach, as it can increase their chances of being a victim Named Entity Recognition (NER) using pre-trained models
of identity theft. For example, in 2017, the credit reporting has been utilized for identifying named entities in unstructured
company Equifax had sensitive information on approximately text, for example, usernames, locations, quantities, and brands
140 million consumers stolen. This was an unprecedented data [7]. However, there is still a significant gap when it comes
breach due to its size and the implications for Equifax and its to sensitive data identification in unstructured text. The main
customers [2]. With increased digitalization and an exponential reason for this is the lack of labeled datasets for training
increase in data, such incidents are frequently reported. For machine learning models because of sensitive nature of data.
example, in 2020, Capital One reported that the sensitive In this work, we focus on the identification of financial as
information of over 100 million people had been compromised well as health related sensitive data in unstructured text for
[3]. Similarly in health domain, another incident occurred with a given context in a data-constrained environment. Context,
a health insurer Excellus Health Plan Inc., where personal in this regard, means the privacy regulation under which the
health information of 9.3 million customers was breached [4]. text is analyzed. This is a typical setup in an enterprise as
Despite extensive research to detect and prevent data breaches, the majority of the data that is dealt with is in unstructured
they remain an active threat to enterprises and individuals format [8]. Even though we address the issue of data scarcity
alike. for sensitive information discovery in free-flowing text, our
For an enterprise, it is important to maintain the end user’s proposed approach is equally applicable to other text-based
privacy while processing the data to generate valuable insights problems in the cybersecurity domain, for example, spear-
about their services. Various data protection regulations, such phishing, insider threat, email analytics, etc. We investigate the
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-7706-2/23/$31.00 ©2023 IEEE 159
COMSNETS 2023 Cybersecurity and Privacy Workshop
feasibility of pre-trained language models to identify sensitive They used CNN with CRF for structured and unstructured data
data in unstructured text. More specifically, we evaluate the and evaluated their model with synthetic and real-world data.
accuracy of different pre-trained language models without any While most of the aforementioned work provides good accu-
domain or language-dependent pre-training for the extraction racy in sensitive data discovery, they need a decent amount of
of sensitive information from unstructured data for the English real-world training data to perform well, which is hard to get
language. Further, we improve over the current pre-trained because of the sensitive nature of the data. Our work focuses
models and propose a new model architecture with improved on the use of pre-trained models and shows the effectiveness of
performance. Our contributions are as follows: these models in a scenario where training data is scarce. Zhang
1) The paper analyzes the performance of different pre- et al. [11] improved the accuracy of clinical named entity
trained models on sensitive data discovery task in finan- recognition based models by utilizing transfer learning. The
cial as well as health domain. authors of [12] made use of multi-task learning to obtain useful
2) Experimental results show that pre-trained models in a information from different datasets. By utilizing a global
data-constrained environment can lead to better result. attention mechanism, Xu et al. [13] improved the performance
Also, with better feature engineering in financial and of clinical named entity recognition models. All of these deep
health domain, the performance of pre-trained models learning techniques, however, rely on token representations
is improved. that are not contextualized. Moreover, these techniques require
3) We propose a new Char-BERT-CRF model to iden- good amount of data for their models to provide good accuracy
tify sensitive information, which captures morphological in clinical named entity extraction tasks. In this work, we
features along with word level features. Experimental mainly focus on contextualized representations and pre-trained
results show that our model achieves reasonable perfor- models in a data-constrained environment. Similar to our work,
mance in a data-constrained environment. Zhou et al. [14] pre-trained two deep contextualized models
The rest of the paper is organized as follows. Section II C-ELMO and C-Flair using corpus from PubMed Central
describes the Related Work. Section III presents a detailed and utilized pre-trained contextualized embedding along with
description on Approach followed in this work. Section IV static word embedding on MACCROBAT2018 dataset and fed
describes different experiments and results. Section V includes them into a BiLSTM-CRF model for clinical named entity
the discussion part. Finally, Section VI concludes the work. extraction task. However, their model fails to capture long-
range dependencies within a document and as opposed to
II. R ELATED W ORK theirs, our work captures long-range dependencies better.
Depending on the difficulty of task and the desired outcome, Few Shot and Zero Shot Learning. Few-shot and zero-
multiple techniques have been used to find sensitive data in shot learning NER techniques are useful in a scenario where
free-flowing text. training data is scarce. Researchers have used few-shot NER
Pattern Matching. Some of the past work has used pattern [15] and zero-shot NER techniques [15] in an environment
matching for extraction and identification of sensitive data in where no or little training data is available. However, experi-
unstructured text. Yongyan Guo et al. [5] proposed a system mental results in these studies show that the accuracy of such a
based on content-based (regular expressions) and context- system can be very low and could lead to the loss of sensitive
based (deep learning; BiLSTM-CRF) detection of sensitive information. Therefore, we focus on pre-trained models and
information. Similarly, Mariana Dias et al. [9] used rule-based propose a hybrid deep learning-based system for sensitive data
techniques to extract entities, such as email addresses, postal discovery.
codes, and some dates in the Portuguese language. Pattern
matching based systems tend to have high precision for certain III. A PPROACH
kinds of data and need constant maintenance that can be very Our objective is to extract sensitive information that is
time-consuming and expensive. present in unstructured data. To achieve this goal, we used
Machine Learning and Deep Learning. Silva et al. [7] generic and vanilla version of the pre-trained language models,
used publicly available natural language processing (NLP) evaluated the performance of different models on sensitive data
tools such as Spacy, NLTK, and Stanford CoreNLP to identify discovery task.
generic sensitive data such as names, places, organizations, etc.
They experimented with the Groningen Meaning Bank dataset A. Data Preprocessing
and manually collected and annotated US voter registration 1) Tokenization: Sentences were split into discrete tokens
data. Their combined data consists of 170K lines, which was and all the non-ASCII characters, whitespaces at the end and
sufficient to train and test machine learning models. Hassan beginning of all sentences were removed. We used WordPiece
et al. [10] trained a word embedding model and analyzed tokenization to further partition the tokens [16].
the semantic relationships of sensitive entities. Their proposed 2) Character Tokenization: We also implemented character
method requires a large corpus to train the word embeddings level tokenization to employ character level features into our
and it is limited by the choice of the window size. Troung et model to better understand the structure of each word. Each
al. [6] focused on the extraction of sensitive information from token is converted to a sequence of characters with a maximum
both structured and unstructured data in financial institutions. sequence length of ‘N’ characters, which is a configurable
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
160
COMSNETS 2023 Cybersecurity and Privacy Workshop
parameter that is equal to the length of largest sequence in the CharacterBERT. While CharacterBERT is identical to
dataset. For the dataset used in this paper, the value of ’N’ is vanilla BERT in every manner, it builds initial context-
35. independent representations differently. At its embeddings
layer, it uses a Character-CNN module to represent each
B. Model Enhancement word with help of characters that constitute it. Each token
is converted to a sequence of characters of length ‘N’ and
Named Entity Recognition is one of the techniques that
then fed to multiple 1-d CNNs with different filters. The
can be used to extract sensitive data from free flowing text.
output from each CNN is max-pooled and concatenated to
We evaluated different pre-trained models on sensitive data
produce the final representation. As followed by [17], each
discovery task in financial and health domain. However, some
representation is then sent to Highway layers that control the
entities in the dataset such as credit card numbers, account
flow of information with help of gating units. We followed the
numbers, age, date, etc. are numeric entities. Pre-trained
same architecture for generating character level features.
models trained on general corpora fail to capture the features
BERT. For obtaining word level contextual representation,
of such numeric entities.
we adopt the BERT model architecture, more specifically
To overcome the above mentioned issue, we improved fea-
BERT-base architecture (L = 12, H = 768, A = 12). The
ture engineering by incorporating the character level features
character and word level representation are then concatenated
along with word level features. Character level features are
to obtain final representations.
generated by dealing with text at character level leading to
Conditional Random Fields (CRF). The final output rep-
better representation of text. This is beneficial when dealing
resentations are then fed to CRFs, which acts as decoder and
with numeric entities. Hence, we developed a model that
classifies each token to different categories. The CRFs are
uses embedding layer architecture of CharacterBERT [17]
excellent in capturing adjacent labels dependency which is a
for character level embeddings and BERT [18] for word
crucial part in sequence labelling task.
level embeddings. We extract the sensitive entities by using a
Conditional Random Field [19] as decoder and applying it on C. Training and Evaluation
the concatenated embeddings. Figure 1 shows the architecture
For character level embeddings, we employed only the
of our model. We used the sequence labeling strategy, which
embedding layer architecture of CharacterBERT [17]. Each
is commonly referred to as ‘BIO’ sequence labeling strategy.
token is divided into a sequence of characters with a maximum
The ‘B’ prefix indicate the beginning of a tag. The ‘I’ prefix
length of ‘N’ and fed into multiple 1-d CNNs. We used ReLU
indicates that the word is inside a chunk and ‘O’ prefix
as activation function and different filter sizes given as (1, 32),
indicates that the word does not belong to any category. Table
(2, 32), (3, 64), (4, 128), (5, 256), (6, 512) and (7, 1024).
II shows the different sensitive entities used.
We initialize our BERT model using pre-trained weights
Sensitive Entities
through Hugging Face [20]. We feed the final concatenated
representation to CRF decoder model, which outputs each
entity type, including ‘O’ for other entities. We used a dropout
CRF
probability of 0.5 and an Adam Optimizer with a learning rate
of 5e-5. We use linear learning rate warm up for first 10% of
iterations with maximum sequence length of 258 and a batch
size of 16. We aim to evaluate different pre-trained models and
CONCATENATION
deep learning models with respect to accuracy on financial and
health datasets.
BERT CharacterBERT IV. E XPERIMENTS
A. Baseline Models
654,#21,##23,##76... 6,5,4,2,1,2,3,7,6,5,4,2,4 Considering sensitive data discovery as a sequence labeling
problem, we compare and evaluate different pre-trained and
Fig. 1: Char-BERT-CRF Architecture. In CharacterBERT, the deep learning models in terms of Precision, Recall and F1-
inputs are sequence of characters in a word and in BERT, the score. We implemented: (1) BiLSTM-CRF [21], (2) Char-
inputs are sequence of words in a sentence. BiLSTM-CRF [22] and different hybrid combination of pre-
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
161
COMSNETS 2023 Cybersecurity and Privacy Workshop
TABLE II: Types of sensitive entities in financial dataset the number of sentences while maintaining the contextual
Sensitive Information Type Sensitive Entity information, we used the Pegasus model [26]. The model
Personal Sensitive Entities Location uses an encoder-decoder architecture for sequence to sequence
Person Name learning task such as summarization, paraphrase generation,
Social Security Number
Email ID etc. In our work, it helped to increase the number of sentences
Financial Sensitive Entities Bank Name by generating similar sentences, regarding a particular entity
Credit Card Number while maintaining the contextual information. The number of
CVV Number
Expiry Date sentences, tokens and entities in training and testing set are
Permanent Identification Number (10349, 4758), (204291, 54213) and (11, 11) respectively.
Bank Account Number Figure 2 shows the number of entities in each set as a part of
International Bank Account Number
the dataset used to Train and Test the various models.
Sensitive entites
trained models such as (3) RoBERTa-CRF [21], (4) ALBERT- ACCOUNT NUMBER Train Set
IBAN Test Set
CRF [21], (5) DistilBERT-CRF [21], (6) BERT-CRF [21]. The EXPIRY DATE
pre-trained models are trained on general corpora extracted EMAIL ID
are included because they are most widely used in any CVV
SOCIAL SECURITY NUMBER
sequence labeling task. We also compared our model with CREDIT CARD NUMBER
different baseline models [12], [14], [22]. Our model in a data- ORGANIZATION
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
162
COMSNETS 2023 Cybersecurity and Privacy Workshop
F1 Score
DistilBERT-CRF 0.92, 0.84, 0.88 0.92, 0.83, 0.87 0.42
0.3
0.3
Fig. 4: Graph representing change in performance with respect
0.2
0.2
to change in health training data percent.
0.1 0.1
RF
RF
CR
CR
C
-C
C
T-
A-
T-
T-
RT
ER
ER
BE
BE
B
ilB
-B
Al
ar
ist
Ro
Ch
D
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
163
COMSNETS 2023 Cybersecurity and Privacy Workshop
F1 Score
Recall
Pre-trained Models Pre-trained Models Pre-trained Models
Bilstm Bilstm Bilstm
0.4 0.4 0.4
els and vanilla pre-trained models on health dataset. Overall, [4] HIPAA JOURNAL. (January, 2021) Excellus health plan
hybrid models show more accuracy when compared with their settles hipaa violation case and pays $5.1 million penalty.
[Online]. Available: https://www.hipaajournal.com/excellus-health-plan-
vanilla counterparts in terms of F1-score. This is because settles-hipaa-violation-case-and-pays-5-1-million-penalty/
when computing the probability distribution of outputs, vanilla [5] Y. Guo et al., “Exsense: Extract sensitive information from unstructured
models only consider the current word in the input sequence, data,” Computers & Security, vol. 102, p. 102156, 2021.
[6] A. Truong et al., “Sensitive data detection with high-throughput
whereas hybrid models compute outputs by considering not neural network models for financial institutions,” arXiv preprint
only the current word but also the rest of the sequence. In arXiv:2012.09597, 2020.
other words, hybrid models improve the models’ ability to [7] P. Silva et al., “Using nlp and machine learning to detect data privacy
violations,” IEEE INFOCOM Workshops, pp. 972–977, 2020.
learn and generalize well on downstream tasks. [8] Allahyari et al., “A brief survey of text mining: Classification, clustering
Future Work. Several avenues exist for future research. In and extraction techniques,” arXiv preprint arXiv:1707.02919, 2017.
the domain of sensitive data discovery, no real-world annotated [9] M. Dias et al., “Named entity recognition for sensitive data discovery
in portuguese,” Applied Sciences, vol. 10, no. 7, p. 2303, 2020.
dataset is available publicly. For any deep learning or machine [10] Hassan et al., “Automatic anonymization of textual documents: detecting
learning solution to work, there is a need for a good quality sensitive information via word embeddings,” TrustCom, pp. 358–365,
dataset. Further research may include the use of generative 2019.
[11] Zhang et al., “Improving clinical named-entity recognition with transfer
models such as GANs for generating realistic and good-quality learning,” Stud Health Technol Inform, vol. 252, pp. 182–187, 2018.
synthetic datasets. In this study, we focused on general domain [12] X. Wang et al., “Cross-type biomedical named entity recognition with
pre-trained models, and domain-specific models were not deep multi-task learning,” Bioinformatics, vol. 35, no. 10, pp. 1745–
1752, 2019.
explored. Future work could include exploring other models [13] G. Xu et al., “Improving clinical named entity recognition with global
and their performance on a limited training dataset scenario. A neural attention,” in Asia-Pacific Web (APWeb) and Web-Age Information
large proportion of sensitive entities contain numeric data and Management (WAIM) Joint International Conference on Web and Big
Data. Springer, 2018, pp. 264–279.
pre-trained models trained on general corpora tend to struggle [14] Y. Zhou et al., “Clinical named entity recognition using contextualized
with numeric data due to poor representation of large numbers. token representations,” arXiv preprint arXiv:2106.12608, 2021.
Further work is required to improve the representation of [15] Y. Wang et al., “Learning from language description: Low-shot
named entity recognition via decomposed framework,” arXiv preprint
numeric data within pre-trained models. arXiv:2109.05357, 2021.
[16] Y. Wu et al., “Google’s neural machine translation system: Bridg-
VI. C ONCLUSION ing the gap between human and machine translation,” arXiv preprint
arXiv:1609.08144, 2016.
In this paper, we address the issue of extracting sensitive [17] H. E. Boukkouri et al., “Characterbert: Reconciling elmo and bert
information from free-flowing text. We experimented with for word-level open-vocabulary representations from characters,” arXiv
different pre-trained models in a data-constrained environment, preprint arXiv:2010.10392, 2020.
[18] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers
where these models provided good accuracy. We proposed a for language understanding.” Association for Computational Linguis-
Char-BERT-CRF model, which combines character level and tics, 2019, pp. 4171–4186.
word level contextual representation and feeds them into a [19] J. Lafferty et al., “Conditional random fields: Probabilistic models for
segmenting and labeling sequence data,” 2001.
CRF decoder for better extraction of numeric sensitive entities. [20] T. Wolf et al., “Huggingface’s transformers: State-of-the-art natural
Also, our study shows that it is possible to come up with a language processing,” arXiv preprint arXiv:1910.03771, 2019.
sensitive data discovery solution by employing different pre- [21] J. Li et al., “A survey on deep learning for named entity recognition,”
IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1,
trained models in a data-constrained environment. pp. 50–70, 2020.
[22] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional
R EFERENCES lstm-cnns-crf,” pp. 1064–1074, 2016.
[1] Cheng et al., “Enterprise data breach: causes, challenges, prevention, [23] Y. Zhu et al., “Aligning books and movies: Towards story-like visual
and future directions,” Wiley Interdisciplinary Reviews: Data Mining explanations by watching movies and reading books,” in Proceedings of
and Knowledge Discovery, vol. 7, no. 5, p. e1211, 2017. the IEEE international conference on computer vision, 2015, pp. 19–27.
[2] Zou et al., “”i’ve got nothing to lose”: Consumers’ risk perceptions and [24] Z. Wang et al., “Crossweigh: Training named entity tagger from imper-
protective actions after the equifax data breach,” SOUPS, pp. 197–216, fect annotations,” pp. 5154–5163, 2019.
2018. [25] J. Alvarado et al., “Domain adaption of named entity recognition to
[3] Neto et al., “A case study of the capital one data breach,” Stuart E. and support credit risk assessment,” pp. 84–90, 2015.
Moraes G. de Paula, Anchises and Malara Borges, Natasha, A Case [26] J. Zhang et al., “Pegasus: Pre-training with extracted gap-sentences for
Study of the Capital One Data Breach (January 1, 2020), 2020. abstractive summarization,” pp. 11 328–11 339, 2020.
Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
164