0% found this document useful (0 votes)
4 views6 pages

Detecting Sensitive Information From Unstructured Text in A Data-Constrained Environment

The document discusses a study on detecting sensitive information in unstructured text within data-constrained environments, focusing on financial and health domains. It highlights the challenges of existing methods that either produce many false positives or require extensive labeled data for machine learning models. The authors propose a hybrid model architecture that combines pre-trained language models with morphological features to improve the identification of sensitive data while reducing the time and cost associated with data discovery.

Uploaded by

Alaa Alkrud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Detecting Sensitive Information From Unstructured Text in A Data-Constrained Environment

The document discusses a study on detecting sensitive information in unstructured text within data-constrained environments, focusing on financial and health domains. It highlights the challenges of existing methods that either produce many false positives or require extensive labeled data for machine learning models. The authors propose a hybrid model architecture that combines pre-trained language models with morphological features to improve the identification of sensitive data while reducing the time and cost associated with data discovery.

Uploaded by

Alaa Alkrud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

COMSNETS 2023 Cybersecurity and Privacy Workshop

Detecting Sensitive Information from Unstructured


Text in a Data-Constrained Environment
Saurabh Anand Manish Shukla Sachin Lodha
2023 15th International Conference on COMmunication Systems & NETworkS (COMSNETS) | 978-1-6654-7706-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/COMSNETS56262.2023.10041388

TCS Research, India TCS Research, India TCS Research, India


[email protected] [email protected] [email protected]

Abstract—For an enterprise, it is important to handle sensitive as the Global Data Protection Regulation (GDPR), Payment
customer data properly because any data breach or violation Card Industry-Data Security Standard (PCI-DSS), and the
can lead to hefty penalties. Past work has looked at various Health Insurance Portability and Accountability Act (HIPAA),
techniques for detecting sensitive data in free-flowing text for
a given regulation. However, most of them either produce have mandated enterprises to treat the personal data of their
many false positives or are very specific to certain types of customers with great caution. Any form of data leakage or
data, for example, email, account number, or social security privacy violation can have serious consequences for an enter-
number. Moreover, machine learning-based methods are difficult prise. For this, the enterprise needs to know the kind and nature
to use as finding large amounts of labeled data for training a of data residing on each device, as it will help in installing
supervised model poses a serious challenge. In this work, we
aim to address the issue of sensitive data discovery in a data- appropriate security tools and enabling required access rights
constrained environment by utilizing pre-trained models. We for its employees. However, within an enterprise, the personal
compare their effectiveness in the financial and health domains. and sensitive information of a customer can be present in
Further, we improve the performance of pre-trained models by unstructured, structured, and semi-structured forms. Manual
employing morphological-level features and propose a hybrid analysis of these documents can be difficult, expensive, time
model architecture. Our experimental results show that pre-
trained models in a data-constrained environment can reduce the consuming and often prone to error. Therefore, it becomes
turnaround time for sensitive data discovery, thus saving money imperative for enterprises to have a balance between security
and effort. and operational costs.
Index Terms—Sensitive Data, User Privacy, Named Entity In order to keep track of sensitive information, enterprises
Recognition, Pre-trained Language Models need to extract and identify sensitive information present
in unstructured data sources. Past research has looked into
I. I NTRODUCTION content fingerprinting, locality-sensitive hashes [1], regular
A data breach is a cybersecurity incident where confidential expressions [5], and machine learning-based approaches [6],
information is intentionally or unintentionally exposed to [7]. Some of these techniques either generate too many false
unauthorized parties. For an enterprise, a data breach incident positives (content fingerprinting and regular expressions), take
is a serious concern as it can result in monetary losses due to too long to be a practical solution due to pairwise comparison
legal penalties, loss of customers, and a negative effect on their (locality-sensitive hashes), or require a large amount of labeled
reputation [1]. It also impacts the end-user, whose data is lost data for training machine learning models. To a certain extent,
in the breach, as it can increase their chances of being a victim Named Entity Recognition (NER) using pre-trained models
of identity theft. For example, in 2017, the credit reporting has been utilized for identifying named entities in unstructured
company Equifax had sensitive information on approximately text, for example, usernames, locations, quantities, and brands
140 million consumers stolen. This was an unprecedented data [7]. However, there is still a significant gap when it comes
breach due to its size and the implications for Equifax and its to sensitive data identification in unstructured text. The main
customers [2]. With increased digitalization and an exponential reason for this is the lack of labeled datasets for training
increase in data, such incidents are frequently reported. For machine learning models because of sensitive nature of data.
example, in 2020, Capital One reported that the sensitive In this work, we focus on the identification of financial as
information of over 100 million people had been compromised well as health related sensitive data in unstructured text for
[3]. Similarly in health domain, another incident occurred with a given context in a data-constrained environment. Context,
a health insurer Excellus Health Plan Inc., where personal in this regard, means the privacy regulation under which the
health information of 9.3 million customers was breached [4]. text is analyzed. This is a typical setup in an enterprise as
Despite extensive research to detect and prevent data breaches, the majority of the data that is dealt with is in unstructured
they remain an active threat to enterprises and individuals format [8]. Even though we address the issue of data scarcity
alike. for sensitive information discovery in free-flowing text, our
For an enterprise, it is important to maintain the end user’s proposed approach is equally applicable to other text-based
privacy while processing the data to generate valuable insights problems in the cybersecurity domain, for example, spear-
about their services. Various data protection regulations, such phishing, insider threat, email analytics, etc. We investigate the

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-7706-2/23/$31.00 ©2023 IEEE 159
COMSNETS 2023 Cybersecurity and Privacy Workshop

feasibility of pre-trained language models to identify sensitive They used CNN with CRF for structured and unstructured data
data in unstructured text. More specifically, we evaluate the and evaluated their model with synthetic and real-world data.
accuracy of different pre-trained language models without any While most of the aforementioned work provides good accu-
domain or language-dependent pre-training for the extraction racy in sensitive data discovery, they need a decent amount of
of sensitive information from unstructured data for the English real-world training data to perform well, which is hard to get
language. Further, we improve over the current pre-trained because of the sensitive nature of the data. Our work focuses
models and propose a new model architecture with improved on the use of pre-trained models and shows the effectiveness of
performance. Our contributions are as follows: these models in a scenario where training data is scarce. Zhang
1) The paper analyzes the performance of different pre- et al. [11] improved the accuracy of clinical named entity
trained models on sensitive data discovery task in finan- recognition based models by utilizing transfer learning. The
cial as well as health domain. authors of [12] made use of multi-task learning to obtain useful
2) Experimental results show that pre-trained models in a information from different datasets. By utilizing a global
data-constrained environment can lead to better result. attention mechanism, Xu et al. [13] improved the performance
Also, with better feature engineering in financial and of clinical named entity recognition models. All of these deep
health domain, the performance of pre-trained models learning techniques, however, rely on token representations
is improved. that are not contextualized. Moreover, these techniques require
3) We propose a new Char-BERT-CRF model to iden- good amount of data for their models to provide good accuracy
tify sensitive information, which captures morphological in clinical named entity extraction tasks. In this work, we
features along with word level features. Experimental mainly focus on contextualized representations and pre-trained
results show that our model achieves reasonable perfor- models in a data-constrained environment. Similar to our work,
mance in a data-constrained environment. Zhou et al. [14] pre-trained two deep contextualized models
The rest of the paper is organized as follows. Section II C-ELMO and C-Flair using corpus from PubMed Central
describes the Related Work. Section III presents a detailed and utilized pre-trained contextualized embedding along with
description on Approach followed in this work. Section IV static word embedding on MACCROBAT2018 dataset and fed
describes different experiments and results. Section V includes them into a BiLSTM-CRF model for clinical named entity
the discussion part. Finally, Section VI concludes the work. extraction task. However, their model fails to capture long-
range dependencies within a document and as opposed to
II. R ELATED W ORK theirs, our work captures long-range dependencies better.
Depending on the difficulty of task and the desired outcome, Few Shot and Zero Shot Learning. Few-shot and zero-
multiple techniques have been used to find sensitive data in shot learning NER techniques are useful in a scenario where
free-flowing text. training data is scarce. Researchers have used few-shot NER
Pattern Matching. Some of the past work has used pattern [15] and zero-shot NER techniques [15] in an environment
matching for extraction and identification of sensitive data in where no or little training data is available. However, experi-
unstructured text. Yongyan Guo et al. [5] proposed a system mental results in these studies show that the accuracy of such a
based on content-based (regular expressions) and context- system can be very low and could lead to the loss of sensitive
based (deep learning; BiLSTM-CRF) detection of sensitive information. Therefore, we focus on pre-trained models and
information. Similarly, Mariana Dias et al. [9] used rule-based propose a hybrid deep learning-based system for sensitive data
techniques to extract entities, such as email addresses, postal discovery.
codes, and some dates in the Portuguese language. Pattern
matching based systems tend to have high precision for certain III. A PPROACH
kinds of data and need constant maintenance that can be very Our objective is to extract sensitive information that is
time-consuming and expensive. present in unstructured data. To achieve this goal, we used
Machine Learning and Deep Learning. Silva et al. [7] generic and vanilla version of the pre-trained language models,
used publicly available natural language processing (NLP) evaluated the performance of different models on sensitive data
tools such as Spacy, NLTK, and Stanford CoreNLP to identify discovery task.
generic sensitive data such as names, places, organizations, etc.
They experimented with the Groningen Meaning Bank dataset A. Data Preprocessing
and manually collected and annotated US voter registration 1) Tokenization: Sentences were split into discrete tokens
data. Their combined data consists of 170K lines, which was and all the non-ASCII characters, whitespaces at the end and
sufficient to train and test machine learning models. Hassan beginning of all sentences were removed. We used WordPiece
et al. [10] trained a word embedding model and analyzed tokenization to further partition the tokens [16].
the semantic relationships of sensitive entities. Their proposed 2) Character Tokenization: We also implemented character
method requires a large corpus to train the word embeddings level tokenization to employ character level features into our
and it is limited by the choice of the window size. Troung et model to better understand the structure of each word. Each
al. [6] focused on the extraction of sensitive information from token is converted to a sequence of characters with a maximum
both structured and unstructured data in financial institutions. sequence length of ‘N’ characters, which is a configurable

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
160
COMSNETS 2023 Cybersecurity and Privacy Workshop

TABLE I: Example for financial data generation


Original Sentence Template Generated Generated Sentence
Bank will notify Borrower <ORG> will notify <PER> Wells Fargo will notify Jake Milton
when it debits Borrower account. when it debits <PER> account. when it debits Jake Milton account.
Bank will disburse such Equipment <ORG> will disburse such Equipment BARCLAYS will disburse such Equipment
Advance in by the internal transfer to Borrower Advance in by the internal transfer to <PER> Advance in by the internal transfer to Ray Adam
deposit account with Bank . deposit account with <ORG> . deposit account with BARCLAYS .

parameter that is equal to the length of largest sequence in the CharacterBERT. While CharacterBERT is identical to
dataset. For the dataset used in this paper, the value of ’N’ is vanilla BERT in every manner, it builds initial context-
35. independent representations differently. At its embeddings
layer, it uses a Character-CNN module to represent each
B. Model Enhancement word with help of characters that constitute it. Each token
is converted to a sequence of characters of length ‘N’ and
Named Entity Recognition is one of the techniques that
then fed to multiple 1-d CNNs with different filters. The
can be used to extract sensitive data from free flowing text.
output from each CNN is max-pooled and concatenated to
We evaluated different pre-trained models on sensitive data
produce the final representation. As followed by [17], each
discovery task in financial and health domain. However, some
representation is then sent to Highway layers that control the
entities in the dataset such as credit card numbers, account
flow of information with help of gating units. We followed the
numbers, age, date, etc. are numeric entities. Pre-trained
same architecture for generating character level features.
models trained on general corpora fail to capture the features
BERT. For obtaining word level contextual representation,
of such numeric entities.
we adopt the BERT model architecture, more specifically
To overcome the above mentioned issue, we improved fea-
BERT-base architecture (L = 12, H = 768, A = 12). The
ture engineering by incorporating the character level features
character and word level representation are then concatenated
along with word level features. Character level features are
to obtain final representations.
generated by dealing with text at character level leading to
Conditional Random Fields (CRF). The final output rep-
better representation of text. This is beneficial when dealing
resentations are then fed to CRFs, which acts as decoder and
with numeric entities. Hence, we developed a model that
classifies each token to different categories. The CRFs are
uses embedding layer architecture of CharacterBERT [17]
excellent in capturing adjacent labels dependency which is a
for character level embeddings and BERT [18] for word
crucial part in sequence labelling task.
level embeddings. We extract the sensitive entities by using a
Conditional Random Field [19] as decoder and applying it on C. Training and Evaluation
the concatenated embeddings. Figure 1 shows the architecture
For character level embeddings, we employed only the
of our model. We used the sequence labeling strategy, which
embedding layer architecture of CharacterBERT [17]. Each
is commonly referred to as ‘BIO’ sequence labeling strategy.
token is divided into a sequence of characters with a maximum
The ‘B’ prefix indicate the beginning of a tag. The ‘I’ prefix
length of ‘N’ and fed into multiple 1-d CNNs. We used ReLU
indicates that the word is inside a chunk and ‘O’ prefix
as activation function and different filter sizes given as (1, 32),
indicates that the word does not belong to any category. Table
(2, 32), (3, 64), (4, 128), (5, 256), (6, 512) and (7, 1024).
II shows the different sensitive entities used.
We initialize our BERT model using pre-trained weights
Sensitive Entities
through Hugging Face [20]. We feed the final concatenated
representation to CRF decoder model, which outputs each
entity type, including ‘O’ for other entities. We used a dropout
CRF
probability of 0.5 and an Adam Optimizer with a learning rate
of 5e-5. We use linear learning rate warm up for first 10% of
iterations with maximum sequence length of 258 and a batch
size of 16. We aim to evaluate different pre-trained models and
CONCATENATION
deep learning models with respect to accuracy on financial and
health datasets.
BERT CharacterBERT IV. E XPERIMENTS
A. Baseline Models
654,#21,##23,##76... 6,5,4,2,1,2,3,7,6,5,4,2,4 Considering sensitive data discovery as a sequence labeling
problem, we compare and evaluate different pre-trained and
Fig. 1: Char-BERT-CRF Architecture. In CharacterBERT, the deep learning models in terms of Precision, Recall and F1-
inputs are sequence of characters in a word and in BERT, the score. We implemented: (1) BiLSTM-CRF [21], (2) Char-
inputs are sequence of words in a sentence. BiLSTM-CRF [22] and different hybrid combination of pre-

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
161
COMSNETS 2023 Cybersecurity and Privacy Workshop

TABLE II: Types of sensitive entities in financial dataset the number of sentences while maintaining the contextual
Sensitive Information Type Sensitive Entity information, we used the Pegasus model [26]. The model
Personal Sensitive Entities Location uses an encoder-decoder architecture for sequence to sequence
Person Name learning task such as summarization, paraphrase generation,
Social Security Number
Email ID etc. In our work, it helped to increase the number of sentences
Financial Sensitive Entities Bank Name by generating similar sentences, regarding a particular entity
Credit Card Number while maintaining the contextual information. The number of
CVV Number
Expiry Date sentences, tokens and entities in training and testing set are
Permanent Identification Number (10349, 4758), (204291, 54213) and (11, 11) respectively.
Bank Account Number Figure 2 shows the number of entities in each set as a part of
International Bank Account Number
the dataset used to Train and Test the various models.
Sensitive entites

trained models such as (3) RoBERTa-CRF [21], (4) ALBERT- ACCOUNT NUMBER Train Set
IBAN Test Set
CRF [21], (5) DistilBERT-CRF [21], (6) BERT-CRF [21]. The EXPIRY DATE
pre-trained models are trained on general corpora extracted EMAIL ID

from English Wikipedia and BookCorpus [23]. These models PIN

are included because they are most widely used in any CVV
SOCIAL SECURITY NUMBER
sequence labeling task. We also compared our model with CREDIT CARD NUMBER
different baseline models [12], [14], [22]. Our model in a data- ORGANIZATION

constrained environment gives better results as compared to LOCATION


PERSON
different baseline models.
500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Entities
B. Datasets
Fig. 2: Sensitive entity distribution in financial dataset.
1) Synthetic Dataset: For any supervised learning, espe-
cially deep learning based solution, data is of utmost impor- 2) Real World Dataset: A Credit card and Bank complaint
tance as it determines the final quality of any model. Especially dataset 2 is used for evaluating the performance of different
in a scenario where contextual information is important, the models. The sensitive entities in the dataset are redacted with
quality of the model changes with the quality of the corpus ‘XXXXX’ to preserve the privacy of different customers.
used. Since there is no publicly available annotated dataset We manually inserted random sensitive entities in place of
containing sensitive entities for training models, therefore redacted terms and included only those sentences that con-
we have used synthetic data. Table II contains the list of tained financial entities used in our work and not the entire
financial entities that are considered in this paper. We have dataset. The final dataset contains about 513 sentences, with
included these entities in dataset because sensitive entities 8 unique entities overall.
in financial domain mostly include cardholder’s data such as 3) MACCROBAT2018: Additionally, we used the publicly
name, account numbers, expiry date, etc. available MACCROBAT2018 dataset [14]. The authors found
For the generation of some entities such as Name, Organi- content relating to medical concepts within 3100 clinical text
zation and Location, we followed a template-based approach1 . documents, including clinical case reports and free text compo-
We collected some sentences containing the above mentioned nents of electronic health records. In total, they annotated 200
entities and converted them to templates based on their labels. case reports containing 3652 sentences altogether, and these
For example, if original sentence contains ORGANIZATION phrases comprise a total of 36 distinct entities and event types.
(ORG) entity, then a template based generic sentence is We randomly divided 10% of the case reports for development
produced by replacing that ORGANIZATION entity with tem- and 10% for testing.
plate ‘<ORG>’. Now in template sentence that is generated,
we replaced a particular placeholder (‘<entity-type>’) with C. Results
fake PIIs of similar entity type. Table I shows an example We evaluated the effectiveness of different pre-trained mod-
of a template generated from a public dataset. The sentences els and baselines models with respect to accuracy on sensitive
were collected from publicly available annotated datasets [24], data discovery task. Table III shows the performance of models
[25]. trained and tested on financial data.
For the generation of other entities such as Credit Card Figure 3a shows the performance of different models on
Number, Account numbers, etc., we collected some random health dataset. We can observe that models are showing less
samples of sentences that maintain the contextual information. accuracy because of poor representation of entities in the
Here. contextual information helps in differentiating between dataset.
similar looking values. For example, a 13-digit number can Figure 5 displays the changes in performance of different
be ‘credit card number’ or ‘account number’. To increase metrics with respect to changes in training data. We can
1 https://microsoft.github.io/presidio/ 2 https://data.world/dataquest/bank-and-credit-card-complaints

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
162
COMSNETS 2023 Cybersecurity and Privacy Workshop

TABLE III: Precision, Recall and F1-score on financial dataset.


MICRO-Averaged MACRO-Averaged 0.48

BiLSTM 0.88, 0.82, 0.85 0.88, 0.80, 0.84 0.46


BiLSTM-CRF 0.90, 0.84, 0.87 0.90, 0.81, 0.85
0.44
ALBERT-CRF 0.95, 0.88, 0.91 0.94, 0.87, 0.90

F1 Score
DistilBERT-CRF 0.92, 0.84, 0.88 0.92, 0.83, 0.87 0.42

BERT-CRF 0.93, 0.88, 0.90 0.92, 0.85, 0.89 0.40


Char-BiLSTM-CRF 0.92, 0.85, 0.88 0.90, 0.85, 0.87
0.38
Char-BERT-CRF 0.93, 0.91, 0.92 0.91, 0.90, 0.91
0.36
Vanilla Models
0.34
0.7 Hybrid Models
Precision 0.6 Hybrid
Recall Vanilla
0.6 F1-Score 0.5
0 50 100
Train Data
0.5
0.4
0.4

0.3
0.3
Fig. 4: Graph representing change in performance with respect
0.2
0.2
to change in health training data percent.
0.1 0.1

0.0 0.0 V. D ISCUSSION


AlBERT RoBERTA DistilBERT BERT
RF

RF

RF
CR

CR
C

-C

C
T-

A-

T-

T-
RT

In this work, we trained and tested several models for


RT
ER

ER

ER
BE
BE
B

ilB

-B
Al

ar
ist
Ro

Ch
D

extracting sensitive information from free-flowing text in fi-


(a) (b) nancial and health domain. While comparing common metrics
Fig. 3: Figure 3a showing accuracy of different pre-trained such as precision, recall and F1-score on financial dataset, the
models on health dataset. Figure 3b showing performance of pre-trained models showed better accuracy as compared to the
Hybrid model and Vanilla model on health dataset. BiLSTM-based models. The same can be observed from Table
III. Figure 3a shows the performance of different pre-trained
observe that the accuracy of models increases with an in- models on health dataset. The accuracy of pre-trained models
crease in training data. Comparatively, pre-trained models’ is lower than usual when trained and tested on health datasets.
performances are consistently accurate as compared to BiL- The reason behind this is that the MACCROBAT2018 dataset
STM models. Figure 4 shows the changes in performance of lacks uniformity as some entities are poorly represented in the
different models with respect to training data on health dataset. dataset. Moreover, the context around some entities is poorly
Figure 3b displays the F1-score of different hybrid models and expressed in the dataset, leading to less accuracy for certain
vanilla models. In hybrid models, we have used pre-trained entities. When looking at the performance metrics of different
models with CRF as decoder, whereas in vanilla models, we models in Table III, employing character level representation
follow the initial design of pre-trained models. Hybrid models in BERT-CRF and BiLSTM-CRF models improved the ac-
show more accuracy with respect to vanilla models on the curacies when compared with their counterparts. Also from
MACCROBAT2018 dataset. On the same dataset, the proposed Figure 3a the performance of BERT-CRF model increased
solution of Ma et al. [22] has an accuracy of 0.61, Wang et when we added the character level representation to the model.
al. [12] achieves an accuracy of 0.64, and the model of Zhou The reason behind this is that pre-trained models are trained
et al. [14] achieves an accuracy of 0.65. In comparison, our on generic corpora and therefore fail to capture the features
proposed model performs better with an accuracy of 0.67 on of numerical entities such as credit card number, account
the MACCROBAT2018 dataset. numbers, age, date, etc.
To check the effectiveness of pre-trained models in a data-
TABLE IV: Evaluation results on financial Real-World Data constrained environment, we trained them on different subset
of training data and tested them on remaining subset of
F1-score
BiLSTM 0.54 data. Figure 5 compares the performance of pre-trained and
BiLSTM-CRF 0.55 BiLSTM-based models with respect to change in training
ALBERT-CRF 0.60 data percent in the financial domain. It can be observed that
DistilBERT-CRF 0.62
BERT-CRF 0.64 performance of models increases with an increase in dataset
Char-BiLSTM-CRF 0.58 size. Pre-trained models suffer less and are more robust in
Char-BERT-CRF 0.68 data-constrained environment when compared with BiLSTM-
based models. Figure 4 shows the F1-score of different hybrid
We annotated a small portion of real-world data and eval- and vanilla models with varying amounts of health data. As
uated the performances of different models on real-world expected, the F1-score of different models increased with an
financial data. From Table IV, we can observe that the per- increase in training data size. All in all, these results indicate
formance of different models decreases when tested on real- that these pre-trained models reduce the need for manually
world data. This can be attributed to annotation strategy given labeled data, which is a huge problem in the cybersecurity
for real-world data, wherein only redacted terms are marked domain.
as sensitive entites, ignoring the other entities in the sentence. Figure 3b shows the performance of hybrid pre-trained mod-

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
163
COMSNETS 2023 Cybersecurity and Privacy Workshop

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6


Precision

F1 Score
Recall
Pre-trained Models Pre-trained Models Pre-trained Models
Bilstm Bilstm Bilstm
0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0 50 100 0 50 100 0 50 100
Train Data Train Data Train Data

(a) (b) (c)


Fig. 5: Graph representing change in performance with respect to change in financial training data percent.

els and vanilla pre-trained models on health dataset. Overall, [4] HIPAA JOURNAL. (January, 2021) Excellus health plan
hybrid models show more accuracy when compared with their settles hipaa violation case and pays $5.1 million penalty.
[Online]. Available: https://www.hipaajournal.com/excellus-health-plan-
vanilla counterparts in terms of F1-score. This is because settles-hipaa-violation-case-and-pays-5-1-million-penalty/
when computing the probability distribution of outputs, vanilla [5] Y. Guo et al., “Exsense: Extract sensitive information from unstructured
models only consider the current word in the input sequence, data,” Computers & Security, vol. 102, p. 102156, 2021.
[6] A. Truong et al., “Sensitive data detection with high-throughput
whereas hybrid models compute outputs by considering not neural network models for financial institutions,” arXiv preprint
only the current word but also the rest of the sequence. In arXiv:2012.09597, 2020.
other words, hybrid models improve the models’ ability to [7] P. Silva et al., “Using nlp and machine learning to detect data privacy
violations,” IEEE INFOCOM Workshops, pp. 972–977, 2020.
learn and generalize well on downstream tasks. [8] Allahyari et al., “A brief survey of text mining: Classification, clustering
Future Work. Several avenues exist for future research. In and extraction techniques,” arXiv preprint arXiv:1707.02919, 2017.
the domain of sensitive data discovery, no real-world annotated [9] M. Dias et al., “Named entity recognition for sensitive data discovery
in portuguese,” Applied Sciences, vol. 10, no. 7, p. 2303, 2020.
dataset is available publicly. For any deep learning or machine [10] Hassan et al., “Automatic anonymization of textual documents: detecting
learning solution to work, there is a need for a good quality sensitive information via word embeddings,” TrustCom, pp. 358–365,
dataset. Further research may include the use of generative 2019.
[11] Zhang et al., “Improving clinical named-entity recognition with transfer
models such as GANs for generating realistic and good-quality learning,” Stud Health Technol Inform, vol. 252, pp. 182–187, 2018.
synthetic datasets. In this study, we focused on general domain [12] X. Wang et al., “Cross-type biomedical named entity recognition with
pre-trained models, and domain-specific models were not deep multi-task learning,” Bioinformatics, vol. 35, no. 10, pp. 1745–
1752, 2019.
explored. Future work could include exploring other models [13] G. Xu et al., “Improving clinical named entity recognition with global
and their performance on a limited training dataset scenario. A neural attention,” in Asia-Pacific Web (APWeb) and Web-Age Information
large proportion of sensitive entities contain numeric data and Management (WAIM) Joint International Conference on Web and Big
Data. Springer, 2018, pp. 264–279.
pre-trained models trained on general corpora tend to struggle [14] Y. Zhou et al., “Clinical named entity recognition using contextualized
with numeric data due to poor representation of large numbers. token representations,” arXiv preprint arXiv:2106.12608, 2021.
Further work is required to improve the representation of [15] Y. Wang et al., “Learning from language description: Low-shot
named entity recognition via decomposed framework,” arXiv preprint
numeric data within pre-trained models. arXiv:2109.05357, 2021.
[16] Y. Wu et al., “Google’s neural machine translation system: Bridg-
VI. C ONCLUSION ing the gap between human and machine translation,” arXiv preprint
arXiv:1609.08144, 2016.
In this paper, we address the issue of extracting sensitive [17] H. E. Boukkouri et al., “Characterbert: Reconciling elmo and bert
information from free-flowing text. We experimented with for word-level open-vocabulary representations from characters,” arXiv
different pre-trained models in a data-constrained environment, preprint arXiv:2010.10392, 2020.
[18] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers
where these models provided good accuracy. We proposed a for language understanding.” Association for Computational Linguis-
Char-BERT-CRF model, which combines character level and tics, 2019, pp. 4171–4186.
word level contextual representation and feeds them into a [19] J. Lafferty et al., “Conditional random fields: Probabilistic models for
segmenting and labeling sequence data,” 2001.
CRF decoder for better extraction of numeric sensitive entities. [20] T. Wolf et al., “Huggingface’s transformers: State-of-the-art natural
Also, our study shows that it is possible to come up with a language processing,” arXiv preprint arXiv:1910.03771, 2019.
sensitive data discovery solution by employing different pre- [21] J. Li et al., “A survey on deep learning for named entity recognition,”
IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1,
trained models in a data-constrained environment. pp. 50–70, 2020.
[22] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional
R EFERENCES lstm-cnns-crf,” pp. 1064–1074, 2016.
[1] Cheng et al., “Enterprise data breach: causes, challenges, prevention, [23] Y. Zhu et al., “Aligning books and movies: Towards story-like visual
and future directions,” Wiley Interdisciplinary Reviews: Data Mining explanations by watching movies and reading books,” in Proceedings of
and Knowledge Discovery, vol. 7, no. 5, p. e1211, 2017. the IEEE international conference on computer vision, 2015, pp. 19–27.
[2] Zou et al., “”i’ve got nothing to lose”: Consumers’ risk perceptions and [24] Z. Wang et al., “Crossweigh: Training named entity tagger from imper-
protective actions after the equifax data breach,” SOUPS, pp. 197–216, fect annotations,” pp. 5154–5163, 2019.
2018. [25] J. Alvarado et al., “Domain adaption of named entity recognition to
[3] Neto et al., “A case study of the capital one data breach,” Stuart E. and support credit risk assessment,” pp. 84–90, 2015.
Moraes G. de Paula, Anchises and Malara Borges, Natasha, A Case [26] J. Zhang et al., “Pegasus: Pre-training with extracted gap-sentences for
Study of the Capital One Data Breach (January 1, 2020), 2020. abstractive summarization,” pp. 11 328–11 339, 2020.

Authorized licensed use limited to: Newcastle University. Downloaded on May 30,2024 at 14:01:22 UTC from IEEE Xplore. Restrictions apply.
164

You might also like