Deep_Learning_Based_Sensitive_Data_Detection (1)
Deep_Learning_Based_Sensitive_Data_Detection (1)
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.
2. Related Works circulating neural networks such as RNN and LSTM, the
Transformer structure's self-attention mechanism can
2.1. Sensitive Information and Privacy Detection parallelize the amount of computation, overcoming the
limitation that RNN and LSTM models cannot do.
Sensitive personal information usually refers to Transformer can also build more interpretable models cause
personal information such as bio-metrics, religious beliefs, each attentional head may learn to execute different tasks.
specific identity, medical and health information, financial BERT which stacked by lots of transformer encoder blocks
accounts, track, and other information that, if leaked or achieved the best results of 11 NLP tasks at the time [13].
illegally used, can easily result in a violation of personal
dignity or harm to personal and property safety [7]. 3. Deep Learning Based Comprehensive Sensitive
Using machine learning model (e.g., deep learning Data Detection Framework
model) to detect and protect privacy data is a novel and
challenging research area. The machine learning approach In this section, this paper focus on how we build a
focuses on building comprehensive and complex NER comprehensive sensitive data detection system. In Section
systems to detect sensitive information. Paulo Silva trained 3.1, this paper focus on the definition and classification of
and evaluated data sets containing personally identifiable sensitive information and private data. In Section 3.2, this
information using three well-known natural language paper gives an introduction of the overall comprehensive
processing tools (NLTK, Stanford, and CoreNLP) [8]. sensitive data detection framework. In Section 3.3, this paper
Adeyemo Victor Elijah developed an intrusion detection focus on how we detected structured sensitive data using
system with LSTM model which can achieve a detection regular expression. Finally in Section 3.4, this paper focus
accuracy rate of 80% on the two-classed attack dataset [9]. on how we detect unstructured sensitive data using machine
Compared with the traditional detection model, deep learning.
learning can automatically discover potential rules. It can
also achieve high accuracy and is able to guarantee 3.1. Sensitive Information and Private Data
generalization ability. The deep learning-based privacy
detection shows great promises in privacy detection Sensitive information is the information that individuals,
protection. institutions do not want to be known to the outside world. In
specific applications, sensitive information is mainly related
2.2. Deep learning based named entity recognition to personally identifiable information, patient illness records,
company financial information, etc. Desensitization mainly
Extract specific information is one of the most refers to the reliable protection of sensitive data through the
important application of deep learning. In recent years, the desensitization of these sensitive information. In order to
deep learning-based NER model has become the mainstream accurately evaluate the effect of desensitization, this study
and has produced the most advanced results. In contrast to further subdivides private data into structured privacy data
feature-based approaches, deep learning can automatically and unstructured privacy data.
discover potential representations and features that is In data anonymization, the definition of personal data is
required for classification or detection [10]. unclear yet but general following personal data is considered
The key to deep learning is to train word vectors using 'sensitive' and is subject to scenarios: personal data, health
various neural network structure models. Collobert et al. records, ethic, religion/political/sex opinion, etc.
employed a CNN (convolution neural network) to produce
local features around each word and input them into the label 3.1.1 Definition of Sensitive Information
decoder to compute the distribution score of potential labels
after each word in the input sequence was embedded into the . Definition 1. Static sensitive information: Sensitive
N × N dimension vectors [11]. Using the gate mechanism, information (Rui= r1Ui, r2Ui, ...., rRUi) that has a fixed structure for
the LSTM model can avoid it and perform well in extended a particular user ui. Each rule r kUi is a privacy rule with a
sequences. Chalapathy et al. achieve 85.19 F1-score (under
specific structure associated with the user.
an unofficial evaluation) on MedLine test data adding a CRF
layer to the top of a LSTM model [12]. ⊗←f1 ∧f2 ∧...∧fL (1)
Transformer has been a huge success in many areas of in which means the target sensitive rule ri. the right
artificial intelligence, such as natural language processing, side of the (1) is the description of regular logical operation.
computer vision and audio processing. In comparison to Each fk is represented as a logical expression for an instance
property. L mainly indicate the length of the rule. The
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.
expression of the equation is
IFf1 &f2 &...&fL ,THENClass=⊗ (2)
In order to detect Ru from user data, there are several
methods. For simpler rules, one can use regular expressions
to extract the rules directly. For the data with complex
Fig.1 Procedure of Detection Framework
features, the method based on machine learning can be used
to automatically detect and identify the features. Data Source As defined in Section 2, unstructured data
typically features with flexible formats, these unstructured
3.1.2 Classification of Sensitive Information data are embedded in different sources, such as texts,
documents, weblogs, images, etc. In this work, we focus on
As mentioned above, in general, Rules based privacy unstructured sensitive data in text, such as name, address
detection can be divided into two groups: text privacy data and so on.
and visual privacy data. For text privacy data, it can be Sensitive Data Detection In this research, a
further categorized into structured data and unstructured data. comprehensive Detection module is proposed to deal with
Definition 2. Structured sensitive data: the rules of the sensitive data. In this module, we will detect the structured
sensitive data are regular and structured that can be sensitive data firstly with the regular expression in Section
expressed easily by regular expressions. 3.2. Then for the unstructured sensitive data, we adopt
Definition 3. Unstructured sensitive data: The rules of machine Learning technology. Machine Learning
this sensitive data are irregular. Its rules are hard to express technology that can automatically classify sensitive
directly. For the detection of such data, this paper adopts the data/files can significantly reduce the risk of exposure of
method of machine learning to mine the hidden rules or sensitive data or based on the detected result, we can alert
patterns. publisher on potential sensitive data.
According to the above definition. This research lists 11 Text processing engine Once sensitive data has been
kinds of sensitive information with private rules in life. The detected and identified, it will be automatically tagged
sensitive information is shown in Table 1. sensitive tags, such as PII, commercial sensitive, etc. These
tags usually can be created by specific user based on their
Table 1 Typical Sensitive Data Categories scenarios. We will also create Sensitive Data Dictionary to
Structured Data Email address; Phone store these sensitive tags and data. Using Sensitive Data
Number; Passport; Ip Dictionary, we can search and encrypt sensitive data more
Address conveniently.
Anonymization and processing the identified sensitive
Unstructured Data Person; Location;
data will be anonymized using specific algorithms. In this
Organization; Disease;
research, we just encrypt these sensitive data with their class.
Occupation
if necessary, we can adopt better algorithms including k-
anonymous, l-diversity, etc. to protect identified sensitive
data.
3.2. Comprehensive Sensitive Data Detection Evaluation of the effects of detection We analyzed the
Framework results of the detection from three perspectives. For
structured data, we analyze the advantages and
Automatically sensitive data detection and disadvantages of regular expression. For unstructured data,
classification play a key role in data anonymization. In this we analyze our detection model's params and F1 score. The
section, a sensitive data automatically detection framework experiment proves that the accuracy is at the cost of the
will be developed, which is able to detect sensitive features computation
from sensitive data which is consisted of structured data and
unstructured data. The framework also classifies these 3.3. Sensitive Information Detection from Structured
sensitive data according to their subjects. The sensitive data Data
automatically detection framework's key procedure can be
shown in Fig.1. It is very important to detect and identify sensitive data
before protecting them. The key to the detection of structured
sensitive data is to identify the structure patterns. In this
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.
subsection, this paper will introduce features of structured 3.4.1 Dataset
sensitive data in daily life and study the corresponding
regular expressions to detect them. This paper creates a specific privacy dataset for training
Email address is an important personal data. Leaked and testing because there are few personal privacy datasets
email address may be the target of scammers and spammers on the Internet. The private dataset we needed was scattered
or even worse endanger personal property and life safety. To across other datasets. Therefore, this paper extracts the
protect your email address, it needs to be identified and sensitive data from other datasets to construct the final
anonymized in some public scenarios. Using regular privacy dataset of sensitive information. From the famous
expression ^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+a-zA-Z0-9-.] Conll-2003 datasets [14], we extracted 6000 pieces of data
can detect email address. including PERSON, LOCATION and ORGANIZATION.
Telephone number is the most commonly used data in From the public NCBI-Disease dataset [16], this paper
life. Once your phone number is leaked. you may face extracted 2000 pieces of data containing disease information.
countless harassment messages which will cause great Datasets about Occupation entities are scarce on the Internet.
trouble to your life. Using regular expression 1(?:[358][0- So, this paper extracted 2000 pieces of occupation sentences
9]|4[579]|66|+7[0135678]|9[89])[0-9]{8}+ to detect phone from CLUE Chinese dataset [17], translated them into
number. English and merged them into our final sensitive dataset.
Passport is one of the necessary documents for going Table 2. shows the contents in the final sensitive dataset.
abroad. It is a legal document to prove the nationality and
identity of the citizen. In China, passports are divided into Table 2 Contents of Sensitive DataSets
diplomatic passports, official passports, ordinary passports DataSet Description
and special zone passports. Using regular expression Conll-2003 DataSet Sentences that contains
^(E\d{8}) | E[A-Za-z]\d{7} | (G\d{8}) | (H\d{8}) | (HJ\d{7}) name, location, and
| (K\d{8}) | (KJ\d{7}) | (MB\d{7}) can detect passport. organization
IP address is assigned by Internet service provider, CLUE DataSet Sentences that contains
which is the network address of device in the Internet. IP occupation tags
address is very important for keeping us safe. Hackers are NCBI DataSet Sentences that contain
able to conduct cyber attacks if they found specific target IP common disease tags
address. In this study, we only detect common IPv4
addresses. IPv4 addresses are typically made up of four 3.4.2 BERT
groups of numbers, each ranging from 0 to 255. Using
regular expression ((?:[0,1]?\d{1,2}|2(?:[0-4][0-9]|5[0-5]))(? The sequential serial computing process of LSTMs
+:\.(?:[0,1]?\d{1,2}|2(?:[0-4][0-9]|5[0-6]))) +{3})+ to detect greatly increase the cost of computing because the LSTMs
ip address. calculation must be performed after the completion of the
It is not very challenging to detect and identify previous moment. However, BERT network uses Attention
structured sensitive data since fixed patterns can be used. mechanism [15] instead of RNNs to make the computation
Actually, most sensitive data are embedded in unstructured parallel, thus greatly reducing the computation cost. In
data, e.g., documents, images, audio, etc., which needs addition, BERT is an excellent transfer learning model.
sophisticated techniques, such as machine learning to Through pre-training in a large number of unsupervised
analyze. expectations, BERT learns the deep-language feature
representation of contextual information. Pre-train and fine-
3.4. Unstructured Sensitive Data Detection using tune make BERT achieve the best results in 11 NLP tasks
Machine Learning [13].
In our research, we use BERT network to detect private
Unstructured sensitive data usually features flexible data. The architecture of BERT uses a series of Transformer
formats which makes it hard to detect with regular blocks which contain self-attention mechanism, stacked on
expressions. This section focuses on how to detect top of each other. Each transformer block takes word
unstructured sensitive data with machine learning methods. embeddings as input which are constructed by the encoding
In practice, a appropriate model should be selected into the of the word vector. Considering the data scale of our current
comprehensive sensitive data detection framework. study, we adopted the standard of Bert-Base parameter. In
our BERT model, we adopt 12 stacked Transformer blocks,
each with a feed-forward network containing 768 hidden
units and 12 attention heads. And the input of our model
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.
takes in less 512 word-vectors at a time. 4.2 Experiments Results Analysis
After the transformer block, BERT is fine-tuned for a
specific task. In our research, this task is named entity 4.2.1 Machine Learning results Analysis
recognition. So, we add a linear layer with SoftMax classifier
to score every token to indicate the most probable entity. F1 score is a comprehensive measure of accuracy and
After the linear layer, we also add a CRF layer similarly to recall rate. It accurately represents the performance of model
learn the transition rule to make the result better. The whole training. It's the harmonic mean of the models' Precision and
structure of our model can be shown in Fig.2. Recall computations. The formula is as follows:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∗ (3)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
Precision (also called positive predictive value) is the
fraction of relevant instances among the retrieved instances,
while recall (also known as sensitivity) is the fraction of
relevant instances that were retrieved. there are 1272
sensitive data in our test dataset. The specific value can be
shown in Table 3.
Model params is usually used to compare the
computational complexity of model which allows
comparison of Computation without regard to Hardware.
Fig.2 The architectural of BERT model The higher the value of param, the complexes the model is.
In our Bert model, every input char will be represented into
4. Experimental Validation a 512-shape vector and there are 768 hidden units for
computing the weights.
4.1. Analysis of experimental results Table 3. shows the exact param number and a
comprehensive comparison between these three models.
our sensitive data detection system can automatically
detect the sensitive information and encrypt them with the Table 3 Performances Summary
corresponding type. For the Structured Data Detection, our Model Precision Recall F1 Params
system can recognize 4 classes as described in Table 1. score
Besides the structured Data, our system can also detect BERT 0.897 0.896 0.896 110M
and encrypt the unstructured data, the unstructured data Regular 1.000 \ \ \
usually features with flexible formats, such as name, Expressions
locations, address, etc. For the Unstructured Data Detection, BERT+Regular 0.925 0.896 0.896 110M
our system can recognize 5 classes as described in Table 1. expressions
Fig.3. shows our system's detection results for the In Table 3, we can learn that combing BERT with
Sensitive Data. regular expressions to detect sensitive result we can get
good result.
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.
unstructured data, combining regular expressions with BERT vol.46, no.7, p.102334, 2022
model can achieve high precision and good generalization [8] P.Silva, C.Goncalves, C.Godinho, N.Antunes, and
capability with not so large corpus. M.Curado, “Using nlp and machine learning to detect
data privacy violations,” in IEEE INFOCOM 2020-
5. Conclusions IEEE conference on Computer Communications
Workshops(INFOCOM WKSHPS). IEEE,
This work focused on the sensitive data proactively 2020,pp.972-977.
identification and anonymization using machine learning [9] V.E.Adeyemo, A.Abdullah, N.JhanJhi,
based techniques. Specifically, this work investigated the M.Supramaniam, and A.O.Balogun, “Ensemble and
sensitive data extraction techniques from structured data and deep-learning methods for two-class and multi-attack
unstructured data, in which a machine learning based anomaly intrusion detection: an empirical study,”
sensitive detection framework was proposed that can International Journal of Advanced Computer Science
automate the identification of sensitive data in real-time with and Applications,vol.10,no.9,2019.
deep learning model BERT. The proposed method can [10] J.Ma,J.Zhang,L.Xiao,K.Chen,”Classification of power
achieve 92.5% precision and 89.6% recall rate without so quality disturbances via deep learning,” IEEE
large corpus. Technical Review, vol.34, no.4,pp.408-415,2017
[11] R.Collobert and J.Weston, “A unified architecture for
Acknowledgements natural language processing: Deep neural networks
with multitask learning,” in Proceedings of the 25 th
I would like to thank my supervisor (shancang.li) international conference on Machine learning, 2008,
for his tireless guidance and dedication. Without him, pp.160-167.
[12] R.Chalapathy, E.Z.Borzeshi, and M.Piccardi, “An
this article would not have happened.
investigation of recurrent neural architectures for drug
name recognition,” arXiv preprint arXiv:1609.07585,
References
2016.
[13] J.Devlin, M.-W. Chang, K.Lee, and K.Toutanova,
[1] H.-Y.Tran and J.Hu, “Privacy-preserving big data
“Bert:Pre-training of deep bidirectional transformers
analytics a comprehensive survey,” Journal of Parallel
for language understanding,” arXiv preprint
and Distributed Computing, vol.134, pp.207-218, 2019
arXiv:1810.04805,2018
[2] J.Khan, G.A.Kan, J.P.Li, et al.”Secure smart healthcare
[14] E.F.Sang and F.De Meulder, “Introduction to the conll-
monitoring in industrial internet of things(iiot)
2003 shared task: Language-independent named entity
ecosystem with cosine function hybrid chaotic map
recognition,”arXiv preprint cs/0306050,2003.
encryption,” Scientific Programming, vol. 2022,2022.
[15] A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones,
[3] A. De Slave, P.Mori, and L. Ricci, “A survey on privacy
A.N.Gomez,L.Kaiser, and I.Polosukhin, “Attention is
in decentralized online social networks,” Computer
all you need,” Advances in neural information
Science Review, vol.27,pp.154-176,2018.
processing systems, vol.30,2017.
[4] D.K.Alferidah and N.Jhanjhi, “Cybersecurity impact
[16] R.I.Dogan, R.Leaman, and Z.lu, “Ncbi disease corpus:
over big data and iot growth,” in 2020 International
a resource for disease name recognition and concept
Conference on Computational Intelligence(ICCI).
normalization,” Journal of biomedical informatics,
IEEE,2020,pp.103-108.
vol.47,pp.1-10,2014
[5] V.Meshram, K.Patil, V.Meshram, D.Haanchate, and
[17] Xu L,Zhang X,Li L, et al. CLUE:A Chinese Language
S.Ramkteke, “Machine learning in agriculture
Understanding Evaluation Benchmark.2020.
domain:a state-of-art survey,” Artificial Intelligence in
the Life Sciences, vol.1,p.100010,2021.
[6] J.Li,A.Sun,J.Han, and C.Li,”A survey on deep learning
for named entity recognition,” IEEE Transactions on
Knowledge and Data Engineering,vol.34,no.1,pp.50-
70,202.
[7] Y.-I.Liu, L.Huang, W.Yan, X.Wang, and R.Zhang,
“Privacy in ai and the iot: The privacy concerns of
smart speaker users and the personal information
protection law in China,” Telecommunications Policy,
Authorized licensed use limited to: Newcastle University. Downloaded on May 23,2024 at 15:03:40 UTC from IEEE Xplore. Restrictions apply.