0% found this document useful (0 votes)

54 views9 pages

Suspicious E-Mail Detection Via Decision Tree: A Data Mining Approach

The document proposes using a decision tree algorithm to detect suspicious emails related to criminal activities. It applies a model of deception which suggests deceptive writing has reduced first-person pronouns and exclusive words, and elevated negative emotion words and action verbs. The paper extracts features from emails based on this model, trains an ID3 decision tree on the data, and uses the tree to classify new emails as suspicious or not suspicious. It aims to help criminal investigators more efficiently explore large databases of email communications.

Uploaded by

Praneeth Kusuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views9 pages

Suspicious E-Mail Detection Via Decision Tree: A Data Mining Approach

Uploaded by

Praneeth Kusuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Journal of Computing and Information Technology - CIT 15, 2007, 2, 161169 doi:10.2498 /cit.

1000984

161

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

S. Appavu alias Balamurugan and Ramasamy Rajaram
Thiagarajar College of Engineering, Madurai, India

Data mining is the quest for knowledge in databases to uncover previously unimagined relationships in the data. This paper proposes to apply Decision tree in Suspected e-mail detection (e-mails about criminal activities). Deception theory suggests that deceptive writing is characterized by reduced frequency of first person pronouns and exclusive words and elevated frequency of negative emotion words and action verbs. We applied this model of deception to the set of e-mail dataset, then applied ID3 algorithm to generate the decision tree. The decision tree that is generated is used to test the e-mail as suspicious or not. In particular, we are interested in detecting fraudulent and possibly criminal activities from such data. Keywords: data mining, deceptive theory, decision tree based classification

predictions. Data mining has emerged to address problems of understanding ever-growing volumes of information for structured data, finding patterns within data that are used to develop useful knowledge. As individuals increase their usage of electronic communication, there has been research into detecting deception in these new forms of communication. Models of deception assume that deception leaves a footprint. The work done by various researchers suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words and elevated frequency of negative emotion words and action verbs [8]. We apply this model of deception to the set of e-mail dataset and preprocess the e-mail body and to train the system we used ID3 (Iterative Dichotomiser 3) algorithm [6] to generate a decision tree that categorize the e-mail as deceptive or not. Text classification including e-mail classification presents challenges because of large and various number of features in the data set and large number of documents. Applicability in these datasets with existing classification techniques was limited because the large number of features makes most documents undistinguishable. In many document datasets, only a small percentage of the total features may be useful in classifying documents, and using all the features may adversely affect performance. The quality of training dataset decides the performance of both the text classification algorithms and feature selection algorithms. An ideal training document dataset for each particular cate-

1. Introduction Data mining has recently attracted considerable attention from database practitioners and researchers because of its applicability in many areas such as decision support, market strategy, financial forecasts, etc. Combining techniques from the fields like Statistics, Machine learning, Databases, etc. Data mining helps in extracting useful and invaluable information from database. Detecting unusual communication patterns in various means and channels of communications represents an important class of application directly relevant to security informatics [2]. E-mail has become one of todays standard means of communication. The large percentage of the total traffic over the internet is the e-mail. E-mail data is also growing rapidly, creating needs for automated analysis. So, to detect crime, a spectrum of techniques should be applied to discover and identify patterns and make

162

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

gory will include all the important terms and their possible distribution in the category. To our knowledge, this is the first attempt to apply Decision tree to task of Suspicious e-mail detection (e-mails about criminal activities). 1.1. Motivation Concern about national security has increased significantly since the terrorist attacks on 11 September 2001. The CIA, FBI and other federal agencies are actively collecting domestic and foreign intelligence to prevent future attacks. These efforts have in turn motivated us to collect the data and undertake this paper work as a challenge. Data mining is a powerful tool that enables criminal investigators who may lack extensive training as data analyst to explore large databases quickly and efficiently. Computers can process thousands of instructions in seconds, saving precious time. In addition, installing and running software often costs less than hiring and training personnel. Also, computers are less prone to errors than human investigators. So this system helps and supports the investigators. 1.2. Organization of the Paper The paper is organized as follows: Section 2 defines problem statement and related work in this area. Section 3 describes the proposed work and experimental results are presented in Section 4. Section 5 discusses performance measure. Finally, Section 6 concludes the paper and points out some potential future work. 2. Problem Statement and Related Work Its hard to remember what our lives were like without e-mail. Ranking up there with the web as one of the most useful features of the Internet, billions of messages are sent each year. Though e-mail was originally developed for sending simple text messages, it has become more robust in the last few years. So, it is one possible source of data from which potential problem can be detected. Thus the problem is to find a system that identifies the deception in communication through e-mails.

One of the earlier automated deceptive detection systems, constructed from a record linkage method based on string comparators [5], was proposed by Gang Wang, Hsinchun Chen and Homa Atabakhsh. This method has a restriction that it often requires intensive computation. Xindong Wu and Xing Xingquanzhu developed impact sensitive instance ranking method [12] to identify deception for real world data sets. This method has a restriction that the switching of attribute Ai and class C for attribute prediction APi , the accuracy of APi could be very low. P. S. Kaila and Skillicon developed a method based on the singular value decomposition [8] to detect unusual and deceptive communication in e-mails. The problem with this approach is that it does not deal with incomplete data in an efficient and elegant way and can not incorporate new data incrementally without having to reprocess the entire matrix. Classification is an important data mining problem. The input is a dataset of training records (also called training database), wherein each record has several attributes. Attribute with numerical domains are numerical attributes and attributes whose domains are non non-numerical are categorical attributes. There is also a distinguished attribute called the class label. This classification aims at building a concise model that can be used to predict the class label of future, unlabeled records. Many classification models including Naive Bayes, Decision tree, Support vector machine, and Neural networks have been proposed. [13] compared a cross-experiment between 14 classification methods, including Decision tree, Naive Bayesian, Neural networks, Linear square fit, Rocchio. KNN is one of the top performers, and it performs well in scaling up to very large and noisy classification problems. [9] showed a good performance reducing the classification error by discovering temporal relations in an e-mail sequence in the form of temporal sequence patterns and embedding the discovered information into content-based learning methods. Approach to Anomalous e-mail detection is considered. [15] showed approaches to detect anomalous e-mail and involved the deployment of data mining techniques. [4] proposed a model based on the Neural network to classify personal e-mails and the use of principal component analysis as a preprocessor of NN to

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

163

reduce the data in terms of both dimensionality as well as size. [11] and [14] developed an algorithm to reduce the feature space without sacrificing remarkable classification accuracy, but the effectiveness was based on the quality of the training dataset. In the classification experiment for spam filtering, Decision tree showed better result than NB, NN, or SVM classifier [10]. 3. Proposed Work In this paper, we present a novel data mining based decision tree algorithm to detect e-mail concerning criminal activities. It is developed specifically for detecting deceptive communication in e-mail. The architecture of the proposed system is as follows.

3.1. Database Used in Experiment The Microsoft Access database is used to store the e-mail messages. The dataset contains the folder information for each of the suspicious and normal e-mail. Each message present in the folders contains the sender and receiver e-mail address, date and time, subject, body, text and some other e-mail specific technical details. We created MS Access database for the dataset to store the e-mail message, our database contains two tables. The first table contains the information of the e-mail message the sender, subject, text and other information. The second table contains the recipients information. It contains the e-mail address of the recipient and the type (To, Cc, Bcc) in which message was sent to the recipient. 3.2. Text Classication Architecture In Figure 2 we present a simple architecture of text classification systems. There is a pool of documents which represents the content at hand that can either be stored on disk, or could come from data streams or the web. There are standard preprocessing steps applied to this document corpus, followed by an appropriate choice of token models, representation methods, and labeling systems. Classification models are chosen to operate on train-validation-test splits, and classifiers are learned and stored.

Figure 1. Proposed Suspicious e-mail detection system.

The architecture shown above is used to detect the suspicious e-mails. Connection manager is used to give the connection between the e-mail sender and the Processing center. The Content filter plays the important role i.e., it uses the preprocessing and classifying algorithm such as Decision tree, etc. to separate the suspicious e-mails. This output is delivered to the investigator with the help of Delivery manager. The proposed method is implemented in JDK1.5 because Java is a high performance language for technical computing. In implementation, there are three parts: E-mail preprocessing, Building decision tree and Validation.

Figure 2. The standard text classification set up.

164

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

3.3. E-mail Preprocessing E-mail preprocessing involves the process of transforming the training dataset into a representation suitable for the decision tree ID3 (Iterative Dichotomiser 3) algorithm. This stage extracts the informational words from the data set. It consists of the following two steps: 1. Removal of non-discriminative words 2. Suffix stripping. 3.3.1. Removal of Non-discriminative Words In e-mails, certain words are most frequent and are not discriminative of a message contents, such as prepositions, pronouns and conjunctions. Examples of such words are we, that, and, this, etc. Some widely used conversational English words, such as Im, isnt, cant, etc. are of less importance. Elimination of these terms is performed in this step. Based on the theory of deception, a deceptive e-mail will have highly emotional words and action verbs. So, such words are set as keywords and extracted from the input dataset and the most frequent, but less deceptive words are eliminated in this step. Examples for highly emotional words and action verbs are lifeless, anger, kill, attack, etc. 3.3.2. Sufx Stripping Suffix stripping is a process of removing the commoner morphological and inflexional endings from words in English. Its main use is a part of a term normalization process that is usually done when setting up information retrieval systems. The Porter stemming algorithm (or Porter stemmer) is used to perform this process. Ignoring the issue of where precisely the words originate from we can say that a document is represented by a vector of words, or terms. Terms with a common stem will usually have similar meanings, for example: Assassinate Assassinated Assassinating Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes -ED, -ING,

-ION, -IONS to leave the single term assassinate. In addition, the suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous. Hence, those words which are extracted from the previous steps are suffixstripped to increase their efficiency. Unfortunately e-mails are usually very noisy and simply applying text-mining tools to them, which are usually not designed for mining from noisy data, may not bring good results. Prior to indexing and classification, a number of preprocessing steps were performed. 1. E-mails were converted to plain-text from .mbox files. 2. Headers and HTML components were removed. 3. Body of the message was extracted. 4. The message body was tokenized into words, stop words were removed, and words were converted into lower case. Figure 3 shows an example e-mail, which includes many typical noises (or errors) for text mining. Lines 1 and 2 are a header; lines from 4 to 8 are a signature. All of them are supposed to be irrelevant to text mining. Only line 3 is actual text content.
1. On Wednesday February 2007 13:39:42-0500, X 2. [email protected] 3. Today there will be bomb blast in parliament house and the US consulates in India at 11.46 am. Stop it if you could. Cut relations with the U.S.A. Long live Osama Finladen Asadullah Alkalfi. 4. 5. ----------------------------------------------------------------------6. Best regards 7. X 8 -------------------------------------------------------------------------

Figure 3. Example of e-mail message.

Figure 4 shows an ideal output of cleaning of the e-mail in Figure 3 within it; the non-text parts (header, signature and quotation) have been removed. The text has been normalized. Specifically, the extra line breaks have been eliminated.
1. 2.

bomb blast

Figure 4. Cleaned e-mail message.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

165

In this paper, we formalize the e-mail-cleaning problem as that of non-text data filtering and text data normalization. By filtering of an e-mail we mean a process of removing the parts in the e-mail which are not needed for text mining, and by normalization of an e-mail we mean a process of converting the parts necessary for text mining into texts in canonical form (like a newspaper style text). Header, signature, quotation (in forwarded message or replied message), program code, and table are usually irrelevant for mining, and thus should be identified and removed (in a particular text mining application, however, we can retain some of them when necessary).

3.4. Building the Decision Tree We have used ID3 (Iterative Dichotomiser 3) Decision tree algorithm to classify the records. These are a few sets of e-mails used in the experiment and below is the tabulated result after preprocessing. We have shown some of the attributes in the row. The last column represents the outcome. Yes (Y) denotes the occurrence of the attribute in the e-mail and no (N) denotes the non-occurrence of the attribute in the e-mail. By using the tabulated values, the entropy and the information gain of each attribute is calculated.
Class Email A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 . . . (D/ND) 1 2 3 4 5 6 7 8 9 10 1 0 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 D D D D D D D ND D ND

Figure 5. E-mail message before preprocessing.

A1 BOMB, A2 BLAST, A3 ATTACK, A4 THREATEN, A5 KIDNAP, A6 MURDER, A7 DESTROY, A8 WEAPONS, A9 DANGER, A1 0TERRORIST, A1 1HIJACK, A1 2DISASTER. 1YES, 0NO, DDECEPTIVE, NDNON-DECEPTIVE.

Table 1. Sample Training dataset.

E-mail used in the experiments: We have collected over 3000 e-mails through a brainstorming session, some of them are as follows and the first example is a real example. Example suspicious and normal e-mail.
Suspicious e-mail Sender: X Sub: Bomb Blast Body: Today there will be bomb blast in parliament house and the US consulates in India at 11.46 am. Stop it if you could. Cut relations with the U.S.A.long live Osama Finladen Asadullah Alkalfi. Figure 6. E-mail message after preprocessing. Normal e-mail Sender: y Sub: Hi Body: Hope ur fine! How are u & family members?

166

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

3.4.1. Algorithm for Inducing a Decision Tree from Training Samples Input: The Training sample, samples represented by discrete value attributes; the set of candidate attributes attribute-list. Output: A Decision tree Method: 1) Tree is constructed in a top-down recursive divide-and-conquer manner 2) At start, all the training examples are at the root 3) Attributes are categorical (if continuous-valued, they are discredited in advance) 4) Examples are partitioned recursively based on selected attributes 5) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning 6) All samples for a given node belong to the same class 7) There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf 8) There are no samples left 3.4.2. Proof by Induction If S is a collection of 50 e-mails in which 38 are deceptive and 12 are non-deceptive. Entropy(S) = (38/50) log2 (38/50) (12/50) log2 (12/50) = 0.7941 Then the information gain is calculated as follows. There are 14 occurrences of the attribute Bomb. Gain(S, Bomb) = Entropy(S) (36/50) Entropy(S non-deceptive ) (14/50) Entropy(S deceptive ) = 0.7941 (36/50) (24/36, 12/36) = 0.189 Similarly, there are 10 occurrences of the at-

tribute Attack. Gain(S, Attack) = Entropy(S) (40/50) Entropy (S non-deceptive ) (10/50) Entropy (S deceptive ) = 0.7941 (40/50) (28/40, 12/40) = 0.133 Similarly, there are 9 occurrences of the attribute Hijack. Gain(S, Hijack) = Entropy(S) (41/50) Entropy (S non-deceptive ) (9/50) Entropy (S deceptive ) = 0.7941 (41/50) (29/41, 12/49) = 0.133 Gain(S, Blast) = 0.0567 Gain(S, Kidnap) = 0.0499 Likewise, for all the attributes the information gain is calculated. The attribute which has the highest information gain becomes the root node of the tree. The attribute Bomb has got the highest information gain of about 0.189 and hence it becomes the root node. This process goes on until all the attributes are mapped in to the tree based on the sorted information gain. Following the each individual path in the tree, the rules are generated. The output of this module is Decision tree and Rules.

Figure 7. Generated Decision tree.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

167

4. Experimental Results The application of data mining to the task of automatic suspicious e-mail classification is done; experiments were carried out on a small e-mail corpus. In order to conduct an experiment setting, different sets of 3000 e-mails are used: a mixture containing 1000 suspicious e-mails and 2000 normal e-mails. The system was trained with the training dataset and the information gain and the entropy were calculated. When the training process was finished, the best quality rules were taken as the final classification rules. Some of the rules generated by the Decision tree based classifications are:

Rule 1: If Bomb=Yes then mail=Deceptive Rule 2: If Bomb=No and Attack=Yes then mail= Deceptive else if Attack=No and Blast=yes the mail=Deceptive Rule 3: If Blast=No and Hijack=Yes then mail= deceptive else if Hijack=No and Murder=Yes then mail=Deceptive Rule 4: If Murder=No and Death=Yes then mail= Deceptive else if Death=No and Terrorist=Yes then mail=Deceptive else if Terrorist=No and Destroy=yes then mail=Deceptive

Figure 8. Proposed framework for suspicious e-mail detection.

5. Performance Measure To evaluate the classifier on testing data, we defined an accuracy measure as follows. Accuracy(%)=correctly classified emails /Total emails100 Also, precision and recall were used as the metrics for evaluating the performance of each email classification approach. A. Effect of dataset on performance An experiment measuring the performance against the size of dataset was conducted using dataset of different sizes. For example, in case of 1000 dataset, accuracy was 96.25% using DT classifier. Decision tree classifier supports over 95% of classification accuracy for more than 1000 dataset.

This will be the input to the testing stage:

This is the output that is obtained during the execution:

168

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Data size 500 1000 1500 2000

DT 93.40 96.25 95.17 95.38

Table 2. Classification result based on Data size.

and easy to be maintained. The proposed work will be helpful for identifying the deceptive email and will also assist the investigators to get the information in time to take effective actions to reduce criminal activities. In the future, we would add following features to the paper: automatic reply to the incoming e-mails that are found deceptive and enabling our application to work in mobile environment. A problem we faced when trying to test out new ideas dealing with e-mail systems was an inherent limitation of the available data. Because we only have access to our own data, our results and experiments no doubt reflects some bias. Much of the work published in the e-mail classification domain also suffers from the fact that it tries to reach general conclusion using very small data sets collected on a local scale. 7. Appendix We have collected over 3000 e-mails through a brainstorming session, some of them are as follows and the first example is a real example. Live example: Today there will be bomb blast in parliament house and the US consulates in India at 11.46 am. Stop it if you could. Cut relations with the U.S.A. Long live Osama Finladen Asadullah Alkalfi. Sample training dataset We are a band of patriot who has currently captured a Brahmas and a nuclear warhead. We threaten to destroy the parliament building in session if any is held. With captures technology and our expertise. We have started mass production of this ICMS.we can destroy any place, any time since we have many strategically placed base stations, throughout the planet. We, the freedom lovers of India, have planned an attack on the secretariat, Chennai on 26th January. We challenge you for the attack and stop it if you can. This is not just a threaten mail. So get ready for the attack. We have planned to attack trade center on the coming week. This attack is been planned by us to insist on the freedom of our people. We are ready to loss our lives for the sake of Our people if possible, try your level best to save your lives.

B. Effect of feature size on performance The other experiment measuring the performance against the feature size was conducted using different features. For example, in case of 20 features, accuracy was 93.91% using DT classifier. The most frequent words in suspicious e-mail were selected as features. Generally, the result of classification was increased for all classification methods according the feature size increased. Decision tree classifier supported over 94% of classification accuracy for more than 30 feature size.
Feature size 10 20 30 40 50 DT 91.64 93.91 94.46 94.16 94.64

Table 3. Classification result based on Feature size.

6. Conclusions and Further Work E-mail is an important means for communication. It is a possible source of data from which potential problem can be detected. In this paper, we have employed decision tree-based classification approach to detect e-mails in relation to criminal activities. All the e-mails were classified as suspicious (1) or not (0). From this experiment, we can find that a simple decision tree classifier can provide better classification result for suspicious e-mail detection. In the near future, we plan to incorporate other techniques such as different ways of feature selection, and classification using other method. One major advantage of the decision tree-based classifier is that it doesnt assume that terms are independent and its training is relatively fast. Furthermore, the rules are human understandable

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

169

The lists of supporters of RAM temple construction have been found region wise. These coverts are going to face drastic consequences which no one had ever seen in their life. It may include your name. So roll back your efforts otherwise????? One of the key defense research labs of the national security /national defense agent will be blown within the next 120 hours. We are going to attack the US embassy. There will be a terrorist attack or a bomb scare or anthrax spread. Beware of it. We are going to demolish the US.consultate Building beware .if possible try to stop it. We, the terrorist society have planned to do a bomb blast in the Indian airline plane today. We dont like this government. We do this act as criticism to this government. We are going to attack the parliament on 28-12006. The bomb will blast at any time .we plan to kill the leaders. The attack may be on 26-106 to demolish. The republic day parades. It is not possible to prevent the disaster if possible save them. A bomb has been placed in Tajmahal. It may explode on today .save your people if you wish to. References
[1] S. APPAVU ALIAS BALAMURUGAN, R. RAJARAM, S. SENTHAMARAI KANNAN, A Novel Data mining approach to Detect Deceptive Communication in Email Text. Proceedings of the National Conference on Advanced Computing, MIT, Chennai, India, (2007), pp. 179188. [2] A. BADIA, M. M. KANTARDZIC, Link Analysis Tools for Intelligence and Counterterrorism. Proceedings of the IEEE International Conference on Intelligence and Security Informatics, Atlanta, GA, (2005), pp. 4959. [3] W. COHEN, Learning rules that Classify Email. In proc.of the AAAI Spring Symposium on Machine Learning in Information Access, (1996). [4] B. CUI, A. MONDAL, J. SHEN, G. CONG, K. TAN, On Effective Email Classification via Neural Networks. In Proc. of DEXA, (2005), pp. 8594. [5] G. WANG, H. CHEN, H. ATABAKHSH, Automatically Detecting Deceptive Criminal Identity. Comm. ACM , (2004), pp. 7076. [6] J. HAN, M. KAMBER, Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, 2004.

[7] J. TANG, H. LI, Y. CAO, Z. TANG, Email Data Cleaning. Proceedings of KDD, Chicago, USA, (2005). [8] P. S. KEILA, D. B. SKILLICORN, Detecting Unusual and Deceptive Communication in Email. Technical Reports, (June 2005). [9] S. KIRITCHENKO, S. MATWIN, S. ABU-HAKIMA, Email Classification with Temporal Features. Intelligent Information Systems, (2004), pp. 523533. [10] S. YOUN, D. MCLEOD, A Comparative Study for Email Classification. Proceedings of International Joint Conferences on Computer, Information, System Sciences and Engineering, Bridgeport, CT, (2006). [11] S. SHANKAR, G. KARYPIS, Weight Adjustment Schemes for a Centroid based Classifier. Computer Science Technical Report TR00-35, (2000). [12] X. WU, X. XINGQUANZHU, Data Acquistation with Active and Impact Sensitive Instance Selection. 16th IEEE interactive conference, (2004). [13] Y. YANG, An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, Vol. 1, No. 1/2, (1999), pp. 6788. [14] Y. YANG, J. PEDERSEN, A Comparative study on Feature selection in Text Categorization. In ICML, (1997), pp. 412420. [15] Z. HUANG, D. D. ZENG, A Link Prediction Approach to Anomalous Email Detection. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Taipei, Taiwan, (2006).
Received: November, 2006 Accepted: February, 2006 Contact addresses: S. Appavu alias Balamurugan Faculty, Dept. of Information Technology Thiagarajar College of Engineering Madurai, Tamil Nadu, India e-mail: [email protected] Ramasamy Rajaram Prof. & Head, Dept. of Computer Science Thiagarajar College of Engineering Madurai, Tamil Nadu, India e-mail: [email protected]

S. APPAVU ALIAS BALAMURUGAN received the B.E. degree in Electronics and Communication Engineering from Mohamed Sathak Engineering College, Kilakarai, in 2001 and the Masters in Computer Science and Engineering from the University of Madras, Chennai. He is currently pursuing Ph.D. in Information and Communication Engineering from Anna University, Chennai. He has been working in the area of databases focusing on data mining, data warehousing and cyber security.

DR. RAMASAMY RAJARAM, Head of the Department of Computer Science, Thiagarajar College of Engineering, Madurai, secured his B.E. E.E.E. from Madras University in 1966, M Tech in E.E.E. from IIT Kharagpur in 1971 and Ph.D. from Madurai Kamaraj University in 1979. He currently teaches and guides research in data mining, genetic algorithms and cyber security.

Untitled
100% (2)
Untitled
66 pages
Security+ Lab Answers
0% (1)
Security+ Lab Answers
160 pages
PPT
0% (1)
PPT
15 pages
Phi Sing Email Detection Report Python
No ratings yet
Phi Sing Email Detection Report Python
59 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
A Novel Approach For Spam Detection Using Natural Language Processing With AMALS Models
No ratings yet
A Novel Approach For Spam Detection Using Natural Language Processing With AMALS Models
16 pages
Psycho-Linguistic Forensic Analysis of Internet Text Data
No ratings yet
Psycho-Linguistic Forensic Analysis of Internet Text Data
125 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
No ratings yet
Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30
75 pages
BT-3435 Ali
No ratings yet
BT-3435 Ali
49 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Intrusion Detection in Software Defined Network Using Machine Learning
No ratings yet
Intrusion Detection in Software Defined Network Using Machine Learning
11 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Group 17 Blackbook Final Report
No ratings yet
Group 17 Blackbook Final Report
40 pages
Team 1 - Final Document
No ratings yet
Team 1 - Final Document
44 pages
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
No ratings yet
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
11 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Application of Natural Languag
No ratings yet
Application of Natural Languag
32 pages
0 - Spam Mail Prediction
No ratings yet
0 - Spam Mail Prediction
29 pages
Final ML Report
No ratings yet
Final ML Report
34 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Cloud-Based Email Phishing Attack Using Machine and Deep Learning Algorithm
No ratings yet
Cloud-Based Email Phishing Attack Using Machine and Deep Learning Algorithm
28 pages
Data Mining Report
No ratings yet
Data Mining Report
28 pages
Email Mining
No ratings yet
Email Mining
37 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Online Student's Academic Registration System
100% (1)
Online Student's Academic Registration System
22 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
NLP Report
No ratings yet
NLP Report
19 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Format For PBS
No ratings yet
Format For PBS
18 pages
E-Mail Spam Detection by Using NLP and Naïve Bayes Classification Through Machine Learning
No ratings yet
E-Mail Spam Detection by Using NLP and Naïve Bayes Classification Through Machine Learning
5 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Zoom
No ratings yet
Zoom
20 pages
46 - Ijme... Mech Engg..Research Paper-1
No ratings yet
46 - Ijme... Mech Engg..Research Paper-1
10 pages
Monitering Suspicious Discussion On Online Forum
No ratings yet
Monitering Suspicious Discussion On Online Forum
6 pages
Detecting Phishing in Text Messages
No ratings yet
Detecting Phishing in Text Messages
6 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
No ratings yet
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
12 pages
Fraud Detection System: - Nikita Lawande - Prakarsha Dahat - Riya Thakur
No ratings yet
Fraud Detection System: - Nikita Lawande - Prakarsha Dahat - Riya Thakur
14 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
Integral Calculus
100% (2)
Integral Calculus
68 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
A Data Collection Scheme With Multi-Agent Based Approach For VSNS
No ratings yet
A Data Collection Scheme With Multi-Agent Based Approach For VSNS
5 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
A Study of Suspicious E-Mail Detection Techniques
No ratings yet
A Study of Suspicious E-Mail Detection Techniques
8 pages
Online Message Categorization Using Apriori Algorithm
No ratings yet
Online Message Categorization Using Apriori Algorithm
7 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
No ratings yet
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
6 pages
1A Proposed Approach To Analyze Insider Threat Detection Using Emails
No ratings yet
1A Proposed Approach To Analyze Insider Threat Detection Using Emails
6 pages
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
No ratings yet
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
11 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
Introduction Email Mining Final
No ratings yet
Introduction Email Mining Final
18 pages
IJCSIS Camera Ready Academia
No ratings yet
IJCSIS Camera Ready Academia
4 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Irjet V9i11154
No ratings yet
Irjet V9i11154
4 pages
Detecting Fake Accounts in Media Application Using Machine Learning
No ratings yet
Detecting Fake Accounts in Media Application Using Machine Learning
4 pages
Mining For Lies
No ratings yet
Mining For Lies
2 pages
Quiz 2 Study Guide
No ratings yet
Quiz 2 Study Guide
8 pages
AdmitCard 190320227829 PDF
No ratings yet
AdmitCard 190320227829 PDF
1 page
Face Identification
No ratings yet
Face Identification
62 pages
Use Case Description
No ratings yet
Use Case Description
9 pages
Darkstar One - Manual - PC
100% (1)
Darkstar One - Manual - PC
33 pages
User Manual
No ratings yet
User Manual
18 pages
National Skills Registry
No ratings yet
National Skills Registry
5 pages
AFEM Ch32
No ratings yet
AFEM Ch32
30 pages
Windows Update Causes Global Chaos British English Student B2 C1
No ratings yet
Windows Update Causes Global Chaos British English Student B2 C1
9 pages
Vulnerability Assessment Services
0% (1)
Vulnerability Assessment Services
2 pages
UGRD-CYBS6375 Information Assurance and Security 5
100% (1)
UGRD-CYBS6375 Information Assurance and Security 5
43 pages
Information Assurance and Security 2
No ratings yet
Information Assurance and Security 2
9 pages
Copyright in Cyberspace
No ratings yet
Copyright in Cyberspace
7 pages
Private Investigation
No ratings yet
Private Investigation
3 pages
How To Solve FVDI 2015 2016 Locked Error - OBD16Shop - Com Official Tech Support Blog
No ratings yet
How To Solve FVDI 2015 2016 Locked Error - OBD16Shop - Com Official Tech Support Blog
4 pages
SM Finalnew Isca Cp5
No ratings yet
SM Finalnew Isca Cp5
28 pages
CSE 1243 - Labsheet 2 - Imperative Programming - Python
No ratings yet
CSE 1243 - Labsheet 2 - Imperative Programming - Python
3 pages
Ascl Certified Cyber Crime Investigator
No ratings yet
Ascl Certified Cyber Crime Investigator
5 pages
NERC - Drones and The Threat To Critical Infrastructure - Hayden
No ratings yet
NERC - Drones and The Threat To Critical Infrastructure - Hayden
17 pages
Ictcuitcr36 Control Security Threat
No ratings yet
Ictcuitcr36 Control Security Threat
2 pages
Sample Paper 3
No ratings yet
Sample Paper 3
11 pages
Sophos Enterprise Console Quick Startup Guide: 5.3 Product Version: April 2015 Document Date
No ratings yet
Sophos Enterprise Console Quick Startup Guide: 5.3 Product Version: April 2015 Document Date
28 pages
DDOSartigo-ST566A Caue Cintra
No ratings yet
DDOSartigo-ST566A Caue Cintra
15 pages
Gray Hat Hacking 54
No ratings yet
Gray Hat Hacking 54
1 page
Issuance and Processing of Appointment (Job Order)
No ratings yet
Issuance and Processing of Appointment (Job Order)
3 pages
Scott Tucker Eprodigy Capital Stack Emails 2
No ratings yet
Scott Tucker Eprodigy Capital Stack Emails 2
3 pages
Planning and Recon
No ratings yet
Planning and Recon
1 page
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Suspicious E-Mail Detection Via Decision Tree: A Data Mining Approach

Uploaded by

Suspicious E-Mail Detection Via Decision Tree: A Data Mining Approach

Uploaded by

Journal of Computing and Information Technology - CIT 15, 2007, 2, 161169 doi:10.2498 /cit.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Figure 1. Proposed Suspicious e-mail detection system.

Figure 2. The standard text classification set up.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Figure 3. Example of e-mail message.

Figure 4. Cleaned e-mail message.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Figure 5. E-mail message before preprocessing.

Table 1. Sample Training dataset.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Figure 7. Generated Decision tree.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Figure 8. Proposed framework for suspicious e-mail detection.

This will be the input to the testing stage:

This is the output that is obtained during the execution:

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

Data size 500 1000 1500 2000

DT 93.40 96.25 95.17 95.38

Table 2. Classification result based on Data size.

Table 3. Classification result based on Feature size.

Suspicious E-mail Detection via Decision Tree: A Data Mining Approach

You might also like