New Web Content Filtering: An Implementation: ISSN 2319 - 1953
New Web Content Filtering: An Implementation: ISSN 2319 - 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
Abstract - Now-a-days, accessing the internet is a mixed The rest of this paper has following topics including
blessing. Constructing a new web content filter is Section II provides Basic Idea, Section III provides
necessary because in worst case accessing the internet Background of this work, Section IV provides
may create some serious problems. There are two Construction, Section V provides Comparison and Section
different services present in the content filtering: one is VI provides Experiments & Results.
web filtering, the screening of websites (or) pages and Basic Idea for implementing a new web content filter is
another is e-mail filtering, the screening of e-mail for derived from the idea of ―An Early Decision Algorithm to
spam (or) other objectionable content. This paper Accelerate Web Content Filtering‖ by Ying-Dar Lin [1],
provide implementation of new web content filter with Po-Ching Lin [1], Ming-Dao Liu [1] and Yuan-Cheng Lai
its basic idea, background, construction, comparison, [2], Department of Computer Science National Chiao Tung
experiments & results and concludes with University, HsinChu, Taiwan [1] and Department of
acknowledgement & references. Information Management National Taiwan University of
Keywords – Filtering, Screening, Spam, URL based Science and Technology, Taipei, Taiwan [2] .
searching, String searching. Background of this work is extracted from the Survey
Paper – ―Web Content Filtering Techniques: A Survey‖ by
I. INTRODUCTION V.K.T.Karthikeyan, School of Computer Science and
Content Filtering is new subject in the area of Engineering, Bharathidasan University, Trichy, India.
technology. That has to study in deep. This issue appears as Construction section describes a proposed work as an
consequences for the variety of media and advertisement in implementing a new web content filtering. Comparison
the internet web sites that lead to unethical and misuse of section consists difference between An Early Decision
World Wide Web users. Massive volume of Internet Algorithm and New Web Content Filtering. In the next
content is widely accessible nowadays. section it describes the Experiments and Results obtain
One can easily view improper content at will without from the New Web Content Filtering. Finally this paper
access control. A modern and effective web content concludes with Acknowledgement and References.
filtering solution scans more than the domain name. It is
able to break down and analyze web traffic making it II. BASIC IDEA
capable to accurately pinpoint portions of a web page 2.1 An Early Decision Algorithm to Accelerate Web
which should not be allowed into the internal network. Content Filtering
Content Filtering is a firewall to block certain sites from This work presents a simple, but effective early decision
being accessed. It is usually works by specifying character algorithm to accelerate the filtering from the observation
string that, if matched, indicated undesirable content that is that the filtering decision can be made before scanning the
to be screened out. Content filtering and the products that entire content, as soon as the content can be classified into
offer this service can be divided into Web filtering, the a certain category. A fast decision is particularly important
screening of Web sites or pages, and e-mail filtering, the since most Web content is normally allowable and should
screening of e-mail for spam or other objectionable pass the filter as soon as possible.
content. The philosophy behind the early decision algorithm is to
make the filtering decision from the front partial Web
content. The keyword position is normalized by the page
length. The keywords in almost all Web pages tend to be
distributed uniformly throughout the content or appear
more in the front part according to this investigation. The
Web content in a banned category starts to exhibit much
more keywords than that in an allowable category since the
front part. In other words, keywords from the front partial
content can reveal the category of the Web content and
Fig 1: New Web Content Filter serve as the clues to filtering.
IJSRCSAMS
Volume 3, Issue 4 (July 2014) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
Like the Bayesian classification, the filtering engine is remaining content should be bypassed. In contrast, if there
trained off-line from the Web content in the banned exists some ci in the list of banned categories such that
categories. The Bow library and its front-end, Rainbow PCDi is larger than Tblock, this means the content is likely
perform the training herein, extracting keywords as the to belong to ci and should be blocked by the filter. A
features from the target categories. The keywords with the minimum of the content should be scanned in the process
information gains larger than a threshold are selected. Stop to avoid deciding too early from only the little front part of
words, such as ―the‖, ―of‖ and so on, should be dropped the content, which may render the filtering result incorrect.
because they help little in classification. The words inside The algorithm is
the HTML tags are also ignored so that a malicious user Early bypass False;
cannot stuff unrelated content in the tags, particular in the Early block False;
front part of the Web page, to deceive the filter.
n 0;
If the malicious user fills the Web text outside the tags
with irrelevant content to confuse the filter, the irrelevant Do {
content will be displayed in the browser and will spoil the Read next keyword;
layout of the Web pages – a great limitation on the design // Skip stop words and the HTML tags.
of the Web pages. The score of keyword wt that should
n the percentage of content that has been scanned;
belong to a category cj is defined to be logP(wt|cj), which
can be derived in the training stage. Taking the logarithm m the accumulated score;
simplifies the computation of the posterior probability If (n > Min_Scan)
P(cj|di) from multiplication operations to score {
accumulation with independence assumption between
words. The scores are accumulated while the content is // scanning at least Min_Scan% of document,
scanned from the front to the end. // Min_Scan=10 herein
In the filtering stage, given n% of the content that has For (each category ci in the set of banned categories)
been scanned and the score m or less that has been {
accumulated, the probability that the content should belong
to a category c is derived from PDCi P(D(n, m)|ci) of current scanning position;
P(D(n,m)|c) P(c) PDCi’ P(D(n, m)|ci’) of current scanning position;
P(c|D(n,m))= ____________________________________ PCDi (PDCi*P(ci))/(PDCi*P(ci)+PDCi’*P(ci’));
P(D(n,m)|c) +P(D(n,m)|c‘) P(c‘) }
1. D(n,m) : the event that the filter has read n% of the // end of For
content and has observed the score accumulation m or less. If (for all category ci, PCDi < Tbypass)
2. P(c) : the estimated probability that category c appears in {
typical Web content.
Earlybypass:=True;
3. P(c' ) : the estimated probability that category c does not
Exit;
appear in typical Web content. P(c' ) = 1 - P(c) .
}
4. P(D(n,m) | c) : the estimated probability that D(n,m)
happens given that the content belongs to category c. The If (for some category ci, PCDi > Tblock)
estimate of P(D(n,m) | c) is the number of Web pages in c {
that D(n,m) happens divided by the number of Web pages Earlyblock:=True;
in c.
Exit;
5. P(D(n,m) | c' ) : defined similarly as P(D(n,m) | c) ,
}
except that c is replaced with c‘.
} // End of If (n > Min_Scan)
In the training phase, two two-dimensional indexed
tables of P(D(n, m)|ci) and P(D(n, m)|ci’) are built for each while (not end of content);
n and m from the training examples, where ciÎC. The Fig 2: An Early Decision Algorithm
values of P(ci) and P(ci’) can be estimated beforehand or
dynamically tuned in a running environment by recording III. BACKGROUND
and analyzing actual Web content. Fig. 2 presents the early 3.1 Web Content Filtering Techniques: A Survey
decision algorithm. This methodology works in both an online and offline
Two thresholds, Tbypass and Tblock, are defined to be time content analysis. The philosophy behind this work is
0.1 and 0.9 herein. PCDi is the estimate that the content to make the filtering decision from the textual part of the
should belong to a category ci. If PCDi is less than Web content. Stop words, such as ―a‖, ―of‖, ―an‖ and so
Tbypass for all ci in the list of banned categories, this on, should be dropped because they help little in
means the content is unlikely to be banned and the
IJSRCSAMS
Volume 3, Issue 4 (July 2014) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
classification. The words inside the HTML tags are also web content is synonymic search. The filter considers the
ignored so that a malicious user cannot stuff unrelated synonym words which are corresponding to the keywords.
content in the tags, particular in the front part of the Web The keywords are also searched in the HTML tags because
page, to deceive the filter. If the malicious user fills the there are some unwanted or illegal content may inserted in
Web text outside the tags with irrelevant content to confuse the web pages‘ HTML tags by the creators which lead to
the filter, the irrelevant content will be displayed in the illegal offense.
browser and will spoil the layout of the Web pages – a The Filter calculates the Number of words in the web
great limitation on the design of the Web pages. content (B) and the number of times that the keywords
First, count the total number of words present in the web appeared in the web content (A). The categorical value
page(A) and the total number of keywords present in the (CV) of the web page is derived by the simple formula
web page(B) was calculated. Next, find the categorical CV=A/B. The filter have a default boundary value which
value(C) by calculating C=B/A. This value represents the analyze the web pages category, when the categorical value
category of the web content whether it belongs to is above than the boundary value then the web page is
Allowable category or Banned category. When the C value belongs to the Banned category otherwise its belongs to the
becomes greater than boundary value then the web content Allowable category. Here, there is no need to maintain the
comes under Banned category otherwise it comes under URL list in a database because, Day-by-day the contents of
Allowable category. the web pages may changes in a single URL.
This work addresses the problem of content filtering in A fast decision is particularly important since most Web
the web management. This methodology decides pages to content is normally allowable and should pass the filter as
either block or pass the web content as soon as the decision soon as possible.
can be made is presented. This method is simple but
effective. The same rationale behind this method can be V. COMPARISON
applied to other content filtering applications as well, such The following Fig. 3 show the difference between
as anti-spam. an early decision algorithm and new web content filter
This method can also be combined with more features
other than keywords from the text to further increase the
overall accuracy of the content filter. Besides, the filtering An Early Decision New Web Content Filter
can be further accelerated by combining the URL-based Algorithm
method with the cached results. That is, by caching the
URLs of the filtered Web pages, duplicate filtering on the Web contents are taken Web contents are taken
same Web page can be avoided. from directories (yahoo). from the URL in real time.
(online or offline)
IV. CONSTRUCTION The filtering process The Filtering process
This section briefly describes about the implementation scanned only the front part scanned the entire part of
of New Web Content Filter. This is a mechanism which of the Web content. the web content.
makes a filtering decision on the web pages content. It searches the desired It searches the desired
Because Massive volume of Internet content is widely keywords only. keywords and also its
accessible now-a-days. One can easily view improper corresponding Synonym
content at will without access control. For example, a words.
student may watch social websites during laboratory hours Here, HTML words are Here, HTML words are
in a schools or colleges. Web filtering products can enforce ignored. considered because
the access control. The up-to-date products have widely unwanted or illegal content
adopted content analysis besides the URL-based approach. may inserted in the web
This filter analyzes the Web content to a certain category pages‘ HTML tags by the
first, and makes the filtering decision, either to block or to creators which lead to
pass the content. The filtering work present a simple, but illegal offense.
effective in the form of observation that the filtering Cached the URLs of the Here, there is no need to
decision can be made scanning through the entire content filtered Web pages, maintain the URL list in a
of the web page, as soon as the content can be analyzed duplicate filtering on the database because, Day-by-
into a certain category. same Web page can be day the contents of the web
The content of the web page is get from the avoided. Content analysis pages may changes in a
corresponding URL of the web pages. This filtering can be skipped if the single URL.
process neglects some words are called ―Stops Words‖, cached URL is matched.
they are single letter words and preposition words. The The maintenance of the
category of the web content is analyzed by the keywords URL list is also facilitated.
which are present in the content. According to this
investigation, the searching sense of the keywords in the
IJSRCSAMS
Volume 3, Issue 4 (July 2014) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
Fig 3: Difference between An Early Decision Algorithm VII. CONCLUSION
and New Web Content Filter This paper presents the detailed description of
implementing a new web content filter with its basic idea
VI. EXPERIMENTS AND RESULTS
from ―An Early Decision Algorithm‖, background from
For a sample, totally 5 pages are randomly taken and ―Web Content Filtering: A Survey‖, constructing the new
extracted the page content through the corresponding URLs filter, difference between An Early Decision Algorithm &
in real time (online or offline). Some keywords are New Web Content Filter, experiments & its results, and
selectively approaches to the filter for the process. (Note: conclude with acknowledgement & references of this work.
According to this sample investigation the selected The filtering performance of this filter is more efficiency
keywords are assume to be a Bad keywords to ban the than the earlier filters.
pages).
The following Fig. 4 represents the overall values for the VIII. ACKNOWLEDGEMENT
given pages. My sincere thanks to Dr. M.Thangaraj M.Tech., Ph.D.,
Associate Professor, Department of Computer Science,
S .no No: No: Of: No: Of: Category Bounda- Result
Of: Words Times that Value of ry Value (Banned Madurai Kamaraj University for his keen interest to
Key- present the the Web (BV) is / encourage this work.
words in the keywords page assumed Allowed)
given Web appeared (CV=A/B) by admin
in filter page in page REFERENCES
(B) (A) [1] T. Almeida, A. Yamakami, and J. Almeida, ―Filtering
spams using the minimum description length
Page 1 5 435 19 0.0436 0.1 Allowed principle,‖ in Proceedings of the 25th ACM
Symposium On Applied Computing, Sierre,
Switzerland, March 2010, pp. 1–5
Page 2 9 830 35 0.0421 0.1 Allowed
[2] T. Almeida, A. Yamakami, and J. Almeida,
―Probabilistic anti-spam filtering with dimensionality
Page 3 12 690 78 0.1130 0.1 Banned reduction,‖ in Proceedings of the 25th ACM
Symposium On Applied Computing, Sierre,
Page 4 14 687 98 0.1426 0.1 Banned
Switzerland, March 2010, pp.1–5.
[3] T. Almeida, A. Yamakami, and J. Almeida,
―Evaluation of approaches for dimensionality
Page 5 7 348 24 0.0689 0.1 Allowed
reduction applied with naive bayes anti-spam filters,‖
in Proceedings of the 8th IEEE International
Fig 4: Overall values for the filtering WebPages
Conference on Machine Learning and Applications,
Miami, FL, USA, December2009, pp. 1–6.
The following Fig. 5 shows the graphical representation
of results obtained from the above tabular column. [4] I. Androutsopoulos, G. Paliouras, and E. Michelakis,
―Learning tofilter unsolicited commercial e-mail,‖
National Centre for ScientificResearch ―Demokritos‖,
Athens, Greece, Tech. Rep. 2004/2, March.
[5] Arasu, A. and Garcia-Molina, H (2003). Extracting
Structured Data from Web Pages. SIGMOD-03.
[6] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B.
Zupan, ―Spam filtering using statistical data
compression models,‖ Journal of Machine Learning
Research, vol. 7, pp. 2673–2698, 2006.
[7] Cao, Jiuxin , Mao, Bo and Luo, Junzhou, 'A
segmentation method for web page analysis using
From the above graphical representation there are five shrinking and dividing', International Journal of
WebPages are involved in this investigation and their Parallel, Emergent and Distributed Systems, 25: 2,
results are shown as- Pages 1, 2 & 5 are Allowed to access 93— 104, 2010.
because these Pages have their Categorical Value is below [8] J. Carpinter and R. Hunt, ―Tightening the net: A
than the Boundary Value and Pages 3 & 4 are Banned to review of current and next generation spam filtering
access because these Pages have their Categorical Value is tools,‖ Computers and Security, vol. 25, no. 8, pp.
above than the Boundary Value. 566–578, 2006.
IJSRCSAMS
Volume 3, Issue 4 (July 2014) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
[9] G. Cormack, ―Email spam filtering: A systematic [24] Ying-Dar Lin, Po-Ching Lin, Yuan-Cheng Lai. – ―An
review,‖ Foundations and Trends in Information Early Decision Algorithm to Accelerate Web Content
Retrieval, vol. 1, no. 4 , pp. 335–455, 2008. Filtering‖ IEICE TRANS. INF. & SYST., VOL.E91–D.
[10] G. Cormack and T. Lynam, ―Online supervised spam
filter evaluation,‖ ACM Transactions on Information
Systems, vol. 25, no. 3, pp. 1–11, 2007.
[11] Chen, Z., O. Wu, M. Zhu, and W. Hu (2006) A novel
web page filtering system by combining texts and
images. In WI ‘06: Proceedings of the 2006
IEEE/WIC/ACM International Conference on Web
Intelligence, Washington, DC, pp. 732–735. IEEE
Computer Society.
[12] Dontcheva, M., S. Drucker, D. Salesin, and M. F.
Cohen, Changes in Webpage Structure over Time,
TR2007-04-02, UW, CSE, 2007.
[13] Du, R.; Safavi-Naini, R.; Susilo, W.; Web filtering
using text classification, The 11th IEEE International
Conference on Networks, 2003. ICON2003.pages:325
– 330.
[14] T. Guzella and W. Caminhas, ―A review of machine
learning approaches to spam filtering,‖ Expert Systems
with Applications, 2009, in press.
[15] V.K.T.Karthikeyan, - ―Web Content Filtering
Techniques: A Survey‖, - IJCSET 14-05-03-038 Pages
203-208.
[16] Kim, J. K., and S. H. Lee. An empirical study of the
change of Webpages. APWeb ‗05, 632-642, 2005.
[17] Kwon, S. H., S. H. Lee, and S. J. Kim. Effective
criteria for Webpage changes. In Proceedings of
APWeb ‘06, 837-842, 2006.
[18] D. Losada and L. Azzopardi, ―Assessing multivariate
bernoulli modelsfor information retrieval,‖ ACM
Transactions on Information Systems, vol. 26, no. 3,
pp. 1–46, June 2008.
[19] V. Metsis, I. Androutsopoulos, and G. Paliouras,
―Spam filtering with naive bayes - which naïve
bayes?‖ in Proceedings of the 3rd International
Conference on Email and Anti-Spam, Mountain View,
CA, USA, July 2006, pp. 1–5.2004.
[20] Neha Gupta and Dr.Saba Hilal, - ―algorithm to filter
and redirect the web content for kids‘‖, - IJET 13-05-
01-024.
[21] K. Schneider, ―On word frequency information and
negative evidence in naive bayes text classification,‖
in Proceedings of the 4th Inter-national Conference on
Advances in Natural Language Processing, Alicante,
Spain, October 2004, pp. 474–485.
[22] F. Sebastiani, ―Machine learning in automated text
categorization,‖ ACM Computing Survey, vol. 34,No.
1 March (2002) 1-47.
[23] A. Seewald, ―An evaluation of naive bayes variants in
content-based learning for spam filtering,‖ Intelligent
Data Analysis, vol. 11, no. 5, pp. 497–524, 2007.
IJSRCSAMS
Volume 3, Issue 4 (July 2014) www.ijsrcsams.com