8.progress Report Presentation (Clickbait Detection System)
8.progress Report Presentation (Clickbait Detection System)
Introduction
NumPy
• First one is Numpy, which is an open source library that is extremely popular among Machine
Learning community. It is mainly used for handing mathematics formulas and calculations in
Machine learning applications, in addition to that Numpy is also used for Cleaning data-sets and
handling NULL spaces in data cells.
• NumPy is the fundamental package for scientific computing with Python. It contains among
other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code useful linear algebra, Fourier transform, and
random number capabilities
• Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy
to seamlessly and speedily integrate with a wide variety of databases.
• NumPy is licensed under the BSD license, enabling reuse with few restrictions.
Pandas
• Second Library is Pandas, It is the most widely used tool for data munging. It contains
high-level data structures and manipulation tools designed to make data analysis fast and easy.
• Its ability of being fast, powerful, flexible and easy to use make it best for data analysis
and manipulation. Another big benefit of using Pandas is that it is open source.
• The Pandas library provides a really fast and efficient way to manage and explore data. It
does that by providing us with Series and Data-Frames, which help us not only to
represent data efficiently but also manipulate it in various ways. These features of Pandas is
exactly what makes it such an attractive library for data scientists.
• Labeling of data is of utmost importance. Another important factor is an organization,
without which data would be impossible to read. These two needs: Organization and
labeling of data are perfectly taken care of by the intelligent methods of alignment and
indexing, which can be found within Pandas.
• Data is very crude in nature and one of the many problems associated with data is the
occurrence of missing data or value. Therefore, it is pertinent to handle the missing
values properly so that they do not adulterate our study results. Some Pandas features
have you covered on this end because handling missing values is integrated within the
library.
This technology is used in this project for text cleaning process.The Natural Language Toolkit (NLTK) python library
has built-in methods for removing stop words.It can be done by various steps
Word Embedding:-
This is also used in this project and the method used is Woed2vec embedding.The
first layer of the CNN is used for embedding the words into vectors of low-
dimensions
Natural Language Processing:
Natural language processing (NLP), helps to make computers understand the unstructured text and retrieve meaningful
pieces of information from it. Natural language Processing (NLP) is a sub-field of Artificial Intelligence in which its
depth involves the interactions between computers and humans.
MODEL USED:
CNN MODEL
we use a simple CNN having one layer of convolution.The above figure shows a
graphical representation of the complete model utilized. The CNN we utilize is based
on the CNN architecture of Kim . The first layer of the CNN is used for embedding
the words into vectors of low-dimensions. For word embeddings we utilize two
variants word embeddings which are learnt from scratch, and word embeddings
which are learnt from an unsupervised neural language model which keep evolving
as training
occurs. This technique of initializing word vectors from an unsupervised neural
language model has been
shown to improve performance . We utilize the word vectors trained by Mikolov,
Chen, Corrado and Dean
on 100 billion words of Google News. These vectors are publicly available as
word2vec.
CNN Model Used
MODEL DESIGN
Future Scope
1.Finding the features that the model has learnt
and finding the most important ones.
2.Gathering more data for developing better models and
3.Coming up with a serverbacked web browser plugin which can harness the power of this model and
can alert the user about the clickbaits on the page.
CONCLUSION
The nuisance of clickbait keeps on increasing in online
media. To curb that, we collected data from multiple sources and created a new corpus for clickbait and
non-clickbait headlines. We then developed a deep learning model based on CNN that performs strongly
on the classification of headlines into clickbait and non-clickbait categories. We were able toreceive an
accuracy of 0.90 along with a precision of 0.85 and a recall of 0.88 on the clickbait class. We aim to
make available this model and the corpus for further usage.
REFERENCES
[1] K. El-Arini and J. Tang, “News feed fyi: Click-baiting,” 2014.[Online]. Available:
http://newsroom.fb.com/news/2014/08/news-feedfyi-click-baiting/
[2] J. C. dos Reis, F. Benevenuto, P. O. S. V. de Melo, R. O. Prates, H. Kwak,and J. An, “Breaking the
news: First impressions matter on online news,”in Proceedings of ICWSM 2015.
[3] G. J. Digirolamo and D. L. Hintzman, “First impressions are lasting impressions: A primacy effect in
memory for repetitions,” Psychonomic Bulletin & Review, vol. 4, no. 1,
[4] D. J. Dooling and R. Lachman, “Effects of comprehension on retention of prose.” Journal of
Experimental Psychology, vol. 88, no. 2, pp. 216–222, 1971.
[5] G. Loewenstein, “The psychology of curiosity: A review and reinterpretation,” Psychological
Bulletin, vol. 116, no. 1, pp. 75–98, July 1994.
[6] B. Gardiner, “Youll be outraged at how easy it was to
get you to click on this headline,” 2015. [Online]. Available:
http://www.wired.com/2015/12/psychology-of-clickbait/
[7] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep
convolutional networks,” in Proceedings of NIPS 2012.
[8] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep
recurrent neural networks,” in Proceedings of ICASSP 2013.
[9] Y. Kim, “Convolutional neural networks for sentence classification,” in
Proceedings of EMNLP 2014.
Project by: