0% found this document useful (0 votes)
147 views6 pages

Cluster Based Analyasis For Google YouTube Videos Viewer

This document discusses analyzing YouTube video viewers using cluster-based analysis. It reviews related literature on clustering algorithms like k-means, hierarchical clustering, and fuzzy c-means. The research methodology section outlines that clustering will be used since there are no predefined classes, and instances will be divided into natural groups. The k-means clustering algorithm will be applied as an unsupervised cubic clustering method to analyze YouTube video viewers and achieve a cluster quality of 0.75.

Uploaded by

rahul deo sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views6 pages

Cluster Based Analyasis For Google YouTube Videos Viewer

This document discusses analyzing YouTube video viewers using cluster-based analysis. It reviews related literature on clustering algorithms like k-means, hierarchical clustering, and fuzzy c-means. The research methodology section outlines that clustering will be used since there are no predefined classes, and instances will be divided into natural groups. The k-means clustering algorithm will be applied as an unsupervised cubic clustering method to analyze YouTube video viewers and achieve a cluster quality of 0.75.

Uploaded by

rahul deo sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Cluster based Analysis for Google YouTube Videos Viewers

*Rahul Deo Sah1, Neelamadhab Padhy2 Raja Ram Dutta3


*Dr. Shyama Prasad Mukherjee University, Ranchi
Email.ID [email protected]
Giet University, Gunupur
Email [email protected]
Email ID- [email protected]

Abstract In the modern scenario, of the huge number of internet users and everyone are friendly with Google
YouTube apps. Online videos are especially interesting as a potential vector for social correspondence.
Video is additionally ready to catch social encounters. YouTube is where sound and video have been seen
by the clients. Every person needs to watch/views their fascinating sound and recording materials they visit
YouTube is where there is no bar age or sexual orientation. Everyone inquiries they vary own decision.
YouTube have lots of videos collections of different types. In these regard, a number of viewers have seen
different channels and videos. YouTube has a great part of the client watching their most loving channels
and videos. In the previous research paper title “Data Mining Techniques for Videos Subscribers Google
YouTube” taken data mining classifiers and find out the results. Now in this research paper analysis
YouTube videos views and analysis with clustering and achieve a cluster quality of 0.75.

Keywords: Data mining; clustering, a-Means, feature unsupervised feature

1 Introduction

Classification proceeds based on classifier selection on dataset and proposes a clustering based classifier
se0lection method. In this method many clusters are selected for ensemble process. then the standard
presentation of each classifier on selection cluster is calculated and the classifier’s with the best average
technique is used. Weight values are calculated according to distances between the given data and each
selected cluster. Online video, a universal, visual, and exceedingly shareable medium is appropriate to
intersection geographic, social, and semantic obstructions. Inclining recordings specifically, by uprightness
of achieving an expansive number of watchers in a limited capacity to focus time, are incredible as both
influencers and pointers of global correspondence streams. Be that as it may, are new correspondence
innovations really being utilized to share thoughts all around, or would they say they are essentially
reflecting previous social channels? What's more, how do social, political, and geographic variables impact
global correspondence? By dissecting utilization information from computerized correspondences stages,
we can start to respond to these inquiries. In this paper, we center around slanting information from the
YouTube video sharing stage to look at the global utilization of online video
.
1. Literature Survey
Rahul deo Sah [1] The proposed system relies upon isolating the course of action of features from making
tires out of male and female writers and getting ready classifiers to make sense of how to isolate between
the two. Pictures are isolated using Otsu thresholding estimation. The going with features have been
considered. Forming properties like tendency, ebb, and stream, surface and hazard are evaluated by enlisting
neighborhood and overall features. The author Osama Abu Abbas [2] has presented paper Comparisons
between data clustering algorithms. In his paper different clustering algorithms are k-means, self-organizing
map algorithm (SOM), a hierarchical clustering algorithm and expectation maximization clustering
algorithm (EM). These four algorithm are chosen per the popularity, flexibility and high dimensionality. It
is compared based on the factors are size of dataset, type of dataset, a number of clusters and software used.
Experimental result analyzes k-means and expectation maximization clustering algorithms are better result
than hierarchical clustering algorithm. Self-organizing map algorithm shows better accuracy as compared
to k-means and expectation maximization clustering algorithms. Dr. S.K. Jyanthi et. al [3] has proposed a
paper clustering approach for classification of research articles based on a keyword search. In his paper-
means, hierarchical clustering and fuzzy C-means clustering algorithms are used for clustering.
Experimental analysis shows fuzzy C-means shows a better result than k-means and hierarchical clustering
algorithm. Bhagyashree Pathak et al. [4] has presented a survey paper of the clustering method. In her survey
paper, different types of clustering partition clustering, hierarchical clustering, and density based clustering,
grid based clustering, the model based clustering, soft computing clustering was discussed. In this paper, it
is observed, hierarchical clustering (agglomerative hierarchical clustering / divisive hierarchical clustering)
give a better result than partitioning cluster (k-mean algorithm / k-medoid algorithm) and soft computing
technique give a better result for a large dataset. Author Mythili S et al. [5] has presented a research article
on an analysis of clustering algorithms in data mining. This paper described a broad survey of a different
technique of clustering and their issues on a basis of accuracy and complexity of the algorithm on a large
dataset. Mythili, et.al [6] presented a paper which provides an overview of algorithms along with their
advantages and disadvantages. The different clustering methods that have been studied are partitioning
clustering, hierarchical clustering, density based clustering Amandeep Kaur Mann [8] discussed the
different data mining techniques used in cloud computing. It would help to evaluating all possible software
services on the cloud computing by using the clustering technique. This paper determines that the K-means
algorithm is the more efficient algorithm as compared to remaining algorithms and it is suitable for large
database Tamilkili.M [7] presented a paper on various clustering techniques namely partitioning, density
based, hierarchical, based, a model based, a constraint based technique along with their specialty, advantages
and disadvantages. Madura phatak.et al. [9] proposed the new software using Cluster Knowledge Discovery
in Databases and Classification knowledge Discovery in a database (KDD). It concluded that Clustering
Knowledge Discovery is suitable for larger dataset but the software contains more complication.Mihika
shah.et.al [10] presented a paper that discussed the various types of algorithms like k-means clustering
algorithm. This paper provides a broad survey of the most basic techniques such as hierarchical and partition
algorithm. Amandeep Kaur Mann [8] discussed the different data mining techniques used in cloud
computing. It would help to evaluating all possible software services on the cloud computing by using a
clustering technique. This paper determines that the K-means algorithm is a more efficient algorithm as
compared to remaining algorithms and it is suitable for large database We consider de-anonymization of
Bitcoin addresses as a clustering problem. Clustering is an important class of unsupervised learning
problems [14], which focuses on splitting data into groups Hierarchical Clustering method merged or splits
the similar data objects by constructing a hierarchy of clusters also known as dendrogram. Hierarchical
Clustering method forms clusters progressively [11]. Divisive clustering: This is a "top down" approach.
This clustering observations start in one cluster, and splits are performed recursively as one moves down
the hierarchy [12]. A cluster is a dense region of points that is separated by low density regions from the
tightly dense regions. This clustering algorithm can be used when the clusters are irregular It finds core
objects i.e. objects that have dense neighborhoods [13] A cluster is a dense region of points that is separated
by low density regions from the tightly dense regions. This clustering algorithm can be used when the
clusters are irregular It finds core objects i.e. objects that have dense neighborhoods.
.
3. Research Methodology
In the research methodology, Clustering methods are apply when there is no class to be early as predicted
but rather when the instances are to be divided into supernatural groups. These clusters obviously reflect
some mechanism at work in the domain from which instances are drawn, a mechanism that causes some
instances to bear a stronger resemblance to each other than they do to the remaining instances. The algorithm
developed result with one type of data may fail unfortunate result with a dataset of other models.
Scalability–data must be scalable; we may get the wrong results.
1) Always clustering algorithm able to handle with different types of instances.
2) Clustering must be able to evaluate clustered data with the arbitrary nature
3) The clustering should be insensitive to noise and outliers.
4) Illustrate –eligibility and usability –Result obtained should be interpretable and usable. So that
maximum skills about the input parameters can be obtained.
5) The clustering algorithm should be able to find with a data set of more dimensionality.
The clustering algorithm can be extensively grouped into two categories.
1) Unsupervised cubic clustering algorithms.
2) Unsupervised non-cubic clustering algorithms.
3) Unsupervised cubic algorithm-K-Means clustering algorithm. Simple Means clusters data
using k-means; the number of clusters is specified by a parameter. The user can choose
between the Euclidean and Manhattan Distance metrics. In the latter case the algorithm is
actually k-medians instead of K-means, and the centroids are based on medians rather than
means in order to minimize the within-cluster distance function Simple Means for the weather
data, with default options: two clusters and Euclidean distance. The result of clustering is
shown as a table with rows that are attribute names and columns that correspond to the cluster
centroids; an additional cluster at the beginning shows the entire dataset

J (V) =∑𝒄𝒊=𝟏 ∑𝒄𝒊


𝒋=𝟏(||𝒙𝒊 − 𝒗𝒋||)
2

Where,
'||xi–vj||' is the Euclidean separation among xi and vj.

'ci' is the quantity of information focuses in its group.

'c' is the quantity of group focuses.

Algorithmic steps for k-means clustering


Let X = {x1, x2, x3,… … ..,xn} be the arrangement of information focuses and V = {v1,v2,… … .,vc} be
the arrangement of focuses
1) We randomly select ‘c’ cluster centers.
2) Evaluate the separation between every n each data point and group focuses.
3) Define the information point to the group focus whose good ways from the bunch focus is at least all the
group focuses.
4) Reevaluate the new group focus utilizing:

Vi= (1/ci)∑𝐶𝑖
𝑗=1 𝑥𝑖

Here, 'ci' speaks to the quantity of information focuses in ith cluster.


4) Reevaluate the separation between every n each datum point and new got bunch focuses.
5) There is no data point was reassigned then process should be stop, otherwise repeat
from step 3).
4. Algorithmic Data Model for YouTube Dataset–(which have 1003 rows and 06 columns. In the
tabular form contains rank,grade,channel name, videos upload, subscribers and viewers)
1. Start
2. Retrieve data from public domain Kaggle.
3. Data for Pre-processing
4. Replace the missing value
5. Select the attributes
6. Choose the branch of different
7. Make it stratified samples
8. Is it handled with text?
9. Remove useless attributes
10. Normalize the dataset
11. Recording the attributes
12. Multiply with unsupervised feature set
13. Then apply it's with these feature set
14. Go through clustering the data
15. Again multiply it
16. Make the model for visualization
17. Short they finalized the data
18. Generate final attribute
19. Set the role for that
20. Make the decision
21. End

Fig.1 Representation of Data tables

This section shows generic information which is independent of the models.

 Data: the data set after it has been transformed for modeling.
 Text: only shown if feature extraction from text data was activated. Shows the words in the text
columns which are used for the analysis as a table and as a word cloud. In addition, we can inspect
all the training and scoring documents where those words have been highlighted. Finally, if we
activated the calculation of sentiment or language, we can inspect the distribution of those values
for all our text columns as well.
Fig.2 Optimal features sets for k-Means Clustering

The plot on the left shows the result of the feature selection run. Each point represents a different
feature set, i.e. a subset of the original columns. A feature set could, for example, have a complexity
of 5 and achieves a cluster quality of 0.75. Please note that this quality measure is the Davies
Bouldin Index of the clustering and smaller values are better. Unlike for classification or
regression, the goal of clustering is to describe the data. Therefore, we want to stay as close to the
original data as possible and only remove noise in the data. Typically, the most meaningful results
can be found in the middle area of the Pareto front on the left. The original feature space is shown
as square and is typically in the top right corner. Using fewer features will improve the cluster
quality, but may no longer accurately describe the underlying patterns. We will find those features
toward the bottom left corner. The feature set which has been used to build the final model is shown
bigger.

Fig.3 Scatter Plot for 0 cluster and Scatter Plot for 01 Cluster

Fig.4 Centroid Chart for 03 Clusters


Fig.5 Centroid Table for 03 Clusters (Cluster 0, Cluster 01, Cluster 02)

Fig.6-Decision Tree for different Clusters (Cluster 0, Cluster 01, Cluster 02)

Fig.7- Correlations Matrixes for different attributes

Fig.8 Summary of the running model


In the above fig:-8 shows the data 1003 divided into 03 cluster and obtained the results. The data size of all
found clusters together with some information about the clusters and their quality. Cluster 0 size has 32 and
average distance 6.120 and video uploads is on average 1.453.01% larger, videos is on average 43.08%
larger. Cluster 1 size has 100 and average distance 2.510 and video views is on average 312.09% larger,
videos uploads is on average 20.17% smaller. Cluster 2 size has 871 and average distance 0.277 and video
uploads is on average 51.07% larger, videos is on average 37.41% smaller

5. Conclusion:

All other sections in the results menu are reserved for the cluster models. Each cluster model gets
a section of its own and in general provides the entries above. Shows the size of all found clusters
together with some information about the clusters and their quality. Heat Map: identifies the most
important Attributes for each cluster. Cluster Tree: displays a decision tree describing the main
differences between the clusters. Centroid Chart: shows the values for the cluster centroids in a
parallel chart. Centroid Table: shows the values for the cluster centroids in a table. Scatter Plot:
with a choice of the cluster, displays a scatter plot in terms of the two most important Attributes.
Clustered Data: displays a table with all the data, including the cluster label for each data point
only shown if feature selection was activated. Shows all optimal trade-offs between feature set
complexities and clustering qualities. We can select any of the points in the trade-off plot and see
the specific feature sets at the bottom General Data: the data set after it has been transformed for
modeling. Text: only shown if feature extraction from text data was activated. Shows the words in
the text columns which are used for the analysis as a table and as a word cloud. In addition, we can
inspect all the training and scoring documents where those words have been highlighted. Finally,
if you activated the calculation of sentiment or language, we can inspect the distribution of those
values for all our text columns as well. Correlations: a matrix showing the correlation. YouTube
videos views and analysis with clustering and achieve a cluster quality of 0.75.

6. References-
[1] Rahul Deo Sah “Review of Medical Disease Symptoms Prediction Using Data Mining Technique” IOSR Journal of
Computer Engineering (IOSR-JCE) a-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 3, Ver. I (May.-June.
2017), PP 59-70
[2] Osama Abu Abbas "Comparisons between data clustering algorithms", The International Arab Journal of Information
Technology,Vol. 5No. 3,July 2008
[3] Dr. S.K. Jayanthi, C. Kavi Priya "Clustering Approach for classification of research articles based on keyword search",
International Journal of Advanced Research in Computer Engineering & Technology(IJARCET) Volume 7,Issue
1,January 2018,ISSN: 2278-1323
[4] Bhagyashree Pathak, Nilanjan Lal "A Survey on Clustering Methods in Data Mining", International Journal of Computer
Applications(0975-8887) Volume 159- No 2, February 2017
[5]Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol. 3,Issue. 1,January 2014, pg.335-
340
[6]. An analysis on Clustering Algorithm in Data Mining Mythili S1, Madhiya E2 International Journal of Computer Science
and mobile Computing.
[7]. A Survey on Recent Traffic Classification Techniques Using Machine Learning Methods in M.Tamilkili journal of
Advanced Research in Computer Science and Software Engineering.
[8]. Survey paper on Clustering Techniques in Amandeep Kaur Mann(M.TECH C.S.E)International journal of
Science,Engineering and Technology Research(IJSETR).
[9]. Clustering Techniques and the Similarity Measures used in Clustering Survey Jasmine lrani Nitinpise Maduraphatak
International of Computer Application(0975-8887)Volume 134- No.7,January 2016.
[10]. A Survey of Data Mining Clustering Algorithm in Mihika Shah Sindhu Nair International Journal of Computer
Applications..
[11]. Jun Zhang, Yang Xiang, Wanlei Zhou, Yu Wang, Unsupervised traffic classification using flow statistical properties
and IP packet payload, Journal of Computer and System Sciences 79 (2013) 573-585.
[12]. JyotiYadav, Monika Sharma, A Review of Kmean Algorithm, International Journal of Engineering Trends and
Technology (IJETT) - Volume 4 Issue 7- July 2013.
[13]. G. Sathiya and P. Kavitha, An Efficient Enhanced K-Means Approach with Improved Initial Cluster Centers, Middle-
East Journal of Scientific Research 20 (4): 485-491, 2014.
[14] Ghahramani, Zoubin.” Unsupervised learning.” Advanced lectures on machine learning. Springer Berlin Heidelberg,
2004. 72-112.

You might also like