BIG DATA ANALYTICS (2017 REGULATION)
Insurance Fraud Detection
Machine learning has a critical role to play in fraud detection and has numerous applications in automobile,
healthcare, and insurance fraud detection.
Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity
to clusters that indicate fraudulent patterns.
Rideshare Data Analysis
The publicly available Uber ride information dataset provides a large amount of valuable data around traffic,
transit time, peak pickup localities, and more.
Cyber-Profiling Criminals
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-
relations.
The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation
division to classify the types of criminals who were at the crime scene.
Call Record Detail Analysis
A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and
internet activity of a customer.
This information provides greater insights about the customer’s needs when used with customer
demographics.
Automatic Clustering of IT Alerts
Large enterprise IT infrastructure technology components such as network, storage, or database generate
large volumes of alert messages.
Because alert messages potentially point to operational issues, they must be manually screened for
prioritization for downstream processes.
Others: Image segmentation, Image Compression, Identifying cancerous data, Search engines etc.
BIG DATA ANALYTICS (2017 REGULATION)
Advantages:
It is fast
Easy to understand
Robust
Comparatively efficient
If data sets are distinct then gives the best results
Produce tighter clusters
When centroids are recomputed the cluster changes.
Flexible
Easy to interpret
Better computational cost
Enhances Accuracy
Disadvantages:
Sometimes choosing the centroids randomly cannot give fruitful results
Needs prior specification for the number of cluster centers
If there are two highly overlapping data then it cannot be distinguished and cannot tell that there are two
clusters
With the different representation of the data, the results achieved are also different
Euclidean distance can unequally weight the factors
If very large data sets are encountered then the computer may crash
Prediction issues
BIG DATA ANALYTICS (2017 REGULATION)
Determining Optimal Clusters:
When using k-means clustering, users need some way to determine whether they are using the right number
of clusters.
Methods:
1. Elbow Method
2. Average Silhouette Method
3. Gap Statistic Method
Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding
total within intra-cluster variation Wk.
BIG DATA ANALYTICS (2017 REGULATION)
Elbow Method:
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by
varying k from 1 to 10 clusters
2. For each k, calculate the total within-cluster sum of square (WSS)
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number
of clusters.
5. 4 is the optimal number of clusters.
BIG DATA ANALYTICS (2017 REGULATION)
Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
Compute the average distance from all data points in the same cluster (ai).
Compute the average distance from all data points in the closest cluster (bi).
The coefficient can take values in the interval [-1, 1].
If it is 0 –> the sample is very close to the neighboring clusters.
It it is 1 –> the sample is far away from the neighboring clusters.
It it is -1 –> the sample is assigned to the wrong clusters or overlapping
A high average silhouette width indicates a good clustering.
Compute the coefficient:
BIG DATA ANALYTICS (2017 REGULATION)
Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
A high avg. silhouette score indicates a good clustering.
BIG DATA ANALYTICS (2017 REGULATION)
Gap Statistic Method:
The approach can be applied to any clustering method.
The gap statistic compare the total intra-cluster variation for different values of k with their expected values
under null reference distribution of the data.
The gap statistics for a given k is defined as follows:
BIG DATA ANALYTICS (2017 REGULATION)
Gap Statistic Method:
According to this observation k = 2 is the optimal number of clusters in the data.