0% found this document useful (0 votes)
5 views7 pages

Datascience

This document presents two case studies: one on predicting malicious URLs using machine learning and another on building a movie recommender system within a MySQL database. The first study achieves 97% accuracy in detecting malicious sites by utilizing sparse data representation and online learning, while the second study employs Locality-Sensitive Hashing and Hamming Distance to recommend movies based on rental history efficiently. Additionally, the document outlines various deep learning algorithms, including CNNs, RNNs, and GANs, highlighting their applications in different domains.

Uploaded by

akshiva005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Datascience

This document presents two case studies: one on predicting malicious URLs using machine learning and another on building a movie recommender system within a MySQL database. The first study achieves 97% accuracy in detecting malicious sites by utilizing sparse data representation and online learning, while the second study employs Locality-Sensitive Hashing and Hamming Distance to recommend movies based on rental history efficiently. Additionally, the document outlines various deep learning algorithms, including CNNs, RNNs, and GANs, highlighting their applications in different domains.

Uploaded by

akshiva005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

By

Rajeshwari S
2022IT35
III- Bsc IT
Case study 1: Predicting Malicious URLs

1. Summary :

The internet is widely used for various purposes, but some websites are malicious and pose
security threats. This case study focuses on predicting malicious URLs using machine learning
while handling large datasets efficiently.

2. Explanation :

1. Defining the Research Goal

 The goal is to determine whether a URL is safe or malicious using a large dataset while
handling memory constraints.

2. Acquiring the URL Data

 The dataset is downloaded in SVMLight format from a research project.


 Each record has 3.2 million features and is labeled 1 (safe) or -1 (malicious).

3. Handling Memory Constraints

 Problem: A single file is too large to fit in memory, causing an out-of-memory error.
 Solution:
 Use a sparse representation (only store non-zero values).
 Process compressed files instead of uncompressed data.
 Use an online learning algorithm that processes data in smaller chunks.

4. Data Exploration

 Checking the dataset confirms that most values are zeros (sparse data).
 Storing only non-zero values saves memory.

5. Model Building

 A Stochastic Gradient Descent (SGD) Classifier is used.


 Instead of loading the entire dataset, files are read one by one, and the model is updated
using partial fitting.
 Results:
 97% accuracy in detecting malicious sites.
 Only 3% false negatives and 6% false positives.

Conclusion:

By using sparse representation, compressed data, and online learning, the model efficiently
classifies URLs without exceeding memory limits.

3. Flowchart :
Case study 2 : Building a recommender system inside a
database

1.Summary :
This case study explains how to create a recommender system that suggests movies to
customers based on their rental history. The system uses a MySQL database and Python to
process large datasets efficiently. It applies Locality-Sensitive Hashing (LSH) and Hamming
Distance techniques to find customers with similar preferences and recommend movies they
haven't seen yet. The goal is to make the system memory-friendly and optimize data processing
inside the database itself.

2. Explanation :

1. Research Question

 The task is to recommend movies to customers based on their previous


rentals. The manager asks if it's possible to suggest movies by analyzing
rental history stored in a MySQL database.
2. Tools and Techniques Needed

 MySQL Database – Stores customer rental data.

 Python Libraries – MySQLdb, SQLAlchemy, and Pandas for connecting


and manipulating data.

 Hash Functions – Groups similar customers into buckets.

 Hamming Distance – Measures the similarity between customers' rental


patterns.
3. Data Preparation

 The dataset shows which movies each customer has rented (1 for rented,
0 for not rented).

 The data is stored in MySQL using Python's Pandas library.

 Binary rental data is compressed into bit strings for faster processing.
4. Hash Functions

 Three hash functions select movies in groups of three.


 Customers with the same movie combinations are placed in the same
bucket.

 This reduces the amount of data to compare directly.


5. Model Building

 The Hamming Distance is used to measure how similar two customers are
by counting the differences in their rental patterns.

 The system first selects customers from the same bucket and then
compares them using the distance function.
6. Recommendations

 The system recommends movies that similar customers have watched but
the target customer hasn't.

 This process is automatic and memory-friendly.

Conclusion:

This case study shows how to build a recommender system inside a relational database using
hashing techniques and distance measures. The system is fast, memory-efficient, and suitable
for large datasets.

3. Flow Chart:
Deep Learning Algorithms

1. Convolutional Neural Networks (CNN)

Used in image processing and computer vision tasks like image classification and object
detection. It extracts spatial features using convolution layers.

2. Recurrent Neural Networks (RNN)

Designed for sequential data processing, commonly used in speech recognition and language
modeling due to its ability to retain past information.

3. Long Short-Term Memory (LSTM)

A type of RNN that solves the vanishing gradient problem, making it effective for time-series
forecasting, chatbots, and text generation.

4. Gated Recurrent Unit (GRU)

A simplified version of LSTM with fewer parameters, used for text processing and sequential
data applications like speech recognition.

5. Transformer

A deep learning model that relies on attention mechanisms, widely used in NLP tasks like
machine translation (e.g., GPT, BERT).

6. Generative Adversarial Networks (GAN)

Consists of a generator and discriminator competing to create realistic data, applied in deepfake
generation and image synthesis.

7. Autoencoders

Used for data compression, anomaly detection, and noise reduction by encoding input data
into a lower-dimensional representation and reconstructing it.

8. Deep Belief Networks (DBN)

A stack of Restricted Boltzmann Machines used for feature learning, image recognition, and
dimensionality reduction.

9. Restricted Boltzmann Machines (RBM)


A two-layer neural network primarily used for collaborative filtering, dimensionality
reduction, and feature learning.

10. Self-Organizing Maps (SOM)

An unsupervised learning algorithm that maps high-dimensional data to a lower-dimensional


space, useful for clustering and pattern recognition.

11. Capsule Networks (CapsNet)

An alternative to CNNs that captures spatial hierarchies, improving performance in image


classification and object detection.

12. Deep Q-Networks (DQN)

A reinforcement learning algorithm that combines deep learning with Q-learning, used in game
playing and autonomous decision-making.

13. Variational Autoencoders (VAE)

A probabilistic model for generating new data similar to the training set, used in image
generation and data augmentation.

14. Spiking Neural Networks (SNN)

A biologically inspired neural network that processes information more efficiently, applied in
neuromorphic computing.

15. Attention Mechanism

Enhances deep learning models by focusing on important input parts, crucial for NLP, speech
recognition, and computer vision.

THANK YOU

You might also like