Datascience
Datascience
Rajeshwari S
2022IT35
III- Bsc IT
Case study 1: Predicting Malicious URLs
1. Summary :
The internet is widely used for various purposes, but some websites are malicious and pose
security threats. This case study focuses on predicting malicious URLs using machine learning
while handling large datasets efficiently.
2. Explanation :
The goal is to determine whether a URL is safe or malicious using a large dataset while
handling memory constraints.
Problem: A single file is too large to fit in memory, causing an out-of-memory error.
Solution:
Use a sparse representation (only store non-zero values).
Process compressed files instead of uncompressed data.
Use an online learning algorithm that processes data in smaller chunks.
4. Data Exploration
Checking the dataset confirms that most values are zeros (sparse data).
Storing only non-zero values saves memory.
5. Model Building
Conclusion:
By using sparse representation, compressed data, and online learning, the model efficiently
classifies URLs without exceeding memory limits.
3. Flowchart :
Case study 2 : Building a recommender system inside a
database
1.Summary :
This case study explains how to create a recommender system that suggests movies to
customers based on their rental history. The system uses a MySQL database and Python to
process large datasets efficiently. It applies Locality-Sensitive Hashing (LSH) and Hamming
Distance techniques to find customers with similar preferences and recommend movies they
haven't seen yet. The goal is to make the system memory-friendly and optimize data processing
inside the database itself.
2. Explanation :
1. Research Question
The dataset shows which movies each customer has rented (1 for rented,
0 for not rented).
Binary rental data is compressed into bit strings for faster processing.
4. Hash Functions
The Hamming Distance is used to measure how similar two customers are
by counting the differences in their rental patterns.
The system first selects customers from the same bucket and then
compares them using the distance function.
6. Recommendations
The system recommends movies that similar customers have watched but
the target customer hasn't.
Conclusion:
This case study shows how to build a recommender system inside a relational database using
hashing techniques and distance measures. The system is fast, memory-efficient, and suitable
for large datasets.
3. Flow Chart:
Deep Learning Algorithms
Used in image processing and computer vision tasks like image classification and object
detection. It extracts spatial features using convolution layers.
Designed for sequential data processing, commonly used in speech recognition and language
modeling due to its ability to retain past information.
A type of RNN that solves the vanishing gradient problem, making it effective for time-series
forecasting, chatbots, and text generation.
A simplified version of LSTM with fewer parameters, used for text processing and sequential
data applications like speech recognition.
5. Transformer
A deep learning model that relies on attention mechanisms, widely used in NLP tasks like
machine translation (e.g., GPT, BERT).
Consists of a generator and discriminator competing to create realistic data, applied in deepfake
generation and image synthesis.
7. Autoencoders
Used for data compression, anomaly detection, and noise reduction by encoding input data
into a lower-dimensional representation and reconstructing it.
A stack of Restricted Boltzmann Machines used for feature learning, image recognition, and
dimensionality reduction.
A reinforcement learning algorithm that combines deep learning with Q-learning, used in game
playing and autonomous decision-making.
A probabilistic model for generating new data similar to the training set, used in image
generation and data augmentation.
A biologically inspired neural network that processes information more efficiently, applied in
neuromorphic computing.
Enhances deep learning models by focusing on important input parts, crucial for NLP, speech
recognition, and computer vision.
THANK YOU