Data Analytics Assignment
Data Analytics Assignment
Analytics
Assignment
Introduction:
In today’s digital era, data analytics has become the backbone of modern businesses,
driving informed decision-making and fostering innovation. Companies across industries
generate massive amounts of data daily, encompassing customer behaviour, operational
processes, market trends, and more. Harnessing this data effectively is crucial for
maintaining a competitive edge. This is where data modelling comes into play.
Data modelling involves transforming raw data into actionable insights through
structured methodologies and machine learning algorithms. Businesses use these models
to predict customer preferences, optimize operations, improve user experiences, and
design smarter products and services. From dividing datasets for training and testing to
scaling and imputing missing values, each step in the data pipeline ensures accuracy and
reliability of outcomes.
This document delves into how four industry leaders—Amazon, Netflix, Tesla, and
Google—utilize data analytics to model their business strategies. These companies are
pioneers in their respective domains, leveraging state-of-the-art data processing
techniques to achieve remarkable results.
Data Division: How datasets are split into training, validation, and test sets for
optimal performance.
Data Scaling: Techniques used to normalize or standardize data for better model
compatibility.
Model Selection: The types of machine learning models employed, such as
regression, classification, and neural networks.
Decision Trees and Graphs: The role of visual and hierarchical decision-making
tools in improving predictions.
Data Imputation: Strategies to address missing or incomplete data, ensuring robust
model performance.
1. Amazon:
1. Data Division
To ensure effective model training and testing, the data is divided as follows:
2. Data Scaling
Scaling ensures that all features contribute equally to the model's performance. Amazon
applies:
Min-Max Normalization:
o Normalizes features like user age, order value, and product ratings to a
range of [0, 1].
o Example:
Original data: Order Value = $10, $100, $500
Scaled data: Order Value = 0.02, 0.2, 1
o This ensures no single feature (like order value) disproportionately affects
the model.
Standardization:
o For datasets with Gaussian distributions (e.g., delivery times), Amazon uses
z-score normalization.
o Example:
Delivery times are standardized using: z=x−μσz = \frac{x - \mu}{\
sigma} where xx is a delivery time, μ\mu is the mean, and σ\sigma
is the standard deviation.
3. Models Used
Amazon relies on advanced machine learning models tailored for specific use cases:
1. Collaborative Filtering:
o Used for personalized recommendations.
o Example:
A user buys a book on data science. The system suggests related
books based on purchases by other users with similar interests.
3. Logistic Regression:
o Used for binary classification, such as detecting fraudulent transactions.
o Example:
Amazon identifies patterns of unusual purchases or location
mismatches to flag suspicious activities.
Decision trees and graph-based methods are extensively used in Amazon's operations:
Decision Trees:
o Applied in warehouse management to optimize inventory levels.
o Example:
A decision tree predicts which products need to be stocked more
heavily during holiday seasons based on historical sales data.
Graphs:
o Used in Amazon’s supply chain optimization.
o Example:
Graph-based algorithms determine the shortest delivery routes,
reducing shipping times and costs.
5. Data Imputation
Mean Imputation:
o For missing values in numerical features like delivery time or product
ratings.
o Example:
If a product's average rating is missing, Amazon imputes it with the
average rating from similar products in the same category.
Collaborative Imputation:
o For sparse user-item matrices in recommendation systems.
o Example:
If a user hasn’t rated a product, the system estimates a rating based
on ratings given by similar users.
Advanced Techniques:
o Amazon uses predictive imputation for real-time scenarios.
o Example:
Predicting missing delivery time for orders based on traffic, weather,
and location data.
End-to-End Example: Amazon’s Recommendation System
1. Data Collection:
o Collects data on a user's interactions, including search terms, product views,
and purchases.
2. Data Division:
o 70% of this data trains the recommendation engine, 15% tunes parameters,
and 15% evaluates performance.
3. Data Scaling:
o Normalizes features like product prices and user demographics.
4. Model Training:
o Collaborative Filtering:
A model predicts that if a user bought a laptop, they might also buy
accessories like a mouse or keyboard.
5. Imputation:
o For missing ratings, estimates are made using the average ratings of similar
products.
2. Netflix:
Netflix is one of the most prominent users of data analytics, leveraging massive amounts
of user data to drive its recommendation engine, optimize content delivery, and enhance
user experiences.
1. Data Division
Netflix divides its dataset strategically to ensure robust training, validation, and testing
for its models.
2. Data Scaling
Numeric Data: Ratings (1–5 stars), viewing duration (in minutes), and search
frequency.
o Method: Z-Score Standardization. This transforms the data to have a mean
of 0 and a standard deviation of 1, ensuring features like ratings and
durations are on the same scale.
o Example:
Original Ratings: [3, 4, 2, 5, 1]
Standardized Ratings: [0.5, 1.5, -0.5, 2.0, -1.5]
Categorical Data: Genres (Action, Comedy, Drama), device types (Mobile, TV).
o Method: One-Hot Encoding.
Example: Action = [1, 0, 0], Comedy = [0, 1, 0].
This ensures that the model treats each category equally without introducing bias.
3. Models Used
Netflix employs various machine learning models depending on the specific use case:
Content-Based Filtering:
This focuses on the content features (e.g., genre, actors, ratings) to recommend
similar movies.
o Example: If User A likes movies starring Keanu Reeves, the system might
recommend John Wick based on metadata.
Netflix uses decision trees and graph-based models for feature selection and clustering:
Missing data is a common challenge in large datasets. Netflix uses advanced imputation
techniques to address this:
Time-Series Imputation:
For missing viewing patterns (e.g., due to network issues), Netflix uses forward-
fill or interpolation methods to estimate data.
o Example:
If a user's session logs are incomplete, Netflix estimates their viewing
duration based on prior patterns.
1. Data Division:
User A's watch history is included in the training set, while their unseen
preferences (e.g., interest in Dark) are in the test set.
2. Data Scaling:
Viewing duration and ratings are scaled to ensure they’re treated uniformly by the
recommendation engine.
3. Model Application:
o Matrix factorization predicts User A’s interest in Dark based on correlations
between The Witcher and other fantasy titles.
o Content-based filtering highlights Dark due to shared themes with The
Witcher (fantasy, mystery).
4. Tesla:
Tesla is at the forefront of innovation, using advanced data analytics to power its
autonomous driving systems, optimize manufacturing, and enhance user experiences.
Below is an in-depth look at how Tesla uses data analytics in its operations:
1. Data Division
Tesla collects an enormous amount of data from its fleet of vehicles worldwide. The data
is used to train and validate its machine learning models for autonomous driving.
Example:
For their Full Self-Driving (FSD) Beta program, Tesla collects millions of miles of
driving data and uses this to continually improve the accuracy of their autonomous
driving systems.
2. Data Scaling
Data scaling is essential to ensure consistent performance across different sensors and
environments.
Robust Scaling:
Tesla uses robust scaling techniques to handle outliers in sensor data. For
example:
o Scaling sensor data (e.g., distances measured by LiDAR) to a consistent
range.
o Normalizing camera pixel values to enhance the image quality fed into
convolutional neural networks (CNNs).
Feature Scaling:
Features like vehicle speed, road curvature, and object distances are standardized
so that no single feature disproportionately influences the model.
Example:
In training models to detect pedestrians, Tesla scales pixel intensity values from dashcam
images to ensure the neural network processes images efficiently, regardless of lighting
conditions.
3. Models Used
Tesla employs various machine learning models tailored to specific tasks within its
autonomous driving system:
Reinforcement Learning:
Helps Tesla's vehicles learn optimal driving strategies based on simulated and
real-world data.
o Example: Deciding when to change lanes to optimize speed and safety.
Logistic Regression:
Used for binary classification tasks, such as determining whether an object in the
road is an obstacle that needs avoidance.
Time-Series Models:
Predicts future driving conditions based on current sensor readings and telemetry
data.
Example:
Tesla’s system can predict the movement of pedestrians by analyzing video frames and
learning their walking patterns.
Tesla utilizes decision trees and graph-based methods for optimization and decision-
making:
Example:
In heavy traffic, Tesla’s system uses a graph to predict whether adjacent vehicles will
change lanes and adjusts its strategy to maintain safety.
Time-Series Imputation:
When GPS signals are temporarily lost, Tesla uses forward-fill imputation to
maintain accurate location tracking.
Interpolation:
Sensor data gaps are filled using interpolation techniques to estimate missing
values.
Synthetic Data Generation:
Tesla generates synthetic data for rare driving scenarios, such as animals crossing
the road, to improve model robustness.
Example:
If a radar sensor temporarily fails to detect an object due to interference, Tesla’s system
uses data from adjacent cameras to estimate the object’s position and maintain safe
driving behaviour.
6. Real-World Application Example
Autonomous Driving:
Tesla’s Autopilot system uses an end-to-end machine learning pipeline:
1. Data Collection: Fleet-wide data is continuously collected and sent to
Tesla’s servers.
2. Model Training: Neural networks are trained on diverse datasets, ensuring
they can handle various conditions.
3. Real-Time Processing: In-car systems use pre-trained models to make
split-second decisions.
Over-the-Air Updates:
Tesla updates its models regularly using insights gained from real-world data. For
example, new updates improve how the car handles stop signs or reacts to
merging lanes.
5. Google:
1. Data Division
Google collects massive amounts of data daily from search queries, web crawlers, and
user interactions. To develop its algorithms and models, the company follows a
systematic approach to divide data:
2. Data Scaling
Log Transformations:
o Applied to web traffic data and ad performance metrics.
o Example: Log transformation helps in normalizing page views, which can
vary drastically from a few hundred to billions.
Standardization:
o Standardizes features like click-through rates, dwell time, and keyword
relevance.
o Ensures model inputs are mean-centered with unit variance for algorithms
like Logistic Regression.
3. Models Used
Logistic Regression:
o Applied in Gmail for spam classification.
o Example: Analyzes email content, sender reputation, and attachment types
to classify emails as spam or important.
Decision Trees:
o Used in Google Ads to optimize ad placement based on user preferences and
demographics.
o Example: A decision tree predicts whether a user is likely to click on an ad
for “running shoes” based on their search and purchase history.
5. Data Imputation
Missing or incomplete data can arise from various sources like ad-click errors,
incomplete search terms, or network failures. Google addresses these issues with
advanced imputation techniques:
Multivariate Imputation:
o Estimates missing values based on correlated features.
o Example: If CTR data for a specific ad is missing, it is estimated using similar
ads' CTR data within the same campaign.
Predictive Imputation:
o Uses machine learning models to predict and fill gaps.
o Example: Predicts missing demographic details of users based on their
browsing behavior and historical data.
1. Problem Statement:
Improve the accuracy of Google’s search engine by delivering highly relevant
results in milliseconds.
2. Data Sources:
o Search queries from billions of users.
o Web crawling data from indexed pages.
o User behavior metrics (e.g., click-through rates, dwell time).
3. Pipeline:
o Data Collection: Collects real-time search queries and historical data.
o Preprocessing: Removes noise (e.g., typos in queries) and tokenizes text for
processing.
o Feature Engineering: Extracts features like query intent, keyword
relevance, and geolocation.
o Model Training: Trains transformer models like BERT to improve semantic
understanding of search queries.
o Evaluation: Validates the ranking algorithm using the validation set,
ensuring results are relevant and timely.
o Deployment: Deploys the trained model in production, continuously
updated with real-time user feedback.
Impact
Conclusion:
Netflix, with its focus on user engagement, leverages matrix factorization and deep
learning models like RNNs to refine content recommendations and predict user behavior.
Its meticulous data scaling and imputation techniques ensure the reliability and accuracy
of its models, ultimately driving subscriber retention and satisfaction.
Tesla, at the forefront of autonomous driving technology, utilizes sophisticated models
such as CNNs and logistic regression to process complex sensor data. Its use of decision
trees for route optimization and robust scaling techniques ensures that its vehicles can
operate safely and efficiently in diverse conditions.
Google capitalizes on its vast datasets to revolutionize search, advertising, and AI-driven
applications. Its use of transformer models like BERT and graph neural networks
exemplifies cutting-edge advancements in natural language processing and web page
ranking. Google’s meticulous data scaling and imputation techniques ensure the
consistency and effectiveness of its algorithms.
Across these companies, several commonalities emerge. The division of data into
training, validation, and test sets is a universal practice that ensures model robustness.
Data scaling techniques, such as normalization and standardization, are employed to
handle diverse feature ranges and outliers. Advanced machine learning models, including
neural networks and decision trees, form the backbone of their analytical strategies.
Additionally, imputation methods address missing data, maintaining dataset integrity and
preventing biases in model predictions.