Data Science by Internshala Trainings
Data Science by Internshala Trainings
Programming Skills
● Python: Most popular language for Data Science due to its simplicity and
extensive libraries like NumPy, Pandas, Scikit-learn, and TensorFlow.
● R: Another powerful language primarily used for statistical analysis and data
visualization.
● SQL: Essential for querying and managing data stored in databases.
Data Collection
● APIs: Interfaces that allow interaction with other software components, useful for
fetching data from external services.
● Web Scraping: Techniques using libraries like BeautifulSoup or Scrapy to extract
data from websites.
● Database Queries: Using SQL to pull data from relational databases like MySQL
or PostgreSQL.
Data Cleaning
Data Transformation
Data Visualization
● Matplotlib and Seaborn: Libraries in Python for creating static, animated, and
interactive visualizations.
● Plotly: Used for creating interactive visualizations that can be embedded into
web applications.
● Visualization Techniques: Histograms, scatter plots, box plots, and heatmaps
for understanding data distributions, trends, and relationships.
Statistical Analysis
4. Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
● Learning Models: Algorithms learn through trial and error using rewards and
punishments (e.g., Q-learning).
● Applications: Robotics, game AI, and autonomous vehicles where
decision-making in complex environments is required.
Model Validation
● Cross-Validation: Splitting data into multiple folds to ensure the model performs
well on unseen data.
● Confusion Matrix: Tool for understanding the performance of classification
models in terms of true positives, false positives, etc.
Hyperparameter Tuning
● Grid Search and Random Search: Techniques for finding the best combination
of hyperparameters to enhance model performance.
● Automated Tuning: Using libraries like Optuna or Hyperopt for more efficient
optimization.
6. Deep Learning
Neural Networks
● Applications: Best suited for sequential data like time series, speech, and text
due to their memory capability.
● LSTM and GRU: Advanced RNN architectures designed to handle long-term
dependencies in data.
Text Processing
Sentiment Analysis
Topic Modeling
● Hadoop and Spark: Frameworks used for processing large datasets across
distributed computing environments.
● NoSQL Databases: Tools like MongoDB or Cassandra used for storing
unstructured data.
Cloud Platforms
● AWS, Google Cloud, Azure: Provide services for scalable computing, storage,
and machine learning solutions, making it easier to handle large-scale data
science projects.
9. Data Science in Production
Model Deployment
MLOps
Data Privacy
● GDPR Compliance: Ensuring data handling meets regulations like the General
Data Protection Regulation, protecting individual data privacy.
Probability Theory: Detailed Explanation
Probability theory is a branch of mathematics that deals with the analysis of random
phenomena and the likelihood of occurrences of events. It provides a formal framework
for reasoning about uncertainty and is foundational in fields like Data Science, Statistics,
Machine Learning, and many real-world applications. Here’s a detailed look at its key
concepts:
1. Basic Definitions
○ Experiment: An action or process that leads to a set of possible
outcomes. For example, flipping a coin or rolling a die.
○ Outcome: The result of a single trial of an experiment. For example,
getting heads when flipping a coin.
○ Sample Space (S): The set of all possible outcomes of an experiment.
For example, for a coin flip, the sample space is
S={Heads, Tails}
2. Probability of an Event
○ Probability is a measure of the likelihood of an event occurring, expressed
as a number between 0 and 1.
○ Formula: For a finite sample space where all outcomes are equally likely,
the probability of an event AAA is given by:
6.
7. Random Variables
○ A variable that takes on different numerical values based on the outcomes
of a random experiment.
○ Types:
■ Discrete Random Variables: Take on a countable number of
values (e.g., number of heads in coin tosses).
■ Continuous Random Variables: Take on an infinite number of
values within a range (e.g., temperature, height).
8. Probability Distributions
○ Describes how probabilities are distributed over the values of a random
variable.
○ Discrete Distributions: Examples include the Binomial, Poisson, and
Geometric distributions.
○ Continuous Distributions: Examples include the Normal (Gaussian),
Uniform, and Exponential distributions.
○ Sampling Methods:
■ Simple Random Sampling: Every member of the population has
an equal chance of being selected.
■ Stratified Sampling: Population is divided into subgroups, and
random samples are taken from each subgroup.
■ Cluster Sampling: Dividing the population into clusters and then
randomly selecting entire clusters for sampling.
○ Hypothesis Testing:
■ A method for testing a claim or hypothesis about a parameter of the
population using sample data.
■ Null Hypothesis (H0H_0H0): A statement of no effect or no
difference. It is the default assumption.
■ Alternative Hypothesis (H1H_1H1): A statement that contradicts
the null hypothesis, indicating the presence of an effect or
difference.
■ p-Value: The probability of obtaining test results at least as extreme
as the observed results, assuming that the null hypothesis is true.
■ A low p-value (typically < 0.05) indicates that the null
hypothesis can be rejected.
■ Significance Level (α\alphaα): A threshold set before testing,
usually 0.05 or 0.01, used to decide whether to reject the null
hypothesis.
■ Types of Errors:
■ Type I Error (False Positive): Rejecting the null hypothesis
when it is actually true.
■ Type II Error (False Negative): Failing to reject the null
hypothesis when it is actually false.
○ Confidence Intervals:
■ A range of values used to estimate a population parameter with a
certain level of confidence (e.g., 95% confidence interval).
■ Formula: Confidence Interval=Sample Mean±Margin of Error
5. Regression Analysis
○ A statistical method used for modeling the relationship between a
dependent variable and one or more independent variables.
○ Simple Linear Regression:
■ Models the relationship between two variables by fitting a linear
equation to observed data: Y=β0+β1X+ϵ
■ Y: Dependent variable, X: Independent variable, β0:Intercept, β1:
Slope, ϵ: Error term.
○ Logistic Regression:
■ Used when the dependent variable is categorical (e.g., binary
outcomes). It models the probability that a given input point belongs
to a particular class.
Statistics is a critical skill in Data Science, providing the theoretical foundation for data
analysis, experimentation, and machine learning.
Data Science
Module 1
Data science has become highly popular because of several key factors:
1. Explosion of Data: The digital age has created massive amounts of data (from
social media, IoT devices, transactions, etc.). Organizations now have access to
more data than ever before, which is often referred to as "Big Data."
2. Advances in Computing Power: Modern computing technologies (e.g., cloud
computing, GPUs) allow for large-scale data processing and analysis in
real-time, which was previously impossible.
3. AI and Machine Learning: Machine learning models (which are part of data
science) can make data-driven predictions and decisions, leading to
breakthroughs in automation and artificial intelligence.
4. Business Value: Companies see data science as essential for improving
products, optimizing processes, understanding customer behavior, and gaining a
competitive edge.
5. Personalization and User Experience: Data science enables highly
personalized experiences (e.g., recommendation systems on Netflix, Spotify,
Amazon), which are increasingly demanded by consumers.
1. Descriptive Analytics:
○ Focus: Understand what has happened based on historical data.
○ Techniques: Summarization, data aggregation, basic statistical methods.
○ Example: Sales reports, average customer ratings.
2. Diagnostic Analytics:
○ Focus: Understand why something happened by analyzing data patterns.
○ Techniques: Correlation analysis, regression analysis, and drill-down
methods.
○ Example: Why a marketing campaign succeeded or failed.
3. Predictive Analytics:
○ Focus: Use historical data to make predictions about future events.
○ Techniques: Time series analysis, machine learning models (e.g.,
regression, decision trees, neural networks).
○ Example: Predicting stock prices, customer churn, or demand forecasting.
4. Prescriptive Analytics:
○ Focus: Suggest actions based on predictions and data.
○ Techniques: Optimization algorithms, decision trees, reinforcement
learning.
○ Example: Optimizing supply chain routes, dynamic pricing models.
5. Machine Learning and Artificial Intelligence:
○ Focus: Automating the decision-making process or building models that
can "learn" from data without being explicitly programmed.
○ Techniques: Supervised learning, unsupervised learning, reinforcement
learning.
○ Example: Self-driving cars, voice recognition, fraud detection.
6. Natural Language Processing (NLP):
○ Focus: Enable machines to understand and interpret human language.
○ Techniques: Sentiment analysis, text classification, language modeling.
○ Example: Chatbots, sentiment analysis of social media posts.
7. Data Visualization:
○ Focus: Represent data visually for easier interpretation and insight
discovery.
○ Techniques: Bar charts, heatmaps, scatter plots, dashboards.
○ Example: Business dashboards, geographic maps of customer locations.
1. Healthcare:
○ Application: Predicting diseases, personalized medicine, drug discovery.
○ Example: Using AI models to predict patient outcomes and recommend
treatments.
2. Finance:
○ Application: Fraud detection, risk analysis, algorithmic trading.
○ Example: Credit card companies use data science to identify suspicious
transactions in real time.
3. Retail and E-commerce:
○ Application: Recommendation engines, customer segmentation,
inventory management.
○ Example: Amazon’s recommendation system suggests products based on
previous purchases and browsing behavior.
4. Marketing:
○ Application: Targeted advertising, customer sentiment analysis, churn
prediction.
○ Example: Personalized email marketing campaigns that increase
engagement and sales.
5. Transportation:
○ Application: Route optimization, traffic prediction, autonomous vehicles.
○ Example: Uber and Lyft use data science to optimize routes and pricing
models in real-time.
6. Sports:
○ Application: Player performance analysis, game strategy optimization,
fan engagement.
○ Example: Sports teams use data analytics to improve player recruitment
and game-day strategies.
7. Entertainment:
○ Application: Content recommendations, sentiment analysis, user
behavior analysis.
○ Example: Netflix recommends movies and TV shows based on users'
viewing history.
Conclusion
In short:
Pandas and NumPy are essential Python libraries for data manipulation and
analysis, each with its own focus and set of key concepts.
Pandas:
Pandas is primarily used for data manipulation and analysis. It provides powerful,
flexible, and easy-to-use data structures to work with labeled or tabular data (like
spreadsheets or SQL tables).
1. DataFrame:
○ A 2-dimensional, size-mutable, and heterogeneous data structure (like a
table in Excel or SQL). It consists of rows and columns.
Example:
import pandas as pd
df = pd.DataFrame(data)
2. Series:
○ A 1-dimensional labeled array (like a single column of a DataFrame). It
can hold any data type (integer, string, etc.).
Example:
s = pd.Series([1, 2, 3, 4])
Example:
df.fillna(0) # Replace missing values with 0
5. Data Manipulation:
○ Pandas excels at filtering, grouping, and transforming data (e.g.,
groupby(), merge(), pivot_table()).
Example:
df.groupby('Age').mean() # Group by age and calculate the mean
6. Input/Output:
○ You can read and write data from/to various file formats like CSV, Excel,
SQL, etc.
Example:
df = pd.read_csv('data.csv') # Read a CSV file
NumPy is the fundamental package for scientific computing with Python. It provides
support for large, multi-dimensional arrays and matrices, along with a variety of
mathematical functions to operate on these arrays.
Example:
import numpy as np
2. Element-wise Operations:
○ NumPy allows you to perform mathematical operations on arrays
element-wise.
Example:
arr * 2 # Multiply every element by 2
3. Broadcasting:
○ NumPy can automatically broadcast smaller arrays to fit the shape of
larger arrays, enabling operations without explicit loops.
Example:
arr = np.array([1, 2, 3])
Example:
matrix = np.reshape(arr, (2, 2)) # Reshape 1D array to 2x2
matrix
5. Linear Algebra:
○ NumPy includes functions for matrix operations, like matrix multiplication,
inversion, and decompositions.
Example:
Example:
np.random.rand(3, 3) # Create a 3x3 matrix with random
numbers
7. Performance:
○ NumPy is designed for performance, leveraging low-level C and Fortran
libraries. It is much faster than native Python for numerical computations,
especially when dealing with large datasets.
Summary:
Module 3
Statistics - Already discussed
Module 4
Predictive Modeling:
1. Regression Models:
○ Objective: Predict a continuous outcome.
○ Example: Predicting house prices based on features like size, location,
and number of rooms.
○ Types:
■ Linear Regression: Predicts a value based on a linear relationship
between input features and output.
■ Polynomial Regression: Extends linear regression to handle
non-linear relationships.
2. Classification Models:
○ Objective: Predict a discrete outcome (categorical classes).
○ Example: Predicting whether a customer will churn or not (yes/no).
○ Types:
■ Logistic Regression: Predicts binary outcomes (e.g., yes/no,
true/false).
■ Decision Trees: Splits data into branches to predict categories.
■ Random Forest: An ensemble of decision trees to improve
accuracy.
■ Support Vector Machines (SVM): Classifies data by finding the
optimal boundary between categories.
■ Neural Networks: Models complex, non-linear relationships in
data.
3. Time Series Models:
○ Objective: Predict future values based on past data points.
○ Example: Predicting stock prices, sales forecasting, or weather patterns.
○ Types:
■ ARIMA (AutoRegressive Integrated Moving Average): Combines
autoregression and moving averages to model time series.
■ Exponential Smoothing: Forecasts based on weighted averages
of past observations.
4. Clustering Models:
○ Objective: Group similar data points into clusters (unsupervised learning).
○ Example: Segmenting customers into groups based on purchasing
behavior.
○ Types:
■ K-Means Clustering: Partitions data into k distinct clusters.
■ Hierarchical Clustering: Builds a tree of clusters.
5. Ensemble Models:
○ Objective: Combine multiple models to improve prediction accuracy.
○ Example: Using both decision trees and logistic regression to predict
customer churn.
○ Types:
■ Bagging (e.g., Random Forest): Combines multiple models by
averaging their predictions to reduce variance.
■ Boosting (e.g., XGBoost, Gradient Boosting): Sequentially
builds models where each model corrects the errors of the previous
one.
1. Problem Definition:
○ Clearly define the business problem or objective. Decide what needs to be
predicted (e.g., sales, customer churn, loan defaults).
○ Example: Predict customer churn within the next 6 months based on
transaction data.
2. Data Collection:
○ Gather relevant data for your model. This could be historical data,
transactional data, or data from external sources.
○ Example: Collect customer data, transaction history, and demographic
information.
5. Model Selection:
○ Choose a model type (e.g., linear regression, decision trees, random
forests) based on the problem and data characteristics.
○ Example: For a classification problem, you might choose between logistic
regression, decision trees, or SVM.
6. Model Training:
○ Split the data into training and test sets, and fit the model using the
training data.
○ Steps:
■ Train/Test Split: Typically split the data into 80% training and 20%
testing.
■ Cross-Validation: Use techniques like k-fold cross-validation to
ensure the model generalizes well.
7. Model Evaluation:
○ Evaluate model performance using appropriate metrics and the test
dataset.
○ Common Evaluation Metrics:
■ Regression: Mean Absolute Error (MAE), Mean Squared Error
(MSE), R-squared.
■ Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
curve.
■ Time Series: Mean Absolute Percentage Error (MAPE), Root Mean
Squared Error (RMSE).
8. Model Tuning:
○ Fine-tune the model by adjusting hyperparameters to improve
performance. Techniques like Grid Search or Random Search are often
used.
○ Hyperparameters: These are model-specific settings (e.g., learning rate,
tree depth) that are tuned to optimize performance.
9. Model Deployment:
○ Once the model is tuned and validated, deploy it into production for
real-time use or batch processing.
○ Example: Deploy a predictive model that identifies potential customer
churn so that marketing teams can act preemptively.
Summary:
Predictive modeling is a multi-step process that involves selecting the right model,
training it on historical data, and making predictions on new data. There are various
types of predictive models (regression, classification, time series, clustering), and each
has specific stages, from data collection to model deployment and maintenance. Proper
evaluation, tuning, and monitoring ensure that the model remains reliable and accurate
in predicting future outcomes.
Data exploration and transformation are crucial steps in any data analysis or predictive
modeling workflow. These steps help ensure that the data is clean, understandable, and
ready for analysis or modeling.
Here are the typical steps for Data Exploration and Transformation:
1. Data Exploration
Data Exploration is the initial step where you investigate and understand the dataset.
This helps you identify patterns, spot anomalies, and determine the quality of the data.
The main goals are to familiarize yourself with the data, identify missing values, and
assess the distribution and relationships within the data.
Example:
import pandas as pd
df = pd.read_csv('data.csv')
Example:
df.shape # Shows the number of rows and columns
3. Summary Statistics:
○ Get a statistical summary of numerical columns (mean, median, quartiles,
standard deviation).
Example:
df.describe() # Summary of numerical data
Example:
df.isnull().sum() # Counts missing values in each column
○
6. Outlier Detection:
○ Identify outliers that could skew your analysis or model performance (e.g.,
extreme values in numeric columns).
Example:
df.boxplot() # Visualize outliers using boxplots
○
7. Visualizing Distributions:
○ Plot histograms or density plots for numeric columns to understand their
distributions (e.g., normal, skewed).
Example:
df['age'].hist() # Plot histogram for 'age' column
○
8. Correlation Analysis:
○ Assess relationships between variables using correlation matrices or
scatter plots.
Example:
df.corr() # Calculate correlation matrix
2. Data Transformation
Data Transformation is the process of cleaning, adjusting, and modifying the data so it
can be effectively used in analysis or modeling. It helps standardize the data, address
missing or incorrect values, and prepare features for machine learning.
Imputation: Fill missing values with the mean, median, mode, or other imputed values.
df['age'].fillna(df['age'].mean(), inplace=True) # Fill
missing values with mean
■
2. Outlier Treatment:
○ Handle outliers by either removing or transforming them.
○ Techniques:
■ Cap and Floor: Limit extreme values to a specific threshold.
■ Transformation: Apply log or square root transformations to
reduce the effect of outliers.
3. Feature Scaling:
○ Normalize or standardize the numerical data to ensure all features have a
comparable scale, which is important for certain algorithms (e.g., SVM,
KNN).
○ Techniques:
scaler = MinMaxScaler()
■
4. Encoding Categorical Variables:
○ Convert categorical data into numerical format so that it can be used in
models.
○ Techniques:
le = LabelEncoder()
■
5. Feature Engineering:
○ Create new features based on existing data to improve the model's
performance.
○ Techniques:
■ Interaction Features: Multiply or combine two features to create a
new one.
■ Aggregations: Compute new features by aggregating data (e.g.,
sum, mean, max).
Datetime Features: Extract useful information from datetime columns (e.g., year,
month, day, hour).
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
■
6. Binning:
○ Convert continuous data into discrete bins or categories.
Example:
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100],
labels=['Child', 'Young Adult', 'Adult', 'Senior'])
○
7. Dimensionality Reduction:
○ Reduce the number of features while retaining the most important
information.
○ Techniques:
■ Principal Component Analysis (PCA): Reduce dimensionality by
projecting data onto new axes.
■ Feature Selection: Remove irrelevant or redundant features using
techniques like variance thresholding or recursive feature
elimination (RFE).
8. Data Integration:
○ Merge or join datasets to enrich the data for analysis.
○ Techniques:
■
Bivariate Analysis:
Bivariate analysis is the statistical analysis of two variables to determine the empirical
relationship between them. It helps identify whether there is an association or
correlation between the two variables and what kind of relationship (if any) exists.
Bivariate analysis is often performed to explore how one variable influences another or
to detect patterns between two variables, which could be continuous or categorical.
The type of relationship and the methods used will depend on whether the variables are
numerical (continuous) or categorical (discrete).
● The method you use for bivariate analysis depends on whether you're working
with numerical, categorical, or a mix of both types of variables.
Methods for Bivariate Analysis:
Goal: Determine the strength and direction of the relationship between two continuous
variables.
Common Techniques:
● Scatter Plot:
○ A scatter plot visually shows the relationship between two continuous
variables. Each point represents an observation.
plt.scatter(df['age'], df['salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
○
● Correlation Coefficient:
○
■ Pearson’s Correlation: Measures the linear relationship between
two variables.
■ Spearman’s Rank Correlation: Measures the monotonic
relationship (used for non-linear data).
● Linear Regression:
y = df['salary']
model = LinearRegression()
model.fit(X, y)
Common Techniques:
● Box Plot:
○ A box plot visualizes the distribution of a numerical variable for different
categories of a categorical variable, making it easy to spot differences in
medians, ranges, and outliers.
○
● T-Test or ANOVA:
○ These statistical tests compare the means of the numerical variable
across different categories to see if there is a significant difference.
○ T-Test: Used when comparing two categories (e.g., male vs. female
salary).
ttest_ind(male, female)
● Violin Plot:
○ A violin plot combines aspects of a box plot and a density plot, showing
the distribution of the data for each category.
Common Techniques:
● Chi-Square Test:
contingency_table = pd.crosstab(df['gender'],
df['product'])
chi2_contingency(contingency_table)
○
● Stacked Bar Plot:
○ A stacked bar plot can be used to visually compare the frequencies of one
categorical variable across different levels of another categorical variable.
Example: Proportion of men and women who prefer different product categories.
df.groupby(['gender',
'product']).size().unstack().plot(kind='bar', stacked=True)
Let's say you are working with a dataset that contains information on customer
demographics and their spending behavior. You want to explore the relationship
between customer age and the amount they spend.
Steps:
1. Identify Variables:
○ Age: Continuous (Numerical)
○ Spending: Continuous (Numerical)
2. Visualize:
Plot a scatter plot to see if there is a linear relationship between age and
spending.
plt.scatter(df['age'], df['spending'])
plt.xlabel('Age')
plt.ylabel('Spending')
plt.show()
○
3. Calculate Correlation:
print(correlation)
○
4. Interpret:
○ If the correlation is positive and strong (e.g., r = 0.7), you can conclude
that older customers tend to spend more.
Outlier Treatment:
Outliers are data points that differ significantly from other observations in the dataset.
They may result from variability in the data or errors in data collection. Outliers can
distort statistical analyses and reduce the accuracy of predictive models, so it is
important to detect and treat them appropriately.
There are several ways to handle outliers depending on the nature of the data and the
analysis goals.
1. Detect Outliers:
○ First, you need to identify which data points are considered outliers.
2. Decide on a Treatment Method:
○ You can either remove, cap, transform, or impute the outliers, depending
on the context of your analysis.
1. Detecting Outliers
1. Visual Methods:
○ Boxplot:
■ Boxplots are used to detect outliers visually. Data points outside the
whiskers of the boxplot are typically considered outliers.
sns.boxplot(df['column_name'])
○
○ Scatter Plot:
■ Useful for identifying outliers in bivariate data.
plt.scatter(df['age'], df['salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
○
2. Statistical Methods:
○ Z-Score Method:
■ The Z-score measures how far a data point is from the mean in
terms of standard deviations. A common threshold is to consider
data points with a Z-score greater than 3 or less than -3 as outliers.
df['zscore'] = stats.zscore(df['column_name'])
○
○ IQR (Interquartile Range) Method:
■ The IQR is the range between the first quartile (25th percentile) and
third quartile (75th percentile). Outliers are typically defined as data
points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
Once outliers are detected, the treatment method depends on whether the outliers
represent genuine anomalies or just extreme values.
1. Remove Outliers:
● If the outliers are due to errors (e.g., data entry mistakes), you can remove them
from the dataset.
Example:
# Remove outliers using IQR
Example:
# Cap outliers using IQR
df['column_name'] = df['column_name'].clip(lower_bound,
upper_bound)
4. Impute Outliers:
● Instead of removing outliers, replace them with more appropriate values like the
mean or median.
Example:
# Replace outliers with median
median = df['column_name'].median()
● Some statistical techniques are less sensitive to outliers (e.g., robust regression,
tree-based models like decision trees, random forests).
# Calculate Z-scores
df['zscore'] = stats.zscore(df['column_name'])
# Filter out rows with Z-scores greater than 3 or less than -3
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
# Remove outliers
import numpy as np
df['log_column'] = np.log(df['column_name'] + 1)