Crash Course_Introduction to Data Science
Crash Course_Introduction to Data Science
Data Science
A crash course
About the Instructor
Engi Amin
Instructor of Statistics & Data Science at Markov Data
Email: [email protected]
› M.Sc. in Socio-computing
› B.Sc. in Statistics
› Teaching Experience: 9+ years
› Researcher & Data Analyst: 9+ years (national and international projects)
Table of 01 Introduction to Data Science
02 The Field of Data Science
Contents
03 Role of Data Science in Modern Society
04 The Data Science Life Cycle
05 Data Science Skill Set
06 Big Data
07 Machine Learning
08 MLOps
09 Career Paths in Data Science
10 Resources for Further Learning
Introduction to
Data Science
Data
• A dataset is a collection of
related sets of information,
usually formatted in a table, used
for analysis or processing.
• Structured data are represented
in rows and columns.
• Columns represent features or
variables.
• Rows represent instances or
records.
Data Analysis Define
Present Collect
• The process of inspecting,
cleaning, and modeling data with
the goal of discovering useful
information.
Visualize Clean
Interpret Analyze
Types of Data Analytics
• Descriptive analysis focuses on summarizing and describing the
characteristics of a dataset without making inferences or predictions.
• Example Questions:
Analysis
groups?
• What are the most common opinions regarding a topic among
participants?
• Common Techniques:
• Summary Statistics: Measures of central tendency (mean, median,
mode), measures of variability (range, variance, standard deviation).
• Frequency Distributions: Tabulations, histograms, bar charts.
• Data Visualization: Graphs, charts, heatmaps.
• Common Tools:
• Spreadsheet software (e.g., Microsoft Excel, Google Sheets).
• Statistical software (e.g., SPSS, SAS, R, Python with libraries like pandas, seaborn,
matplotlib).
• Business intelligence tools (e.g., Tableau, Power BI).
• Diagnostic analysis aims to identify the causes of observed phenomena or
outcomes by examining relationships within the data and diagnosing problems
or anomalies.
• Example Questions:
• Common Techniques:
• Hypothesis testing: Involves formulating a hypothesis about the relationship between
variables and using statistical tests to determine if the data supports or rejects the
hypothesis.
• Regression analysis: Involves understanding the relationship between a dependent variable
and one or more independent variables. It can reveal how changes in the independent
variables affect the dependent variable.
• Outlier Detection: Identifies data points that deviate significantly from the norm.
• Correlation analysis: Assesses the strengths and direction of relationships between variables.
Although it doesn’t imply causation, it can indicate potential areas for further investigation.
• Mediation and moderation analysis: Allows the analysis of how a third variable affects the
relationship between the IV and DV.
• Common Tools:
• Spreadsheet software (e.g., Microsoft Excel, Google Sheets).
• Statistical software (e.g., SPSS, SAS, R, Python with libraries like statsmodels, scikit-learn).
• Predictive analysis involves using historical data to make predictions about
future outcomes or trends, typically through statistical modeling or machine
learning algorithms.
• Example Questions:
Predictive • Can we predict future buying patterns based on past behavior and demographic information?
• How accurately can we forecast stock market fluctuations or economic trends using
behavioral indicators and sentiment analysis?
Analytics • Can we predict retirement savings adequacy and assess the impact of behavioral
interventions on retirement planning?
• Common Techniques:
• Regression Analysis: Linear regression, logistic regression, polynomial regression.
• Time Series Analysis: ARIMA models, exponential smoothing.
• Machine Learning Algorithms: Supervised ML techniques such as KNN, Decision trees,
random forests, support vector machines, and neural networks.
• Simulation Analysis: Monte Carlo, Agent-based Modeling, Discrete Event Simulation, System
Dynamics Modeling.
• Common Tools:
• Machine learning libraries in Python (e.g., scikit-learn, TensorFlow, Keras) and R (e.g., caret)
• Predictive analytics platforms (e.g., RapidMiner, KNIME, Weka).
• Specialized software for time series forecasting (e.g., SAS Forecast Server, Prophet by
Facebook).
• Simulation software (e.g., AnyLogic, Simul8, MATLAB/Simulink, NetLogo)
• Prescriptive analysis goes beyond prediction by recommending
actions or decisions to optimize outcomes or achieve specific
objectives, often through optimization or simulation techniques.
• Example Questions:
Prescriptive •
•
Which behavioral interventions (nudges) are most effective in promoting healthy
behaviors, such as exercise, diet, or medication adherence?
What prescriptive strategies can be implemented to encourage retirement savings
Analytics •
among specific demographic groups, considering their unique behavioral tendencies?
What incentive structures or reward programs are most effective in motivating
employees to achieve performance goals, considering behavioral principles such as
loss aversion or social norms?
• Common Techniques:
• Optimization Techniques: Linear programming, integer programming, dynamic
programming.
• Simulation and Scenario Analysis: Monte Carlo simulation, agent-based modeling,
discrete event simulation, system dynamics modeling, scenario planning.
• Machine Learning Techniques: Decision trees, neural networks, recommendation
systems.
• Common Tools:
• Optimization libraries in Python (e.g., SciPy, CVXPY, Gekko).
• Optimization software (e.g., IBM CPLEX, Gurobi).
• Simulation software (e.g., AnyLogic, Arena, Simul8, MATLAB/Simulink, NetLogo).
• Decision support systems (e.g., Microsoft Decision Support, D-Sight).
Data Science
• Data science is an interdisciplinary field that
uses scientific methods, processes, algorithms,
and modeling techniques to extract knowledge
and insights from structured and unstructured
data.
• Data science combines math and statistics,
specialized programming, advanced analytics,
artificial intelligence (AI) and machine learning
with specific subject matter expertise to
uncover actionable insights hidden in an
organization’s data.
• These insights can be used to guide decision-
making and strategic planning.
Data Science
The Interdisiplinary Nature of DS
• Maths & Statistics: Fundamental tools for data analysis, hypothesis testing, and
building models.
• Visualization: Techniques for creating visual representations of data to communicate
insights effectively (e.g., charts, graphs).
• Exploratory Data Analysis (EDA): Initial data analysis to discover patterns, spot
anomalies, test hypotheses, and check assumptions using statistical graphics and
visualization tools.
• Artificial Intelligence (AI): Broad field of computer science focuses on creating
systems capable of performing tasks that typically require human intelligence. These
tasks include reasoning, learning, problem-solving, perception, and language
understanding.
• Machine Learning (ML): subset of AI that involves training algorithms to recognize
patterns in data and make predictions or decisions based on new data. It focuses on
building models that can learn from and make decisions based on data.
• Deep Learning (DL): is a specialized subset of ML that uses neural networks with
many layers (deep neural networks) to model complex patterns in large datasets. It is
particularly effective for tasks such as image and speech recognition.
• Data science leverages AI, ML, and DL techniques to analyze and interpret complex
data. The core components of data science—maths, statistics, visualization, and
EDA—are essential for building and evaluating AI/ML/DL models.
Using new and advanced
techniques from several
disciplines
The Field of
Data Science
Role of Data
Science in
Modern Society
Transforming
Industries through Data Science has become a critical
critical component in various sectors,
Data-Driven sectors, driving innovation, efficiency,
efficiency, and better decision-making.
Insights making. Its impact can be seen across
across multiple industries,
transforming how businesses and
and organizations operate and deliver
deliver value.
Healthcare
Photo by Pexels
Retail
Optimizing Supply Chain and Enhancing
Customer Experience
• Personalized Recommendations: Analyzing customer behavior
data to provide personalized product recommendations.
• Inventory Management: Predictive analytics helps in managing
stock levels by forecasting demand trends.
• Customer Sentiment Analysis: Analyzing social media and
customer reviews to gauge sentiment and improve products
and services.
• Examples: Amazon’s recommendation engine, Walmart's
supply chain optimization.
Photo by Pexels
Transportation and Logistics
Photo by Pexels
Marketing
Targeted Campaigns and Better Customer
Insights
• Customer Segmentation: Analyzing demographic, behavioral,
and transactional data to segment customers.
• Campaign Effectiveness: Measuring the impact of marketing
campaigns in real-time and adjusting strategies based on data-
driven insights.
• Sentiment Analysis: Understanding customer sentiment
towards brands and products through social media analysis.
• Examples: Netflix's recommendation system, Procter &
Gamble's market research.
Government and Public Policy
Data-Driven Decision Making and Public
Services
• Policy Formulation: Using data to understand the impact of
existing policies and predict the outcomes of proposed ones.
• Public Health: Monitoring disease outbreaks and health trends
to inform public health strategies.
• Smart Governance: Implementing data-driven approaches to
improve public services and resource allocation.
• Examples: Government economic tracking, public health
agencies monitoring disease spread.
Photo by Pexels
Data Science in Banking
• Data science has become a pivotal
tool in the banking sector, driving
innovations and improvements across
various functions.
• Data Volume: Banks manage vast
amounts of data daily—from
transaction data to customer
interactions—which is a prime
resource for insights.
• Innovation and Competition: Staying
competitive in the fintech era where
startups leverage cutting-edge
technology to disrupt traditional
banking models.
• Customer Expectations: Customers
expect personalized experiences and
services that can only be delivered
effectively through data-driven
insights.
Key Benefits of Data Science in Banking
2. Risk Management
• Credit Scoring: Data science improves the
accuracy of credit scoring models by
incorporating a wider range of data points,
including non-traditional data like social media
activity or mobile phone usage patterns.
• Fraud Detection: Machine learning models can
detect patterns and anomalies that indicate
fraudulent activities, reducing losses
significantly.
• Real-time processing and predictive capabilities
allow banks to respond swiftly to potential
threats.
Key Benefits of Data Science in Banking
3. Operational Efficiency
• Automation: Automation of routine tasks such
as data entry, compliance checks, and customer
queries through intelligent algorithms frees up
human resources for more complex tasks.
• Decision Making: Advanced analytics help in
making data-driven decisions that optimize
operational costs and improve service delivery.
Key Benefits of Data Science in Banking
4. Regulatory Compliance
• Regulatory Reporting: Data science tools can
automate the extraction, processing, and
reporting of data required by regulatory bodies,
ensuring accuracy and timeliness.
• Anti-Money Laundering (AML): Data science
techniques can enhance the monitoring of
transactions by identifying suspicious patterns
that human analysts might miss.
Key Benefits of Data Science in Banking
Key Activities
• Identify data sources: Databases, web scraping, APIs, surveys, sensors, etc.
• Collect raw data: Downloading datasets, querying databases, collecting data
through surveys, etc.
Example:
Preprocessing • Data cleaning is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a dataset.
Key Activities
Example:
• Identify and correct invalid, erroneous or inconsistent data values or remove them.
Ways to Handle
Outliers
Data Integration
Data Integration
• Data integration involves
combining data from
multiple sources into a
single, unified dataset.
• The process of data
integration can be
complex and time-
consuming, as it often Resources:
involves identifying and • https://www.javatpoint.com/data-integration-in-data-mining
• https://www.geeksforgeeks.org/data-integration-in-data-mining/
resolving inconsistencies • https://hevodata.com/learn/data-integration-techniques-strategies/
and conflicts between the
data sources.
Data Reduction
Data Reduction
• Data reduction is used to reduce the amount of
data and thereby reduce the costs associated
with data mining or data analysis.
• Although this step reduces the volume, it
maintains the integrity of the original data.
• This data preprocessing step is especially crucial
when working with big data as the amount of
data involved would be gigantic.
Data Transformation
Data
Transformation
• Data transformation is
the process of converting
data from one format to
another. In essence, it
involves methods for
transforming data into
appropriate formats that
the model can learn
efficiently from.
Exploratory Definition
Data Analysis • EDA is an approach to analyzing data sets to summarize their main
characteristics, often using visual methods.
Key Activities
Example:
Key Activities
Example:
Key Activities
• Validate model: Split data into training and testing sets or use cross-
validation.
• Evaluate performance: Use metrics such as accuracy, precision, recall, F1
score, ROC-AUC.
• Compare models: Evaluate multiple models to select the best performing
one.
Example:
Key Activities
Example:
• Data Visualization:
• Matplotlib and Seaborn (Python): For creating static, animated, and interactive visualizations in Python.
• ggplot2 (R): For data visualization in R, providing a powerful tool for creating complex plots.
• Tableau and Power BI: Business intelligence tools for creating interactive dashboards and visualizations.
• Problem-Solving:
• Ability to frame business problems into data science problems and find data-driven
solutions.
• Communication Skills:
• Data Storytelling: The ability to present data insights in a clear and compelling way to
stakeholders.
• Collaboration: Working effectively with cross-functional teams, including engineers,
product managers, and business analysts.
• Critical Thinking:
• Ability to analyze complex problems, interpret data correctly, and make informed
decisions based on data.
• Project Management:
• Managing data science projects efficiently, from data collection and analysis to model
building and deployment.
Soft Skills
• Curiosity and Continuous Learning:
• A good data scientist is always curious and eager to learn new
methods, technologies, and tools in the ever-evolving field of
data science.
• Attention to Detail:
• Being meticulous in analyzing data and building models to
ensure accuracy and reliability of results.
• Adaptability:
• Ability to adapt to new challenges, tools, and techniques in a
fast-paced environment.
Things to Do When Starting Data Science
Be curious Be curious
about the about the
data results
BE CURIOUS
Be curious
about the Be curious
method about the
application
Big Data
Big Data
• Unsupervised Learning
• Here, there is no supervision. The machine is trained using an
unlabeled dataset and is enabled to predict the output
without any supervision.
• Semi-supervised Learning
• Semi-supervised learning comprises characteristics of both
supervised and unsupervised machine learning.
• It uses a combination of labeled and unlabeled datasets to
train its algorithms.
• Reinforcement Learning
• Reinforcement learning is a feedback-based process.
• Here, the AI component automatically acts on its
surroundings by the hit & trial method, takes action, learns
from experiences, and improves performance.
• The component is rewarded for each good action and
penalized for every wrong move.
Supervised Machine Learning
• Supervised Learning is a type of ML where the
model is trained on labeled data. The training
data includes input-output pairs, where the
correct output is known, allowing the model to
learn over time to predict the output from the
input.
• The algorithm makes predictions based on the
input data and is corrected when those
predictions are wrong.
• Example: A bank uses historical customer data
to predict who might default on a loan. The
model learns from past customer profiles that
are labeled as 'defaulted' or 'not defaulted'.
Unsupervised Machine Learning
• Parameters on the other hand are internal to the model. That is, they are
learned or estimated purely from the data during training as the algorithm
used tries to learn the mapping between the input features and the labels
or targets.
• Model training typically starts with parameters being initialized to some values (random
values or set to zeros). As training/learning progresses the initial values are updated using an
optimization algorithm (e.g. gradient descent).
• The learning algorithm is continuously updating the parameter values as learning progress
but hyperparameter values set by the model designer remain unchanged.
• At the end of the learning process, model parameters are what constitute the model itself.
• Examples: The coefficients (or weights) of linear and logistic regression models, Weights and
biases of a nn, The cluster centroids in clustering
Bias vs Variance Training Dataset