0% found this document useful (0 votes)
8 views

Crash Course_Introduction to Data Science

This document serves as an introduction to data science, detailing its interdisciplinary nature, key concepts, and applications across various industries. It covers essential topics such as data types, data analysis methods, and the data science life cycle, while highlighting the role of data science in modern society. The instructor, Engi Amin, brings over nine years of experience in statistics and data science, providing a comprehensive overview for learners interested in the field.

Uploaded by

Shams Eldeeb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Crash Course_Introduction to Data Science

This document serves as an introduction to data science, detailing its interdisciplinary nature, key concepts, and applications across various industries. It covers essential topics such as data types, data analysis methods, and the data science life cycle, while highlighting the role of data science in modern society. The instructor, Engi Amin, brings over nine years of experience in statistics and data science, providing a comprehensive overview for learners interested in the field.

Uploaded by

Shams Eldeeb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Introduction to

Data Science
A crash course
About the Instructor
Engi Amin
Instructor of Statistics & Data Science at Markov Data
Email: [email protected]

› M.Sc. in Socio-computing
› B.Sc. in Statistics
› Teaching Experience: 9+ years
› Researcher & Data Analyst: 9+ years (national and international projects)
Table of 01 Introduction to Data Science
02 The Field of Data Science
Contents
03 Role of Data Science in Modern Society
04 The Data Science Life Cycle
05 Data Science Skill Set
06 Big Data
07 Machine Learning
08 MLOps
09 Career Paths in Data Science
10 Resources for Further Learning
Introduction to
Data Science
Data

• Data are units of information.


• In a more technical context, data
is a set of values of qualitative or
quantitative variables about one
or more persons or objects.
Data Types
• Structured Data: Highly organized and easily
searchable data, usually stored in databases or
spreadsheets (e.g., customer records, transaction
logs).
• Unstructured Data: Data that does not fit into a
database neatly, including text, images, and video
(e.g., customer emails, call center transcripts,
social media posts).
• Semi-structured Data: A mix of both structured
and unstructured data elements (e.g., JSON, XML
files). Example: Transaction logs that include
structured data (amount, date) along with
unstructured data (customer comments).
Dataset

• A dataset is a collection of
related sets of information,
usually formatted in a table, used
for analysis or processing.
• Structured data are represented
in rows and columns.
• Columns represent features or
variables.
• Rows represent instances or
records.
Data Analysis Define

Present Collect
• The process of inspecting,
cleaning, and modeling data with
the goal of discovering useful
information.
Visualize Clean

Interpret Analyze
Types of Data Analytics
• Descriptive analysis focuses on summarizing and describing the
characteristics of a dataset without making inferences or predictions.
• Example Questions:

Descriptive • What is the average income of households in a particular region?


• How does the distribution of ages vary among different treatment

Analysis
groups?
• What are the most common opinions regarding a topic among
participants?
• Common Techniques:
• Summary Statistics: Measures of central tendency (mean, median,
mode), measures of variability (range, variance, standard deviation).
• Frequency Distributions: Tabulations, histograms, bar charts.
• Data Visualization: Graphs, charts, heatmaps.
• Common Tools:
• Spreadsheet software (e.g., Microsoft Excel, Google Sheets).
• Statistical software (e.g., SPSS, SAS, R, Python with libraries like pandas, seaborn,
matplotlib).
• Business intelligence tools (e.g., Tableau, Power BI).
• Diagnostic analysis aims to identify the causes of observed phenomena or
outcomes by examining relationships within the data and diagnosing problems
or anomalies.
• Example Questions:

Diagnostic • What factors contributed to a decrease in contributions to a public good?


• Are there any patterns or trends in trust behavior, and what might be causing them?
• What variables are significantly associated with cooperation, and what are the potential
Analysis reasons behind it?

• Common Techniques:
• Hypothesis testing: Involves formulating a hypothesis about the relationship between
variables and using statistical tests to determine if the data supports or rejects the
hypothesis.
• Regression analysis: Involves understanding the relationship between a dependent variable
and one or more independent variables. It can reveal how changes in the independent
variables affect the dependent variable.
• Outlier Detection: Identifies data points that deviate significantly from the norm.
• Correlation analysis: Assesses the strengths and direction of relationships between variables.
Although it doesn’t imply causation, it can indicate potential areas for further investigation.
• Mediation and moderation analysis: Allows the analysis of how a third variable affects the
relationship between the IV and DV.

• Common Tools:
• Spreadsheet software (e.g., Microsoft Excel, Google Sheets).
• Statistical software (e.g., SPSS, SAS, R, Python with libraries like statsmodels, scikit-learn).
• Predictive analysis involves using historical data to make predictions about
future outcomes or trends, typically through statistical modeling or machine
learning algorithms.
• Example Questions:

Predictive • Can we predict future buying patterns based on past behavior and demographic information?
• How accurately can we forecast stock market fluctuations or economic trends using
behavioral indicators and sentiment analysis?

Analytics • Can we predict retirement savings adequacy and assess the impact of behavioral
interventions on retirement planning?

• Common Techniques:
• Regression Analysis: Linear regression, logistic regression, polynomial regression.
• Time Series Analysis: ARIMA models, exponential smoothing.
• Machine Learning Algorithms: Supervised ML techniques such as KNN, Decision trees,
random forests, support vector machines, and neural networks.
• Simulation Analysis: Monte Carlo, Agent-based Modeling, Discrete Event Simulation, System
Dynamics Modeling.

• Common Tools:
• Machine learning libraries in Python (e.g., scikit-learn, TensorFlow, Keras) and R (e.g., caret)
• Predictive analytics platforms (e.g., RapidMiner, KNIME, Weka).
• Specialized software for time series forecasting (e.g., SAS Forecast Server, Prophet by
Facebook).
• Simulation software (e.g., AnyLogic, Simul8, MATLAB/Simulink, NetLogo)
• Prescriptive analysis goes beyond prediction by recommending
actions or decisions to optimize outcomes or achieve specific
objectives, often through optimization or simulation techniques.
• Example Questions:

Prescriptive •


Which behavioral interventions (nudges) are most effective in promoting healthy
behaviors, such as exercise, diet, or medication adherence?
What prescriptive strategies can be implemented to encourage retirement savings

Analytics •
among specific demographic groups, considering their unique behavioral tendencies?
What incentive structures or reward programs are most effective in motivating
employees to achieve performance goals, considering behavioral principles such as
loss aversion or social norms?

• Common Techniques:
• Optimization Techniques: Linear programming, integer programming, dynamic
programming.
• Simulation and Scenario Analysis: Monte Carlo simulation, agent-based modeling,
discrete event simulation, system dynamics modeling, scenario planning.
• Machine Learning Techniques: Decision trees, neural networks, recommendation
systems.

• Common Tools:
• Optimization libraries in Python (e.g., SciPy, CVXPY, Gekko).
• Optimization software (e.g., IBM CPLEX, Gurobi).
• Simulation software (e.g., AnyLogic, Arena, Simul8, MATLAB/Simulink, NetLogo).
• Decision support systems (e.g., Microsoft Decision Support, D-Sight).
Data Science
• Data science is an interdisciplinary field that
uses scientific methods, processes, algorithms,
and modeling techniques to extract knowledge
and insights from structured and unstructured
data.
• Data science combines math and statistics,
specialized programming, advanced analytics,
artificial intelligence (AI) and machine learning
with specific subject matter expertise to
uncover actionable insights hidden in an
organization’s data.
• These insights can be used to guide decision-
making and strategic planning.
Data Science
The Interdisiplinary Nature of DS

• Maths & Statistics: Fundamental tools for data analysis, hypothesis testing, and
building models.
• Visualization: Techniques for creating visual representations of data to communicate
insights effectively (e.g., charts, graphs).
• Exploratory Data Analysis (EDA): Initial data analysis to discover patterns, spot
anomalies, test hypotheses, and check assumptions using statistical graphics and
visualization tools.
• Artificial Intelligence (AI): Broad field of computer science focuses on creating
systems capable of performing tasks that typically require human intelligence. These
tasks include reasoning, learning, problem-solving, perception, and language
understanding.
• Machine Learning (ML): subset of AI that involves training algorithms to recognize
patterns in data and make predictions or decisions based on new data. It focuses on
building models that can learn from and make decisions based on data.
• Deep Learning (DL): is a specialized subset of ML that uses neural networks with
many layers (deep neural networks) to model complex patterns in large datasets. It is
particularly effective for tasks such as image and speech recognition.
• Data science leverages AI, ML, and DL techniques to analyze and interpret complex
data. The core components of data science—maths, statistics, visualization, and
EDA—are essential for building and evaluating AI/ML/DL models.
Using new and advanced
techniques from several
disciplines
The Field of
Data Science
Role of Data
Science in
Modern Society
Transforming
Industries through Data Science has become a critical
critical component in various sectors,
Data-Driven sectors, driving innovation, efficiency,
efficiency, and better decision-making.
Insights making. Its impact can be seen across
across multiple industries,
transforming how businesses and
and organizations operate and deliver
deliver value.
Healthcare

Transforming Patient Care and Outcomes


• Predictive Analytics: Helps in predicting disease outbreaks,
patient admissions, and readmission rates, enabling early
intervention.
• Personalized Medicine: Tailoring treatments based on
individual genetic information and other personal data.
• Medical Imaging: Enhancing the accuracy of image analysis in
radiology through deep learning techniques.
• Examples: IBM Watsonx Health, COVID-19 spread prediction
models.
Finance
Enhancing Risk Management and Fraud
Detection
• Fraud Detection: Machine learning algorithms identify unusual
patterns in transaction data, flagging potentially fraudulent
activities.
• Risk Management: Assessing credit risk and loan eligibility by
analyzing customer financial behavior and history.
• Algorithmic Trading: Use of algorithms to analyze market data
and make high-frequency trading decisions.
• Examples: PayPal, hedge funds using algorithmic trading
strategies.

Photo by Pexels
Retail
Optimizing Supply Chain and Enhancing
Customer Experience
• Personalized Recommendations: Analyzing customer behavior
data to provide personalized product recommendations.
• Inventory Management: Predictive analytics helps in managing
stock levels by forecasting demand trends.
• Customer Sentiment Analysis: Analyzing social media and
customer reviews to gauge sentiment and improve products
and services.
• Examples: Amazon’s recommendation engine, Walmart's
supply chain optimization.

Photo by Pexels
Transportation and Logistics

Improving Efficiency and Reducing Costs


• Route Optimization: Analyzing traffic patterns and delivery
data to find the most efficient routes.
• Predictive Maintenance: Monitoring vehicle health and
predicting maintenance needs to prevent breakdowns.
• Smart Cities: Using Data Science to manage urban traffic,
reducing congestion, and improving public transportation
systems.
• Examples: UPS route optimization, ride-sharing companies like
Uber and Lyft.

Photo by Pexels
Marketing
Targeted Campaigns and Better Customer
Insights
• Customer Segmentation: Analyzing demographic, behavioral,
and transactional data to segment customers.
• Campaign Effectiveness: Measuring the impact of marketing
campaigns in real-time and adjusting strategies based on data-
driven insights.
• Sentiment Analysis: Understanding customer sentiment
towards brands and products through social media analysis.
• Examples: Netflix's recommendation system, Procter &
Gamble's market research.
Government and Public Policy
Data-Driven Decision Making and Public
Services
• Policy Formulation: Using data to understand the impact of
existing policies and predict the outcomes of proposed ones.
• Public Health: Monitoring disease outbreaks and health trends
to inform public health strategies.
• Smart Governance: Implementing data-driven approaches to
improve public services and resource allocation.
• Examples: Government economic tracking, public health
agencies monitoring disease spread.

Photo by Pexels
Data Science in Banking
• Data science has become a pivotal
tool in the banking sector, driving
innovations and improvements across
various functions.
• Data Volume: Banks manage vast
amounts of data daily—from
transaction data to customer
interactions—which is a prime
resource for insights.
• Innovation and Competition: Staying
competitive in the fintech era where
startups leverage cutting-edge
technology to disrupt traditional
banking models.
• Customer Expectations: Customers
expect personalized experiences and
services that can only be delivered
effectively through data-driven
insights.
Key Benefits of Data Science in Banking

1. Enhanced Customer Experience


• Personalization: Data science
enables banks to analyze customer
data and behavior, allowing for
personalized marketing, customized
product offerings, and tailored
banking experiences.
• Customer Retention: By predicting
customer needs and addressing
them proactively, banks can
improve satisfaction and loyalty.
Key Benefits of Data Science in Banking

2. Risk Management
• Credit Scoring: Data science improves the
accuracy of credit scoring models by
incorporating a wider range of data points,
including non-traditional data like social media
activity or mobile phone usage patterns.
• Fraud Detection: Machine learning models can
detect patterns and anomalies that indicate
fraudulent activities, reducing losses
significantly.
• Real-time processing and predictive capabilities
allow banks to respond swiftly to potential
threats.
Key Benefits of Data Science in Banking

3. Operational Efficiency
• Automation: Automation of routine tasks such
as data entry, compliance checks, and customer
queries through intelligent algorithms frees up
human resources for more complex tasks.
• Decision Making: Advanced analytics help in
making data-driven decisions that optimize
operational costs and improve service delivery.
Key Benefits of Data Science in Banking

4. Regulatory Compliance
• Regulatory Reporting: Data science tools can
automate the extraction, processing, and
reporting of data required by regulatory bodies,
ensuring accuracy and timeliness.
• Anti-Money Laundering (AML): Data science
techniques can enhance the monitoring of
transactions by identifying suspicious patterns
that human analysts might miss.
Key Benefits of Data Science in Banking

5. Innovative Financial Products


• Product Development: Insights
derived from data science can
lead to the creation of new
financial products that meet
changing customer needs.
• Market Analysis: Banks can use
data science to conduct market
analysis and forecast trends,
helping them to stay competitive
in a rapidly changing financial
landscape.
Data Science Life
Cycle
Data Collection Definition

• Data collection is the process of gathering and measuring information on


targeted variables to establish a data set for analysis.

Key Activities

• Identify data sources: Databases, web scraping, APIs, surveys, sensors, etc.
• Collect raw data: Downloading datasets, querying databases, collecting data
through surveys, etc.

Tools and Techniques

• SQL for database queries


• Web scraping tools (e.g., BeautifulSoup, Scrapy, Selenium)
• APIs and data ingestion frameworks (e.g., Google Maps API, Twitter API,
Apache NiFi, Kafka)
• Data integration platforms (e.g., Talend)

Example:

• Collecting customer transaction data from a retail store's database to


analyze purchasing behavior.
Data Cleaning & Definition

Preprocessing • Data cleaning is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a dataset.

Key Activities

• Handle missing values: Imputation, deletion, or interpolation.


• Remove duplicates: Identify and remove duplicate records.
• Correct errors: Fix data entry errors and inconsistencies.
• Standardize data: Convert data into a consistent format.

Tools and Techniques

• Python libraries (e.g., Pandas, NumPy)


• Data profiling tools (e.g., OpenRefine)
• Data cleaning frameworks (e.g., DataCleaner)

Example:

• Cleaning a dataset of customer information by filling missing values in


the 'Age' column with the median age and standardizing the 'Date of
Birth' format.
Data Preprocessing
After you gather the data you need, you
must assess your data to identify any
problems in your data’s quality or
structure, and clean your data by
modifying, replacing, or removing data to
ensure that your dataset is of the highest
quality and as well-structured as possible.
Data preprocessing is an important step
in the data analysis process that involves
transforming raw data into a clean and
structured format that can be analyzed
and where models can be deployed.
Data Assessment
Data Assessment
• When assessing your data, you're like a
detective at work, inspecting your dataset
for two things:
1. Data quality issues (i.e., content issues)
2. Tidiness issues (i.e., structural issues)

• Assessing is the precursor to cleaning. You


can't clean something that you don't know
exists!
Data Types
When dealing with datasets,
the type of data plays an
important role to determine
which preprocessing
strategy would work for a
particular set to get the right
results or which type of
statistical analysis should be
applied for the best results.
Data Quality
• Low quality data is also
called dirty data.
• Data in the Real World Is
Dirty: lots of potentially
incorrect data, e.g.,
instrument faulty, human or
computer error, transmission
error.
• There are six main
dimensions to assess the
quality of data you have.
Data Quality: Completeness
• Completeness measures the degree to which
all expected records in a dataset are present
(i.e., no missing values).
• Data is not always available (e.g., many tuples
have no recorded value for attributes, such as
customer name in customer data).
• Missing data may be due to:
• equipment malfunction
• data not entered due to
misunderstanding
• certain data may not be considered
important at the time of entry
• no registered history or changes of the
data.
Data Quality: Validity
• Validity measures the degree to
which the values in a data element
are correct or makes sense.
• Invalid data can be caused by a
variety of issues, such as data entry
errors, false recording and unreal
data.
• Example:
• CustomerBirthDate value must be a date
in the past.
• CustomerAccountType value must be
either Loan or Deposit.
• LatestAccountOpenDate value must be a
date in the past.
Data Quality: Uniqueness

• Uniqueness measures the degree to


which the records in a dataset are
not duplicated.
• Uniqueness is a critical dimension
for ensuring no duplication or
overlaps in the data.
Data Quality: Timeliness
• Timeliness refers to the
expected time of availability
and accessibility of the data.
• Timeliness can be of
importance since out-of-date
information can lead
individuals to make poor
decisions.
Data Quality: Consistency

• Consistency is a data quality dimension that


measures the degree to which data is the
same across all instances of the data (i.e.,
relating to the format or type of data. )
• Consistency issues also include finding
patterns in data that are outside the
expected behavior (i.e., outlier or anomaly
detection).
Outliers

• Outliers are data objects with characteristics that are


considerably different than most of the other data
objects in the data set.
• Case 1: Outliers are noise that interferes with data analysis
• Case 2: Outliers are the goal of our analysis
• Credit card fraud
• Intrusion detection

• How to detect outliers?


• Boxplots
• Z-score
Data Quality: Accuracy

• Accuracy refers to whether the data


values stored for an object are the real
correct values.
• Accuracy means that the data is error-free
and has a reliable and consistent source
of information.
• Example: All records in the Customer
Table must have accurate Customer
Name, Customer Birthdate, and Customer
Address fields when compared to the Tax
Form.
Data Tidiness
• Untidy data is also
called Messy Data.
• Untidy data has
issues with the
structure.
Exercise: Quality vs Tidiness
Solution: Quality vs Tidiness
Data Cleaning
Ways to Handle Missing Data
Handling Duplicates
• Identify and remove duplicate records from the dataset.
Handling Inconsistent or Erroneous Data

• Identify and correct invalid, erroneous or inconsistent data values or remove them.
Ways to Handle
Outliers
Data Integration
Data Integration
• Data integration involves
combining data from
multiple sources into a
single, unified dataset.
• The process of data
integration can be
complex and time-
consuming, as it often Resources:
involves identifying and • https://www.javatpoint.com/data-integration-in-data-mining
• https://www.geeksforgeeks.org/data-integration-in-data-mining/
resolving inconsistencies • https://hevodata.com/learn/data-integration-techniques-strategies/
and conflicts between the
data sources.
Data Reduction
Data Reduction
• Data reduction is used to reduce the amount of
data and thereby reduce the costs associated
with data mining or data analysis.
• Although this step reduces the volume, it
maintains the integrity of the original data.
• This data preprocessing step is especially crucial
when working with big data as the amount of
data involved would be gigantic.
Data Transformation
Data
Transformation
• Data transformation is
the process of converting
data from one format to
another. In essence, it
involves methods for
transforming data into
appropriate formats that
the model can learn
efficiently from.
Exploratory Definition

Data Analysis • EDA is an approach to analyzing data sets to summarize their main
characteristics, often using visual methods.

Key Activities

• Summarize data: Descriptive statistics (mean, median, mode, standard


deviation).
• Identify patterns: Detect trends, correlations, and anomalies.
• Visualize data: Create plots and graphs to understand distributions and
relationships.

Tools and Techniques

• Python libraries (e.g., Pandas, Matplotlib, Seaborn)


• Statistical software (e.g., R)
• BI tools (e.g., Tableau, Power BI)

Example:

• Exploring the distribution of different variables and how they are


correlated.
Definition
Model Building • Model building involves developing statistical or machine learning
models to make predictions or classifications based on the data.

Key Activities

• Select appropriate model: Choose algorithms based on the problem


type (regression, classification, clustering).
• Train the model: Fit the model to the training data.
• Tune hyperparameters: Optimize model performance through
hyperparameter tuning.

Tools and Techniques

• Python libraries (e.g., Scikit-learn, TensorFlow, Keras)


• R packages (e.g., caret, randomForest)
• Model training platforms (e.g., Google Colab, AWS SageMaker)

Example:

• Building a logistic regression model to predict customer churn based


on historical customer data.
Model Definition

Evaluation • Model evaluation is the process of assessing the performance of a machine


learning model.

Key Activities

• Validate model: Split data into training and testing sets or use cross-
validation.
• Evaluate performance: Use metrics such as accuracy, precision, recall, F1
score, ROC-AUC.
• Compare models: Evaluate multiple models to select the best performing
one.

Tools and Techniques

• Scikit-learn for model evaluation metrics


• Cross-validation techniques (e.g., k-fold cross-validation)
• Visualization tools (e.g., ROC curves)

Example:

• Evaluating a credit scoring model by assessing its precision and recall on a


test dataset.
Model Deployment Definition

• Model deployment is the process of integrating a machine learning model


into a production environment where it can be used to make predictions on
new data.

Key Activities

• Prepare deployment environment: Set up infrastructure (cloud, on-


premise).
• Deploy model: Use APIs, microservices, or batch processing.
• Monitor performance: Continuously track model performance and accuracy.

Tools and Techniques

• Docker for containerization


• Flask/Django for API development
• Cloud services (e.g., AWS, Azure, Google Cloud)

Example:

• Deploying a recommendation engine for an e-commerce site to provide


real-time product suggestions to customers.
Data Science
Skill Set
American mathematician and computer scientist who served as the Chief Data Scientist of the United States Office of
Science and Technology Policy from 2015 to 2017.
Data Science
Skill Set
• A comprehensive understanding
of both technical and
non-technical skills is essential for
success in the field of data
science.
Technical Skills
• Programming :
• Python: Widely used for data analysis, machine learning, and general-purpose programming. Libraries include Pandas, NumPy, SciPy,
Scikit-learn, and TensorFlow.
• R: Popular for statistical analysis and data visualization. Libraries include ggplot2, dplyr, and caret.
• SQL: Essential for querying and manipulating databases. SQL skills are crucial for extracting and working with large datasets stored in
relational databases.

• Data Manipulation and Analysis:


• Pandas (Python): For data manipulation and analysis. It provides data structures and functions needed to manipulate structured data.
• NumPy (Python): For numerical operations and handling large, multi-dimensional arrays and matrices.
• dplyr (R): For data manipulation, providing a grammar of data manipulation.

• Machine Learning and Statistical Modeling:


• Scikit-learn (Python): A key library for implementing machine learning algorithms.
• TensorFlow and Keras (Python): For building and training deep learning models.
• RandomForest, XGBoost, LightGBM: Popular machine learning libraries for building robust models.

• Data Visualization:
• Matplotlib and Seaborn (Python): For creating static, animated, and interactive visualizations in Python.
• ggplot2 (R): For data visualization in R, providing a powerful tool for creating complex plots.
• Tableau and Power BI: Business intelligence tools for creating interactive dashboards and visualizations.

• Big Data Technologies:


• Hadoop: Framework for distributed storage and processing of large data sets.
• Spark: Unified analytics engine for big data processing, with built-in modules for SQL, streaming, machine learning, and graph
processing.
Non-technical Skills
• Domain Knowledge:
• Understanding the specific industry you are working in (e.g., finance, healthcare,
retail) is crucial for applying data science techniques effectively.

• Problem-Solving:
• Ability to frame business problems into data science problems and find data-driven
solutions.

• Communication Skills:
• Data Storytelling: The ability to present data insights in a clear and compelling way to
stakeholders.
• Collaboration: Working effectively with cross-functional teams, including engineers,
product managers, and business analysts.

• Critical Thinking:
• Ability to analyze complex problems, interpret data correctly, and make informed
decisions based on data.

• Project Management:
• Managing data science projects efficiently, from data collection and analysis to model
building and deployment.
Soft Skills
• Curiosity and Continuous Learning:
• A good data scientist is always curious and eager to learn new
methods, technologies, and tools in the ever-evolving field of
data science.
• Attention to Detail:
• Being meticulous in analyzing data and building models to
ensure accuracy and reliability of results.
• Adaptability:
• Ability to adapt to new challenges, tools, and techniques in a
fast-paced environment.
Things to Do When Starting Data Science

Be curious Be curious
about the about the
data results

BE CURIOUS
Be curious
about the Be curious
method about the
application
Big Data
Big Data

• Big Data refers to the vast


volumes of structured and
unstructured data that are
generated at high velocity from a
wide variety of sources.
• This data is so large and complex
that traditional data processing
tools and techniques are
insufficient to handle it.
Importance of Big Data
• Data-Driven Decision Making:
• Big data analytics helps organizations make informed decisions
based on data insights rather than intuition. This leads to better
strategies, improved performance, and competitive advantage.
• Innovation and Business Opportunities:
• Analyzing big data can uncover hidden patterns and trends,
leading to new products, services, and business models.
• Operational Efficiency:
• Big data allows companies to optimize their operations by
identifying inefficiencies, predicting maintenance needs, and
automating processes.
• Customer Insights:
• Understanding customer behavior and preferences through big
data analysis enables personalized marketing, enhanced
customer experiences, and improved customer retention.
Big Data Technologies
• Data Storage Technologies: focus on storing vast
amounts of data efficiently and securely.
• Hadoop, Amazon S3, Google Cloud Storage, Microsoft Azure

• Data Analysis and Mining Technologies: designed to


process, analyze, and derive insights from large
datasets.
• Apache Spark, R, Python, Weka, KNIME, RapidMiner

• Data Visualization Technologies: help create visual


representations of data to communicate insights
effectively.
• Tableau, Power BI, Python, R

• Data Ingestion and Integration Technologies: facilitate


the collection, import, combination and processing of
data from various sources.
• Apache Kafka, NiFi, AWS, IBM InfoSphere DataStage
Machine
Learning
Machine Learning
• Machine Learning is a subset of artificial intelligence
that focuses on building systems that learn from
data, identify patterns, and make decisions with
minimal human intervention.
• ML algorithms use statistical techniques to give
computers the ability to 'learn' from past data
without being explicitly programmed for each task.
• An expert system requires developers to create a
strict set of rules to imitate the decision-making
processes of experts in the field. In contrast,
machine learning models adjust their actions based
on data and outcomes.
• This learning process is what powers AI applications,
Machine Learning
from simple tasks like spam filtering to complex ones
like self-driving cars.
AI &Machine Learning
• Machine learning is a subset of AI that uses
algorithms trained on data to produce
models that can perform specific tasks.
• Most AI is performed using machine
learning, so the two terms are often used
synonymously.
• but AI actually refers to the general concept of
creating human-like cognition using computer
software,
• while ML is only one method of doing so.
• Deep learning is a subset of ML, in which
artificial neural networks (AANs) that mimic
the human brain are used to perform more
complex reasoning tasks without human
intervention.
DS &Machine Learning
• Machine Learning overlaps with Data Science.
• Data Science involves preparing the data,
selecting the right features, and choosing the
appropriate algorithms for training models.
• Machine Learning, in turn, utilizes these
prepared datasets to develop models that can
predict outcomes, classify data, and uncover
patterns.
• Professionals in both fields must be adept at
handling and processing large datasets.
• Data Scientists often use machine learning
techniques as part of their toolkit to extract
deeper insights from data.
• Machine Learning Engineers rely on data
prepared by Data Scientists to train and
validate their models.
Types of Machine Learning
• Supervised Learning
• This type of ML involves supervision, where machines are
trained on labeled datasets and enabled to predict outputs
based on the provided training.

• Unsupervised Learning
• Here, there is no supervision. The machine is trained using an
unlabeled dataset and is enabled to predict the output
without any supervision.

• Semi-supervised Learning
• Semi-supervised learning comprises characteristics of both
supervised and unsupervised machine learning.
• It uses a combination of labeled and unlabeled datasets to
train its algorithms.

• Reinforcement Learning
• Reinforcement learning is a feedback-based process.
• Here, the AI component automatically acts on its
surroundings by the hit & trial method, takes action, learns
from experiences, and improves performance.
• The component is rewarded for each good action and
penalized for every wrong move.
Supervised Machine Learning
• Supervised Learning is a type of ML where the
model is trained on labeled data. The training
data includes input-output pairs, where the
correct output is known, allowing the model to
learn over time to predict the output from the
input.
• The algorithm makes predictions based on the
input data and is corrected when those
predictions are wrong.
• Example: A bank uses historical customer data
to predict who might default on a loan. The
model learns from past customer profiles that
are labeled as 'defaulted' or 'not defaulted'.
Unsupervised Machine Learning

• Unsupervised learning algorithms are used


when the information used to train is
neither classified nor labeled.
• These models discover hidden patterns or
data groupings without the need for human
intervention.
• The algorithm explores the data to find any
structure or patterns. It groups the data
into clusters or associates different inputs
based on their similarities or differences.
• Example: Clustering customers based on
their purchasing behaviors or transaction
histories to target marketing more
effectively
Semi-supervised Machine Learning

• Semi-supervised learning lies between


supervised and unsupervised learning.
It uses a small amount of labeled data
with a large amount of unlabeled data
during training.
• Semi-supervised learning can be a cost-
effective solution when labeling data
becomes too expensive.
• Example: A model that uses a small
amount of labeled fraud data and a
larger amount of unlabeled transactions
to detect potential fraud.
Reinforcement Machine Learning

• Reinforcement learning is a type of


machine learning where an agent learns to
behave in an environment by performing
certain actions and experiencing the results
of these actions. It seeks to maximize some
notion of cumulative reward.
• The agent receives rewards by performing
correctly and penalties for incorrect
actions.
• It learns to choose the best strategy, called
a policy, that will yield the most reward
over time.
• Example: Algorithmic trading where an AI
learns to make buying and selling decisions
based on historical price data and rewards
based on profit outcomes.
Basic Concepts
Model vs Algorithm
Algorithms are procedures that are implemented in
code and are run on data.
An algorithm is the process by which a model is
learned.
Models are output by algorithms and are comprised
of model data and a prediction algorithm.
When you train an “algorithm” with data it will
become a “model”.

Model = Training (an Algorithm + Data)


ML Algorithms
There are numerous algorithms to solve the same
type of problem in ML.

For a given problem, our job as data scientists is to


choose which algorithm(s) will learn and produce the
best-performing model.

ML algorithms can be categorized by how they learn:


• Supervised
• Unsupervised
• Semi-supervised
• Reinforcement
Training vs Test Data
• In machine learning, we divide the original dataset into training
data and test data.
• Training data is used to train the machine-learning model. The
more training data the model has, the better it can make
predictions.
• The model can learn from the training data and improve its
predictions.
• Testing Data is used to evaluate the performance of the model.
• The testing data is not exposed to the model before evaluation. This
ensures the model cannot memorize the testing data and make
perfect predictions.
• Splitting the dataset into train and test sets is one of the important
parts of data pre-processing.
• The training dataset is generally larger in size compared to the testing
dataset. The general ratios of splitting train and test datasets are
80:20, 70:30, or 90:10
Parameters vs Hyperparameters
• Hyperparameters are parameters whose values control the learning
process.
• The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that control the learning
process and the model parameters that result from it.
• Hyperparameters are used by the learning algorithm when it is learning but they are not part
of the resulting model.
• Examples: Train-test split ratio, Learning rate in optimization algorithms (e.g. gradient
descent), Choice of optimization algorithm (e.g., gradient descent, stochastic gradient
descent, or Adam optimizer), Choice of activation function in a neural network (nn) layer
(e.g. Sigmoid, ReLU, Tanh), The choice of cost or loss function the model will use, Number of
hidden layers in a nn, Number of clusters in a clustering task.

• Parameters on the other hand are internal to the model. That is, they are
learned or estimated purely from the data during training as the algorithm
used tries to learn the mapping between the input features and the labels
or targets.
• Model training typically starts with parameters being initialized to some values (random
values or set to zeros). As training/learning progresses the initial values are updated using an
optimization algorithm (e.g. gradient descent).
• The learning algorithm is continuously updating the parameter values as learning progress
but hyperparameter values set by the model designer remain unchanged.
• At the end of the learning process, model parameters are what constitute the model itself.
• Examples: The coefficients (or weights) of linear and logistic regression models, Weights and
biases of a nn, The cluster centroids in clustering
Bias vs Variance Training Dataset

• Bias is an error introduced into the model due to overly simplistic


assumptions in the learning algorithm.
• High bias can cause an algorithm to miss the relevant relations
between features and target outputs (underfitting).
• Variance is an error introduced by sensitivity to small fluctuations High Bias Low Bias
in the data set.
• High variance can cause an algorithm to model the random noise
in the training data, rather than the intended outputs (overfitting). Test Dataset
• A high-variance model performs well on its training data but
poorly on any unseen data because it has learned the noise in the
training data as concepts.

Low Variance High Variance


Overfitting vs Underfitting
• Underfitting occurs when a model is too simple to capture the underlying
pattern of the data, and hence, fails to achieve good performance on the
training data and may or may not generalize well to new data.
• Indicators: Poor performance on both training and testing data, but especially
training dataset.
• Solutions: Increase model complexity, add more features, or decrease
regularization.
• Overfitting occurs when a model is too complex, with too many
parameters relative to the number of observations. This model learns both
the underlying pattern and the noise in the training data as if they were
concepts, which negatively affects the model's ability to generalize.
• Indicators: High performance on the training data but poor performance on
testing data.
• Solutions: Simplify the model, use regularization techniques, increase training
data, or reduce noise in the data.
Bias-Variance Tradeoff
MLOps
MLOps
• Machine Learning Operations (MLOps) are a
set of practices and tools for reliably and
efficiently deploying, monitoring, and
managing machine learning models in
production.
• It integrates machine learning (ML) system
development and operations (Ops) to ensure
continuous delivery and automation of ML
models.
• MLOps bridges the gap between data science and dev
operations, ensuring that machine learning models
can be deployed and maintained in a scalable and
reliable manner.
• It enables continuous integration and continuous
MLOps
delivery (CI/CD) of ML models, reducing the time from
model development to production.
• Ensures models are robust, reproducible, and scalable,
leading to better performance and reliability in
production environments.
MLOps Benefits
• Improved Efficiency:
MLOps removes unnecessary manual steps and automated repetitive tasks
making the whole process more efficient and reliable. This helps reduce
development time and costs.
• Version Control:
MLOps provide version controls for machine learning models and data. This
helps organizations track the changes and even reproduce the model when
required.
• Automated Deployment:
Organizations can deploy machine learning models faster by implementing
MLOps and bringing down the deployment times.
• Enhanced Security:
The entire MLOps process can be safeguarded with access controls, data,
and model encryption techniques for added security.
• Improved Collaboration:
It enables seamless communication with different teams thus leading to
improved collaboration.
• Faster Time to Value:
With MLOps, organizations can deploy their machine learning projects
faster – leading to faster time to value for their customers.
Career Paths
Career Paths in
Data Science
• The field of data science offers a
variety of career paths, each with
its unique focus and skill set.
• Professionals in data science are
in high demand across industries
due to their ability to analyze data,
generate insights, and drive
decision-making.
Data Analyst
Role and Responsibilities:
• Interpret data and analyze results using
statistical techniques.
• Develop and implement data analyses, data
collection systems, and other strategies.
• Create dashboards and reports to present
data insights.
• Identify, analyze, and interpret trends or
patterns in complex data sets.
Required Skills:
• Proficiency in SQL for data querying.
• Strong analytical skills and attention to
detail.
• Experience with data visualization tools
(Tableau, Power BI).
• Knowledge of statistical software (Excel,
SAS).
Career Progression:
Junior Data Analyst → Data Analyst → Senior Data
Analyst → Data Analytics Manager
Business Intelligence Analyst
Role and Responsibilities:
• Analyze business data to provide
actionable insights.
• Develop and maintain BI dashboards
and reports.
• Identify business trends and patterns
through data analysis.
• Work with business stakeholders to
understand their data needs.
Required Skills:
• Proficiency in BI tools (Tableau, Power
BI).
• Strong analytical and problem-solving
skills.
• Knowledge of SQL for data querying.
• Understanding of business operations
and key performance indicators (KPIs).
Career Progression:
•Junior BI Analyst → BI Analyst → Senior BI
Analyst → BI Manager
Data Engineer
Role and Responsibilities:
• Design, build, and maintain data pipelines
and ETL processes.
• Ensure the reliability and efficiency of
data systems.
• Work with big data technologies to
process large datasets.
• Collaborate with data scientists to
prepare data for analysis
Required Skills:
• Strong programming skills (Python, Java,
Scala).
• Experience with big data technologies
(Hadoop, Spark).
• Proficiency in SQL and NoSQL databases.
• Understanding of data warehousing
concepts
Career Progression:
Junior Data Engineer → Data Engineer →
Senior Data Engineer → Lead Data Engineer
ML Engineer
Role and Responsibilities:
• Develop and deploy machine learning models
into production.
• Optimize models for performance and
scalability.
• Implement ML algorithms and techniques for
specific use cases.
• Collaborate with data scientists and engineers to
integrate models.
Required Skills:
• Proficiency in machine learning frameworks
(TensorFlow, PyTorch).
• Strong programming skills (Python, C++).
• Knowledge of cloud platforms for ML (AWS,
GCP, Azure).
• Understanding of software engineering best
practices.
Career Progression:
Junior Machine Learning Engineer → Machine
Learning Engineer → Senior Machine Learning
Engineer → Lead Machine Learning Engineer
Data Scientist
Role and Responsibilities:
• Develop and implement machine
learning models.
• Analyze large datasets to extract
actionable insights.
• Work with stakeholders to understand
business problems and provide data-
driven solutions.
• Communicate findings through data
visualization and reports.
Required Skills:
• Strong programming skills (Python, R).
• Proficiency in machine learning libraries
(Scikit-learn, TensorFlow).
• Statistical analysis and data mining.
• Data visualization (Matplotlib, Seaborn,
Tableau).
Career Progression:
Junior Data Scientist → Data Scientist →
Senior Data Scientist → Lead Data Scientist
Resources for
Future Learning
Resources for Further Learning
• Learning Data Science and
Machine Learning requires
continuous education due to the
rapidly evolving nature of the
field.
• There are various resources
available, including online
courses, books, websites, and
certification programs, that can
help you stay updated and
improve your skills.
• Check out the courses offered by
Markov Data:
https://markovdata.com/

You might also like