0% found this document useful (0 votes)
538 views132 pages

Foundations of Data Science PPT TEXT BOOK

The document provides a comprehensive overview of Data Science, including its definition, scope, and applications across various domains such as healthcare and finance. It discusses the differences between structured, semi-structured, and unstructured data, as well as the significance of Big Data characterized by volume, variety, and velocity. Additionally, it outlines the stages of a data science project, essential skills, tools, ethical considerations, and the relationship between data science and other fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
538 views132 pages

Foundations of Data Science PPT TEXT BOOK

The document provides a comprehensive overview of Data Science, including its definition, scope, and applications across various domains such as healthcare and finance. It discusses the differences between structured, semi-structured, and unstructured data, as well as the significance of Big Data characterized by volume, variety, and velocity. Additionally, it outlines the stages of a data science project, essential skills, tools, ethical considerations, and the relationship between data science and other fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 132

FOUNDATIONS OF DATA

SCIENCE
By
Mr. S.SRINIVAS REDDY
ASST PROFESSOR
Foundations of Data Science

1. Introduction to Data Science


•Definition and scope of Data Science
•Applications in various domains (healthcare, finance, marketing, etc.)
•Difference between Data Science, Machine Learning, and AI
What is Data Science?
Data Science is an interdisciplinary field that combines statistical
analysis, machine learning, data visualization, and computational
techniques to extract insights and knowledge from structured and
unstructured data. It integrates elements of mathematics,
programming, and domain expertise to facilitate data-driven decision-
making.
Difference Between Structured, Semi-Structured, and Unstructured Data

Feature Structured Data Semi-Structured Data Unstructured Data


Data that is organized in a fixed Data that has some organizational
Data that lacks a predefined
Definition format, typically in tables with properties but does not follow a
structure or organization.
predefined schemas. strict schema.

Stored in NoSQL databases, XML,


Stored in relational databases (SQL Stored in file systems, data lakes, or
Storage JSON, or semi-structured
databases). object storage.
repositories.

Well-defined schema (rows and


Schema Partial schema or flexible structure. No predefined schema.
columns).

Requires specialized tools (e.g., Difficult to query directly; requires


Querying Easily queried using SQL.
NoSQL databases, JSON parsers). NLP, AI, or data processing tools.

- Customer records in an SQL


database. - XML and JSON data used in APIs. - Emails, images, videos, social
Examples
- Sales transactions stored in a - Log files from web servers. media posts, PDFs.
relational database.
Properties Structured data Semi-structured data Unstructured data

Technology It is based on Relational database It is based on XML/RDF(Resource It is based on character and binary
table Description Framework). data

Transaction management Matured transaction and various Transaction is adapted from DBMS No transaction management and no
concurrency techniques not matured concurrency

Version management Versioning over tuples,row,tables Versioning over tuples or graph is Versioned as a whole
possible

It is schema dependent and less It is more flexible than structured It is more flexible and there is
Flexibility flexible data but less flexible than absence of schema
unstructured data

Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured It is more scalable.
data

Robustness Very robust New technology, not very spread —

Query performance Structured query allow complex Queries over anonymous nodes are Only textual queries are possible
joining possible

Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data, and Unstructured data.
Big
Data:
3V’s

6
Volume
(Scale)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing
exponentiall
y

Exponential increase in
collected/generated data

7
4.6
30 billion RFID billion
tags today
12+ TBs camer
(1.3B in 2005) a
of tweet
data every phones
day world
wide

100s of
millions
of GPS
? TBs of
data every

enabled
day

devices
sol
d
25+ TBs of
log data 2+
annuall
every day billiony
people
on the
76 million Web
smart meters by end
in 2009… 200M 2011
by 2014
CERN’s Large Hydron Collider (LHC) generates 15 PB a year

Maximilien Brice, ©
The Earthscope
The Earthscope is the world's
science project. Designed
largest to track North
America's geological evolution, this
observatory records data over 3.8 million
square miles, amassing 67 terabytes of
data. It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
muchmore.(http://www.msnbc.msn.com/
id/44363598/ns/technology_and_science
-future_of_technology/#.TmetOdQ--uI)
Variety
(Complexity
• Relational Data )
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web
(RDF), …
• Streaming Data
• You can only scan the data
once
• A single application can be
generating/collecting many types of
data To extract knowledge all these types of
data need to linked together
• Big Public Data (online, weather, 11
A Single View to the
Customer
Social Banking
Finance
Media

Our
Gaming
Customer Known

History

Purchas
Entertain
e
Velocity
(Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right
now for store next to you
• Healthcare monitoring: sensors monitoring your activities
and body  any abnormal measurements require
immediate reaction

13
Real-time/
Fast Data

Mobile devices
(tracking all objects all the
time)
Social media and networks Scientific instruments
(all of us are generating (collecting all sorts of
data) data) Sensor technology and
networks
(measuring all kinds of
data)
• The progress and innovation is no longer hindered by the
ability to collect data
• But, by the ability to manage, analyze, summarize, visualize,
and discover knowledge from the collected data in a timely 14
manner and in a scalable fashion
Real-Time
Analytics/Decision
Requirement
Product
Recommendations Learning why Customers
Influence
that are Relevant Switch to competitors
& Compelling
Behavior
and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Some
Make it
4V’s

16
Harnessing
Big Data

• OLTP: Online Transaction Processing (DBMSs)


• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture
& technology)
17
Typically, a data science project goes through the following stages:
1.Problem Definition: Identifying the problem or question to answer.
2.Data Collection: Gathering the necessary data from various
sources.
3.Data Cleaning: Preprocessing and cleaning the data to ensure its
quality.
4.Data Exploration: Analyzing the data to understand its structure
and patterns.
5.Feature Engineering: Enhancing the data with new, derived
features.
6.Model Building: Creating a statistical or machine learning model
based on the data.
7.Model Validation: Evaluating the model's performance using
various metrics.
8.Model Deployment: Implementing the model in a production
environment.
9.Monitoring: Observing the model's performance over time and
refining as necessary.
Key Differences
Data Science Machine Learning Artificial Intelligence
Analyzing customer data Predicting house prices Self-driving cars making
Example
for trends based on features real-time decisions

Enable machines to learn Enable machines to mimic


Goal Extract insights from data
from data human intelligence

Broadest field that


Broad field covering data Subset of AI that learns
Scope includes ML and other
analysis, ML, statistics from data
intelligent systems

Feature Data Science Machine Learning Artificial Intelligence

Supervised, unsupervised,
Techn Machine learning, deep
Statistics, ML, data visualization and reinforcement
iques learning, expert systems
learning
Where Do We See Data Science?
Data Science is applied in numerous fields, including:
•Healthcare: Disease prediction, personalized medicine, and medical
imaging analysis.
•Finance: Fraud detection, algorithmic trading, and credit scoring.
•Marketing: Customer segmentation, targeted advertising, and sentiment
analysis.
•E-commerce: Recommendation systems and inventory optimization.
•Social Media: Trend analysis, user behavior modeling, and fake news
detection.
•Sports: Performance analysis and game strategy optimization.
•Government: Policy modeling and public health analytics.
How Does Data Science Relate to Other Fields?
Data Science overlaps with several disciplines, including:
•Computer Science: Algorithms, databases, and software engineering
play a crucial role in handling and processing data.
•Statistics: Essential for data analysis, hypothesis testing, and
probability modeling.
•Machine Learning & Artificial Intelligence: Core components of
predictive analytics and automation.
•Business Intelligence: Uses data insights to support strategic decision-
making.
•Engineering & Natural Sciences: Utilizes data science for simulation,
modeling, and optimization.
The Relationship Between Data Science and Information Science
•Data Science focuses on extracting actionable insights from data using
statistical and computational techniques.
•Information Science studies how information is created, managed, and
shared, emphasizing human-computer interaction and data management.
•The two fields overlap in areas like data organization, retrieval, and
knowledge management, making Information Science a foundational
component of Data Science.
Computational Thinking in Data Science
Computational Thinking involves problem-solving skills fundamental to
Data Science, including:
•Decomposition: Breaking complex problems into manageable
components.
•Pattern Recognition: Identifying trends and structures in data.
•Abstraction: Simplifying problems by focusing on relevant information.
•Algorithm Design: Creating step-by-step instructions to solve problems
effectively.
Skills for Data Science
To succeed in Data Science, individuals need a combination of technical
and soft skills:
•Programming: Proficiency in Python, R, SQL, or Julia.
•Mathematics & Statistics: Linear algebra, probability, and inferential
statistics.
•Data Wrangling & Cleaning: Handling missing data and preprocessing
datasets.
•Machine Learning: Understanding supervised and unsupervised learning
techniques.
•Big Data Technologies: Experience with Hadoop, Spark, and cloud
computing.
•Data Visualization: Using Matplotlib, Seaborn, and Tableau for storytelling
with data.
•Domain Knowledge: Understanding the industry context for meaningful
analysis.
•Communication: Ability to explain findings to technical and non-technical
audiences.
Tools for Data Science
Data Scientists rely on various tools for analysis, visualization, and
model deployment, including:
•Programming Languages: Python, R, SQL
•Data Processing & Analysis: Pandas, NumPy, SciPy
•Machine Learning Frameworks: Scikit-learn, TensorFlow, PyTorch
•Visualization Tools: Matplotlib, Seaborn, Tableau, Power BI
•Big Data & Cloud Services: Hadoop, Apache Spark, AWS, Google
Cloud, Azure
Issues of Ethics, Bias, and Privacy in Data Science
Ethics in Data Science
•Ensuring transparency in data usage and model decisions.
•Avoiding manipulation or misrepresentation of data.
•Promoting accountability for automated decisions.

Bias in Data Science


•Data Bias: Arises from imbalanced datasets leading to inaccurate or
unfair outcomes.
•Algorithmic Bias: Unintended prejudices embedded in machine
learning models.
•Human Bias: Subjective interpretations influencing model training
and results.
Privacy Concerns in Data Science
•Data Protection: Compliance with regulations like GDPR and HIPAA.
•Anonymization Techniques: Removing personally identifiable information
(PII) before analysis.
•Security Risks: Preventing data breaches and unauthorized access.
Introduction to Data Science –Chapter-2
Data Types:
Structured Data,
Unstructured Data,
Challenges with Unstructured Data.
Data Collections:
Open Data,
SocialMedia Data,
Multimodal Data,
Data Storage and Presentation
Data Pre-processsing:
DataCleaning,DataIntegration,
DataTransformation,Data Reduction,Data Discretization.
Managing unstructured data presents
numerous challenges, including
storage costs,
difficulty in organization
and retrieval,
security risks, and
challenges in data quality and
compliance.
What is the role of Data Analytics?

Gather Hidden Hidden insights from data are gathered and
Insights – then
analyzed with respect to business requirements.

Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.

Perform Market Analysis – Market Analysis can be performed to understand the
strengths and of competitors.
weaknesses

Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
11:28 AM 41 CSE
Aspect Open Data Social Media Data Multimodal Data
Autonomous vehicles (sensor +
Examples World Bank Open Data, NASA Twitter sentiment analysis, GPS + camera), AI chatbots
Earth Observation Data Instagram engagement metrics (voice + text)

Use Cases Research, policy-making, Sentiment analysis, trend AI applications, medical imaging,
business intelligence prediction, marketing self-driving cars

Text, images, videos, comments, Text + Image, Video + Audio,


Formats CSV, JSON, XML, API feeds likes, shares Speech + Text

Data generated from social Data that integrates multiple


Definition Publicly accessible datasets that media platforms through user formats such as text, images,
can be freely used and shared interactions video, and audio

Cameras, sensors,
Government portals, Facebook, Twitter,
Data Sources microphones, social media,
research institutions, NGOs Instagram, YouTube, TikTok
medical devices

Owned by social media Generated from various


Provided by governments,
Ownership companies; subject to sources (e.g., sensors,
organizations, or institutions
privacy policies media, text inputs)

Unstructured or semi-
Structured, machine- Combines different data
structured, continuously
Characteristics readable, often in CSV, JSON, types, requiring specialized
updated, platform-
or API formats processing
dependent
Data Pre processing In Data Science
(The process of transforming raw data into an understandable format)
Four major tasks
1.Data Cleaning
2.Data Integration
3.Data reduction
4.Data Transformation
1.Data cleaning - removing noisy data (incorrect, incomplete, inconsistent
data)
and replace missing values For missing values, replace with N/A,
a mean value (normal),
Ex1:2,4,5,8,?,6,7,9,9, avg mean is 10 so missing value can assume with
10
a median value (non normal) or
most probable value manually for small data sets, automatic for large
data sets
Ex2: The List of students and their marks randomly mentioned, can make
Data, Information,
Analysis, Analytics:
What is Data? Why it is so Crazy?
• Data is raw, unorganized, unanalysed, uninterrupted, and unrelated used in different
contexts.

• For instance, facts and stats gathered by researchers for their analysis can collectively be
called data.

• Data in essence lacks its informative fervour and relatively renders itself to
be meaningless unless given a purpose or direction to acquire its significance.

11:28 AM 44 CSE
Data vs Information:

11:28 AM 45 CSE
Analysis vs
Analytics
Data Analytics vs Data Analysis, Are they Same or even Similar?
‘No’! they are not the same. They have considerable differences between them
But, Data Analysis is actually a subset of Data Analytics

Analysis Analytics
• Data Analysis helps
Data Analytics is the process
in • understanding the of exploring the data from the past
make appropriate decisions in the
to
pastdata and provides
to understand required
what happened
insights from the future by using valuable insights
so far.
• Data Analysis is actually a subset of Data Analytics is a wide area
Data

Analytics which helps us to involving handling data with a lot of
understand the data by questioning necessary tools to produce helpful
and to collect useful insights from decisions with useful predictions for
the information already available. a better output

11:28 AM 46 CSE
Different Types of Analytics:

11:28 AM 47 CSE
Type of Analytics Purpose Key Question Examples
Sales reports, customer
Summarizes past events
Descriptive Analytics What happened? service data, website
to understand patterns
traffic

Customer churn analysis,


Explains the causes of
Diagnostic Analytics Why did it happen? sales decline analysis,
past outcomes
defect analysis

Stock market prediction,


Forecasts future trends weather forecasting,
Predictive Analytics What is likely to happen?
and behaviors customer behavior
prediction

Fraud detection, supply


Recommends actions and
Prescriptive Analytics What should be done? chain optimization,
optimizations
marketing strategies

Simulates human thought Healthcare diagnosis (IBM


How can we make
Cognitive Analytics processes and decision- Watson), AI-powered
smarter decisions?
making chatbots, fraud detection
Type Definition Purpose Key Question Techniques Used Examples

- Monthly sales
report
Analyzes historical Data aggregation, - Web traffic
Descriptive Understand past
data to summarize What happened? data visualization, analytics
Analytics events and trends.
what happened. summary statistics - Financial
performance
review

- Stock market
Regression
Uses historical forecasting
Forecast future analysis, time-
Predictive data and machine What is likely to - Predicting
trends and series forecasting,
Analytics learning to predict happen? customer churn
behaviors. machine learning
future outcomes. - Weather
models
predictions

- Customer churn
analysis
Investigates the Correlation
- Analyzing why
Diagnostic cause of past Identify reasons Why did it analysis, root
sales dropped
Analytics outcomes or behind events. happen? cause analysis,
- Understanding
behaviors. data mining
manufacturing
defects
- Optimizing
delivery routes
Provides - Marketing
Suggest actions Optimization,
Prescriptive recommendatio What should be campaign
for optimal simulation,
Analytics ns on the best done? recommendatio
results. decision analysis
course of action. ns
- Inventory
management

Uses AI, - IBM Watson


machine Natural for healthcare
Mimics human
learning, and language diagnostics
cognitive
natural language How can we processing - AI chatbots for
Cognitive processes for
processing (NLP) make smarter (NLP), machine customer
Analytics decision-making
to simulate decisions? learning, deep service
and adaptive
human thinking learning, neural - Fraud
learning.
and decision- networks detection
making. systems
1. Data Cleaning (Removing Errors and Inconsistencies)
•Example: Handling missing or incorrect customer details in an e-
commerce database.
•Process:
• Identify missing values in customer addresses.
• Remove duplicate entries for the same customer.
• Standardize formats (e.g., "USA" and "United States" should be
the same).
• Replace incorrect data (e.g., negative age values).
2. Data Integration (Combining Data from Multiple Sources)
•Example: A hospital integrates patient data from different
departments (Radiology, Lab Tests, and Billing).
•Process:
• Collect patient data from different hospital databases.
• Remove inconsistencies in data formats (e.g., date formats like
MM/DD/YYYY vs. DD/MM/YYYY).
• Merge records using a unique patient ID.
• Ensure no duplication or redundancy.
3. Data Transmission (Transferring Data Efficiently)
•Example: Real-time stock market data streaming to financial
applications.
•Process:
• Stock exchange generates price updates every second.
• Data is compressed before transmission to reduce bandwidth
usage.
• A secure protocol (e.g., HTTPS, MQTT) is used for real-time data
transfer.
• The financial app receives and decodes the data for display.
4. Data Reduction (Minimizing Storage and Processing Load)
•Example: Compressing image files in a cloud storage system.
•Process:
• Convert high-resolution images (e.g., 4K) to lower resolutions
while maintaining quality.
• Remove unnecessary metadata from image files.
• Use data sampling techniques to store only essential parts of
large datasets.
• Reduce dimensions in a dataset using Principal Component
Analysis (PCA).
5. Data Discretization (Converting Continuous Data into Categories)
•Example: Categorizing customer age groups for a marketing
campaign.
•Process:
• Collect customer age data (e.g., 18, 25, 37, 45, 60).
• Define age group bins:
• 0-18: Teen
• 19-35: Young Adult
• 36-50: Middle Age
• 51+: Senior
• Assign each customer to a category.
• Use categorized data for targeted advertisements.
What are the tools used
in Data Analytics?
• With the increasing demand for Data Analytics in the market, many tools
have emerged with various functionalities for this purpose.
• Either open-source or user-friendly, the top tools in the data analytics market are as
follows.

 R programming  Microsoft Excel


 Python  RapidMiner
 Tableau Public  KNIME
 QlikView  OpenRefine
 SAS  Apache Spark

11:28 AM 57 CSE
Data and
Architecture
Design:
Data architecture in Information Technology is composed of models, policies, rules or
standards that govern which data is collected, and how it is stored, arranged, integrated, and
put to use in data systems and in organizations.
 A data architecture should set data standards for all its data systems
 Data architectures address data in storage and data in motion; descriptions of data stores,
data
groups and data items; and mapping of those data artifacts to data qualities, applications,
s
locations etc.
 Data Architecture describes how data processed, stored, and utilized in a given system.
is
Data Architect is typically responsible for defining target state, aligning during
 The the
developmen to ensure enhancements are done in the spirit of the original
t 11:28 AM blueprint. 58 CSE
Scenario: A dataset of customer information has missing values in the "Age" and "Income" columns.

•Techniques:
•Imputation:
•Replace missing "Age" values with the median age.
•Replace missing "Income" values with the mean income.

•Deletion:
•Remove rows with missing values (if the number of missing values is small).
•Example:

•If a customer's age is "NaN," calculate the median age of all customers and replace "NaN" with that value.
Outlier Detection and Removal:
•Scenario: A dataset of house prices has a few houses with extremely high prices compared to the rest.
• Techniques:
• Visual inspection: Plot a box plot or scatter plot to identify outliers.
• Statistical methods: Use the interquartile range (IQR) to identify values that are significantly different from the
rest.
• Removal: Remove the outlier data points from the dataset.
•Example:
• A house price of $10 million in a neighborhood where most houses are priced between $200,000 and $500,000
might be considered an outlier.
Noisy Data(inconsistent or incorrect/error data)

Binning means, sorting the data, assign into bins smoothing process
means- remove error values –
smoothing by bin mean –
smooth by bin median –
smooth by bin boundary (min or max values)
Bin means missing vales filling with mean median binning values.

Regression - numerical prediction of data

Clustering - similar data items are grouped at one place dissimilar


items - are outside the cluster.
Data integration –
multiple heterogeneous sources of data are combined into a
single dataset
Two types of data integration
1- Tight coupling - data is combined together into a physical
location
2- Loose coupling - only an interface is created and data is
combined and accessed through the interface data is stored in
the Data Base
Data Reduction –
The volume of data is reduced to make analysis easier methods
for data reduction
1- Dimensionality reduction reduced the number of input variables
in the dataset, because large input vars -> poor performance.

2- Data cube aggregation - data is combined to form a data cube


and redundant noisy data is removed.

3- attribute subset selection (attributes are columns) highly


relevant attributes should be used, others are discarded (data is
reduced).

4- Numerosity reduction - store only a model (a sample) of data


rather than the entire dataset
Data Transformation –
Transformed into appropriate form suitable for the DM process
Four methods ------------------------
1- Nominalization - scale the data values in a specified range (eg;
-1.0 to 1.0 or 0 to 1)
2- Attribute selection - new attributes are created using older ones
3- Discretization - Data Discretization: The Data raw values are
replaced by interval levels
Eg; 10,12,13,14,21,22,34,36 -> 10-20, 20-30, 30-40.
Data Discretization:
•Scenario: A dataset contains "Age" values as continuous numbers.
• Techniques:
• Binning: Divide the "Age" values into discrete intervals (e.g., "Young," "Middle-aged,"
"Senior").
•Example:
• Create age groups: 18-30 = "Young," 31-50 = "Middle-aged," 51+ = "Senior."
concept of hierarchy generation - converting attributes from a
low level attribute to a higher level attribute eg; city -> country
Feature Scaling:
•Scenario: A dataset contains "Income" values in the range of thousands and
"Age" values in the range of tens.
• Techniques:
• Min-max scaling: Scale the "Income" and "Age" values to a common
range (e.g., 0 to 1).
•Example:
• Using Min-max scaling, a person aged 25 from a group of people aged 18
to 70 would have their age scaled to a value within the range of 0 to 1.
Data Transformation:
•Scenario: A dataset contains "Temperature" values in Celsius and Fahrenheit.
• Techniques:
• Normalization/Standardization:
• Convert all "Temperature" values to a common scale (e.g., Celsius).
• Use Z-score standardization to transform numerical data to have a mean of 0 and
a standard deviation of 1.
• Encoding categorical data:
• Convert "Gender" values ("Male," "Female") into numerical values (e.g., 0 and 1).
This is often done using one hot encoding.
•Example:
• Convert all Fahrenheit temperatures to Celsius using the appropriate formula.
• Convert a "color" column containing "red", "blue", and "green" into three columns
labeled "is_red", "is_blue", "is_green", where 1 indicates that the color is present, and 0
indicates it is not.
Regression - numerical prediction of data
Clustering - similar data items are grouped at one place dissimilar items - are outside the cluster
data integration - multiple heterogeneous sources of data are combined into a single dataset Two
types of data integration 1- tight coupling - data is combined together into a physical location 2-
loose coupling - only an interface is created and data is combined and accessed through the
interface data is stored in the DB
data reduction - the volume of data is reduced to make analysis easier methods for data reduction
1- dimensionality reduction reduced the number of input variables in the dataset, because large
input vars -> poor performance 2- data cube aggregation - data is combined to form a data cube
and redundant noisy data is removed 3- attribute subset selection (attributes are columns) highly
relevant attributes should be used, others are discarded (data is reduced) 4- numerosity reduction -
store only a model (a sample) of data rather than the entire dataset
Data transformation - transformed into appropriate
form suitable for the DM process Four methods
Nominalization –
scale the data values in a specified range
(eg; -1.0 to 1.0 or 0 to 1)
2- Attribute selection - new attributes are created using
older ones.

3- Discretization - raw values are replaced by interval


levels eg; 10,12,13,14,21,22,34,36 -> 10-20, 20-30, 30-
40.

4- concept of hierarchy generation - converting


attributes from a low level attribute to a higher level
attribute eg; city -> country
2. Data Types and Data Structures
•Structured vs. Unstructured Data
•Data types: Numerical, Categorical, Ordinal
•Common data structures: Arrays, Lists, Dictionaries, DataFrames
3. Data Collection and Cleaning
•Sources of Data (APIs, Web Scraping, Databases, etc.)
•Handling missing values (Imputation, Dropping, etc.)
•Removing duplicates and dealing with outliers
4. Exploratory Data Analysis (EDA)
•Descriptive Statistics: Mean, Median, Mode, Variance, Standard
Deviation
•Data Visualization: Histograms, Boxplots, Scatter Plots, Heatmaps
•Feature Engineering and Selection
5. Probability and Statistics for Data Science
•Basic Probability Rules and Distributions (Normal, Binomial, Poisson)
•Hypothesis Testing and Confidence Intervals
•Correlation vs. Causation
6. Machine Learning Basics
•Supervised vs. Unsupervised Learning
•Common algorithms: Linear Regression, Decision Trees, k-NN, Clustering
•Model Evaluation Metrics (Accuracy, Precision, Recall, F1-Score)
7. Introduction to Python for Data Science
•Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn
•Data Manipulation with Pandas
•Data Visualization with Matplotlib & Seaborn
8. Big Data and Cloud Computing
•Introduction to Big Data Technologies (Hadoop, Spark)
•Cloud Computing for Data Science (AWS, Google Cloud, Azure)
•Data Storage and Processing
9. Ethical Considerations in Data Science
•Bias in Data and Algorithms
•Privacy and Data Protection (GDPR, HIPAA)
•Responsible AI and Fairness
10. Case Studies and Real-World Applications
•Predictive Analytics in Finance
•Recommender Systems in E-commerce
•Healthcare Analytics

You might also like