Foundations of Data Science PPT TEXT BOOK
Foundations of Data Science PPT TEXT BOOK
SCIENCE
By
Mr. S.SRINIVAS REDDY
ASST PROFESSOR
Foundations of Data Science
Technology It is based on Relational database It is based on XML/RDF(Resource It is based on character and binary
table Description Framework). data
Transaction management Matured transaction and various Transaction is adapted from DBMS No transaction management and no
concurrency techniques not matured concurrency
Version management Versioning over tuples,row,tables Versioning over tuples or graph is Versioned as a whole
possible
It is schema dependent and less It is more flexible than structured It is more flexible and there is
Flexibility flexible data but less flexible than absence of schema
unstructured data
Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured It is more scalable.
data
Query performance Structured query allow complex Queries over anonymous nodes are Only textual queries are possible
joining possible
Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data, and Unstructured data.
Big
Data:
3V’s
6
Volume
(Scale)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing
exponentiall
y
Exponential increase in
collected/generated data
7
4.6
30 billion RFID billion
tags today
12+ TBs camer
(1.3B in 2005) a
of tweet
data every phones
day world
wide
100s of
millions
of GPS
? TBs of
data every
enabled
day
devices
sol
d
25+ TBs of
log data 2+
annuall
every day billiony
people
on the
76 million Web
smart meters by end
in 2009… 200M 2011
by 2014
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Maximilien Brice, ©
The Earthscope
The Earthscope is the world's
science project. Designed
largest to track North
America's geological evolution, this
observatory records data over 3.8 million
square miles, amassing 67 terabytes of
data. It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
muchmore.(http://www.msnbc.msn.com/
id/44363598/ns/technology_and_science
-future_of_technology/#.TmetOdQ--uI)
Variety
(Complexity
• Relational Data )
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web
(RDF), …
• Streaming Data
• You can only scan the data
once
• A single application can be
generating/collecting many types of
data To extract knowledge all these types of
data need to linked together
• Big Public Data (online, weather, 11
A Single View to the
Customer
Social Banking
Finance
Media
Our
Gaming
Customer Known
History
Purchas
Entertain
e
Velocity
(Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your
purchase history, what you like send promotions right
now for store next to you
• Healthcare monitoring: sensors monitoring your activities
and body any abnormal measurements require
immediate reaction
13
Real-time/
Fast Data
Mobile devices
(tracking all objects all the
time)
Social media and networks Scientific instruments
(all of us are generating (collecting all sorts of
data) data) Sensor technology and
networks
(measuring all kinds of
data)
• The progress and innovation is no longer hindered by the
ability to collect data
• But, by the ability to manage, analyze, summarize, visualize,
and discover knowledge from the collected data in a timely 14
manner and in a scalable fashion
Real-Time
Analytics/Decision
Requirement
Product
Recommendations Learning why Customers
Influence
that are Relevant Switch to competitors
& Compelling
Behavior
and their offers; in
time to Counter
Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Some
Make it
4V’s
16
Harnessing
Big Data
Supervised, unsupervised,
Techn Machine learning, deep
Statistics, ML, data visualization and reinforcement
iques learning, expert systems
learning
Where Do We See Data Science?
Data Science is applied in numerous fields, including:
•Healthcare: Disease prediction, personalized medicine, and medical
imaging analysis.
•Finance: Fraud detection, algorithmic trading, and credit scoring.
•Marketing: Customer segmentation, targeted advertising, and sentiment
analysis.
•E-commerce: Recommendation systems and inventory optimization.
•Social Media: Trend analysis, user behavior modeling, and fake news
detection.
•Sports: Performance analysis and game strategy optimization.
•Government: Policy modeling and public health analytics.
How Does Data Science Relate to Other Fields?
Data Science overlaps with several disciplines, including:
•Computer Science: Algorithms, databases, and software engineering
play a crucial role in handling and processing data.
•Statistics: Essential for data analysis, hypothesis testing, and
probability modeling.
•Machine Learning & Artificial Intelligence: Core components of
predictive analytics and automation.
•Business Intelligence: Uses data insights to support strategic decision-
making.
•Engineering & Natural Sciences: Utilizes data science for simulation,
modeling, and optimization.
The Relationship Between Data Science and Information Science
•Data Science focuses on extracting actionable insights from data using
statistical and computational techniques.
•Information Science studies how information is created, managed, and
shared, emphasizing human-computer interaction and data management.
•The two fields overlap in areas like data organization, retrieval, and
knowledge management, making Information Science a foundational
component of Data Science.
Computational Thinking in Data Science
Computational Thinking involves problem-solving skills fundamental to
Data Science, including:
•Decomposition: Breaking complex problems into manageable
components.
•Pattern Recognition: Identifying trends and structures in data.
•Abstraction: Simplifying problems by focusing on relevant information.
•Algorithm Design: Creating step-by-step instructions to solve problems
effectively.
Skills for Data Science
To succeed in Data Science, individuals need a combination of technical
and soft skills:
•Programming: Proficiency in Python, R, SQL, or Julia.
•Mathematics & Statistics: Linear algebra, probability, and inferential
statistics.
•Data Wrangling & Cleaning: Handling missing data and preprocessing
datasets.
•Machine Learning: Understanding supervised and unsupervised learning
techniques.
•Big Data Technologies: Experience with Hadoop, Spark, and cloud
computing.
•Data Visualization: Using Matplotlib, Seaborn, and Tableau for storytelling
with data.
•Domain Knowledge: Understanding the industry context for meaningful
analysis.
•Communication: Ability to explain findings to technical and non-technical
audiences.
Tools for Data Science
Data Scientists rely on various tools for analysis, visualization, and
model deployment, including:
•Programming Languages: Python, R, SQL
•Data Processing & Analysis: Pandas, NumPy, SciPy
•Machine Learning Frameworks: Scikit-learn, TensorFlow, PyTorch
•Visualization Tools: Matplotlib, Seaborn, Tableau, Power BI
•Big Data & Cloud Services: Hadoop, Apache Spark, AWS, Google
Cloud, Azure
Issues of Ethics, Bias, and Privacy in Data Science
Ethics in Data Science
•Ensuring transparency in data usage and model decisions.
•Avoiding manipulation or misrepresentation of data.
•Promoting accountability for automated decisions.
Use Cases Research, policy-making, Sentiment analysis, trend AI applications, medical imaging,
business intelligence prediction, marketing self-driving cars
Cameras, sensors,
Government portals, Facebook, Twitter,
Data Sources microphones, social media,
research institutions, NGOs Instagram, YouTube, TikTok
medical devices
Unstructured or semi-
Structured, machine- Combines different data
structured, continuously
Characteristics readable, often in CSV, JSON, types, requiring specialized
updated, platform-
or API formats processing
dependent
Data Pre processing In Data Science
(The process of transforming raw data into an understandable format)
Four major tasks
1.Data Cleaning
2.Data Integration
3.Data reduction
4.Data Transformation
1.Data cleaning - removing noisy data (incorrect, incomplete, inconsistent
data)
and replace missing values For missing values, replace with N/A,
a mean value (normal),
Ex1:2,4,5,8,?,6,7,9,9, avg mean is 10 so missing value can assume with
10
a median value (non normal) or
most probable value manually for small data sets, automatic for large
data sets
Ex2: The List of students and their marks randomly mentioned, can make
Data, Information,
Analysis, Analytics:
What is Data? Why it is so Crazy?
• Data is raw, unorganized, unanalysed, uninterrupted, and unrelated used in different
contexts.
• For instance, facts and stats gathered by researchers for their analysis can collectively be
called data.
• Data in essence lacks its informative fervour and relatively renders itself to
be meaningless unless given a purpose or direction to acquire its significance.
11:28 AM 44 CSE
Data vs Information:
11:28 AM 45 CSE
Analysis vs
Analytics
Data Analytics vs Data Analysis, Are they Same or even Similar?
‘No’! they are not the same. They have considerable differences between them
But, Data Analysis is actually a subset of Data Analytics
Analysis Analytics
• Data Analysis helps
Data Analytics is the process
in • understanding the of exploring the data from the past
make appropriate decisions in the
to
pastdata and provides
to understand required
what happened
insights from the future by using valuable insights
so far.
• Data Analysis is actually a subset of Data Analytics is a wide area
Data
•
Analytics which helps us to involving handling data with a lot of
understand the data by questioning necessary tools to produce helpful
and to collect useful insights from decisions with useful predictions for
the information already available. a better output
11:28 AM 46 CSE
Different Types of Analytics:
11:28 AM 47 CSE
Type of Analytics Purpose Key Question Examples
Sales reports, customer
Summarizes past events
Descriptive Analytics What happened? service data, website
to understand patterns
traffic
- Monthly sales
report
Analyzes historical Data aggregation, - Web traffic
Descriptive Understand past
data to summarize What happened? data visualization, analytics
Analytics events and trends.
what happened. summary statistics - Financial
performance
review
- Stock market
Regression
Uses historical forecasting
Forecast future analysis, time-
Predictive data and machine What is likely to - Predicting
trends and series forecasting,
Analytics learning to predict happen? customer churn
behaviors. machine learning
future outcomes. - Weather
models
predictions
- Customer churn
analysis
Investigates the Correlation
- Analyzing why
Diagnostic cause of past Identify reasons Why did it analysis, root
sales dropped
Analytics outcomes or behind events. happen? cause analysis,
- Understanding
behaviors. data mining
manufacturing
defects
- Optimizing
delivery routes
Provides - Marketing
Suggest actions Optimization,
Prescriptive recommendatio What should be campaign
for optimal simulation,
Analytics ns on the best done? recommendatio
results. decision analysis
course of action. ns
- Inventory
management
11:28 AM 57 CSE
Data and
Architecture
Design:
Data architecture in Information Technology is composed of models, policies, rules or
standards that govern which data is collected, and how it is stored, arranged, integrated, and
put to use in data systems and in organizations.
A data architecture should set data standards for all its data systems
Data architectures address data in storage and data in motion; descriptions of data stores,
data
groups and data items; and mapping of those data artifacts to data qualities, applications,
s
locations etc.
Data Architecture describes how data processed, stored, and utilized in a given system.
is
Data Architect is typically responsible for defining target state, aligning during
The the
developmen to ensure enhancements are done in the spirit of the original
t 11:28 AM blueprint. 58 CSE
Scenario: A dataset of customer information has missing values in the "Age" and "Income" columns.
•Techniques:
•Imputation:
•Replace missing "Age" values with the median age.
•Replace missing "Income" values with the mean income.
•Deletion:
•Remove rows with missing values (if the number of missing values is small).
•Example:
•If a customer's age is "NaN," calculate the median age of all customers and replace "NaN" with that value.
Outlier Detection and Removal:
•Scenario: A dataset of house prices has a few houses with extremely high prices compared to the rest.
• Techniques:
• Visual inspection: Plot a box plot or scatter plot to identify outliers.
• Statistical methods: Use the interquartile range (IQR) to identify values that are significantly different from the
rest.
• Removal: Remove the outlier data points from the dataset.
•Example:
• A house price of $10 million in a neighborhood where most houses are priced between $200,000 and $500,000
might be considered an outlier.
Noisy Data(inconsistent or incorrect/error data)
Binning means, sorting the data, assign into bins smoothing process
means- remove error values –
smoothing by bin mean –
smooth by bin median –
smooth by bin boundary (min or max values)
Bin means missing vales filling with mean median binning values.