0% found this document useful (0 votes)

538 views132 pages

Foundations of Data Science PPT TEXT BOOK

The document provides a comprehensive overview of Data Science, including its definition, scope, and applications across various domains such as healthcare and finance. It discusses the differences between structured, semi-structured, and unstructured data, as well as the significance of Big Data characterized by volume, variety, and velocity. Additionally, it outlines the stages of a data science project, essential skills, tools, ethical considerations, and the relationship between data science and other fields.

Uploaded by

malkireddyvenumadhava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

538 views132 pages

Foundations of Data Science PPT TEXT BOOK

Uploaded by

malkireddyvenumadhava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 132

FOUNDATIONS OF DATA

SCIENCE
By
Mr. S.SRINIVAS REDDY
ASST PROFESSOR
Foundations of Data Science

1. Introduction to Data Science

•Definition and scope of Data Science
•Applications in various domains (healthcare, finance, marketing, etc.)
•Difference between Data Science, Machine Learning, and AI
What is Data Science?
Data Science is an interdisciplinary field that combines statistical
analysis, machine learning, data visualization, and computational
techniques to extract insights and knowledge from structured and
unstructured data. It integrates elements of mathematics,
programming, and domain expertise to facilitate data-driven decision-
making.
Difference Between Structured, Semi-Structured, and Unstructured Data

Feature Structured Data Semi-Structured Data Unstructured Data

Data that is organized in a fixed Data that has some organizational
Data that lacks a predefined
Definition format, typically in tables with properties but does not follow a
structure or organization.
predefined schemas. strict schema.

Stored in NoSQL databases, XML,

Stored in relational databases (SQL Stored in file systems, data lakes, or
Storage JSON, or semi-structured
databases). object storage.
repositories.

Well-defined schema (rows and

Schema Partial schema or flexible structure. No predefined schema.
columns).

Requires specialized tools (e.g., Difficult to query directly; requires

Querying Easily queried using SQL.
NoSQL databases, JSON parsers). NLP, AI, or data processing tools.

- Customer records in an SQL

database. - XML and JSON data used in APIs. - Emails, images, videos, social
Examples
- Sales transactions stored in a - Log files from web servers. media posts, PDFs.
relational database.
Properties Structured data Semi-structured data Unstructured data

Technology It is based on Relational database It is based on XML/RDF(Resource It is based on character and binary
table Description Framework). data

Transaction management Matured transaction and various Transaction is adapted from DBMS No transaction management and no
concurrency techniques not matured concurrency

Version management Versioning over tuples,row,tables Versioning over tuples or graph is Versioned as a whole
possible

It is schema dependent and less It is more flexible than structured It is more flexible and there is
Flexibility flexible data but less flexible than absence of schema
unstructured data

Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured It is more scalable.
data

Robustness Very robust New technology, not very spread —

Query performance Structured query allow complex Queries over anonymous nodes are Only textual queries are possible
joining possible

Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data, and Unstructured data.
Big
Data:
3V’s

6
Volume
(Scale)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing
exponentiall
y

Exponential increase in
collected/generated data

7
4.6
30 billion RFID billion
tags today
12+ TBs camer
(1.3B in 2005) a
of tweet
data every phones
day world
wide

100s of
millions
of GPS
? TBs of
data every

enabled
day

devices
sol
d
25+ TBs of
log data 2+
annuall
every day billiony
people
on the
76 million Web
smart meters by end
in 2009… 200M 2011
by 2014
CERN’s Large Hydron Collider (LHC) generates 15 PB a year

Maximilien Brice, ©
The Earthscope
The Earthscope is the world's
science project. Designed
largest to track North
America's geological evolution, this
observatory records data over 3.8 million
square miles, amassing 67 terabytes of
data. It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
muchmore.(http://www.msnbc.msn.com/
id/44363598/ns/technology_and_science
-future_of_technology/#.TmetOdQ--uI)
Variety
(Complexity
• Relational Data )
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web
(RDF), …
• Streaming Data
• You can only scan the data
once
• A single application can be
generating/collecting many types of
data To extract knowledge all these types of
data need to linked together
• Big Public Data (online, weather, 11
A Single View to the
Customer
Social Banking
Finance
Media

Our
Gaming
Customer Known

History

Purchas
Entertain
e
Velocity
(Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right
now for store next to you
• Healthcare monitoring: sensors monitoring your activities
and body  any abnormal measurements require
immediate reaction

13
Real-time/
Fast Data

Mobile devices
(tracking all objects all the
time)
Social media and networks Scientific instruments
(all of us are generating (collecting all sorts of
data) data) Sensor technology and
networks
(measuring all kinds of
data)
• The progress and innovation is no longer hindered by the
ability to collect data
• But, by the ability to manage, analyze, summarize, visualize,
and discover knowledge from the collected data in a timely 14
manner and in a scalable fashion
Real-Time
Analytics/Decision
Requirement
Product
Recommendations Learning why Customers
Influence
that are Relevant Switch to competitors
& Compelling
Behavior
and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Some
Make it
4V’s

16
Harnessing
Big Data

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture
& technology)
17
Typically, a data science project goes through the following stages:
1.Problem Definition: Identifying the problem or question to answer.
2.Data Collection: Gathering the necessary data from various
sources.
3.Data Cleaning: Preprocessing and cleaning the data to ensure its
quality.
4.Data Exploration: Analyzing the data to understand its structure
and patterns.
5.Feature Engineering: Enhancing the data with new, derived
features.
6.Model Building: Creating a statistical or machine learning model
based on the data.
7.Model Validation: Evaluating the model's performance using
various metrics.
8.Model Deployment: Implementing the model in a production
environment.
9.Monitoring: Observing the model's performance over time and
refining as necessary.
Key Differences
Data Science Machine Learning Artificial Intelligence
Analyzing customer data Predicting house prices Self-driving cars making
Example
for trends based on features real-time decisions

Enable machines to learn Enable machines to mimic

Goal Extract insights from data
from data human intelligence

Broadest field that

Broad field covering data Subset of AI that learns
Scope includes ML and other
analysis, ML, statistics from data
intelligent systems

Feature Data Science Machine Learning Artificial Intelligence

Supervised, unsupervised,
Techn Machine learning, deep
Statistics, ML, data visualization and reinforcement
iques learning, expert systems
learning
Where Do We See Data Science?
Data Science is applied in numerous fields, including:
•Healthcare: Disease prediction, personalized medicine, and medical
imaging analysis.
•Finance: Fraud detection, algorithmic trading, and credit scoring.
•Marketing: Customer segmentation, targeted advertising, and sentiment
analysis.
•E-commerce: Recommendation systems and inventory optimization.
•Social Media: Trend analysis, user behavior modeling, and fake news
detection.
•Sports: Performance analysis and game strategy optimization.
•Government: Policy modeling and public health analytics.
How Does Data Science Relate to Other Fields?
Data Science overlaps with several disciplines, including:
•Computer Science: Algorithms, databases, and software engineering
play a crucial role in handling and processing data.
•Statistics: Essential for data analysis, hypothesis testing, and
probability modeling.
•Machine Learning & Artificial Intelligence: Core components of
predictive analytics and automation.
•Business Intelligence: Uses data insights to support strategic decision-
making.
•Engineering & Natural Sciences: Utilizes data science for simulation,
modeling, and optimization.
The Relationship Between Data Science and Information Science
•Data Science focuses on extracting actionable insights from data using
statistical and computational techniques.
•Information Science studies how information is created, managed, and
shared, emphasizing human-computer interaction and data management.
•The two fields overlap in areas like data organization, retrieval, and
knowledge management, making Information Science a foundational
component of Data Science.
Computational Thinking in Data Science
Computational Thinking involves problem-solving skills fundamental to
Data Science, including:
•Decomposition: Breaking complex problems into manageable
components.
•Pattern Recognition: Identifying trends and structures in data.
•Abstraction: Simplifying problems by focusing on relevant information.
•Algorithm Design: Creating step-by-step instructions to solve problems
effectively.
Skills for Data Science
To succeed in Data Science, individuals need a combination of technical
and soft skills:
•Programming: Proficiency in Python, R, SQL, or Julia.
•Mathematics & Statistics: Linear algebra, probability, and inferential
statistics.
•Data Wrangling & Cleaning: Handling missing data and preprocessing
datasets.
•Machine Learning: Understanding supervised and unsupervised learning
techniques.
•Big Data Technologies: Experience with Hadoop, Spark, and cloud
computing.
•Data Visualization: Using Matplotlib, Seaborn, and Tableau for storytelling
with data.
•Domain Knowledge: Understanding the industry context for meaningful
analysis.
•Communication: Ability to explain findings to technical and non-technical
audiences.
Tools for Data Science
Data Scientists rely on various tools for analysis, visualization, and
model deployment, including:
•Programming Languages: Python, R, SQL
•Data Processing & Analysis: Pandas, NumPy, SciPy
•Machine Learning Frameworks: Scikit-learn, TensorFlow, PyTorch
•Visualization Tools: Matplotlib, Seaborn, Tableau, Power BI
•Big Data & Cloud Services: Hadoop, Apache Spark, AWS, Google
Cloud, Azure
Issues of Ethics, Bias, and Privacy in Data Science
Ethics in Data Science
•Ensuring transparency in data usage and model decisions.
•Avoiding manipulation or misrepresentation of data.
•Promoting accountability for automated decisions.

Bias in Data Science

•Data Bias: Arises from imbalanced datasets leading to inaccurate or
unfair outcomes.
•Algorithmic Bias: Unintended prejudices embedded in machine
learning models.
•Human Bias: Subjective interpretations influencing model training
and results.
Privacy Concerns in Data Science
•Data Protection: Compliance with regulations like GDPR and HIPAA.
•Anonymization Techniques: Removing personally identifiable information
(PII) before analysis.
•Security Risks: Preventing data breaches and unauthorized access.
Introduction to Data Science –Chapter-2
Data Types:
Structured Data,
Unstructured Data,
Challenges with Unstructured Data.
Data Collections:
Open Data,
SocialMedia Data,
Multimodal Data,
Data Storage and Presentation
Data Pre-processsing:
DataCleaning,DataIntegration,
DataTransformation,Data Reduction,Data Discretization.
Managing unstructured data presents
numerous challenges, including
storage costs,
difficulty in organization
and retrieval,
security risks, and
challenges in data quality and
compliance.
What is the role of Data Analytics?
•
Gather Hidden Hidden insights from data are gathered and
Insights – then
analyzed with respect to business requirements.
•
Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
•
Perform Market Analysis – Market Analysis can be performed to understand the
strengths and of competitors.
weaknesses
•
Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
11:28 AM 41 CSE
Aspect Open Data Social Media Data Multimodal Data
Autonomous vehicles (sensor +
Examples World Bank Open Data, NASA Twitter sentiment analysis, GPS + camera), AI chatbots
Earth Observation Data Instagram engagement metrics (voice + text)

Use Cases Research, policy-making, Sentiment analysis, trend AI applications, medical imaging,
business intelligence prediction, marketing self-driving cars

Text, images, videos, comments, Text + Image, Video + Audio,

Formats CSV, JSON, XML, API feeds likes, shares Speech + Text

Data generated from social Data that integrates multiple

Definition Publicly accessible datasets that media platforms through user formats such as text, images,
can be freely used and shared interactions video, and audio

Cameras, sensors,
Government portals, Facebook, Twitter,
Data Sources microphones, social media,
research institutions, NGOs Instagram, YouTube, TikTok
medical devices

Owned by social media Generated from various

Provided by governments,
Ownership companies; subject to sources (e.g., sensors,
organizations, or institutions
privacy policies media, text inputs)

Unstructured or semi-
Structured, machine- Combines different data
structured, continuously
Characteristics readable, often in CSV, JSON, types, requiring specialized
updated, platform-
or API formats processing
dependent
Data Pre processing In Data Science
(The process of transforming raw data into an understandable format)
Four major tasks
1.Data Cleaning
2.Data Integration
3.Data reduction
4.Data Transformation
1.Data cleaning - removing noisy data (incorrect, incomplete, inconsistent
data)
and replace missing values For missing values, replace with N/A,
a mean value (normal),
Ex1:2,4,5,8,?,6,7,9,9, avg mean is 10 so missing value can assume with
10
a median value (non normal) or
most probable value manually for small data sets, automatic for large
data sets
Ex2: The List of students and their marks randomly mentioned, can make
Data, Information,
Analysis, Analytics:
What is Data? Why it is so Crazy?
• Data is raw, unorganized, unanalysed, uninterrupted, and unrelated used in different
contexts.

• For instance, facts and stats gathered by researchers for their analysis can collectively be
called data.

• Data in essence lacks its informative fervour and relatively renders itself to
be meaningless unless given a purpose or direction to acquire its significance.

11:28 AM 44 CSE
Data vs Information:

11:28 AM 45 CSE
Analysis vs
Analytics
Data Analytics vs Data Analysis, Are they Same or even Similar?
‘No’! they are not the same. They have considerable differences between them
But, Data Analysis is actually a subset of Data Analytics

Analysis Analytics
• Data Analysis helps
Data Analytics is the process
in • understanding the of exploring the data from the past
make appropriate decisions in the
to
pastdata and provides
to understand required
what happened
insights from the future by using valuable insights
so far.
• Data Analysis is actually a subset of Data Analytics is a wide area
Data
•
Analytics which helps us to involving handling data with a lot of
understand the data by questioning necessary tools to produce helpful
and to collect useful insights from decisions with useful predictions for
the information already available. a better output

11:28 AM 46 CSE
Different Types of Analytics:

11:28 AM 47 CSE
Type of Analytics Purpose Key Question Examples
Sales reports, customer
Summarizes past events
Descriptive Analytics What happened? service data, website
to understand patterns
traffic

Customer churn analysis,

Explains the causes of
Diagnostic Analytics Why did it happen? sales decline analysis,
past outcomes
defect analysis

Stock market prediction,

Forecasts future trends weather forecasting,
Predictive Analytics What is likely to happen?
and behaviors customer behavior
prediction

Fraud detection, supply

Recommends actions and
Prescriptive Analytics What should be done? chain optimization,
optimizations
marketing strategies

Simulates human thought Healthcare diagnosis (IBM

How can we make
Cognitive Analytics processes and decision- Watson), AI-powered
smarter decisions?
making chatbots, fraud detection
Type Definition Purpose Key Question Techniques Used Examples

- Monthly sales
report
Analyzes historical Data aggregation, - Web traffic
Descriptive Understand past
data to summarize What happened? data visualization, analytics
Analytics events and trends.
what happened. summary statistics - Financial
performance
review

- Stock market
Regression
Uses historical forecasting
Forecast future analysis, time-
Predictive data and machine What is likely to - Predicting
trends and series forecasting,
Analytics learning to predict happen? customer churn
behaviors. machine learning
future outcomes. - Weather
models
predictions

- Customer churn
analysis
Investigates the Correlation
- Analyzing why
Diagnostic cause of past Identify reasons Why did it analysis, root
sales dropped
Analytics outcomes or behind events. happen? cause analysis,
- Understanding
behaviors. data mining
manufacturing
defects
- Optimizing
delivery routes
Provides - Marketing
Suggest actions Optimization,
Prescriptive recommendatio What should be campaign
for optimal simulation,
Analytics ns on the best done? recommendatio
results. decision analysis
course of action. ns
- Inventory
management

Uses AI, - IBM Watson

machine Natural for healthcare
Mimics human
learning, and language diagnostics
cognitive
natural language How can we processing - AI chatbots for
Cognitive processes for
processing (NLP) make smarter (NLP), machine customer
Analytics decision-making
to simulate decisions? learning, deep service
and adaptive
human thinking learning, neural - Fraud
learning.
and decision- networks detection
making. systems
1. Data Cleaning (Removing Errors and Inconsistencies)
•Example: Handling missing or incorrect customer details in an e-
commerce database.
•Process:
• Identify missing values in customer addresses.
• Remove duplicate entries for the same customer.
• Standardize formats (e.g., "USA" and "United States" should be
the same).
• Replace incorrect data (e.g., negative age values).
2. Data Integration (Combining Data from Multiple Sources)
•Example: A hospital integrates patient data from different
departments (Radiology, Lab Tests, and Billing).
•Process:
• Collect patient data from different hospital databases.
• Remove inconsistencies in data formats (e.g., date formats like
MM/DD/YYYY vs. DD/MM/YYYY).
• Merge records using a unique patient ID.
• Ensure no duplication or redundancy.
3. Data Transmission (Transferring Data Efficiently)
•Example: Real-time stock market data streaming to financial
applications.
•Process:
• Stock exchange generates price updates every second.
• Data is compressed before transmission to reduce bandwidth
usage.
• A secure protocol (e.g., HTTPS, MQTT) is used for real-time data
transfer.
• The financial app receives and decodes the data for display.
4. Data Reduction (Minimizing Storage and Processing Load)
•Example: Compressing image files in a cloud storage system.
•Process:
• Convert high-resolution images (e.g., 4K) to lower resolutions
while maintaining quality.
• Remove unnecessary metadata from image files.
• Use data sampling techniques to store only essential parts of
large datasets.
• Reduce dimensions in a dataset using Principal Component
Analysis (PCA).
5. Data Discretization (Converting Continuous Data into Categories)
•Example: Categorizing customer age groups for a marketing
campaign.
•Process:
• Collect customer age data (e.g., 18, 25, 37, 45, 60).
• Define age group bins:
• 0-18: Teen
• 19-35: Young Adult
• 36-50: Middle Age
• 51+: Senior
• Assign each customer to a category.
• Use categorized data for targeted advertisements.
What are the tools used
in Data Analytics?
• With the increasing demand for Data Analytics in the market, many tools
have emerged with various functionalities for this purpose.
• Either open-source or user-friendly, the top tools in the data analytics market are as
follows.

 R programming  Microsoft Excel

 Python  RapidMiner
 Tableau Public  KNIME
 QlikView  OpenRefine
 SAS  Apache Spark

11:28 AM 57 CSE
Data and
Architecture
Design:
Data architecture in Information Technology is composed of models, policies, rules or
standards that govern which data is collected, and how it is stored, arranged, integrated, and
put to use in data systems and in organizations.
 A data architecture should set data standards for all its data systems
 Data architectures address data in storage and data in motion; descriptions of data stores,
data
groups and data items; and mapping of those data artifacts to data qualities, applications,
s
locations etc.
 Data Architecture describes how data processed, stored, and utilized in a given system.
is
Data Architect is typically responsible for defining target state, aligning during
 The the
developmen to ensure enhancements are done in the spirit of the original
t 11:28 AM blueprint. 58 CSE
Scenario: A dataset of customer information has missing values in the "Age" and "Income" columns.

•Techniques:
•Imputation:
•Replace missing "Age" values with the median age.
•Replace missing "Income" values with the mean income.

•Deletion:
•Remove rows with missing values (if the number of missing values is small).
•Example:

•If a customer's age is "NaN," calculate the median age of all customers and replace "NaN" with that value.
Outlier Detection and Removal:
•Scenario: A dataset of house prices has a few houses with extremely high prices compared to the rest.
• Techniques:
• Visual inspection: Plot a box plot or scatter plot to identify outliers.
• Statistical methods: Use the interquartile range (IQR) to identify values that are significantly different from the
rest.
• Removal: Remove the outlier data points from the dataset.
•Example:
• A house price of $10 million in a neighborhood where most houses are priced between $200,000 and $500,000
might be considered an outlier.
Noisy Data(inconsistent or incorrect/error data)

Binning means, sorting the data, assign into bins smoothing process
means- remove error values –
smoothing by bin mean –
smooth by bin median –
smooth by bin boundary (min or max values)
Bin means missing vales filling with mean median binning values.

Regression - numerical prediction of data

Clustering - similar data items are grouped at one place dissimilar

items - are outside the cluster.
Data integration –
multiple heterogeneous sources of data are combined into a
single dataset
Two types of data integration
1- Tight coupling - data is combined together into a physical
location
2- Loose coupling - only an interface is created and data is
combined and accessed through the interface data is stored in
the Data Base
Data Reduction –
The volume of data is reduced to make analysis easier methods
for data reduction
1- Dimensionality reduction reduced the number of input variables
in the dataset, because large input vars -> poor performance.

2- Data cube aggregation - data is combined to form a data cube

and redundant noisy data is removed.

3- attribute subset selection (attributes are columns) highly

relevant attributes should be used, others are discarded (data is
reduced).

4- Numerosity reduction - store only a model (a sample) of data

rather than the entire dataset
Data Transformation –
Transformed into appropriate form suitable for the DM process
Four methods ------------------------
1- Nominalization - scale the data values in a specified range (eg;
-1.0 to 1.0 or 0 to 1)
2- Attribute selection - new attributes are created using older ones
3- Discretization - Data Discretization: The Data raw values are
replaced by interval levels
Eg; 10,12,13,14,21,22,34,36 -> 10-20, 20-30, 30-40.
Data Discretization:
•Scenario: A dataset contains "Age" values as continuous numbers.
• Techniques:
• Binning: Divide the "Age" values into discrete intervals (e.g., "Young," "Middle-aged,"
"Senior").
•Example:
• Create age groups: 18-30 = "Young," 31-50 = "Middle-aged," 51+ = "Senior."
concept of hierarchy generation - converting attributes from a
low level attribute to a higher level attribute eg; city -> country
Feature Scaling:
•Scenario: A dataset contains "Income" values in the range of thousands and
"Age" values in the range of tens.
• Techniques:
• Min-max scaling: Scale the "Income" and "Age" values to a common
range (e.g., 0 to 1).
•Example:
• Using Min-max scaling, a person aged 25 from a group of people aged 18
to 70 would have their age scaled to a value within the range of 0 to 1.
Data Transformation:
•Scenario: A dataset contains "Temperature" values in Celsius and Fahrenheit.
• Techniques:
• Normalization/Standardization:
• Convert all "Temperature" values to a common scale (e.g., Celsius).
• Use Z-score standardization to transform numerical data to have a mean of 0 and
a standard deviation of 1.
• Encoding categorical data:
• Convert "Gender" values ("Male," "Female") into numerical values (e.g., 0 and 1).
This is often done using one hot encoding.
•Example:
• Convert all Fahrenheit temperatures to Celsius using the appropriate formula.
• Convert a "color" column containing "red", "blue", and "green" into three columns
labeled "is_red", "is_blue", "is_green", where 1 indicates that the color is present, and 0
indicates it is not.
Regression - numerical prediction of data
Clustering - similar data items are grouped at one place dissimilar items - are outside the cluster
data integration - multiple heterogeneous sources of data are combined into a single dataset Two
types of data integration 1- tight coupling - data is combined together into a physical location 2-
loose coupling - only an interface is created and data is combined and accessed through the
interface data is stored in the DB
data reduction - the volume of data is reduced to make analysis easier methods for data reduction
1- dimensionality reduction reduced the number of input variables in the dataset, because large
input vars -> poor performance 2- data cube aggregation - data is combined to form a data cube
and redundant noisy data is removed 3- attribute subset selection (attributes are columns) highly
relevant attributes should be used, others are discarded (data is reduced) 4- numerosity reduction -
store only a model (a sample) of data rather than the entire dataset
Data transformation - transformed into appropriate
form suitable for the DM process Four methods
Nominalization –
scale the data values in a specified range
(eg; -1.0 to 1.0 or 0 to 1)
2- Attribute selection - new attributes are created using
older ones.

3- Discretization - raw values are replaced by interval

levels eg; 10,12,13,14,21,22,34,36 -> 10-20, 20-30, 30-
40.

4- concept of hierarchy generation - converting

attributes from a low level attribute to a higher level
attribute eg; city -> country
2. Data Types and Data Structures
•Structured vs. Unstructured Data
•Data types: Numerical, Categorical, Ordinal
•Common data structures: Arrays, Lists, Dictionaries, DataFrames
3. Data Collection and Cleaning
•Sources of Data (APIs, Web Scraping, Databases, etc.)
•Handling missing values (Imputation, Dropping, etc.)
•Removing duplicates and dealing with outliers
4. Exploratory Data Analysis (EDA)
•Descriptive Statistics: Mean, Median, Mode, Variance, Standard
Deviation
•Data Visualization: Histograms, Boxplots, Scatter Plots, Heatmaps
•Feature Engineering and Selection
5. Probability and Statistics for Data Science
•Basic Probability Rules and Distributions (Normal, Binomial, Poisson)
•Hypothesis Testing and Confidence Intervals
•Correlation vs. Causation
6. Machine Learning Basics
•Supervised vs. Unsupervised Learning
•Common algorithms: Linear Regression, Decision Trees, k-NN, Clustering
•Model Evaluation Metrics (Accuracy, Precision, Recall, F1-Score)
7. Introduction to Python for Data Science
•Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn
•Data Manipulation with Pandas
•Data Visualization with Matplotlib & Seaborn
8. Big Data and Cloud Computing
•Introduction to Big Data Technologies (Hadoop, Spark)
•Cloud Computing for Data Science (AWS, Google Cloud, Azure)
•Data Storage and Processing
9. Ethical Considerations in Data Science
•Bias in Data and Algorithms
•Privacy and Data Protection (GDPR, HIPAA)
•Responsible AI and Fairness
10. Case Studies and Real-World Applications
•Predictive Analytics in Finance
•Recommender Systems in E-commerce
•Healthcare Analytics

Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
Oose Unit 5
No ratings yet
Oose Unit 5
118 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
MC4102 OOSE Question Bank
No ratings yet
MC4102 OOSE Question Bank
4 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
1.1. Cloud Architecture System Models For Distributed and Cloud Computing
No ratings yet
1.1. Cloud Architecture System Models For Distributed and Cloud Computing
31 pages
Ccs336 CSM Lab Manual
No ratings yet
Ccs336 CSM Lab Manual
30 pages
CS3352 - Foundation of Data Science
No ratings yet
CS3352 - Foundation of Data Science
2 pages
Chapter 06 Part1
No ratings yet
Chapter 06 Part1
20 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 2 - Visualizing Using Matplotlib
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 2 - Visualizing Using Matplotlib
8 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
CS8691 AI CO-PO Mapping
No ratings yet
CS8691 AI CO-PO Mapping
6 pages
Uml&dp - Lab - Manual (r13) - Student
100% (1)
Uml&dp - Lab - Manual (r13) - Student
53 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
No ratings yet
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
24 pages
CS3401 - Algorithm
No ratings yet
CS3401 - Algorithm
37 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
CS3301 Datastructure QN Paper Apr-May
No ratings yet
CS3301 Datastructure QN Paper Apr-May
2 pages
Cocomo Model
No ratings yet
Cocomo Model
26 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
Foundation of Data Science - CS3352 - Question Bank and Important 2 Marks Questions With Answer
No ratings yet
Foundation of Data Science - CS3352 - Question Bank and Important 2 Marks Questions With Answer
32 pages
Information Storage and Management
100% (1)
Information Storage and Management
2 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Ad3301-Data Exploration and Visualization Important Questions For Ciat-1
No ratings yet
Ad3301-Data Exploration and Visualization Important Questions For Ciat-1
3 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
CSM Lab Manual
No ratings yet
CSM Lab Manual
18 pages
Ccs341-Question-Bank NNNNNN
No ratings yet
Ccs341-Question-Bank NNNNNN
10 pages
Security Trends, Legal, Ethical and Professional Aspects of Security
No ratings yet
Security Trends, Legal, Ethical and Professional Aspects of Security
3 pages
FDS IMPORTANT QUESTIONS EduEngg
100% (1)
FDS IMPORTANT QUESTIONS EduEngg
7 pages
SPM 3-I Couse File Format
No ratings yet
SPM 3-I Couse File Format
18 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
Message Oriented Middleware (MOM)
No ratings yet
Message Oriented Middleware (MOM)
19 pages
CCS341 Set2
100% (1)
CCS341 Set2
2 pages
EDA - With Python Question Bank
No ratings yet
EDA - With Python Question Bank
3 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
CS3492 Database Management Systems Question Bank 1
No ratings yet
CS3492 Database Management Systems Question Bank 1
11 pages
Question Paper - AI (Feb 1)
No ratings yet
Question Paper - AI (Feb 1)
2 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Cs8791 Cloud Computing Unit2 Notes
No ratings yet
Cs8791 Cloud Computing Unit2 Notes
37 pages
List The Computer Security Hybrid Policies and Explain
No ratings yet
List The Computer Security Hybrid Policies and Explain
21 pages
CCS354 Network Security
No ratings yet
CCS354 Network Security
87 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
SD Unit 1
No ratings yet
SD Unit 1
30 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
Project Cycle Notes
No ratings yet
Project Cycle Notes
8 pages
Lecture Reinforcement Learning
No ratings yet
Lecture Reinforcement Learning
49 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
AI CYBERSECURITY - Merged
No ratings yet
AI CYBERSECURITY - Merged
32 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
6COM1044 2023 2024 General ML and PCA
No ratings yet
6COM1044 2023 2024 General ML and PCA
44 pages
A Container Scheduling Strategy Based On Machine Learning in Microservice Architecture
No ratings yet
A Container Scheduling Strategy Based On Machine Learning in Microservice Architecture
7 pages
Global Future of Energy 4.0 - Artificial Intelligence and Operational Efficiency Event, 20-21 of June 2019, Amsterdam
No ratings yet
Global Future of Energy 4.0 - Artificial Intelligence and Operational Efficiency Event, 20-21 of June 2019, Amsterdam
4 pages
Elbir 等。 - 2022 - Federated Learning in Vehicular Networks
No ratings yet
Elbir 等。 - 2022 - Federated Learning in Vehicular Networks
6 pages
BUS7017 Analysing Big Data (2495)
No ratings yet
BUS7017 Analysing Big Data (2495)
11 pages
Encryption & Decryption Apk
No ratings yet
Encryption & Decryption Apk
27 pages
4 - Will Robots Steal Our Jobs? The Potential Impact of Automation
No ratings yet
4 - Will Robots Steal Our Jobs? The Potential Impact of Automation
19 pages
Subjective Type Questions
No ratings yet
Subjective Type Questions
24 pages
The Age of AI: by Vishnu Jayamithran
0% (1)
The Age of AI: by Vishnu Jayamithran
10 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Assignment 6
No ratings yet
Assignment 6
3 pages
Leveraging Web Scraping To Develop A Fake News Detection Model For Philippine News Using RNN-LSTM
No ratings yet
Leveraging Web Scraping To Develop A Fake News Detection Model For Philippine News Using RNN-LSTM
7 pages
Data Mining P9-SVM
No ratings yet
Data Mining P9-SVM
30 pages
20isp015 - Lightning - Prediction and Alert System
No ratings yet
20isp015 - Lightning - Prediction and Alert System
9 pages
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
No ratings yet
j2020 A Survey of The Usages of Deep Learning For Natural Language Processing
21 pages
aicte microsoft intern offer letter
No ratings yet
aicte microsoft intern offer letter
2 pages
Clustering
No ratings yet
Clustering
4 pages
DipanshuKhurana NERTask
No ratings yet
DipanshuKhurana NERTask
8 pages
3MTT Pre-Onboarding Learning Resources
No ratings yet
3MTT Pre-Onboarding Learning Resources
31 pages
IEEE Final
No ratings yet
IEEE Final
5 pages
Assignment 1
No ratings yet
Assignment 1
17 pages
ML Lecture3 2013 PDF
No ratings yet
ML Lecture3 2013 PDF
60 pages
Vehicle Accident and Traffic Classification Using Deep Convolutional Neural Networks
No ratings yet
Vehicle Accident and Traffic Classification Using Deep Convolutional Neural Networks
6 pages
AI Assignment 2
No ratings yet
AI Assignment 2
5 pages