0% found this document useful (0 votes)

21 views76 pages

Module 3

Uploaded by

SHITAL BHATT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views76 pages

Module 3

Uploaded by

SHITAL BHATT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 76

Exploratory Data Analysis: INTRODUCTION TO

DATA SCIENCE

Presenter’s Name
Dr. Shital Bhatt
Associate Professor
School of Computational and Data Sciences

www.vidyashilpuniversity.com www.vidyashilpuniversity.com
Raw Data Implications and
Data from Multiple Sources
Introduction

 Definition of Raw Data

 - Unprocessed data collected from various sources

 Importance of Raw Data

 - Foundation for data analysis and decision-making

 Definition of Data from Multiple Sources

 - Data collected from various systems, formats, and locations
Implications of Raw Data - Benefits

 Complete and Unbiased Data

 - Provides a comprehensive view

 Flexibility in Analysis
 - Allows custom transformations and aggregations

 Richness
 - Contains detailed information
Implications of Raw Data - Challenges

 Volume
 - Large amounts of data to process

 Variety
 - Different formats and structures

 Veracity
 - Quality and accuracy of data

 Velocity
 - Speed at which data is generated and needs to be processed
Implications of Raw Data - Use Cases

 Big Data Analytics

 - Analyzing large datasets for insights

 Machine Learning
 - Training models on comprehensive datasets

 Business Intelligence
 - Driving strategic decisions with complete data
Processing Raw Data - Data Cleaning

 Removing Duplicates
 - Ensuring data uniqueness

 Handling Missing Values

 - Imputation, deletion, or ignoring missing data

 Correcting Errors
 - Fixing incorrect or inconsistent data
Processing Raw Data - Data Transformation

 Normalization
 - Scaling data to a standard range

 Aggregation
 - Summarizing data at different levels

 Encoding
 - Converting categorical data to numerical format
Processing Raw Data - Data Integration

 Combining Data from Different Sources

 - Merging datasets into a unified view

 Data Warehousing
 - Centralized storage for integrated data

 ETL (Extract, Transform, Load) Processes

 - Standard method for integrating data
Handling Data from Multiple Sources - Types
of Data Sources
 Structured Data
 - Databases, spreadsheets

 Semi-Structured Data
 - XML, JSON

 Unstructured Data
 - Text, images, videos
Handling Data from Multiple Sources - Data
Integration Techniques
 Batch Processing
 - Integrating data at scheduled intervals

 Real-Time Processing
 - Continuous integration as data is generated

 Data Virtualization
 - Accessing data without moving it
Handling Data from Multiple Sources - Data
Consistency and Quality
 Data Consistency
 - Ensuring data remains consistent across sources

 Data Quality
 - Accuracy, completeness, reliability

 Data Governance
 - Policies and procedures for managing data
Tools and Technologies - Data Integration
Tools
 Apache NiFi
 - Data routing and transformation

 Talend
 - Data integration platform

 Informatica
 - Enterprise data integration
Tools and Technologies - Data Cleaning Tools

 OpenRefine
 - Tool for cleaning messy data

 Trifacta
 - Data preparation platform

 DataWrangler
 - Interactive tool for data cleaning
Tools and Technologies - Data Transformation
Tools
 Apache Spark
 - Unified analytics engine

 Pandas (Python Library)

 - Data manipulation and analysis

 Alteryx
 - Data blending and advanced analytics
What is Raw Data?

 Definition: Unprocessed, unorganized data collected from its original source.

 Examples: Sensor readings, survey responses, website clickstream data.

 Emphasize the contrast between raw data and processed/structured data.

 Why is Raw Data Important?

 Foundation for all data analysis: Provides the building blocks for insights.

 Enables transparency and reproducibility: Allows others to verify and replicate findings.

 Potential for uncovering hidden patterns: Raw data may contain nuances lost in processing.
 Implications of Raw Data: Volume and Variety

 Volume: The sheer amount of raw data can be overwhelming (Big Data).

 Variety: Raw data comes in diverse formats (text, images, sensor data).

 Discuss challenges associated with storing, managing, and processing large, varied datasets.

 Implications of Raw Data: Quality and Security

 Quality: Raw data may contain errors, inconsistencies, and missing values.

 Security: Protecting raw data from unauthorized access or manipulation is crucial.

 Highlight the importance of data cleaning, validation, and security measures.

 Taming the Beast: Processing Raw Data

 Data Preprocessing: Cleaning, transforming, and organizing raw data for analysis.

 Data Transformation: Techniques like normalization, scaling, and feature engineering.

 Ethical Considerations of Raw Data

 Privacy: Protecting individual privacy when working with raw data, especially personal
information.

 Bias: Potential biases introduced during data collection or processing can skew results.

 Emphasize the importance of ethical data collection, handling, and analysis practices.
 The Future of Raw Data

 Advancements in data storage and processing technologies.

 Increased focus on real-time analysis of raw data streams (e.g., Internet of Things).

 The growing importance of data governance and ethical frameworks.

 Conclusion

 Raw data holds immense potential, but requires careful handling and processing.

 By understanding its implications, we can unlock valuable insights and make informed decisions.

 End with a call to action: responsible management and utilization of raw data for a better future.
Handling Data from Multiple
Sources
Introduction

 Definition of Data from Multiple Sources

 - Data collected from various systems, formats, and locations

 Importance of Integrating Data from Multiple Sources

 - Comprehensive insights, improved decision-making, and enhanced data quality
Types of Data Sources

 Structured Data
 - Databases, spreadsheets

 Semi-Structured Data
 - XML, JSON

 Unstructured Data
 - Text, images, videos
Challenges of Integrating Data from Multiple
Sources
 Data Variety
 - Different formats and structures

 Data Volume
 - Large amounts of data to process

 Data Quality and Consistency

 - Ensuring accuracy and reliability

 Data Security
 - Protecting data during integration
Data Integration Techniques - Batch Processing

 Definition and Process

 - Integrating data at scheduled intervals

 Use Cases
 - Periodic data consolidation
Data Integration Techniques - Real-Time
Processing
 Definition and Process
 - Continuous integration as data is generated

 Use Cases
 - Time-sensitive applications
Data Integration Techniques - Data
Virtualization
 Definition and Process
 - Accessing data without moving it

 Use Cases
 - Federated data access
Data Integration Techniques - ETL (Extract,
Transform, Load)
 Definition and Process
 - Standard method for integrating data

 Use Cases
 - Data warehousing
Tools and Technologies - Data Integration
Tools
 Apache NiFi
 - Data routing and transformation

 Talend
 - Data integration platform

 Informatica
 - Enterprise data integration
Tools and Technologies - Data Cleaning Tools

 OpenRefine
 - Tool for cleaning messy data

 Trifacta
 - Data preparation platform

 DataWrangler
 - Interactive tool for data cleaning
Tools and Technologies - Data Transformation
Tools
 Apache Spark
 - Unified analytics engine

 Pandas (Python Library)

 - Data manipulation and analysis

 Alteryx
 - Data blending and advanced analytics
Best Practices for Data Integration

 Ensuring Data Quality

 - Regular data validation and cleaning

 Maintaining Data Consistency

 - Using consistent data formats and standards

 Implementing Data Governance

 - Policies and procedures for managing data

 Data Security and Privacy

 - Protecting sensitive data during integration
Data Preprocessing: Numeric
and Non-Numeric
A comprehensive guide to preprocessing techniques in data science
Introduction to Data Preprocessing

 Importance of Data Preprocessing

 Types of Data: Numeric and Non-Numeric
Numeric Data Preprocessing

 Handling Missing Values

 Normalization and Standardization
 Binning
 Transformation Techniques
Handling Missing Values in Numeric Data

 Techniques:
 1. Mean/Median/Mode Imputation
 2. Forward/Backward Fill
 Example:
 Consider a dataset with missing values in the 'Age' column. Fill missing values using the
mean of the column.
Normalization and Standardization

 Definitions:
 Normalization: Scaling data to a range of [0, 1].
 Standardization: Scaling data to have a mean of 0 and a standard deviation of 1.
 Techniques:
 1. Min-Max Scaling
 2. Z-score Standardization
 Example:
 Consider normalizing the 'Income' column to the range [0, 1].
Binning

 Purpose of Binning: Grouping numeric data into bins to reduce noise and handle outliers.
 Techniques:
 1. Fixed-width Binning
 2. Quantile Binning
 Example:
 Bin the 'Age' column into intervals of 10 years each.
Transformation Techniques

 Transformations:
 1. Log Transformation: Applying log function to skewed data.
 2. Box-Cox Transformation: Stabilizing variance and making the data more normal
distribution-like.
 Example:
 Apply log transformation to the 'Income' column.
Non-Numeric Data Preprocessing

 Handling Missing Values

 Encoding Categorical Data
 Text Data Preprocessing
 Feature Extraction
Handling Missing Values in Non-Numeric Data

 Techniques:
 1. Imputation with Most Frequent Value
 2. Removing Rows/Columns
 Example:
 Fill missing values in the 'City' column with the most frequent city name.
Encoding Categorical Data

 Techniques:
 1. One-Hot Encoding: Creating binary columns for each category.
 2. Label Encoding: Converting categories to numerical labels.
 Example:
 One-hot encode the 'Gender' column.
Text Data Preprocessing

 Techniques:
 1. Tokenization: Splitting text into words or tokens.
 2. Stemming: Reducing words to their root form.
 3. Lemmatization: Reducing words to their base form.
 Example:
 Tokenize and lemmatize the 'Review' column.
Feature Extraction

 Techniques:
 1. Bag of Words: Creating a matrix of word counts.
 2. TF-IDF: Creating a matrix of term frequency-inverse document frequency.
 Example:
 Apply TF-IDF to the 'Review' column.
Tools and Libraries for Data Preprocessing

 Pandas: Data manipulation and analysis.

 NumPy: Numerical operations.
 Scikit-learn: Preprocessing module with various utilities.
Numeric Data Preprocessing
An overview of techniques and methods
Introduction

 Data preprocessing is a crucial step in data analysis and machine learning. It involves
transforming raw data into a clean and usable format. This presentation covers the main
techniques for preprocessing numeric data.
Handling Missing Values

 1. Remove missing values

 - Drop rows or columns with missing values.
 2. Impute missing values
 - Replace with mean, median, mode, or a constant value.
 - Use advanced techniques like KNN imputation or regression imputation.
Normalization

 Normalization scales numeric data to a standard range, typically [0, 1]. Common methods
include:
 1. Min-Max Scaling
 - Formula: (X - min(X)) / (max(X) - min(X))
 2. Z-Score Standardization
 - Formula: (X - mean(X)) / std(X)
Handling Outliers

 Outliers can distort statistical analyses and models. Techniques to handle outliers include:
 1. Removing outliers
 - Identify and drop outlier values.
 2. Transforming data
 - Use log, square root, or other transformations to reduce impact.
 3. Imputation
 - Replace outliers with mean, median, or a constant value.
Binning

 Binning involves grouping numeric values into bins or categories. This can help reduce the
impact of minor observation errors.
 1. Equal-width binning
 - Divide the range of data into intervals of equal size.
 2. Equal-frequency binning
 - Divide the data into intervals that contain approximately the same number of
observations.
Feature Engineering

 Feature engineering involves creating new features from existing data to improve model
performance.
 1. Polynomial features
 - Generate polynomial terms (e.g., X^2, X^3) from existing features.
 2. Interaction features
 - Create features that represent interactions between existing features (e.g., X1 * X2).
Conclusion

 Numeric data preprocessing is essential for improving the quality of data and the
performance of machine learning models. Techniques such as handling missing values,
normalization, handling outliers, binning, and feature engineering are crucial steps in this
process. Properly preprocessed data leads to more accurate and reliable analyses.
Numeric Data Preprocessing
An overview of techniques and methods
Introduction

 1. Remove missing values

Unit - 2
No ratings yet
Unit - 2
17 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Data Mining - Unit - 3
No ratings yet
Data Mining - Unit - 3
62 pages
UNIT-I DA
No ratings yet
UNIT-I DA
42 pages
Methods and Techniques of Data Processing
No ratings yet
Methods and Techniques of Data Processing
22 pages
IBM_Introduccion Analisis de Datos
No ratings yet
IBM_Introduccion Analisis de Datos
148 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Data_Mining_Warehousing Unit II
No ratings yet
Data_Mining_Warehousing Unit II
39 pages
The Three Types of Backtests
No ratings yet
The Three Types of Backtests
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Data Analyst Question-Answers
No ratings yet
Data Analyst Question-Answers
17 pages
Unit-2 new
No ratings yet
Unit-2 new
61 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
BA_CH-2
No ratings yet
BA_CH-2
6 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Analyst
No ratings yet
Data Analyst
1 page
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
How should data preparation be done for an analytics project_
No ratings yet
How should data preparation be done for an analytics project_
30 pages
Data Cleaning in Excel
No ratings yet
Data Cleaning in Excel
4 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Unit 1
No ratings yet
Unit 1
36 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
No ratings yet
Anderson F. Survival Analysis by Example. Hands On Approach Using R 2016
42 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Data Mining
No ratings yet
Data Mining
22 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data warehouse
No ratings yet
Data warehouse
11 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data warehouse (1)
No ratings yet
Data warehouse (1)
14 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit 3
No ratings yet
Unit 3
18 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Correlation
No ratings yet
Correlation
14 pages
Churn Analysis of Bank Customers
100% (1)
Churn Analysis of Bank Customers
12 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Astm C802
No ratings yet
Astm C802
18 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
IFLS Consumption Expenditure Aggregates PDF
No ratings yet
IFLS Consumption Expenditure Aggregates PDF
17 pages
Missing Data
100% (2)
Missing Data
35 pages
Summer Internship Project Report Format-Devansh
No ratings yet
Summer Internship Project Report Format-Devansh
58 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Fraudulent Insurance Claims Detection Using Machine Learning
No ratings yet
Fraudulent Insurance Claims Detection Using Machine Learning
54 pages
CS ELEC 4 Midterm Module
No ratings yet
CS ELEC 4 Midterm Module
59 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Business Statistics Assignment
100% (1)
Business Statistics Assignment
16 pages
ET - Project Presentation Solution
No ratings yet
ET - Project Presentation Solution
29 pages
DMML
No ratings yet
DMML
65 pages
NCKH ĐỀ 4
No ratings yet
NCKH ĐỀ 4
20 pages
Monte Carlo Simulation Based Statistical Modeling ISBN 9811033064, 9789811033063 PDF DOCX Download
No ratings yet
Monte Carlo Simulation Based Statistical Modeling ISBN 9811033064, 9789811033063 PDF DOCX Download
15 pages
BeautyInMind
No ratings yet
BeautyInMind
14 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Designing An AI
No ratings yet
Designing An AI
12 pages
Dance Movement Therapy For Depressed Clients Profiles of The Level and Changes in Depression
No ratings yet
Dance Movement Therapy For Depressed Clients Profiles of The Level and Changes in Depression
18 pages
26
No ratings yet
26
9 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Mining Part 02 Eng
No ratings yet
Data Mining Part 02 Eng
12 pages
McDonaldMoon Ho2002 PDF
No ratings yet
McDonaldMoon Ho2002 PDF
19 pages
1499153291Module11Q1Univariateanalysis PDF
No ratings yet
1499153291Module11Q1Univariateanalysis PDF
11 pages
Predictive Maintenance for Industrial Equipment(REPORT)
No ratings yet
Predictive Maintenance for Industrial Equipment(REPORT)
7 pages
Statistical Science: Volume 33, Number 2 May 2018
No ratings yet
Statistical Science: Volume 33, Number 2 May 2018
35 pages
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
From Everand
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
Pradeep Pasupuleti
No ratings yet

Module 3

Uploaded by

Module 3

Uploaded by

Exploratory Data Analysis: INTRODUCTION TO

 Definition of Raw Data

 Importance of Raw Data

 Definition of Data from Multiple Sources

 Complete and Unbiased Data

 Big Data Analytics

 Handling Missing Values

 Combining Data from Different Sources

 ETL (Extract, Transform, Load) Processes

 Pandas (Python Library)

 Definition: Unprocessed, unorganized data collected from its original source.

 Examples: Sensor readings, survey responses, website clickstream data.

 Emphasize the contrast between raw data and processed/structured data.

 Why is Raw Data Important?

 Implications of Raw Data: Quality and Security

 Security: Protecting raw data from unauthorized access or manipulation is crucial.

 Highlight the importance of data cleaning, validation, and security measures.

 Data Transformation: Techniques like normalization, scaling, and feature engineering.

 Ethical Considerations of Raw Data

 Advancements in data storage and processing technologies.

 The growing importance of data governance and ethical frameworks.

 Definition of Data from Multiple Sources

 Importance of Integrating Data from Multiple Sources

 Data Quality and Consistency

 Definition and Process

 Pandas (Python Library)

 Ensuring Data Quality

 Maintaining Data Consistency

 Implementing Data Governance

 Data Security and Privacy

 Importance of Data Preprocessing

 Handling Missing Values

 Handling Missing Values

 Pandas: Data manipulation and analysis.

 1. Remove missing values

 1. Remove missing values

 1. Remove missing values

 1. Remove missing values

You might also like