Module 1 ML Chapter2
Module 1 ML Chapter2
CHAPTER 2
UNDERSTANDING DATA – 1
UNDERSTANDING DATA – 1
Contents
• Introduction.
• Big Data Analysis Framework.
• Descriptive Statistics.
• Univariate Data Analysis and Visualization.
What is data?
• Data are facts
• Facts are in the form of numbers, audio, video, and image
• Need to analyze data for taking decisions
• Organizations store vast amounts of data (GB, TB, PB, EB).
• Data can be human-interpretable or computer-readable.
• Operational and Non-Operational Data
• Operational Data: Encountered in daily business procedures.
• Non-Operational Data: Used for decision-making.
• Processed data is meaningful and used for analysis.
Elements of Big Data
• Big data is characterized by:
• Volume: Large amounts of data (PB, EB).
• Velocity: Fast data arrival speeds.
• Variety: Different forms, functions, and sources of data.
• Veracity: Truthfulness and accuracy of data.
• Validity: Correctness for decision-making.
• Value: Importance of extracted insights for business decisions.
Types of Data
• Structured Data
• Stored in an organized manner (e.g., databases, SQL tables).
• Types include:
• Record Data: Organized as tables with rows and columns.
• Data Matrix: Numeric attributes arranged in multidimensional space.
• Graph Data: Represents relationships between objects (e.g., web pages and hyperlinks).
• Ordered Data: organized the data in order.
• Unstructured Data
• Includes images, video, audio, blogs, and textual documents.
• Estimated that 80% of data is unstructured.
• Semi-Structured Data
• Combines elements of structured and unstructured data.
• Examples: XML, JSON, RSS feeds, hierarchical data.
Data Storage and Representation
• Data stored in structures for analysis.
• Types:
• Flat Files
• CSV(Comma-Separated Values)
• In CSV files, values are separated by commas (","), Used in
spreadsheets, databases, and data analysis tools.
• TSV(Tab-Separated Values)
• In TSV files, values are separated by tabs (\t) instead of commas, Also
used in spreadsheets, databases, and data exchange between
applications.
Data Storage and Representation
• DBMS manages data efficiently.
• Types of databases:
• Transactional Database
• Time-Series Database
• Spatial Database
• World Wide Web (WWW)
• XML (eXtensible Markup Language)
• Data Stream
• RSS (Really Simple Syndication)
• JSON (JavaScript Object Notation)
Big Data Analytics and Types of Analytics
• Big data analytics helps businesses make decisions by analyzing data.
• It generates useful information and insights.
• Data analytics covers data collection, preprocessing, and analysis.
• It deals with the complete cycle of data management.
• Types of Data Analytics
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
Types of Analytics
• Descriptive Analytics
• Describes the main features of the data.
• Deals with collected data and quantifies it.
• Focuses on descriptive statistics rather than inference.
• Diagnostic Analytics
• Answers the question: 'Why did something happen?'
• Finds cause-and-effect relationships in data.
• Example: If a product is not selling, diagnostic analytics identifies reasons.
Types of Analytics
• Predictive Analytics
• Answers the question: 'What will happen in the future?'
• Uses algorithms to predict future trends.
• Machine learning heavily relies on predictive analytics.
• Prescriptive Analytics
• Recommends the best course of action.
• Goes beyond prediction and aids decision-making.
• Helps organizations plan for the future and mitigate risks.
Big Data Analysis Framework
• Big data frameworks use a layered architecture for flexibility and scalability.
• This architecture simplifies data processing and management.
• Four primary layers make up the big data framework.
• The framework consists of four layers:
1. Data Connection Layer
2. Data Management Layer
3. Data Analytics Layer
4. Presentation Layer
Big Data Analysis Framework
• Data Connection Layer
• Ingests raw data into appropriate structures.
• Supports Extract, Transform, and Load (ETL) operations.
• Connects data from various sources for analysis.
• Data Management Layer
• Preprocesses data for analysis.
• Executes read, write, and management tasks.
• Enables parallel query execution and data warehousing.
Big Data Analysis Framework
• Data Analytics Layer
• Performs statistical tests and machine learning model construction.
• Supports various analytical functions for insights.
• Validates models to ensure data integrity.
• Presentation Layer
• Displays results through dashboards and reports.
• Provides insights using machine learning models.
• Facilitates interpretation and visualization for better decision-making.
Types of Processing
• Cloud Computing
• Cloud computing provides shared resources over the internet.
• Services include:
• SaaS (Software as a Service) – Allows users to access software
applications over the internet without needing to install them on their
devices. Example: Google Docs, Microsoft 365.
• PaaS (Platform as a Service) – Provides a platform for developers to
build, test, and deploy applications. Example: Google App Engine,
Microsoft Azure.
• IaaS (Infrastructure as a Service) – Offers virtualized computing
resources like servers, storage, and networking. Example: Amazon Web
Services (AWS), Google Cloud Platform.
Types of Processing
• Cloud Service Deployment Models
• Public Cloud – Managed by third-party providers and accessible to the
general public. Example: Google Cloud, AWS.
• Private Cloud – Used exclusively by a single organization, providing greater
security and control.
• Community Cloud – Shared infrastructure owned and used by multiple
organizations with common concerns (e.g., government institutions).
• Hybrid Cloud – A combination of two or more cloud models to balance
security, performance, and
Types of Processing
• Characteristics of Cloud Computing
• Shared Infrastructure – Computing resources are shared across multiple
users.
• Dynamic Provisioning – Resources are allocated based on demand.
• Dynamic Scaling – Services can expand or shrink according to user needs.
• Network Access – Cloud resources are accessed over the internet.
• Utility-Based Metering – Users are charged based on resource consumption.
• Multitenancy – Multiple users share cloud resources securely.
• Reliability – Ensures continuous and reliable services.
Types of Processing
• Grid Computing:
• Uses distributed networks for complex tasks.
• Connects multiple computers to act as a single supercomputer.
• Distributes tasks across nodes for parallel processing.
• Ideal for high-performance, large-scale applications.
• HPC (High-Performance Computing):
• Aggregates resources to solve complex problems quickly.
• Utilizes parallel processing across compute, network, and storage components.
• Enhances performance for scientific and engineering tasks.
Data Collection
• Good Data Characteristics
• Timeliness: Relevant and up-to-date.
• Relevancy: Ready for machine learning tasks.
• Knowledge: Understandable and interpretable.
• Data Source Types:
• 1. Open/Public Data (e.g., digital libraries, healthcare databases)
• 2. Social Media Data (e.g., Twitter, YouTube)
• 3. Multimodal Data (e.g., text, audio, video)
Data preprocessing
• In the real world, data is often 'dirty'. Dirty data includes:
• Incomplete data: Missing values in the dataset.
• Outlier data: Errors in the recorded data.
• Data with inconsistent values: Contradictory or logically incorrect data entries.
• Inaccurate data: Errors in the recorded data.
• Data with missing values: Attributes or records with missing information.
• Duplicate data: Repeated entries that can skew analysis.
• Data preprocessing improves the quality of data mining techniques. The raw data
must be preprocessed to provide accurate results. This process involves data
cleaning and wrangling to make data usable for machine learning.
Data preprocessing
• Examples of Bad Data
• Consider the following examples of bad
data:
• Missing Salary values
• Age recorded as '5' but Date of Birth
indicates otherwise
• Age of '136', likely a typographical
error
• Negative salary values, e.g., '-1500'
• Data Cleaning Process involves:
• Identifying and correcting errors
• Removing duplicate or irrelevant data
• Filling in missing values
• Correcting inconsistent data formats
Missing Data Analysis
• The primary data cleaning process is missing data analysis.
• Data cleaning routines attempt to fill up missing values, smooth noise, identify
outliers, and correct data inconsistencies.
• This helps data mining models avoid overfitting.
• Methods for Handling Missing Data
• Ignore the tuple
• Fill in values manually
• Use a global constant
• Attribute value substitution
• Class mean
• Predicted value
Missing Data Analysis
• Ignore the tuple:
• Ignore records with missing data, especially class labels.
• Effective only when missing data is minimal.
• Smoothing by means: {15, 15, 15}, {24, 24, 24}, {30.3, 30.3, 30.3}
• Smoothing by boundaries: {12, 12, 19}, {22, 22, 26}, {28, 28, 34}
Data Integration and Data transformation
• Data integration merges data from multiple sources into a single source, which
may lead to redundant data.
• Detect and remove redundancies arising from data integration.
• These operations (like normalization) enhance data mining algorithm performance
by transforming data into a processable format.
•Normalization:
•A preliminary stage of data conditioning.
•Scales attribute values to a range (e.g., 0 to 1) for better algorithm performance.
•Commonly used in neural networks.
•Normalization Procedures:
• Min-Max
• z-Score
Data Normalization
• MIN-MAX normalization
• Transforms data to the range 0-1