DSV-S7 Data Collection and Data Pre Processing Overview
DSV-S7 Data Collection and Data Pre Processing Overview
Session - 07
1
AIM OF THE
SESSION
The primary aim of data analytics and visualization is to transform raw data into meaningful insights.
INSTRUCTIONAL OBJECTIVES
LEARNING OUTCOMES
At the end of this session, you should be able to:
1.Related to data analytics, such as data types, variables, and data structures.
2.Understand different types of data (e.g., structured, unstructured) and their
significance in
analytics.
3. Analyse data and draw meaningful insights.
4. Understand probability theory and its application in data analysis.
Data Collection Strategies
The quality and relevance of the collected data significantly impact the
insights and decisions derived from the analysis.
Effective data collection and visualization strategies are essential for extracting
valuable insights and empowering data-driven decision-making.
3
Data collection Strategies in the Context of Data Analytics
and Visualization
4
Data collection Strategies in the Context of Data Analytics
and Visualization
5
Data collection Strategies in the Context of Data Analytics
and Visualization
3) Data Quality Assessment:
Assess the quality of available data. Check for completeness, accuracy, consistency,
and relevance.
Cleaning and pre-processing may be necessary to address any issues.
6
Data collection Strategies in the Context of Data Analytics
and Visualization
6) Data Privacy and Ethics:
Ensure compliance with data privacy regulations.
Obtain necessary permissions for data collection, especially when dealing with personal or
sensitive information.
7) Sampling Techniques:
Use sampling methods if working with large datasets.
This involves selecting a representative subset of data for analysis, which can save time and
resources.
8) Surveys and Questionnaires :
Design and deploy surveys or questionnaires to gather specific information directly from users
or relevant stakeholders.
Ensure that the questions align with your objectives.
7
Data collection Strategies in the Context of Data Analytics
and Visualization
8
Data Security Issues
Data security is a critical concern in the field of data analysis and visualization. As organizations
collect and analyze large volumes of data to gain insights and make informed decisions, they also
face significant challenges related to the security and privacy of this data.
1. Data Breaches
2. Data Privacy
3. Data Access Control
4. Data Encryption
5. Data Masking and Redaction
6. Data Integrity
7. Secure Data Sharing
8. Compliance with Regulations
9. Awareness and Training
10. Data Lifecycle Management
9
Data Security Issues
1) Data Breaches:
One of the most significant concerns is the potential for data breaches.
If unauthorized individuals gain access to sensitive data, it can lead to financial losses,
reputational damage, and legal consequences for organizations.
2) Data Privacy:
Protecting the privacy of individuals is crucial, especially when dealing with personally
identifiable information (PII).
Analyzing and visualizing data while preserving privacy is a complex task.
Techniques such as anonymization and differential privacy are used to mitigate
these concerns.
10
Data Security Issues
11
Data Security Issues
5) Data Integrity:
Ensuring the integrity of data is essential.
Data should not be tampered with during analysis or visualization processes.
Implementing checksums and digital signatures can help detect unauthorized changes
to data.
6) Secure Data Sharing:
Organizations often need to share data with external partners or third-party vendors.
Secure data sharing mechanisms, such as secure FTP, secure APIs, or block chain
technology, can help in ensuring the safe transfer of data.
12
Data Security Issues
13
Data Security Issues
14
Data Pre-Processing Overview in DAV
15
Data Pre-Processing Overview in DAV
16
Data Pre-Processing Overview in DAV
1) Data Cleaning:
• Data cleaning is the process of detecting corrupt data and inaccurate records
from a record set or database table.
• The main use of cleaning step is based on detecting incomplete, inaccurate,
inconsistent and irrelevant data and applying techniques to modify or delete
this useless data.
17
Data Pre-Processing Overview in DAV
2) Data Integration:
• Data Integration focuses on unification of data residing in different sources and
presenting a unified view of these data.
• Data with different representations are put together and any conflicts resulting from
it are resolved.
• This process becomes vital in a number of scientific and commercial applications.
With increasing volume and exponential growth of data, integrating it becomes even
more significant.
18
Data Pre-Processing Overview in DAV
3) Data Transformation:
• Data transformation plays a pivotal role in converting unprocessed data into
understandable form.
• It consists of data normalization, aggregation and generalization.
• Data normalization helps to arrange the columns and tables of a database such that
redundancy is minimum. This helps cut down on the processing time and
complexity.
• Data aggregation helps in creating a brief summary for faster overview.
• The process of data generalization is also known as rolling-up data. It helps in
generalizing data and creates successive layers of summary in evaluation database.
19
Data Pre-Processing Overview in DAV
4) Data Reduction:
• Data reduction is the process of transforming digital info into ordered and
simplified form.
• This data is generally derived through empirical and experimental means.
• It involves reducing large amounts of data into smaller and meaningful
fragments.
20
Data Pre-Processing Overview in DAV
5) Data Discretization:
• Data discretization is an important concept when you have a large amount of
numeric data, but only want to classify it based on nominal values.
• In this scenario, the continuous data is split into discrete forms and the values
of these discrete sets are said to be the nominal value. It is basically a process
of converting continuous data attributes into a finite set of intervals with
minimal loss of information.
21
Summary
Data collection and preprocessing are critical steps in the data analysis process,
laying the foundation for accurate and reliable insights derived from the data.
These steps involve identifying, acquiring, cleaning, and transforming data to
make it suitable for analysis and modelling. Properly executed data collection and
preprocessing enhance the quality and effectiveness of downstream analyses and
machine learning tasks.
22
SELF-ASSESSMENT QUESTIONS
1. Have I split the dataset into training and testing sets to assess model generalization?
2. Did I consider stratified sampling, especially for imbalanced datasets?
1. Why is transactional data important for businesses, and how can it be used for analysis?
2. What measures would you take to ensure the security and privacy of transactional data?
Summary
a) Data Collection: The primary purpose of data collection is to gather raw data for analysis
and decision-making in data science and analytics.
b) Data Sources: Structured data, such as Excel spreadsheets, is organized into a predefined
format, making it suitable for analysis.
c) Data Cleaning: Techniques like replacing missing values with the median of the column
are used to handle missing data in datasets.
d) Feature Engineering: Feature engineering involves creating new features from existing
data to enhance model performance and analysis.
24
TERMINAL QUESTIONS
What are the main objectives of data collection in the context of data science
and analytics?
How do you differentiate between structured and unstructured data? Provide
examples of each.
Explain the importance of handling missing values in a dataset during data
preprocessing.
What is feature engineering, and why is it a crucial step in data
preprocessing?
Describe the process of data integration and its significance in data analysis.
TERMINAL QUESTIONS
Reference Books:
1. Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical Guide for IT Professionals.
Sites and Web links:
2. http://www.cs.toronto.edu/~sme/CSC340F/slides/11-objects.pdf
THANK YOU
Team – DAV