0% found this document useful (0 votes)

93 views6 pages

Business Data Mining Week 2

Business Data Mining

Uploaded by

pm6566

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views6 pages

Business Data Mining Week 2

Business Data Mining

Uploaded by

pm6566

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Week 2 - LAQ's

Explain in detail about the steps in data preparation process?

Data preparation is the process of making raw data ready for after processing
and analysis. The key methods are to collect, clean, and label raw data in a format
suitable for machine learning (ML) algorithms, followed by data exploration and
visualization. The process of cleaning and combining raw data before using it for
machine learning and business analysis is known as data preparation, or sometimes
“pre-processing.” But it may not be the most attractive of duties, careful data
preparation is essential to the success of data analytics. Clear and important ideas from
raw data require careful validation, cleaning, and an addition. Any business analysis
or model created will only be as strong and validating as the very first information
preparation.

Data Preparation Process

There are a few important steps in the data preparation process, and each one is
essential to making sure the data is prepared for analysis or other processing. The
following are the key stages related to data preparation:

Step 1: Describe Purpose and Requirements

Identifying the goals and requirements for the data analysis project is the first step in
the data preparation process. Consider the followings:
 What is the goal of the data analysis project and how big is it?
 Which major inquiries or ideas are you planning to investigate or evaluate using
the data?
 Who are the target audience and end-users for the data analysis findings? What
positions and duties do they have?
 Which formats, types, and sources of data do you need to access and analyze?
 What requirements do you have for the data in terms of quality, accuracy,
completeness, timeliness, and relevance?

Pragalath EA2252001010013 1
 What are the limitations and ethical, legal, and regulatory issues that you must take
into account?
With answers to these questions, data analysis project’s goals, parameters, and
requirements simpler as well as highlighting any challenges, risks, or opportunities
that can develop.

Step 2: Data Collection

Collecting information from a variety of sources, including files, databases, websites,
and social media, to conduct a thorough analysis, providing the usage of reliable and
high-quality data. Suitable resources and methods are used to obtain and analyze data
from a variety of sources, including files, databases, APIs, and web scraping.

Step 3: Data Combining and Integrating Data

Data integration requires combining data from multiple sources or dimensions in order
to create a full, logical dataset. Data integration solutions provide a wide range of
operations, including combination, relationship, connection, difference, and join, as
well as a variety of data schemas and types of architecture.
To properly combine and integrate data, it is essential to store and arrange information
in a common standard format, such as CSV, JSON, or XML, for easy access and
uniform comprehension. Organizing data management and storage using solutions
such as cloud storage, data warehouses, or data lakes improves governance, maintains
consistency, and speeds up access to data on a single platform.
Audits, backups, recovery, verification, and encryption are all examples of strong
security procedures that can be used to make sure reliable data management. Privacy
protects data during transmission and storage, whereas authorization and
authentication.

Step 4: Data Profiling

Data profiling is a systematic method for assessing and analyzing a dataset, making
sure its quality, structure, content, and improving accuracy within an organizational
context. Data profiling identifies data consistency, differences, and null values by
analyzing source data, looking for errors, inconsistencies, and errors, and
understanding file structure, content, and relationships. It helps to evaluate elements
including completeness, accuracy, consistency, validity, and timeliness.

Pragalath EA2252001010013 2
Step 5: Data Exploring
Data exploration is getting familiar with data, identifying patterns, trends, outliers,
and errors in order to better understand it and evaluate the possibilities for analysis.
To evaluate data, identify data types, formats, and structures, and calculate descriptive
statistics such as mean, median, mode, and variance for each numerical variable.
Visualizations such as histograms, boxplots, and scatterplots can provide
understanding of data distribution, while complex techniques such as classification
can reveal hidden patterns and show exceptions.

Step 6: Data Transformations and Enrichment

Data enrichment is the process of improving a dataset by adding new features or
columns, enhancing its accuracy and reliability, and verifying it against third-party
sources.
 The technique involves combining various data sources like CRM, financial, and
marketing to create a comprehensive dataset, incorporating third-party data like
demographics for enhanced insights.
 The process involves categorizing data into groups like customers or products
based on shared attributes, using standard variables like age and gender to describe
these entities.
 Engineer new features or fields by utilizing existing data, such as calculating
customer age based on their birthdate. Estimate missing values from available data,
such as absent sales figures, by referencing historical trends.
 The task involves identifying entities like names and addresses within unstructured
text data, thereby extracting actionable information from text without a fixed
structure.
 The process involves assigning specific categories to unstructured text data, such
as product descriptions or customer feedback, to facilitate analysis and gain
valuable insights.
 Utilize various techniques like geocoding, sentiment analysis, entity recognition,
and topic modeling to enrich your data with additional information or context.
 To enable analysis and generate important insights, unstructured text data is
classified into different groups, such as product descriptions or consumer
feedback.

Pragalath EA2252001010013 3
Use cleaning procedures to remove or correct flaws or inconsistencies in your data,
such as duplicates, outliers, missing numbers, typos, and formatting difficulties.
Validation techniques like as checksums, rules, limitations, and tests are used to
ensure that data is correct and complete.

Step 8: Data Validation

Data validation is crucial for ensuring data accuracy, completeness, and consistency,
as it checks data against predefined rules and criteria that align with your
requirements, standards, and regulations.
 Analyze the data to better understand its properties, such as data kinds, ranges, and
distributions. Identify any potential issues, such as missing values, exceptions, or
errors.
 Choose a representative sample of the dataset for validation. This technique is
useful for larger datasets because it minimizes processing effort.
 Apply planned validation rules to the collected data. Rules may contain format
checks, range validations, or cross-field validations.
 Identify records that do not fulfill the validation standards. Keep track of any flaws
or discrepancies for future analysis.
 Correct identified mistakes by cleaning, converting, or entering data as needed.
Maintaining an audit record of modifications made during this procedure is critical.
 Automate data validation activities as much as feasible to ensure consistent and
ongoing data quality maintenance.

Tools for Data Preparation

The following section outlines various tools available for data preparation, essential
for addressing quality, consistency, and usability challenges in datasets.
1. Pandas: Pandas is a powerful Python library for data manipulation and analysis.
It provides data structures like DataFrames for efficient data handling and
manipulation. Pandas is widely used for cleaning, transforming, and exploring data
in Python.
2. Trifacta Wrangler: Trifacta Wrangler is a data preparation tool that offers a
visual and interactive interface for cleaning and structuring data. It supports

Pragalath EA2252001010013 4
various data formats and can handle large datasets.
3. KNIME: KNIME (Konstanz Information Miner) is an open-source platform for
data analytics, reporting, and integration. It provides a visual interface for
designing data workflows and includes a variety of pre-built nodes for data
preparation tasks.
4. DataWrangler by Stanford: DataWrangler is a web-based tool developed by
Stanford that allows users to explore, clean, and transform data through a series of
interactive steps. It generates transformation scripts that can be applied to the
original data.
5. RapidMiner: RapidMiner is a data science platform that includes tools for data
preparation, machine learning, and model deployment. It offers a visual workflow
designer for creating and executing data preparation processes.
6. Apache Spark: Apache Spark is a distributed computing framework that includes
libraries for data processing, including Spark SQL and Spark DataFrame. It is
particularly useful for large-scale data preparation tasks.
7. Microsoft Excel: Excel is a widely used spreadsheet software that includes a
variety of data manipulation functions. While it may not be as sophisticated as
specialized tools, it is still a popular choice for smaller-scale data preparation tasks.

Challenges in Data Preparation

Now, we have already understood that data preparation is a critical stage in the
analytics process, yet it is fraught with numerous challenges like:
1. Lack of or insufficient data profiling:
 Leads to mistakes, errors, and difficulties in data preparation.
 Contributes to poor analytics findings.
 May result in missing or incomplete data.
2. Incomplete data:
 Missing values and other issues that must be addressed from the start.
 Can lead to inaccurate analysis if not handled properly.
3. Invalid values:
 Caused by spelling problems, typos, or incorrect number input.

Pragalath EA2252001010013 5
 Must be identified and corrected early on for analytical accuracy.
4. Lack of standardization in data sets:
 Name and address standardization is essential when combining data sets.
 Different formats and systems may impact how information is received.
5. Inconsistencies between enterprise systems:
 Arise due to differences in terminology, special identifiers, and other factors.
 Make data preparation difficult and may lead to errors in analysis.
6. Data enrichment challenges:
 Determining what additional information to add requires excellent skills and
business analytics knowledge.
7. Setting up, maintaining, and improving data preparation processes:
 Necessary to standardize processes and ensure they can be utilized repeatedly.
 Requires ongoing effort to optimize efficiency and effectiveness.

Pragalath EA2252001010013 6

Database Security and Auditing
No ratings yet
Database Security and Auditing
9 pages
Creating Bootable Device A Step-By-Step Guide of Installing Operating System & Drivers
No ratings yet
Creating Bootable Device A Step-By-Step Guide of Installing Operating System & Drivers
54 pages
AI Gains Momentum in Core Financial Services Functions 1691763089
No ratings yet
AI Gains Momentum in Core Financial Services Functions 1691763089
9 pages
MySQL Quizzes
No ratings yet
MySQL Quizzes
42 pages
Interview Questions
No ratings yet
Interview Questions
67 pages
Fundamental Analysis Via Machine Learning
No ratings yet
Fundamental Analysis Via Machine Learning
26 pages
Installation Topall: Windows10/8.1
No ratings yet
Installation Topall: Windows10/8.1
3 pages
Derivatives and Risk Management
0% (1)
Derivatives and Risk Management
82 pages
AS 1055.2-1997 Acoustics - Description and Measurement of Environmental Noise - Application To Specific Situa
No ratings yet
AS 1055.2-1997 Acoustics - Description and Measurement of Environmental Noise - Application To Specific Situa
12 pages
Final - Data and Ai Governance.6sept2023
No ratings yet
Final - Data and Ai Governance.6sept2023
42 pages
Big Data Analytics Presentation
100% (1)
Big Data Analytics Presentation
34 pages
Data Analytics and Audit Quality
No ratings yet
Data Analytics and Audit Quality
42 pages
Data Analyst Roadmap by Rishabh Mishra
No ratings yet
Data Analyst Roadmap by Rishabh Mishra
9 pages
Data Science Capstone Project
No ratings yet
Data Science Capstone Project
21 pages
Data Scientist Certification Study Guide
No ratings yet
Data Scientist Certification Study Guide
7 pages
Chapter 9 & 10 - Data Warehouse
100% (1)
Chapter 9 & 10 - Data Warehouse
90 pages
Worksoft Execution Suite Installation Guide v101
No ratings yet
Worksoft Execution Suite Installation Guide v101
105 pages
BSBFIM601 Powerpoint Presentation
No ratings yet
BSBFIM601 Powerpoint Presentation
94 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Database Adnalesque Cano: I Sing of A Database and Its Records
No ratings yet
Database Adnalesque Cano: I Sing of A Database and Its Records
42 pages
Data Science Case Study For Introduction
No ratings yet
Data Science Case Study For Introduction
19 pages
Fixed Assets - SOPs
No ratings yet
Fixed Assets - SOPs
9 pages
Igcse English Language Coursework Word Limit
100% (2)
Igcse English Language Coursework Word Limit
8 pages
Data Warehouse
No ratings yet
Data Warehouse
71 pages
Module 3
No ratings yet
Module 3
113 pages
Convolutional Neural Networks & Zapier
No ratings yet
Convolutional Neural Networks & Zapier
75 pages
Perceptrader AI - Setup Guide
No ratings yet
Perceptrader AI - Setup Guide
71 pages
Data Smart For Product Managers
100% (1)
Data Smart For Product Managers
13 pages
Thesis Anum Afzal
No ratings yet
Thesis Anum Afzal
127 pages
2022 Grade 11 3rd Tem Sinhala Lit
No ratings yet
2022 Grade 11 3rd Tem Sinhala Lit
5 pages
Unit-3 DMDW
No ratings yet
Unit-3 DMDW
36 pages
Time Series
No ratings yet
Time Series
29 pages
DMF-1220 Data Management Fundamentals Practice Questions
No ratings yet
DMF-1220 Data Management Fundamentals Practice Questions
13 pages
Deloitte - Artificial Intelligence Credit Risk PDF
0% (1)
Deloitte - Artificial Intelligence Credit Risk PDF
9 pages
DM03 - Ifaudinara Putri & Luci Rustanti
No ratings yet
DM03 - Ifaudinara Putri & Luci Rustanti
69 pages
OS Unit 6
No ratings yet
OS Unit 6
39 pages
BlackSky Technology - Fundamentals Don't Support Recent Rally (NYSE - BKSY) - Seeking Alpha
No ratings yet
BlackSky Technology - Fundamentals Don't Support Recent Rally (NYSE - BKSY) - Seeking Alpha
7 pages
Service Manual Sharp AL1215 - 1530cs - 1540cs - 1551cs
No ratings yet
Service Manual Sharp AL1215 - 1530cs - 1540cs - 1551cs
131 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Predictive Modeling Using Transactional Data: Financial Services
100% (1)
Predictive Modeling Using Transactional Data: Financial Services
12 pages
Auditing IT Governance Controls
No ratings yet
Auditing IT Governance Controls
44 pages
Business Data Mining Week 5
No ratings yet
Business Data Mining Week 5
19 pages
Business Data Mining Week 1
No ratings yet
Business Data Mining Week 1
10 pages
Warehousing 2
No ratings yet
Warehousing 2
17 pages
The Big Data Analytics Market 2013-2023
No ratings yet
The Big Data Analytics Market 2013-2023
21 pages
Question 1: What Is Machine Learning Answer 1
No ratings yet
Question 1: What Is Machine Learning Answer 1
23 pages
XAI Final
No ratings yet
XAI Final
18 pages
Data Mining in Insurance
No ratings yet
Data Mining in Insurance
9 pages
Time Series Forecasting With Python Cheat Sheet
No ratings yet
Time Series Forecasting With Python Cheat Sheet
7 pages
ISB Assignment
No ratings yet
ISB Assignment
17 pages
Data Analytics Applications - Case Studies
No ratings yet
Data Analytics Applications - Case Studies
20 pages
From ChatGPT To ThreatGPT Impact of Generative AI in Cybersecurity and Privacy Compressed
No ratings yet
From ChatGPT To ThreatGPT Impact of Generative AI in Cybersecurity and Privacy Compressed
8 pages
Machine Learning Internship Projects
No ratings yet
Machine Learning Internship Projects
8 pages
Business Data Mining Week 4
No ratings yet
Business Data Mining Week 4
12 pages
Introduction To Requirement Engineering Requirements:: 1. Milk
0% (1)
Introduction To Requirement Engineering Requirements:: 1. Milk
2 pages
Overview of Guides or Best Practices Documents For Extended Warehouse Management
No ratings yet
Overview of Guides or Best Practices Documents For Extended Warehouse Management
2 pages
Genero Studio
No ratings yet
Genero Studio
18 pages
Data Analytics Tableau & Python
No ratings yet
Data Analytics Tableau & Python
15 pages
Artificial Intelligence and Its Applications in The Business World
No ratings yet
Artificial Intelligence and Its Applications in The Business World
20 pages
Prota Help Center - End User Guide
No ratings yet
Prota Help Center - End User Guide
17 pages
Business Data Mining Week 3
No ratings yet
Business Data Mining Week 3
3 pages
Digital Transformation in The Quality Management System
No ratings yet
Digital Transformation in The Quality Management System
5 pages
DataViz Checklist
No ratings yet
DataViz Checklist
4 pages
Predictive Modelling
No ratings yet
Predictive Modelling
5 pages
Kpi List Rev1
No ratings yet
Kpi List Rev1
48 pages
Data Analist
No ratings yet
Data Analist
5 pages
PROJECT MANAGEMENT Final
No ratings yet
PROJECT MANAGEMENT Final
5 pages
Data Science vs. Big Data vs. Data Analytics
No ratings yet
Data Science vs. Big Data vs. Data Analytics
7 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
The Role of Private Equity Firms - Rizwan
No ratings yet
The Role of Private Equity Firms - Rizwan
17 pages
Abusufyan Sher - T24 Technical Analyst - 20241024-1
No ratings yet
Abusufyan Sher - T24 Technical Analyst - 20241024-1
3 pages
Ayush Pawar Resume
No ratings yet
Ayush Pawar Resume
1 page
2 - Practice Test Guidelines
No ratings yet
2 - Practice Test Guidelines
8 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
In Gov Rajasthan rajeduboard-SSCER-08463242021
No ratings yet
In Gov Rajasthan rajeduboard-SSCER-08463242021
1 page
HELPFUL
No ratings yet
HELPFUL
5 pages
Table Penjualan, Piutang, Pembayaran
No ratings yet
Table Penjualan, Piutang, Pembayaran
4 pages
Pe Perioada: 01-01-2023 - 15-01-2023 EXTRAS DE CONT Nr. 1 Din Data: 15-01-2023
No ratings yet
Pe Perioada: 01-01-2023 - 15-01-2023 EXTRAS DE CONT Nr. 1 Din Data: 15-01-2023
4 pages
Data Is Not Everything
No ratings yet
Data Is Not Everything
5 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
CSC8001-Data Science Project Report
No ratings yet
CSC8001-Data Science Project Report
5 pages
Tableau Performance Optimization Flow Chart 2020
No ratings yet
Tableau Performance Optimization Flow Chart 2020
3 pages
Predictive Analytics Course
No ratings yet
Predictive Analytics Course
3 pages
RSS Flash Mobile Reader: New Art Laboratories
No ratings yet
RSS Flash Mobile Reader: New Art Laboratories
4 pages
Guia para Crear Un Juego en Java
No ratings yet
Guia para Crear Un Juego en Java
6 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Jurnal Hukum
No ratings yet
Jurnal Hukum
3 pages
Real Analytics: Shifting From Business Hindsight To Insight To Foresight
No ratings yet
Real Analytics: Shifting From Business Hindsight To Insight To Foresight
6 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
Big Data Analytics Complete Self-Assessment Guide
From Everand
Big Data Analytics Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

Business Data Mining Week 2

Uploaded by

Business Data Mining Week 2

Uploaded by

Week 2 - LAQ's

Explain in detail about the steps in data preparation process?

Data Preparation Process

Step 1: Describe Purpose and Requirements

Step 2: Data Collection

Step 3: Data Combining and Integrating Data

Step 4: Data Profiling

Step 6: Data Transformations and Enrichment

Step 8: Data Validation

Tools for Data Preparation

Challenges in Data Preparation

You might also like