Business Data Mining Week 2
Business Data Mining Week 2
Data preparation is the process of making raw data ready for after processing
and analysis. The key methods are to collect, clean, and label raw data in a format
suitable for machine learning (ML) algorithms, followed by data exploration and
visualization. The process of cleaning and combining raw data before using it for
machine learning and business analysis is known as data preparation, or sometimes
“pre-processing.” But it may not be the most attractive of duties, careful data
preparation is essential to the success of data analytics. Clear and important ideas from
raw data require careful validation, cleaning, and an addition. Any business analysis
or model created will only be as strong and validating as the very first information
preparation.
Pragalath EA2252001010013 1
What are the limitations and ethical, legal, and regulatory issues that you must take
into account?
With answers to these questions, data analysis project’s goals, parameters, and
requirements simpler as well as highlighting any challenges, risks, or opportunities
that can develop.
Pragalath EA2252001010013 2
Step 5: Data Exploring
Data exploration is getting familiar with data, identifying patterns, trends, outliers,
and errors in order to better understand it and evaluate the possibilities for analysis.
To evaluate data, identify data types, formats, and structures, and calculate descriptive
statistics such as mean, median, mode, and variance for each numerical variable.
Visualizations such as histograms, boxplots, and scatterplots can provide
understanding of data distribution, while complex techniques such as classification
can reveal hidden patterns and show exceptions.
Pragalath EA2252001010013 3
Use cleaning procedures to remove or correct flaws or inconsistencies in your data,
such as duplicates, outliers, missing numbers, typos, and formatting difficulties.
Validation techniques like as checksums, rules, limitations, and tests are used to
ensure that data is correct and complete.
Pragalath EA2252001010013 4
various data formats and can handle large datasets.
3. KNIME: KNIME (Konstanz Information Miner) is an open-source platform for
data analytics, reporting, and integration. It provides a visual interface for
designing data workflows and includes a variety of pre-built nodes for data
preparation tasks.
4. DataWrangler by Stanford: DataWrangler is a web-based tool developed by
Stanford that allows users to explore, clean, and transform data through a series of
interactive steps. It generates transformation scripts that can be applied to the
original data.
5. RapidMiner: RapidMiner is a data science platform that includes tools for data
preparation, machine learning, and model deployment. It offers a visual workflow
designer for creating and executing data preparation processes.
6. Apache Spark: Apache Spark is a distributed computing framework that includes
libraries for data processing, including Spark SQL and Spark DataFrame. It is
particularly useful for large-scale data preparation tasks.
7. Microsoft Excel: Excel is a widely used spreadsheet software that includes a
variety of data manipulation functions. While it may not be as sophisticated as
specialized tools, it is still a popular choice for smaller-scale data preparation tasks.
Pragalath EA2252001010013 5
Must be identified and corrected early on for analytical accuracy.
4. Lack of standardization in data sets:
Name and address standardization is essential when combining data sets.
Different formats and systems may impact how information is received.
5. Inconsistencies between enterprise systems:
Arise due to differences in terminology, special identifiers, and other factors.
Make data preparation difficult and may lead to errors in analysis.
6. Data enrichment challenges:
Determining what additional information to add requires excellent skills and
business analytics knowledge.
7. Setting up, maintaining, and improving data preparation processes:
Necessary to standardize processes and ensure they can be utilized repeatedly.
Requires ongoing effort to optimize efficiency and effectiveness.
Pragalath EA2252001010013 6