Laboratory Work 6
Laboratory Work 6
The purpose of this work is to learn the basics of data analysis, methods of collection,
classification, and forecasting, decision trees, processing large amounts of data, methods and
stages of Data Mining, Data Mining tasks, and data visualization.
Data Collection: Gathering raw data from various sources such as surveys, web
scraping, APIs, and existing databases.
Data Cleaning: Handling missing values, removing duplicates, and correcting
inconsistencies.
Data Processing: Transforming data into a structured format suitable for analysis.
Data Analysis: Applying statistical techniques and machine learning algorithms to
extract insights.
Data Interpretation: Understanding results and drawing meaningful conclusions.
Results Presentation
Graphs and statistical summaries were used to provide initial insights into data distribution,
missing values, and potential outliers. A histogram was generated to visualize the distribution of
salaries across the dataset.
Surveys and Questionnaires: Used for gathering user opinions and demographics.
Web Scraping: Extracting data from websites.
APIs and Databases: Programmatically fetching data from structured sources.
For this project, a simulated dataset was used to demonstrate analysis techniques.
A classification model was implemented to predict whether a customer would purchase a product
based on features like age, salary, and experience. Logistic Regression was used for this
classification task.
Evaluation Metrics:
A regression model was applied to predict salary based on age and experience. Linear
Regression was used for this prediction.
Evaluation Metrics:
A Decision Tree model was built using the CART algorithm to classify customers based on their
purchasing behavior.
The model was evaluated using cross-validation. Important attributes influencing purchasing
decisions were identified and analyzed.
A visual representation of the Decision Tree was generated to illustrate the decision-making
process.
Apache Spark was recommended as a tool for processing large datasets efficiently.
The Data Mining process was applied step by step to the dataset.
Python libraries such as scikit-learn, numpy, and pandas were utilized to implement these
tasks.
Part 7: Data Visualization
Task 7.1: Data Visualization Techniques
The generated visualizations were analyzed to extract insights. Key findings included:
Conclusion
This laboratory work covered fundamental and advanced concepts in Data Analysis and Data
Mining. The project successfully demonstrated:
The findings of this study can be applied to real-world business problems, aiding in data-driven
decision-making. The results confirm the importance of structured data analysis and advanced
machine learning techniques in modern analytics.