0% found this document useful (0 votes)
6 views

Assignment03 DataScience Report

Uploaded by

Akash Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Assignment03 DataScience Report

Uploaded by

Akash Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Home Assignments - 03

Project Report Submitted in Partial Fulfilment of the Requirements for the Degree of

Bachelor of Technology (Hons.)


in
Computer Science and Engineering

Submitted by
AKASH KUMAR: (Roll No. 2021UGCS040)

Under the Supervision of


Dr. Dilip Kumar

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

National Institute of Technology Jamshedpur

Page 1
SUMMARY

Part 1: Data Exploration and Visualization


Approach:

• I began by loading the dataset into a pandas DataFrame and exploring its structure. Basic descriptive
statistics were calculated to understand the distribution and central tendencies of the numerical
columns. Visualizations such as histograms and box plots were employed to analyze the data
distribution and identify potential outliers.

Challenges:

• Identifying and dealing with outliers was a significant challenge, as it required a careful balance
between removing anomalous values and preserving valuable data points.
• Another challenge was ensuring the visualizations provided clear insights, which required choosing the
appropriate type of plots.

Key Learnings:

• Visualization is crucial for uncovering hidden patterns and outliers in the data.
• Descriptive statistics provide a quick summary of data, helping to identify areas that need further
cleaning or transformation.

Part 2: Data Cleaning


Approach:

• Missing values were identified and handled using appropriate methods such as imputation for
continuous variables and mode substitution for categorical variables. Duplicates were removed to
ensure data integrity. Outliers identified earlier were either removed or transformed depending on
their impact on the analysis.

Challenges:

• Handling missing data effectively without introducing bias or losing important information was tricky.
• Dealing with outliers required careful consideration of the impact on the overall dataset.

Key Learnings:

• Data cleaning is a critical step in the data science process, as it directly impacts the quality and
accuracy of the analysis.
• Proper handling of missing values and outliers ensures that the dataset is reliable for further analysis.

Page 2
Part 3: Data Integration
Approach:

• A secondary dataset was obtained and merged with the primary dataset using a common identifier.
The merged dataset was then checked for consistency, and any duplicates or discrepancies were
addressed to ensure a unified and accurate dataset.

Challenges:

• Finding a suitable secondary dataset that could be merged with the primary one was time-consuming.
• Ensuring consistency between datasets, especially when they originated from different sources,
required thorough checking and validation.

Key Learnings:

• Data integration is essential for creating comprehensive datasets that can provide deeper insights.
• Ensuring consistency across integrated datasets is crucial for maintaining the integrity of the data and
the reliability of the analysis.

Part 4: Data Storage and Retrieval


Approach:

• The cleaned dataset was saved to Google Drive for cloud storage. It was then retrieved and loaded
back into a DataFrame to ensure that data could be easily stored and accessed in a distributed
environment.

Challenges:

• Setting up the correct permissions and paths for Google Drive storage took some time, especially
when ensuring the data could be easily accessed later.
• Understanding how to effectively use cloud storage within a Python environment required additional
learning.

Key Learnings:

• Cloud storage provides a flexible and scalable solution for storing large datasets, making it easier to
collaborate and manage data in distributed environments.
• Proper storage and retrieval mechanisms are crucial for ensuring that data can be accessed and
analyzed efficiently.
• Contribution to the Overall Data Science Process

Page 3
Data Exploration:

• Data exploration is the foundation of the data science process. It allows for an initial understanding of
the data, guiding the subsequent steps of cleaning, transformation, and analysis. By identifying
patterns, trends, and outliers, it informs the strategies for data cleaning and preparation.

Data Cleaning:

• Cleaning the data ensures that it is accurate, complete, and free of errors. This step is crucial for
building reliable models and making accurate predictions. Without proper cleaning, the analysis might
be flawed, leading to incorrect conclusions.

Data Integration:

• Integrating data from multiple sources enriches the dataset, providing a more holistic view of the
problem at hand. It enables the combination of different perspectives and variables, leading to more
comprehensive analysis and better-informed decisions.

Data Storage and Retrieval:

• Efficient data storage and retrieval are critical for managing large datasets, especially in collaborative
or cloud-based environments. It ensures that data is securely stored, easily accessible, and can be
shared among team members or accessed for future analysis.

Final Thoughts
This assignment provided a comprehensive overview of the data science process, from data exploration to
cleaning, integration, and storage. Each step is interlinked, contributing to the overall goal of extracting
meaningful insights from data. By understanding and addressing the challenges in each part, I have gained a
deeper appreciation of the importance of thorough data preparation and management in the data science
workflow..

Page 4

You might also like