Lab Assignment 1 Title: Data Wrangling I: Problem Statement
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
PROBLEM STATEMENT:
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the
data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get
some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data
frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the
correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
THEORY:
• It makes sense to the resultant dataset, as it gathers data that acts as a preparation stage for the data
mining process.
• Helps to make concrete and take a decision by cleaning and structuring raw data into the required
format.
• To create a transparent and efficient system for data management, the best solution is to have all data
in a centralized location so it can be used in improving compliance.
• Wrangling the data helps make decisions promptly and helps the wrangler clean, enrich, and transform
the data into a perfect picture.
1. DISCOVERING:
Discovering is a term for an entire analytic process, and it’s a good way to learn how to use the data to explore
and it brings out the best approach for analytics explorations. It is a step in which the data is to be understood
more deeply.
2. STRUCTURING:
Raw data is given randomly. There will not be any structure to it in most cases because raw data comes from
many formats of different shapes and sizes. The data must be organized in such a manner where the analytics
attempt to use it in his analysis part.
3. CLEANING:
High-quality analysis happens here where every piece of data is checked carefully and redundancies are
removed that don’t fit the data for analysis. Data containing the Null values have to be changed either to an
empty string or zero and the formatting will be standardized to make the data of higher quality. The goal of data
cleaning or remediation is to ensure that there are no possible ways that the final data could be influenced that
is to be taken for final analysis.
4. ENRICHING:
Enriching is like adding some sense to the data. In this step, the data is derived into new kinds of data fromthe
data which already exits from cleaning into the formatted manner. This is where the data need to strategize that
you have in your hand and to make sure that you have is the best-enriched data. The best way to get the refined
data is to down sample, upscale it, and finally augur the data.
5. VALIDATING:
For analysis and evaluation of the quality of specific data set data quality rules are used. After processing the
data, the quality and consistency are verified which establish a strong surface to the security issues. These are
to be conducted along multiple dimensions and to adhere to syntactic constraints.
6. PUBLISHING:
The final part of the data wrangling is Publishing which gives the sole purpose of the entire wrangling process.
Analysts prepare the wrangled data that use further down the line that is its purpose after all. The finalized data
must match its format for the eventual data’s target. Now the cooked data can be used for analytics.
Pandas are an open-source mainly used for Data Analysis. Data wrangling deals with the following
functionalities.
• Data exploration: Visualization of data is made to analyze and understand the data.
• Dealing with missing values: Having Missing values in the data set has been a common issue when
dealing with large data set and care must be taken to replace them. It can be replaced either by mean,
mode or just labelling them as NaN value.
• Reshaping data: Here the data is either modified from the addressing of pre-existing data or the data
is modified and manipulated according to the requirements.
• Filtering data: The unwanted rows and columns are filtered and removed which makes the data into a
compressed format.
• Others: After making the raw data into an efficient dataset, it is bought into useful for data visualization,
data analyzing, training the model, etc.
Data Preprocessing is carried out to remove the cause of unformatted real-world data which we discussed above.
First of all, let's explain how missing data can be handled during Data Preparation. Three different steps can be
executed which are given below -
• Ignoring the missing record - It is the simplest and efficient method for handling the missing data. But,
this method should not be performed at the time when the number of missing values is immenseor
when the pattern of data is related to the unrecognized primary root of the cause of the statement
problem.
• Filling the missing values manually - This is one of the best-chosen methods of Data Preparation
process. But there is one limitation that when there are large data set, and missing values are significant
then, this approach is not efficient as it becomes a time-consuming task.
• Filling using computed values - The missing values can also be occupied by computing mean, mode
or median of the observed given values. Another method could be the predictive values in Data
Preprocessing are that are computed by using any Machine Learning or Deep Learning tools and
algorithms. But one drawback of this approach is that it can generate bias within the data as the calculated
values are not accurate concerning the observed values.
Data Formatting
We should make sure that every column is assigned to the correct data type. This can be checked through the
property dtypes.
Tweet Id object
Tweet URL object
Tweet Posted Time (UTC) object
Tweet Content object
Tweet Type object
Client object
Retweets Received int64
Likes Received int64
Tweet Location object
Tweet Language object
User Id object
Name object
Username object
User Bio object
Verified or Non-Verified object
Profile URL object
Protected or Non-protected object
User Followers int64
User Following int64
User Account Creation Date object
Impressions int64
dtype: object
We can convert the column Tweet Location to string by using the function astype() as follows:
Data Normalization involves adjusting values measured on different scales to a common scale.
Normalization applies only to columns containing numeric values. Normalization methods are:
• min max
• z-score
Min-Max scaling
Z-score normalization
Z=(x−μ)/ σ
When we look at the categorical data, the first question that arises to anyone is how to handle those data, because
machine learning is always good at dealing with numeric values. We could make machine learning models
by using text data. So, to make predictive models we have to convert categorical data into numeric form.
Syntax:
Replacing the values is not the most efficient way to convert them. Pandas provide a method called
get_dummies which will return the dummy variable columns.
One hot encoding is the most widespread approach, and it works very well unless your categorical variable
takes on a large number of values One hot encoding creates new (binary) columns, indicating the presence of
each possible value from the original data.It uses get_dummies() Method
Method 3:
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-
readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.
It is an important pre-processing step for the structured dataset in supervised learning.
Example:
Suppose we have a column Height in some dataset.
After applying label encoding, the Height column is converted into: where 0 is the label for tall, 1 is the
label for medium, and 2 is a label for short height.
Example :# Import dataset
label_encoder = preprocessing.LabelEncoder()
df[‘Height’]= label_encoder.fit_transform(df[Height’])
df[‘Height’].unique()
Procedure-
PROGRAM :
CONCLUSION:
They will understand how important data wrangling is for data and using
different techniques optimizedresults can be obtained. Hence wrangle the data,
before processing for analysis.