Data Wrangling
Data Wrangling
■ 1. drop data
■ a. drop the whole row
■ b. drop the whole column
■ 2. replace data
■ a. replace it by mean
■ b. replace it by frequency
■ c. replace it based on other functions
Correct data format
■ Making sure that all data is in the correct format (int, float,
text or other).
■ In Pandas, we use
■ **.dtype()** to check the data type
■ **.astype()** to change the data type
Data Standardization
■ Why normalization?
■ Normalization is the process of transforming values of several
variables into a similar range. Typical normalizations include
scaling the variable so the variable average is 0, scaling the
variable so the variable variance is 1, or scaling variable so the
variable values range from 0 to 1
Binning
■ Why binning?
■ Binning is a process of transforming continuous numerical
variables into discrete categorical 'bins', for grouped analysis.
■ Normally, a histogram is used to visualize the distribution of
bins created
Indicator variable (or dummy variable)
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
df.method() description
head( [n] ), tail( [n] ) first/last n rows
kurt kurtosis
Grouping