Pandas - get_dummies() method
Last Updated :
03 Dec, 2024
In Pandas, the get_dummies() function converts categorical variables into dummy/indicator variables (known as one-hot encoding). This method is especially useful when preparing data for machine learning algorithms that require numeric input.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, drop_first=False, dtype=None)
The function returns a DataFrame where each unique category in the original data is converted into a separate column, and the values are represented as True (for presence) or False (for absence).
Encoding a Pandas DataFrame
Let's look at an example of how to use the get_dummies() method to perform one-hot encoding.
Python
import pandas as pd
data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)
print('Original DataFrame')
display(df)
# Perform one-hot encoding
df_encoded = pd.get_dummies(df)
print('\n DataFrame after performing One-hot Encoding')
display(df_encoded)
Output:
Sample DataFrame
DataFrame after performing One-Hot EncodingIn the output, each unique category in the Color and Size columns has been transformed into a separate binary (True or False) column. The new columns indicate whether the respective category is present in each row.
To get, the output as 0 and 1, instead of True and False, you can set the data type (dtype) as 'float' or 'int'.
Python
# Perform one-hot encoding
df_encoded = pd.get_dummies(df, dtype = int)
print('\n DataFrame after performing One-hot Encoding')
display(df_encoded)
Output:
Pandas DataFrame after performing One-Hot Encoding (0s and 1s)Encoding a Pandas Series
Python
import pandas as pd
# Series with days of the week
days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])
print(pd.get_dummies(days, dtype='int'))
Output Friday Monday Thursday Tuesday Wednesday
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 ...
In this example, each unique day of the week is transformed into a dummy variable, where a 1 indicates the presence of that day.
Converting NaN Values into a Dummy Variable
The dummy_na=True option can be used when dealing with missing values. It creates a separate column indicating whether the value is missing or not.
Python
import pandas as pd
import numpy as np
# List with color categories and NaN
colors = ['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue']
print(pd.get_dummies(colors, dummy_na=True, dtype='int'))
Output Blue Green Red NaN
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 1 0
5 1 0 0 0
The dummy_na=True parameter adds a column for missing values (NaN), indicating where the NaN values were originally present.