This document contains two functions: min_max_scaler which takes a dataframe and list of columns and returns a scaled dataframe, and column_dropper which takes a dataframe and threshold and returns a dataframe with columns dropped if the missing value percentage exceeds the threshold.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
14 views
Support Functions
This document contains two functions: min_max_scaler which takes a dataframe and list of columns and returns a scaled dataframe, and column_dropper which takes a dataframe and threshold and returns a dataframe with columns dropped if the missing value percentage exceeds the threshold.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1
def min_max_scaler(df, cols_to_scale):
# Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
for col in cols_to_scale: # Define min and max values and collect them max_values = df.agg({col: 'max'}).collect()[0][0] min_values = df.agg({col: 'min'}).collect()[0][0] new_column_name = 'scaled_' + col # Create a new column based off the scaled data df = df.withColumn(new_column_name, (df[col] - min_values) / (max_values - min_values)) return df
def column_dropper(df, threshold):
# Takes a dataframe and threshold for missing values. # Returns a dataframe. total_records = df.count() for col in df.columns: # Calculate the percentage of missing values missing = df.where(df[col].isNull()).count() missing_percent = missing / total_records # Drop column if percent of missing is more than threshold if missing_percent > threshold: df = df.drop(col) return df