Google Cluster Data Preprocessing - Updated
Google Cluster Data Preprocessing - Updated
Operation: Identify columns with missing values and assess the extent of missingness.
Python Functions:
df.isnull().sum()
df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)
Python Functions: *
python
# Identifying duplicates
duplicates = df[df.duplicated()]
# Removing duplicates
df.drop_duplicates(inplace=True)
Python Functions:
df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].astype(float)
# Convert to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
4. Filtering Outliers
Python Functions:
df['zscore'] = zscore(df['mean_cpu_usage_rate'])
# Removing outliers
```
Python Functions:
python
scaler = MinMaxScaler()
df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']] = scaler.fit_transform(
df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']]
Python Functions:
df['aggregation_type'] = df['aggregation_type'].str.lower().replace(
Python Functions:
Python Functions:
```
9. Consistent Handling of Zero or Negative Values
Python Functions:
# Replace negative or zero values with NaN and then handle them
df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].replace(
df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)
Python Functions:
df['hourly_time'] = df['start_time'].dt.floor('H')
aggregated_df = df.groupby('hourly_time').agg({
'mean_cpu_usage_rate': 'mean',
'assigned_memory_usage_MB': 'sum'
}).reset_index()
```
By following these steps and utilizing the corresponding Python functions, you can effectively clean the
Google Cluster Dataset, preparing it for further analysis and ensuring that the insights you derive will be
reliable and accurate.