0% found this document useful (0 votes)
24 views

Google Cluster Data Preprocessing - Updated

Uploaded by

bsf23000703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Google Cluster Data Preprocessing - Updated

Uploaded by

bsf23000703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Handling Missing Values

Operation: Identify columns with missing values and assess the extent of missingness.

Python Functions:

# Checking for missing values

df.isnull().sum()

# Fill missing values with median

df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)

# Drop rows or columns with too many missing values

df.dropna(axis=0, thresh=5) # Keep rows with at least 5 non-NaN values

2. Removing Duplicate Entries

Operation: Check for and remove duplicate rows.

Python Functions: *

python

# Identifying duplicates

duplicates = df[df.duplicated()]

# Removing duplicates

df.drop_duplicates(inplace=True)

3. Correcting Data Types

Operation: Ensure that columns have the correct data types.

Python Functions:

# Convert column to float

df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].astype(float)
# Convert to datetime

df['start_time'] = pd.to_datetime(df['start_time'])

df['end_time'] = pd.to_datetime(df['end_time'])

4. Filtering Outliers

Operation: Detect and manage outliers using statistical techniques.

Python Functions:

# Using Z-score to identify outliers

from scipy.stats import zscore

df['zscore'] = zscore(df['mean_cpu_usage_rate'])

outliers = df[(df['zscore'] < -3) | (df['zscore'] > 3)]

# Removing outliers

df = df[(df['zscore'] >= -3) & (df['zscore'] <= 3)]

```

5. Standardizing Units and Scales

Operation: Ensure all measurements are in consistent units and scales.

Python Functions:

python

# Convert bytes to megabytes

df['assigned_memory_usage_MB'] = df['assigned_memory_usage'] / (1024 * 1024)

# Normalize or scale data

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']] = scaler.fit_transform(

df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']]

6. Handling Inconsistent Entries

Operation: Clean up inconsistencies in the data.

Python Functions:

# Correct inconsistent entries

df['aggregation_type'] = df['aggregation_type'].str.lower().replace(

{'sum': 'sum', 'SUM': 'sum', 'Summation': 'sum'}

7. Correcting Timestamp Misalignments

Operation: Ensure proper alignment of `start_time` and `end_time`.

Python Functions:

# Find rows where end_time is before start_time

misaligned = df[df['end_time'] < df['start_time']]

# Fix or drop these rows as necessary

df = df[df['end_time'] >= df['start_time']]

8. Removing Irrelevant Columns

Operation: Drop columns that are not needed for analysis.

Python Functions:

# Drop unnecessary columns

df.drop(['sample_portion', 'aggregation_type'], axis=1, inplace=True)

```
9. Consistent Handling of Zero or Negative Values

Operation: Identify and handle zero or negative values appropriately.

Python Functions:

# Replace negative or zero values with NaN and then handle them

df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].replace(

lambda x: x if x > 0 else None

df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)

10. Data Sampling and Reduction

Operation: Reduce dataset size without losing critical information.

Python Functions:

# Random sampling of data

sampled_df = df.sample(frac=0.1, random_state=42) # Take 10% sample

# Aggregating data to hourly means

df['hourly_time'] = df['start_time'].dt.floor('H')

aggregated_df = df.groupby('hourly_time').agg({

'mean_cpu_usage_rate': 'mean',

'assigned_memory_usage_MB': 'sum'

}).reset_index()

```

By following these steps and utilizing the corresponding Python functions, you can effectively clean the
Google Cluster Dataset, preparing it for further analysis and ensuring that the insights you derive will be
reliable and accurate.

You might also like