ASSi2 DSBDA
ASSi2 DSBDA
Outliers are data points that are significantly different from the majority of the other data points in a set.
#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)
The mask method is used to replace values in the DataFrame based on a condition.
#Mask Outliers in the DataFrame:
data_no_outliers = data.select_dtypes(include='number').mask(outliers, np.nan)
In this case, the skewness before the transformation was -0.5783358295678959, indicating a slight
negative skewness,which means the distribution was already somewhat left-skewed. After applying the
logarithm transformation, the skewness became more negative (-1.0555250171550188), suggesting a
further shift towards the left. Reducing skewness is one step towards achieving a more symmetric
distribution, it's important to note that achieving a perfectly normal distribution is not always necessary or
possible in practice. However, making the distribution more symmetric and closer to normal can be
beneficial
Displaying the transformed data:
print("\nTransformed Data: ")
Transformed Data:
#Focus on the output of Fees and Fees_sqrt (Data is transformed from a big number to its
square root for easier understanding and handling)
print(data_no_outliers)
plt.subplot(1, 2, 1) #This line creates a subplot grid with 1 row and 2 columns and selects
the first subplot (leftmost). The parameters (1, 2, 1) specify that there is 1 row, 2 columns,
and the current plot being referred to is the first one.
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True) #This line creates a histogram using
Seaborn's histplot() function. It plots the distribution of data in the 'Fees_sqrt' column of the
DataFrame data_no_outliers
plt.title('Histogram of Square Root-transformed Fees')
plt.subplot(1, 2, 2)
probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt) # Use probplot directly
plt.title('Q-Q Plot of Square Root-transformed Fees')
plt.show()