Report
Report
Data
Team Members
Data quality measures how well a dataset meets criteria for accuracy, completeness, validity,
consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data
governance initiatives within an organization. Data quality standards ensure that companies
make data-driven decisions to meet their business goals. If data issues, such as duplicate data,
missing values, and outliers, aren’t properly addressed, businesses increase their risk for
negative business outcomes.
Accuracy - The data must conform to actual, real-world scenarios and reflect real-world
objects and events.
Completeness - Completeness measures the data's ability to deliver all the mandatory values
that are available successfully.
Consistency - Data consistency describes the data’s uniformity as it moves across applications
and networks and when it comes from multiple sources
Timeliness - Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current
Validity - Data must be collected according to the organization’s defined business rules and
parameters.
Dataset:
There were 15694 variables in the dataset which roughly corresponded to the behavioral
categories of communication, app usage, music consumption, general day- and nighttime
activity (day- and nighttime dependency was treated as a distinct category in the analyses), and
mobility along with 35 (5 factors and 30 facets) personality criteria.
Gender, age, and education were solely used for descriptive statistics and were not included as
predictors in any of the model.
Data Profiling and Analysis:
We used Sweetwiz to do Data Profiling on our dataset. Sweetviz was used to perform data quality
analysis on our dataset. To do this, first, the dataset is loaded into Python and analyzed using
Sweetviz. Sweetviz generates a report that includes an overview of the data, the data types, and the
number of missing values. Then, we used Sweetviz to generate visualizations that show the
distribution of the data, identify outliers, and check for anomalies. For example, histograms were
generated to show the data distribution for each variable, and scatter plots were used to check for
correlations between variables. Additionally, we generated summary statistics such as mean, median,
and standard deviation for each variable, which can help identify potential issues such as outliers or
skewness. By using Sweetviz to perform data quality analysis on the dataset, we gained valuable
insights into the structure and quality of the data, which can help build a more accurate predictive
model for personality traits using the mobile data of users. Our Dataset had almost 15,000 features,
We removed some features which had more than the threshold percentage to null values. The
columns that we got after doing that were simply reduced to 5372.
Here is the analysis of some of the features from the Sweetviz report
Data Quality Analysis and Preprocessing:
Analysis of Skewness and it’s Removal:
There are various ways by which we can reduce the skew. We can apply certain transformations
on the feature column and reduce the skewness. The transformation might also increase
Skewness is a statistical measure that describes the asymmetry of a distribution. It quantifies the
extent to which the data deviates from a symmetrical (normal or Gaussian) distribution.
Skewness is an essential concept in descriptive statistics and data analysis, as it provides insights
into the shape and characteristics of a dataset.
The initial plot of one of our columns for the analysis of skewness is shown below:
We can clearly see that the plot is positively skewed so we apply all kinds of transformations
(Square Root, BoxCox, log etc ) to check if we can reduce it.
For this given feature column log transformation was able to reduce the skewness significantly.
For other columns, some other transformation might work. So we made a function to reduce
skewness, which chooses the transformation that significantly minimizes the skewness.
For inputting null values in the rest of the columns we used predictive modelling which takes the
relationship of different columns using a correlation matrix. We defined a correlation matrix and
for all features, we took some features which were highly correlated to the target feature. Now
for imputation, we trained models on non-null values for the target and then used the same
model for predicting the null values of the target column. Here is the HeatMamp for a few
features in the form of the correlation matrix. For some columns we also used KNN imputer
Handling Outliers:
The final step was the removal of outliers from the columns. For this we analyzed different
columns and found their kurtosis and skewness values. If those values do not fall in a
well-defined range we perform outlier removal using Inter Quartile Range and Isolation forest
with 100 nodes.
Linear regression is the standard algorithm for regression that assumes a linear relationship
between inputs and the target variable. An extension to linear regression involves adding penalties
to the loss function during training that encourage simpler models that have smaller coefficient
values. These extensions are referred to as regularized linear regression or penalized linear
regression.
Elastic net is a popular type of regularized linear regression that combines two popular penalties,
specifically the L1 and L2 penalty functions.
Random Forest Regression is a supervised learning algorithm that uses an ensemble learning
method for regression. The ensemble learning method is a technique that combines predictions
from multiple machine learning algorithms to make a more accurate prediction than a single
model.
Evaluations:
We evaluated the predictive performance of the models based on the Pearson correlation (r)
between the predicted values and the person-parameter trait estimates from the
self-reported values of the respective personality trait variables. Additionally, we considered
the root mean squared error (RMSE) and the coefficient of determination (R2) as measures of
predictive performance.
Descriptive Statistics:
These are descriptives of demographic and personality trait variables for the 629
participants
Results:
The results show that levels of big five personality traits were successfully predicted from records
of smartphone usage for the majority of factors and facets. The results also show that the
non-linear random forest models on average outperformed the elastic net models in both
prediction performance and the number of successfully predicted criteria
Openness
Conscientiousness
Extraversion
Agreeableness
Emotional Stability
Comparison between random forest and elastic net
linear regression models
References
Brownlee, Jason. “How to Develop Elastic Net Regression Models in Python.” Machine
Learning Mastery, 6 Oct. 2020