0% found this document useful (0 votes)
46 views20 pages

Report

The document discusses analyzing data quality in a mobile user dataset using machine learning models to predict personality traits. It summarizes analyzing over 25 million data points from 743 mobile users over 30 days, extracting over 15,000 variables related to behaviors. Data profiling identified issues like missing values and outliers. Transformations reduced skewness in features while imputation addressed missing values below threshold percentages. Overall, the analysis aimed to improve data quality and accuracy for building models to predict personality traits from mobile data.

Uploaded by

Rohan Thakkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views20 pages

Report

The document discusses analyzing data quality in a mobile user dataset using machine learning models to predict personality traits. It summarizes analyzing over 25 million data points from 743 mobile users over 30 days, extracting over 15,000 variables related to behaviors. Data profiling identified issues like missing values and outliers. Transformations reduced skewness in features while imputation addressed missing values below threshold percentages. Overall, the analysis aimed to improve data quality and accuracy for building models to predict personality traits from mobile data.

Uploaded by

Rohan Thakkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Quality Analysis using Mobile User

Data

Team Members

Ayush Das IMT2019014


Balkaran Singh IMT2019016
Keshav Mittal IMT2019044
Pratyush Upadhyay IMT2019066
Rohan Thakar IMT2019071
Daksh Agarwal IMT2019505
Introduction:
In this project, we try to discover Data quality rules using Mobile user data. This project is an
implementation of Machine Learning Models to predict personality traits using Mobile User
Data. The paper focuses on understanding, quantification, and evaluating individual
differences in behavior, feelings, and thoughts. We analyzed behavioural data obtained from
743 participants in 30 consecutive days of smartphone sensing (25,347,089 logging
events).This involved using Data quality rules to analyze and modify the dataset. We computed
variables (15,692) about individual behavior from five semantic categories (communication &
social behaviour, music listening behaviour, app usage behaviour, mobility, and general day- &
nighttime activity). Using a machine learning approach (random forest, elastic net), We
showed how these variables can be used to predict self-assessments of the big five personality
traits at the factor and facet level. The results reveal distinct behavioural patterns that proved
to be differentially predictive of big five personality traits. Overall, Results show how a
combination of rich behavioral data obtained with smartphone sensing and the use of
machine learning techniques can help to advance personality research and can inform both
practitioners and researchers about the different behavioral patterns of personality.

What is Data Quality?

Data quality measures how well a dataset meets criteria for accuracy, completeness, validity,
consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data
governance initiatives within an organization. Data quality standards ensure that companies
make data-driven decisions to meet their business goals. If data issues, such as duplicate data,
missing values, and outliers, aren’t properly addressed, businesses increase their risk for
negative business outcomes.

Data Quality Dimensions


There are six primary, or core, dimensions to data quality. These are the metrics analysts use to
determine the data’s viability and its usefulness to the people who need it.

Accuracy - The data must conform to actual, real-world scenarios and reflect real-world
objects and events.

Completeness - Completeness measures the data's ability to deliver all the mandatory values
that are available successfully.

Consistency - Data consistency describes the data’s uniformity as it moves across applications
and networks and when it comes from multiple sources
Timeliness - Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current

Uniqueness - Uniqueness means that no duplications or redundant information are


overlapping across all the datasets.

Validity - Data must be collected according to the organization’s defined business rules and
parameters.

Dataset:
There were 15694 variables in the dataset which roughly corresponded to the behavioral
categories of communication, app usage, music consumption, general day- and nighttime
activity (day- and nighttime dependency was treated as a distinct category in the analyses), and
mobility along with 35 (5 factors and 30 facets) personality criteria.
Gender, age, and education were solely used for descriptive statistics and were not included as
predictors in any of the model.
Data Profiling and Analysis:
We used Sweetwiz to do Data Profiling on our dataset. Sweetviz was used to perform data quality
analysis on our dataset. To do this, first, the dataset is loaded into Python and analyzed using
Sweetviz. Sweetviz generates a report that includes an overview of the data, the data types, and the
number of missing values. Then, we used Sweetviz to generate visualizations that show the
distribution of the data, identify outliers, and check for anomalies. For example, histograms were
generated to show the data distribution for each variable, and scatter plots were used to check for
correlations between variables. Additionally, we generated summary statistics such as mean, median,
and standard deviation for each variable, which can help identify potential issues such as outliers or
skewness. By using Sweetviz to perform data quality analysis on the dataset, we gained valuable
insights into the structure and quality of the data, which can help build a more accurate predictive
model for personality traits using the mobile data of users. Our Dataset had almost 15,000 features,
We removed some features which had more than the threshold percentage to null values. The
columns that we got after doing that were simply reduced to 5372.

Here is the analysis of some of the features from the Sweetviz report
Data Quality Analysis and Preprocessing:
Analysis of Skewness and it’s Removal:
There are various ways by which we can reduce the skew. We can apply certain transformations
on the feature column and reduce the skewness. The transformation might also increase
Skewness is a statistical measure that describes the asymmetry of a distribution. It quantifies the
extent to which the data deviates from a symmetrical (normal or Gaussian) distribution.
Skewness is an essential concept in descriptive statistics and data analysis, as it provides insights
into the shape and characteristics of a dataset.

Skewness can be categorized into three types:


1. Positive Skewness (Right-skewed):In a positively skewed distribution, the tail of the distribution
extends towards the right, indicating that the majority of the data points are concentrated on the
left side. The mean is typically greater than the median in a positively skewed distribution.
2. Negative Skewness (Left-skewed): In a negatively skewed distribution, the tail of the
distribution extends towards the left, indicating that the majority of the data points are
concentrated on the right side. The mean is typically less than the median in a negatively skewed
distribution.
3. Zero Skewness: A distribution is considered to have zero skewness (symmetrical) if the data is
evenly distributed around the mean, resulting in equal proportions on both sides of the
distribution.
Skewness can create several problems in a dataset:
1. Biased statistical analysis: Skewness affects the validity of statistical measures that assume a
normal distribution. Techniques such as hypothesis testing and confidence intervals are typically
based on the assumption of normality. If the data is significantly skewed, it can bias the estimates
of parameters and lead to incorrect inferences.
2. Impact on predictive modelling: Many machine learning algorithms assume that the predictors
follow a normal distribution or are at least approximately symmetric. Skewed variables can
negatively impact the performance of such models, resulting in suboptimal predictions. Skewness
can cause models to assign more weight to extreme values and potentially introduce errors in
prediction.
3. Violation of assumptions: Skewed variables can violate the assumptions of linear regression
models, such as the assumption of linearity, independence, and homoscedasticity. This can lead
to biased coefficient estimates, inefficient parameter estimation, and inaccurate predictions.
4. Misinterpretation of data: Skewed distributions can distort the interpretation of data and make
it challenging to understand the true patterns and relationships within the dataset. Visualizations,
such as histograms or box plots, may not accurately represent the underlying distribution when
significant skewness is present.
Addressing skewness through appropriate transformations is crucial to ensure the validity of
statistical analyses, improve the performance of predictive models, and gain accurate insights
from the data.

The initial plot of one of our columns for the analysis of skewness is shown below:
We can clearly see that the plot is positively skewed so we apply all kinds of transformations
(Square Root, BoxCox, log etc ) to check if we can reduce it.
For this given feature column log transformation was able to reduce the skewness significantly.
For other columns, some other transformation might work. So we made a function to reduce
skewness, which chooses the transformation that significantly minimizes the skewness.

Imputing Null Values using Various Methods:


There are certain columns which had more than 60% of the Null Values. Those columns were
removed from the data frame. For the rest of the column we used various techniques to deal the
null values. For the column which had a small percentage of values(less than 10%), we imputed
those values by mean/median. Imputing null values in a dataset is a common data preprocessing
step to handle missing data. The choice between imputing with the median or the mean depends
on the distribution of the data, particularly the presence of skewness. We impute skewed
columns greater than skew threshold by median and the rest with mean. When dealing with
highly skewed data, imputing null values with the median is generally advisable. The median is a
robust measure of central tendency that is less affected by extreme values or outliers compared
to the mean. In highly skewed distributions, outliers can significantly influence the mean, leading
to biased imputations. By using the median, which is resistant to extreme values, you can obtain
more robust imputations. On the other hand, when the data is not highly skewed, imputing null
values with the mean is a reasonable choice. The mean is appropriate when the data follows a
relatively symmetric distribution or when the skewness is not substantial. However, it's important
to note that imputing with the mean can be sensitive to outliers, as extreme values can
disproportionately affect the mean. Therefore, if your dataset contains outliers that could
significantly impact the mean, it's advisable to consider alternative imputation methods, such as
imputing with the median or using more sophisticated techniques like multiple imputation or
model-based imputation.

For inputting null values in the rest of the columns we used predictive modelling which takes the
relationship of different columns using a correlation matrix. We defined a correlation matrix and
for all features, we took some features which were highly correlated to the target feature. Now
for imputation, we trained models on non-null values for the target and then used the same
model for predicting the null values of the target column. Here is the HeatMamp for a few
features in the form of the correlation matrix. For some columns we also used KNN imputer
Handling Outliers:

The final step was the removal of outliers from the columns. For this we analyzed different
columns and found their kurtosis and skewness values. If those values do not fall in a
well-defined range we perform outlier removal using Inter Quartile Range and Isolation forest
with 100 nodes.

Before Removing Outliers:


After removing outliers:
We also eliminated extreme outliers by excluding data points that are unreasonably far (greater
than 100 times the median absolute deviation) from the sample median for the remaining
columns. This was done to minimize the impact of possible logging errors on the modelling
process. Finally, the dataset was merged with 35 (5 factors and 30 facets) personality criteria
along with age, gender, and education. The final dataset consists of 1852 predictor variables
along with 35 (5 factors and 30 facets) personality criteria.
Then, we removed ids with 0 unique apps.(i.e ids that have 0 "daily_mean_num_unique_apps").
Furthermore, we excluded variables with less than 2 % unique values - as they would only add a
little information to the modelling process. In this step, we reduced the initial number of 15,692
variables to the final dataset of 1852 variables
Models used:
We used Elastic net regularized linear regression and Non-linear tree-based random forest models.
We trained machine learning models for the prediction of all personality factors. For model
benchmarking, we compared the predictive performance of elastic net regularized linear
regression models with those of nonlinear tree-based random forest models. We performed
additional pre-processing and hyperparameter tuning, using 5-fold cross-validation.

Elastic Net Regularized Linear Regression:

Linear regression is the standard algorithm for regression that assumes a linear relationship
between inputs and the target variable. An extension to linear regression involves adding penalties
to the loss function during training that encourage simpler models that have smaller coefficient
values. These extensions are referred to as regularized linear regression or penalized linear
regression.

Elastic net is a popular type of regularized linear regression that combines two popular penalties,
specifically the L1 and L2 penalty functions.

Random Forest Linear Regression :

Random Forest Regression is a supervised learning algorithm that uses an ensemble learning
method for regression. The ensemble learning method is a technique that combines predictions
from multiple machine learning algorithms to make a more accurate prediction than a single
model.
Evaluations:
We evaluated the predictive performance of the models based on the Pearson correlation (r)
between the predicted values and the person-parameter trait estimates from the
self-reported values of the respective personality trait variables. Additionally, we considered
the root mean squared error (RMSE) and the coefficient of determination (R2) as measures of
predictive performance.
Descriptive Statistics:

These are descriptives of demographic and personality trait variables for the 629
participants
Results:
The results show that levels of big five personality traits were successfully predicted from records
of smartphone usage for the majority of factors and facets. The results also show that the
non-linear random forest models on average outperformed the elastic net models in both
prediction performance and the number of successfully predicted criteria

Openness

Conscientiousness
Extraversion

Agreeableness
Emotional Stability
Comparison between random forest and elastic net
linear regression models
References

Stachl, Clemens, et al. “Predicting Personality from Patterns of Behavior


Collected with Smartphones.” Proceedings of the National Academy of Sciences,
vol. 117, no. 30, 28 July 2020, pp. 17680–17687

“Repository: Predicting Personality from Patterns of Behavior Collected with


Smartphones.”
Osf.io, 12 Oct. 2018, osf.io/kqjhr/, 10.17605/OSF.IO/KQJHR. Accessed 17 May 2022.

Brownlee, Jason. “How to Develop Elastic Net Regression Models in Python.” Machine
Learning Mastery, 6 Oct. 2020

You might also like