0% found this document useful (0 votes)

46 views20 pages

Report

The document discusses analyzing data quality in a mobile user dataset using machine learning models to predict personality traits. It summarizes analyzing over 25 million data points from 743 mobile users over 30 days, extracting over 15,000 variables related to behaviors. Data profiling identified issues like missing values and outliers. Transformations reduced skewness in features while imputation addressed missing values below threshold percentages. Overall, the analysis aimed to improve data quality and accuracy for building models to predict personality traits from mobile data.

Uploaded by

Rohan Thakkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views20 pages

Report

Uploaded by

Rohan Thakkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Quality Analysis using Mobile User

Data

Team Members

Ayush Das IMT2019014

Balkaran Singh IMT2019016
Keshav Mittal IMT2019044
Pratyush Upadhyay IMT2019066
Rohan Thakar IMT2019071
Daksh Agarwal IMT2019505
Introduction:
In this project, we try to discover Data quality rules using Mobile user data. This project is an
implementation of Machine Learning Models to predict personality traits using Mobile User
Data. The paper focuses on understanding, quantification, and evaluating individual
differences in behavior, feelings, and thoughts. We analyzed behavioural data obtained from
743 participants in 30 consecutive days of smartphone sensing (25,347,089 logging
events).This involved using Data quality rules to analyze and modify the dataset. We computed
variables (15,692) about individual behavior from five semantic categories (communication &
social behaviour, music listening behaviour, app usage behaviour, mobility, and general day- &
nighttime activity). Using a machine learning approach (random forest, elastic net), We
showed how these variables can be used to predict self-assessments of the big five personality
traits at the factor and facet level. The results reveal distinct behavioural patterns that proved
to be differentially predictive of big five personality traits. Overall, Results show how a
combination of rich behavioral data obtained with smartphone sensing and the use of
machine learning techniques can help to advance personality research and can inform both
practitioners and researchers about the different behavioral patterns of personality.

What is Data Quality?

Data quality measures how well a dataset meets criteria for accuracy, completeness, validity,
consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data
governance initiatives within an organization. Data quality standards ensure that companies
make data-driven decisions to meet their business goals. If data issues, such as duplicate data,
missing values, and outliers, aren’t properly addressed, businesses increase their risk for
negative business outcomes.

Data Quality Dimensions

There are six primary, or core, dimensions to data quality. These are the metrics analysts use to
determine the data’s viability and its usefulness to the people who need it.

Accuracy - The data must conform to actual, real-world scenarios and reflect real-world
objects and events.

Completeness - Completeness measures the data's ability to deliver all the mandatory values
that are available successfully.

Consistency - Data consistency describes the data’s uniformity as it moves across applications
and networks and when it comes from multiple sources
Timeliness - Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current

Uniqueness - Uniqueness means that no duplications or redundant information are

overlapping across all the datasets.

Validity - Data must be collected according to the organization’s defined business rules and
parameters.

Dataset:
There were 15694 variables in the dataset which roughly corresponded to the behavioral
categories of communication, app usage, music consumption, general day- and nighttime
activity (day- and nighttime dependency was treated as a distinct category in the analyses), and
mobility along with 35 (5 factors and 30 facets) personality criteria.
Gender, age, and education were solely used for descriptive statistics and were not included as
predictors in any of the model.
Data Profiling and Analysis:
We used Sweetwiz to do Data Profiling on our dataset. Sweetviz was used to perform data quality
analysis on our dataset. To do this, first, the dataset is loaded into Python and analyzed using
Sweetviz. Sweetviz generates a report that includes an overview of the data, the data types, and the
number of missing values. Then, we used Sweetviz to generate visualizations that show the
distribution of the data, identify outliers, and check for anomalies. For example, histograms were
generated to show the data distribution for each variable, and scatter plots were used to check for
correlations between variables. Additionally, we generated summary statistics such as mean, median,
and standard deviation for each variable, which can help identify potential issues such as outliers or
skewness. By using Sweetviz to perform data quality analysis on the dataset, we gained valuable
insights into the structure and quality of the data, which can help build a more accurate predictive
model for personality traits using the mobile data of users. Our Dataset had almost 15,000 features,
We removed some features which had more than the threshold percentage to null values. The
columns that we got after doing that were simply reduced to 5372.

Here is the analysis of some of the features from the Sweetviz report
Data Quality Analysis and Preprocessing:
Analysis of Skewness and it’s Removal:
There are various ways by which we can reduce the skew. We can apply certain transformations
on the feature column and reduce the skewness. The transformation might also increase
Skewness is a statistical measure that describes the asymmetry of a distribution. It quantifies the
extent to which the data deviates from a symmetrical (normal or Gaussian) distribution.
Skewness is an essential concept in descriptive statistics and data analysis, as it provides insights
into the shape and characteristics of a dataset.

Skewness can be categorized into three types:

1. Positive Skewness (Right-skewed):In a positively skewed distribution, the tail of the distribution
extends towards the right, indicating that the majority of the data points are concentrated on the
left side. The mean is typically greater than the median in a positively skewed distribution.
2. Negative Skewness (Left-skewed): In a negatively skewed distribution, the tail of the
distribution extends towards the left, indicating that the majority of the data points are
concentrated on the right side. The mean is typically less than the median in a negatively skewed
distribution.
3. Zero Skewness: A distribution is considered to have zero skewness (symmetrical) if the data is
evenly distributed around the mean, resulting in equal proportions on both sides of the
distribution.
Skewness can create several problems in a dataset:
1. Biased statistical analysis: Skewness affects the validity of statistical measures that assume a
normal distribution. Techniques such as hypothesis testing and confidence intervals are typically
based on the assumption of normality. If the data is significantly skewed, it can bias the estimates
of parameters and lead to incorrect inferences.
2. Impact on predictive modelling: Many machine learning algorithms assume that the predictors
follow a normal distribution or are at least approximately symmetric. Skewed variables can
negatively impact the performance of such models, resulting in suboptimal predictions. Skewness
can cause models to assign more weight to extreme values and potentially introduce errors in
prediction.
3. Violation of assumptions: Skewed variables can violate the assumptions of linear regression
models, such as the assumption of linearity, independence, and homoscedasticity. This can lead
to biased coefficient estimates, inefficient parameter estimation, and inaccurate predictions.
4. Misinterpretation of data: Skewed distributions can distort the interpretation of data and make
it challenging to understand the true patterns and relationships within the dataset. Visualizations,
such as histograms or box plots, may not accurately represent the underlying distribution when
significant skewness is present.
Addressing skewness through appropriate transformations is crucial to ensure the validity of
statistical analyses, improve the performance of predictive models, and gain accurate insights
from the data.

The initial plot of one of our columns for the analysis of skewness is shown below:
We can clearly see that the plot is positively skewed so we apply all kinds of transformations
(Square Root, BoxCox, log etc ) to check if we can reduce it.
For this given feature column log transformation was able to reduce the skewness significantly.
For other columns, some other transformation might work. So we made a function to reduce
skewness, which chooses the transformation that significantly minimizes the skewness.

Imputing Null Values using Various Methods:

There are certain columns which had more than 60% of the Null Values. Those columns were
removed from the data frame. For the rest of the column we used various techniques to deal the
null values. For the column which had a small percentage of values(less than 10%), we imputed
those values by mean/median. Imputing null values in a dataset is a common data preprocessing
step to handle missing data. The choice between imputing with the median or the mean depends
on the distribution of the data, particularly the presence of skewness. We impute skewed
columns greater than skew threshold by median and the rest with mean. When dealing with
highly skewed data, imputing null values with the median is generally advisable. The median is a
robust measure of central tendency that is less affected by extreme values or outliers compared
to the mean. In highly skewed distributions, outliers can significantly influence the mean, leading
to biased imputations. By using the median, which is resistant to extreme values, you can obtain
more robust imputations. On the other hand, when the data is not highly skewed, imputing null
values with the mean is a reasonable choice. The mean is appropriate when the data follows a
relatively symmetric distribution or when the skewness is not substantial. However, it's important
to note that imputing with the mean can be sensitive to outliers, as extreme values can
disproportionately affect the mean. Therefore, if your dataset contains outliers that could
significantly impact the mean, it's advisable to consider alternative imputation methods, such as
imputing with the median or using more sophisticated techniques like multiple imputation or
model-based imputation.

For inputting null values in the rest of the columns we used predictive modelling which takes the
relationship of different columns using a correlation matrix. We defined a correlation matrix and
for all features, we took some features which were highly correlated to the target feature. Now
for imputation, we trained models on non-null values for the target and then used the same
model for predicting the null values of the target column. Here is the HeatMamp for a few
features in the form of the correlation matrix. For some columns we also used KNN imputer
Handling Outliers:

The final step was the removal of outliers from the columns. For this we analyzed different
columns and found their kurtosis and skewness values. If those values do not fall in a
well-defined range we perform outlier removal using Inter Quartile Range and Isolation forest
with 100 nodes.

Before Removing Outliers:

After removing outliers:
We also eliminated extreme outliers by excluding data points that are unreasonably far (greater
than 100 times the median absolute deviation) from the sample median for the remaining
columns. This was done to minimize the impact of possible logging errors on the modelling
process. Finally, the dataset was merged with 35 (5 factors and 30 facets) personality criteria
along with age, gender, and education. The final dataset consists of 1852 predictor variables
along with 35 (5 factors and 30 facets) personality criteria.
Then, we removed ids with 0 unique apps.(i.e ids that have 0 "daily_mean_num_unique_apps").
Furthermore, we excluded variables with less than 2 % unique values - as they would only add a
little information to the modelling process. In this step, we reduced the initial number of 15,692
variables to the final dataset of 1852 variables
Models used:
We used Elastic net regularized linear regression and Non-linear tree-based random forest models.
We trained machine learning models for the prediction of all personality factors. For model
benchmarking, we compared the predictive performance of elastic net regularized linear
regression models with those of nonlinear tree-based random forest models. We performed
additional pre-processing and hyperparameter tuning, using 5-fold cross-validation.

Elastic Net Regularized Linear Regression:

Linear regression is the standard algorithm for regression that assumes a linear relationship
between inputs and the target variable. An extension to linear regression involves adding penalties
to the loss function during training that encourage simpler models that have smaller coefficient
values. These extensions are referred to as regularized linear regression or penalized linear
regression.

Elastic net is a popular type of regularized linear regression that combines two popular penalties,
specifically the L1 and L2 penalty functions.

Random Forest Linear Regression :

Random Forest Regression is a supervised learning algorithm that uses an ensemble learning
method for regression. The ensemble learning method is a technique that combines predictions
from multiple machine learning algorithms to make a more accurate prediction than a single
model.
Evaluations:
We evaluated the predictive performance of the models based on the Pearson correlation (r)
between the predicted values and the person-parameter trait estimates from the
self-reported values of the respective personality trait variables. Additionally, we considered
the root mean squared error (RMSE) and the coefficient of determination (R2) as measures of
predictive performance.
Descriptive Statistics:

These are descriptives of demographic and personality trait variables for the 629
participants
Results:
The results show that levels of big five personality traits were successfully predicted from records
of smartphone usage for the majority of factors and facets. The results also show that the
non-linear random forest models on average outperformed the elastic net models in both
prediction performance and the number of successfully predicted criteria

Openness

Conscientiousness
Extraversion

Agreeableness
Emotional Stability
Comparison between random forest and elastic net
linear regression models
References

Stachl, Clemens, et al. “Predicting Personality from Patterns of Behavior

Collected with Smartphones.” Proceedings of the National Academy of Sciences,
vol. 117, no. 30, 28 July 2020, pp. 17680–17687

“Repository: Predicting Personality from Patterns of Behavior Collected with

Smartphones.”
Osf.io, 12 Oct. 2018, osf.io/kqjhr/, 10.17605/OSF.IO/KQJHR. Accessed 17 May 2022.

Brownlee, Jason. “How to Develop Elastic Net Regression Models in Python.” Machine
Learning Mastery, 6 Oct. 2020

Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries
From Everand
Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries
Jim Frost
5/5 (1)
Chapter 1 - Staffing Models and Strategies
100% (1)
Chapter 1 - Staffing Models and Strategies
10 pages
DG 07 001 e 04 10 Control Device For Conventional Injection With Actuators
100% (1)
DG 07 001 e 04 10 Control Device For Conventional Injection With Actuators
435 pages
Statistics: Practical Concept of Statistics for Data Scientists
From Everand
Statistics: Practical Concept of Statistics for Data Scientists
John Slavio
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
DS&ML 4
No ratings yet
DS&ML 4
9 pages
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
49 pages
Deep Learning
No ratings yet
Deep Learning
56 pages
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Insights: The Science of Data Analysis
From Everand
Data Insights: The Science of Data Analysis
Lexa N. Palmer
No ratings yet
Statistics and Data Analysis Essentials
From Everand
Statistics and Data Analysis Essentials
Jayant Ramaswamy
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Module1 BDA
No ratings yet
Module1 BDA
39 pages
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
CLC - Data Cleansing and Data Summary
No ratings yet
CLC - Data Cleansing and Data Summary
17 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Profiling
No ratings yet
Data Profiling
7 pages
Get Hired as a Data Analyst FAST in 2024
From Everand
Get Hired as a Data Analyst FAST in 2024
Silas Meadowlark
No ratings yet
DS Unit 1
No ratings yet
DS Unit 1
99 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Data-Driven Agentic AI: Integrating Data Science and Machine Learning
From Everand
Data-Driven Agentic AI: Integrating Data Science and Machine Learning
Anand Vemula
No ratings yet
Unit 2
No ratings yet
Unit 2
22 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
EDA
100% (1)
EDA
9 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
DAVAI Macro
No ratings yet
DAVAI Macro
6 pages
Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Module Data Analysis
No ratings yet
Module Data Analysis
6 pages
Comprehensive Guide to Statistics
From Everand
Comprehensive Guide to Statistics
Mohit Chatterjee
No ratings yet
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise
From Everand
Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise
Jon Howells
No ratings yet
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
From Everand
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
Othman Khalifa
No ratings yet
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
No ratings yet
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
47 pages
Module 8
No ratings yet
Module 8
13 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
No ratings yet
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
96 pages
Unit 1 - FoDS - Sep 2023
No ratings yet
Unit 1 - FoDS - Sep 2023
147 pages
Data Science Four Marks Qa
No ratings yet
Data Science Four Marks Qa
4 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
No ratings yet
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
5 pages
2020 Assignment 1 CHE 3211 2
No ratings yet
2020 Assignment 1 CHE 3211 2
2 pages
Carbohydrates Pcog
No ratings yet
Carbohydrates Pcog
8 pages
Arm Neon Intrinsics Ref
No ratings yet
Arm Neon Intrinsics Ref
348 pages
Warehouse Manager Supply Chain Logistics in Louisville KY Resume David Porter
No ratings yet
Warehouse Manager Supply Chain Logistics in Louisville KY Resume David Porter
2 pages
2019 ERP Software Project Report
No ratings yet
2019 ERP Software Project Report
23 pages
DGCA RPAS Guidance Manual
No ratings yet
DGCA RPAS Guidance Manual
66 pages
C++ Programming: From Problem Analysis To Program Design,: Fourth Edition
No ratings yet
C++ Programming: From Problem Analysis To Program Design,: Fourth Edition
53 pages
True or False: Multiple Choice Questions
No ratings yet
True or False: Multiple Choice Questions
7 pages
Catalogo Reductor
No ratings yet
Catalogo Reductor
106 pages
The Aluminizing in Powder Technology of AISI 304 S PDF
No ratings yet
The Aluminizing in Powder Technology of AISI 304 S PDF
5 pages
Disaster Response Using Robot Tech
No ratings yet
Disaster Response Using Robot Tech
4 pages
Exemple Dintroduction de Dissertation de Philosophie Sur Le Bonheur
100% (1)
Exemple Dintroduction de Dissertation de Philosophie Sur Le Bonheur
6 pages
Operation Maintenance Manual
100% (1)
Operation Maintenance Manual
138 pages
Kerkhof Petrography
No ratings yet
Kerkhof Petrography
21 pages
Irregular Verbs Exercises
No ratings yet
Irregular Verbs Exercises
2 pages
Independent Contractor Agreement 19
100% (1)
Independent Contractor Agreement 19
6 pages
75 Years of Markting History
No ratings yet
75 Years of Markting History
8 pages
The Idea of International Trade Nepal Pe
No ratings yet
The Idea of International Trade Nepal Pe
3 pages
Christ MSC Data Science
No ratings yet
Christ MSC Data Science
100 pages
3.0 Central Processing Unit: ITE 1922 - ICT Applications
No ratings yet
3.0 Central Processing Unit: ITE 1922 - ICT Applications
7 pages
PRELIM Tour Guiding
No ratings yet
PRELIM Tour Guiding
2 pages
Department of Education: Sta. Lucia National High School
No ratings yet
Department of Education: Sta. Lucia National High School
9 pages
Miller Approximation
No ratings yet
Miller Approximation
14 pages
EEE111L EXP03 Clipper and Clamper Circuits
No ratings yet
EEE111L EXP03 Clipper and Clamper Circuits
6 pages
Distributed Python
No ratings yet
Distributed Python
22 pages
Manifold: Differentiable Manifolds and If Differentiation Can Take Place Arbitrarily Often They Are
No ratings yet
Manifold: Differentiable Manifolds and If Differentiation Can Take Place Arbitrarily Often They Are
1 page
IC307 Industrial Instrumentation - I
No ratings yet
IC307 Industrial Instrumentation - I
2 pages

Report

Uploaded by

Report

Uploaded by

Data Quality Analysis using Mobile User

Ayush Das IMT2019014

What is Data Quality?

Data Quality Dimensions

Uniqueness - Uniqueness means that no duplications or redundant information are

Skewness can be categorized into three types:

Imputing Null Values using Various Methods:

Before Removing Outliers:

Elastic Net Regularized Linear Regression:

Random Forest Linear Regression :

Stachl, Clemens, et al. “Predicting Personality from Patterns of Behavior

“Repository: Predicting Personality from Patterns of Behavior Collected with

You might also like