Data Exploration with
Python
Andrew Michelson, MD
Pulmonary/Critical Care
Institute for Informatics
Washington University School of Medicine in St. Louis
February 17, 2020
Institute for Informatics (I 2)
Disclosures
No relevant financial disclosures.
Many topics could be their own courses, so this will be a brief overview
The best techniques to analyze and clean your data will depend on the question your
asking and data you have
Institute for Informatics (I 2)
Class Structure
Institute for Informatics (I 2)
Objectives
1. Learn how to import data into Python
2. Discuss variable identification
3. Explore missing data and discuss its management
4. Explore univariate & bivariate analyses
5. Discuss outlier assessment and management
6. Explore data transformation
Institute for Informatics (I 2)
The Data
Source: MIMIC-III Demo Data
Contents:
• Vital Signs: Blood pressure, heart rate, respiratory rate, etc…
• Laboratory Values: White Blood Cell Count, Potassium, etc…
• And more, but we won’t use any of that today
Institute for Informatics (I 2)
The Working Environment
1. Python
2. jupyter-notebook
3. Import libraries
A. Pandas
B. Numpy
C. Seaborn
D. Datetime
E. Matplotlib
F. Scipy.stats
Institute for Informatics (I 2)
Importing Data Into Python
1. Python is a versatile and powerful language that can accept data from
many formats
2. In this class we import CSV documents from the MIMIC-III demo data
3. Use: dfNAME = pd.read_csv(filepath/filename, sep = ’,’)
Institute for Informatics (I 2)
Importing Data Into Python
Jupyer-Notebook
• Open Jupyter-Notebook
• Run Section 2: Import Libraries for DataSet Exploration
• Fill in the blank to import the following files:
• ICUSTAYS.csv
• PATIENTS.csv
• D_ITEMS.csv
• D_LABITEMS.csv
Institute for Informatics (I 2)
Variable Identification
Variable Name: Variable name
Variable type:
• Continuous (ex, age)
• Categorical (ex, sex)
Data Type:
• String
• Category
• Integer
• Float
• ManyString
Independent vs Dependent:
Institute for Informatics (I 2)
Variable Identification
Identify your variables:
>> DataFrame.head( )
Patients dataframe
Note: you can use >> DataFrame.tail( ) to view the tail rows of the data frame
By adding in a number within the parenthesis you can specify how many rows to view
Institute for Informatics (I 2)
Variable Identification
View your data frame
ICU Stays
Institute for Informatics (I 2)
Variable Identification
How do we know how many rows and columns we have in total?
>> DataFrame.shape
How do we know the type of the data type?
>> DataFrame.info()
Institute for Informatics (I 2)
Variable Identification
Remove Extraneous Information that takes up space (visible and memory)
>> DataFrame.drop(items, axis, inplace)
Institute for Informatics (I 2)
Variable identification in Python
Go to section 3.0.1 and fill in the *** to start identifying your
variables
Complete until section 3.2: Merge Patients & ICU Data to
Create a single DataFrame
Institute for Informatics (I 2)
Manipulating Data in Python
Often data is collected from different sources and then
merged together for analysis.
>> DataFrame1.merge(DataFrame2, how = “left/right”,
on=[‘’])
After a merge, double check the shape, to make sure you
merged correctly
Institute for Informatics (I 2)
Variable identification in Python
Go to section 3.2: Merge Patients & ICU Data to create a single
DataFrame
Check the size of the new DataFrame to confirm a successful
merge
Institute for Informatics (I 2)
Missing Data
Very Common in clinical data
Why is data missing?
• Data extraction
• Data collection
Institute for Informatics (I 2)
Missing Data Categorization
1. Missing completely at random:
• The propensity for a data point to be missing is completely
random and not dependent on observed or unobserved data
2. Missing at random:
• Systematic differences between the missing and observed values,
but these can be entirely explained by other observed variables
Institute for Informatics (I 2)
Missing Data Categorization
3. Missing not at random
• There is a relationship between the propensity of a value to be
missing and it’s values
Institute for Informatics (I 2)
Missing Data Treatment
Adapted from: https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87
Institute for Informatics (I 2)
Missing Data: Case Deletion
List Wise Pair Wise
Delete all data Analyze all cases
where any where data is
missing available
value is present
Institute for Informatics (I 2)
Missing Data: Imputation
Goal is to fill missing data with estimated values
Most common methods: mean/median/mode:
• Population-wide
• Cohort-wide
Institute for Informatics (I 2)
Missing Data: Statistical-Model Imputation
Linear Regression
• Limitations:
• Reduces variability
• Overestimates the model fit and correlation coefficient
K-nearest Neighbor Imputation
• Limitations:
• The choice of k critical in getting desired results
• Very slow
Institute for Informatics (I 2)
Missing Data: Statistical-Model Imputation
Multiple Imputation by Chained Equations (MICE)
• Assumes data is missing at random
• Runs multiple regression models
• Each value is modeled conditionally
• Multiple data sets are made (usually at least 10)
Institute for Informatics (I 2)
Assessing Missing data in Python
Look for null entries
>>DataFrame.isnull( ).sum
Look for non-null entries
>>DataFrame.notnull( ).sum
Institute for Informatics (I 2)
Assessing Missing Data
Go to section 3.3: Assess Missing Data in NEW Patients
DataFrame and complete UP TO, but not including Import Vital
Signs
Institute for Informatics (I 2)
Data Mapping
Process of extracting and unifying data for further analysis
Measurements of interest could be mixed with measurements
not of interest
The same value can have different names
• Sometimes the differences in names is important, other
times its not
Occurs in many data sets, including MIMIC-III
Institute for Informatics (I 2)
Data Mapping
Vital Signs:
• Blood Pressure (systolic/diastolic)
• Heart Rate
• Respiratory Rate
• Oxygen saturation (%)
• Temperature
In MIMIC-III vital signs are mixed with other measurements in
the CHARTEVENTS.CSV
Institute for Informatics (I 2)
Data Mapping with Vital Signs
Systolic Blood Pressure Synonyms in THIS dataset:
• Non Invasive Blood Pressure systolic',
• 'Arterial Blood Pressure systolic',
• 'Manual Blood Pressure Systolic Left',
• 'Manual Blood Pressure Systolic Right’,
Institute for Informatics (I 2)
Data Mapping with Vital Signs
Count variable frequency
>> DataFrame.series.value_counts( )
Institute for Informatics (I 2)
Data Mapping with Dictionaries
Dictionaries are data structures
that consist of an unordered
collections of key-value pairs
that can be changed
Dictionary = {
<key>: <value>
}
Institute for Informatics (I 2)
Data Mapping with Vital Signs
To accommodate synonyms, or extract items of interest from a
larger data set, you can use a dictionary
Institute for Informatics (I 2)
Import the remaining data and assess
missingness
Go to section 4.2 Import Vital Signs complete up to section 5:
Univariate & Bivariate Analysis
Institute for Informatics (I 2)
Univariate Analysis
Explore variables individually
Basic descriptive analysis
Central Tendency Measure Dispersion Visualization
Mean Interquartile Range Histogram
Median Standard Deviation/ Box plot
Variance
Mode Skewness
Min Kurtosis
Max
Institute for Informatics (I 2)
Univariate Analysis: Skewness
Measure of the asymmetry of the probability distribution of a variable
• Positive or Right
• Negative or Left
Grading Skewness Severity
• Minimal: -0.5 and 0.5
• Moderate: -1 and -0.5 or 0.5 and 1
• Severe: < -1 or >1
https://en.wikipedia.org/wiki/Skewness
Institute for Informatics (I 2)
Univariate Analysis: Kurtosis
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest
of the distribution.”
Kurtosis >3: Positive
No Kurtosis/Normal
Kurtosis <3: Negative
https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#kurtosis
https://bishalbanksonfinance.wordpress.com/tag/probabality-distribution/
Institute for Informatics (I 2)
Bivariate Analysis
A method to determine the relationship between 2 variables
1. Visualization: Scatter plots
2. Regression analysis: Find the equation for the line or curve that best fits the data
3. Correlation coefficients: A measure of association between two data points
Institute for Informatics (I 2)
Outliers
What is an outlier?
• A data point that appears far away and diverges from the overall pattern in a sample
• Can be univariate or bivariate
Institute for Informatics (I 2)
Outliers
How do outliers occur?
• Natural
• Sampling error
• Data entry error
• Data processing error
• Measurement error
• Intentional outlier
• Experimental error
Institute for Informatics (I 2)
Outliers
Why are they important?
• Alters population variance, leading to non-normal data distributions
• Alters performance of downstream analyses
• Biases results
How do you detect outliers?
• Visualization
• Bar charts
• Box plots
• Scatter plots (looking for bivariate outliers)
• There are many, many ways, but we will focus on visualization today!
Institute for Informatics (I 2)
Outliers: Univariate
Institute for Informatics (I 2)
Outliers: Univariate
Institute for Informatics (I 2)
Outliers: Bivariate
Institute for Informatics (I 2)
Outliers
How do you treat outliers? (Subject for an entire course!)
• Delete observations:
• Data entry error
• Data processing error
• Very few (subjective)
• Transform values
• Log conversion
• Binning
• Differential observation weights
• Impute
• Would avoid with natural outliers
• Treat outliers as a separate category
Institute for Informatics (I 2)
Assessing Data in Python: Pivot Tables
DataFrames must be properly structured before they can be plotted
Patient Label Value
John Smith Heart Rate 75
John Smith Respiratory Rate 15
Patient Heart Rate Respiratory Rate
John Smith 75 15
DataFrame.pivot_table(values = 'value', index = [‘columns’], columns='label')
Institute for Informatics (I 2)
Visualize Data Within Python
Declare the graph properties
>> fig, ax = plt.subplots(rows,columns, figsize = (width,height))
Locate a subset of data from within the larger dataframe
>> DataFrame.loc[DataFrame.column == ‘columnname’, ‘return column name']
Use Seaborn to make distribution and boxplots
>> sns.distplot(data, ax=ax[ X ])
>> sns.boxplot(x = data, ax = ax[ X ])
Pivot your dfce
>>DataFrame.pivot_table(values = 'value', index = [‘columns’],
columns='label').reset_index()
Use Seaborn to plot bivariate data
>>sns.pairplot(pivoted table)
Institute for Informatics (I 2)
Visualize Data Within Python
Seaborn can make a heatmap to help you more rapidly identify correlations
>> sns.heatmap(dflabs.corr(), vmax = 1)
Institute for Informatics (I 2)
Univariate & Bivariate Visualization with
Vital Signs
Go to section 5: Univariate & Bivariate Analysis and complete
until section 6: Data Transformation
Institute for Informatics (I 2)
Data Transformation
Skewed data
• Skewed data can violate model assumptions (logistic regression)
• Amplify a class imbalance, degrading model performance towards the tail of the
distribution
Heteroskedasticity
• The relationship between two variables shows increasing scatter (non-constant standard
error) at extremes of measurement of the dependent variable
• Two forms:
• Conditional: Unpredictable volatility
• Unconditional: Predictable volatility
Institute for Informatics (I 2)
Data Transformation: Heteroskedasticity
Conditional
Institute for Informatics (I 2)
Data Transformation: Heteroskedasticity
Unconditional
Institute for Informatics (I 2)
Data Transformation
Way to improve skewness and heteroskedasticity is to normalize your data
• Remove/manage outliers
• Log
• Cube Root
• Binning
• Normalization
• Sigmoid
• Hyperbolic tangent
• Etc…
Again, there are many different ways to do this and the best way will depend on your
planned analyses and the question you are answering
Institute for Informatics (I 2)
Data Transformation
To perform the log function on data, you take a Pandas Series as such:
>> DataFrame.Column = np.log(DataFrame.column)
To raise a value to the cube root
>> DataFrame.Column = DataFrame.column**(1/3)
Institute for Informatics (I 2)
Data Transformation
Go to section 6: Data Transformation and go until the end!
Institute for Informatics (I 2)
Questions?
Thank you!
Institute for Informatics (I 2)
References:
1. Grus, Joel. Data Science from Scratch. O’Reilly Media;2015.
2. Marcellino, P. Comprehensive data exploration with python.
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python. 2/2018. Accessed:
2/12/2020.
3. Sheridan, E. Un-bottling the data. 12/2/2019.
https://towardsdatascience.com/un-bottling-the-data-2da3187fb186. Accessed: 2/12/2020.
4. Ojeda, T. Data exploration with python, part 3.
https://www.districtdatalabs.com/data-exploration-with-python-3. Accessed: 2/12/20.
5. Sunil, R. A comprehensive guide to data exploration.
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/#two. Accessed: 2/12/2020.
6. Bratkovics, C. Exploratory data analysis tutorial in Python.
https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445. 6/16/19.
Accessed: 2/12/20.
7. Sunil, R. Ultiamte guide for data exomploration in Python using Numpy, Matplotlib and Pandas.
https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-nump
y-scipy-matplotlib-pandas/
. 4/9/2015. Accessed: 2/12/2020.
8. Akinfaderin, W. Missing data conundrum: exploration and imputation techniques.
https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techni
ques-9f40abe0fd87
. 9/11/2017. Accessed: 2/12/20.
9. Wade, C. Transforming skewed data. https://towardsdatascience.com/transforming-skewed-data-73da4c2d0d16.
8/21/2019. Accessed: 2/20/20.
10. Chow, J. Log transformation base for data linearization does not matter.
https://towardsdatascience.com/log-transformation-base-for-data-linearization-does-not-matter-22eb3c1463d0.
6/27/2019. Accessed: 2/12/20. Institute for Informatics (I 2)
11. Azur MJ, Stuart EA, Franggakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it
Thank you!
Institute for Informatics (I 2)
Institute for Informatics (I 2)