0% found this document useful (0 votes)

86 views19 pages

Efficient Data Preparation: With Python

Uploaded by

raxby cz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views19 pages

Efficient Data Preparation: With Python

Uploaded by

raxby cz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Efficient data

preparation
with Python
What’s Inside
3......... Introduction

5......... Data anonymization

6......... Data discovery and extraction

7......... Data exploration and profiling

8......... Data visualization

11....... Data cleansing

16....... Data transformation

17....... Automating data prep workflows

18....... Tools are no substitute for human intuition

Efficient data preparation with Python 2

Introduction
Although data scientists spend less time now than they did a few years ago on
data preparation, our 2020 State of Data Science survey found that nearly half
of their time (45%) is still spent preparing for analysis and modeling.

Data discovery and data preparation have always been among the most
time-intensive steps of a research project and for good reason: If there are
unaddressed errors in data, or if the wrong version of data is used, there will be
errors in the resulting analysis or model. In some cases these errors could
render the analysis or model completely useless. This is why thorough data
preparation is essential for accurate analyses and accurate models.

Efficient data preparation with Python 3

Data preparation will likely always be a major step in the data
science process. However, data scientists can speed up the
time spent on data prep tasks with a well-documented and
curated data catalog, a repository of data cleaning functions,
and of course, Python tools and libraries created especially for
more efficient data prep.

What does data preparation include?

Data preparation is a general term that includes multiple steps
that might be needed to prepare data for analysis or machine
learning and model development. Data preparation will usually
include anonymization, data discovery and exploration, data
cleansing, and what some refer to as ETL — extracting,
transforming and loading.

Let’s take a look at some tools and tips to make each of these steps more efficient.

Efficient data preparation with Python 4

Data anonymization
Data must be managed ethically and securely throughout the Scrubadub is another Python tool that can be used to efficiently
data science process. When working with data that involves scrub PII from free text, including names, email addresses,
personally identifiable information (PII) or other sensitive data, phone numbers, social security numbers, usernames, and more.
the first step in the data preparation process must be to
anonymize the data. One way to do this is to replace real
personal data with similar but fake information. An open-source Read more about Scrubadub and how to easily install
Python tool that helps with this process is Faker. Faker’s library this tool.
contains a comprehensive set of data generators for data needed
in a variety of domains.

To maintain the integrity and usefulness of the data after

anonymization, the fake data must preserve the structure and
Learn more about Faker and how to install it. semantics of the original data. Consider the relationships that
originally existed between fields. For example, how do people’s
names and employer names relate to the structure of their email
addresses? If this is important for the analysis or model, create a
mapping of real domains to fake domains. Deduplication should
also be performed after anonymization. This is covered in the
Data Cleansing section of this guide.

Efficient data preparation with Python 5

Data discovery and extraction
Finding and accessing the right data sets to answer research location of files and credentials required to fetch the data.
questions can be extremely time-consuming tasks without a Anaconda recommends using Intake to create data catalogs
well-documented, curated data catalog. To make data discovery that empower data scientists to search and load data for their
and extraction more efficient, data engineers should commit to projects easily from Python. Intake allows data scientists to
the development of a thorough data ingest and cataloging search for data sets in a hierarchical tree, which is more
pipeline and work with IT to ensure data scientists can access efficient than going through lists of files.
data using instructions in the catalog. This labor-intensive prep
work will be well worth the time, enabling data scientists to Intake also allows data engineers to provide descriptions of
quickly search for and access the data sets they need without how to load data so data scientists can access data sets
involving IT, or worse loading the data only to discover it isn’t without having to figure it out on their own or wait for help
right for their model. Furthermore, catalogs, when kept up to from IT. Lastly, Intake allows for the separation of the definition
date, will ensure that data scientists always get the latest version of data sources from their use and analysis. This means data
of datasets, and perhaps choose between sections of data engineers can decide and execute on where data should be
meant for different uses. stored, in what format, and in what type of package without
affecting data scientists’ work. The data scientist can then fetch
The data catalog should include searchable metadata (type of the data in a familiar container (dataframe, data, etc.) that they
data, free word text, tags, provenance, etc.) with descriptions of know how to analyze.
data sets that data scientists would find useful. Each catalog
entry should be a definition of how to read a dataset, including
the type of data, the loader and arguments required, and the
Learn more about Intake and how to get started.

Efficient data preparation with Python 6

Data exploration and profiling
Once you can access the data, the next step is to conduct an Fast data profiling
exploratory data analysis (EDA) to gain a better understanding of
trends and important characteristics. An EDA allows data pandas_profiling is a Python tool that generates interactive
scientists to uncover patterns and irregularities in data and to reports from pandas data frames, allowing for quick data
understand frequency counts, variance, and correlations. profiling with just a few lines of code. pandas_profiling reports
include:
For columnar data, the go-to exploration tool is pandas, where
you can explore descriptive statistics, summarize unique values • More powerful type inference than what Pandas offers
and missing values, and more. Here we’ll provide some • Essentials such as unique and missing values
additional open-source analysis and visualization tools data
• Descriptive stats (mean, mode, standard deviation, quantiles,
scientists can use to explore and understand data sets.
histograms)
• Most common values
• Correlations
• Missing values matrix, heatmaps, and dendrograms
• Text analysis
• File and image analysis

Learn more about pandas_profiling.

Efficient data preparation with Python 7

Data visualization
Blindly running averages or other statistics on your data can easily
lead to incorrect conclusions if there are outliers or trends you did
not expect. If you can present your data visually, our brains are
highly skilled at spotting patterns and inconsistencies. Thus it is
crucial to explore the underlying data visually before drawing
conclusions or taking actions. Scatter plots, cluster size analysis,
and correlation heat maps can be especially helpful in providing
insights when exploring new datasets.

Data exploration is an art, and there is no one perfect way to

visualize any data set. There are dozens of “viz” libraries available for
Python but here we’ll focus on three that are particularly focused on
helping you understand the properties of a new dataset: Seaborn,
Datashader, and Holoviz.

Visit PyViz.org for a complete list of Python

visualization libraries.

Efficient data preparation with Python 8

Seaborn

Seaborn is a library built on top of Matplotlib that provides a large variety of visualization tools for data exploration, focusing on statistical
visualizations. It is closely integrated with pandas and provides a dataset-oriented API for examining relationships between multiple
variables. Here are a few examples of using visualization to explore data sets with Seaborn:

Discovering structure in Paired density and Scatterplot with

heatmap data scatterplot matrix categorical variables

Efficient data preparation with Python 9

Datashader
Datashader was created to make it easier to interactively visualize large, multidimensional datasets, with a goal of accurately rendering both
the trends and outliers in the data fast enough to be practical for interactive exploration. As illustrated in the figure below, plotting even 1%
of a dataset with 300 million points of US Census locations in a standard plotting program (Matplotlib in this case) takes 5 seconds, which is
too slow for interactive panning and zooming. Standard plotting tools also require adjusting multiple parameters before they will reveal the
data, which is difficult if you don’t already know the properties of your dataset! Datashader uses server-side aggregation techniques to
accurately reveal the shape of the entire distribution without needing a parameter exploration, in under 0.1 seconds for the entire dataset.

Datashader for painlessly exploring large datasets

Plotting 1% of the data with Matplotlib takes 5 Patterns can be revealed by adjusting Matplotlib Datashader renders all 300 million points in 0.1
seconds and shows only limited information parameters, if you know what to expect second with no parameter tuning needed

Holoviz
HoloViz is an Anaconda project that provides tools for high-level and large-data visualization built on top of Matplotlib, Bokeh, and Plotly.
HoloViz includes hvPlot for easy one-line plotting from Pandas, Xarray, or Dask to encourage you to visualize every step of your
processing pipeline, plus Datashader so that you can render all your data without making any assumptions about its content, and Panel
for creating web applications for interactive exploration in Jupyter that can later be deployed for sharing with stakeholders. hvPlot works
similarly to Seaborn or the default plotting for Pandas, but provides interactive visualizations that support easy exploration.

Efficient data preparation with Python 10

Data cleansing
Data cleansing or cleaning may be the most plodding
task for some data scientists, but it is essential for
accurate models and analyses. Data cleansing involves
finding and remediating missing, duplicated, irregular,
unnecessary and inconsistent data. Some of this dirty
data will likely already have been identified in the data
exploration phase.

One way to make the data cleansing process more

efficient is to maintain a library of functions or a
repository of rules that data scientists can refer to
when cleaning data sets.

There are several ways to clean data using Python and

common open-source libraries such as pandas and
NumPy and common visualization tools such as
Matplotlib, Seaborn, and HoloViz.

Efficient data preparation with Python 11

Locating missing data Missing Values Heatmap Created with Seaborn

One way to facilitate the search for missing data is to create

a heatmap that shows missing fields in a contrasting color.
By looking at a data set this way, a data scientist can quickly
discover which features need attention and exclude them
altogether if they are not important to the project at hand.
Data scientists can use Seaborn and other data visualization
tools to quickly create heatmaps with contrasting colors to
find missing values or histograms to discover the frequency
of missing values.

Alternatively, one can quickly gain an understanding of what

data is missing by using Python to create a list of the
percentage of missing data for each column:

# % of missing
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing*100)))

Efficient data preparation with Python 12

Finding outliers
When exploring a data set, it’s important to identify outliers and determine if
they are in fact outliers or mistakes in data collection. Histograms can also be
used to find outliers, as well as bar charts and descriptive statistics. To see a
quick description of a data set, input the following:

df[‘feature_name’].describe() Violin Plot Showing Distributions,

Created with HoloViz
This will provide total count, mean, quartiles, and max value, among other
characteristics, enabling a quick glance at the distribution between minimum
and maximum values.

Visualization tools are also great for identifying outliers, especially scatter plots,
clustering, and violin plots. Use Seaborn or HoloViz tools to easily create these
visualizations to view data distributions and identify outliers.

Identifying duplicates
In addition to removing any features that # sorting by first name
are unrelated to answer the research data.sort_values("First Name",
questions at hand, duplicates must be inplace=True)
removed so that they do not bias the
results. Sorting to help identify and then
# dropping duplicate values
removing duplicates is easy with Python
data.drop_
and pandas. Here is an example of sorting
duplicates(keep=False,inplace=True)
by first name (useful if inspecting the data
manually) and dropping duplicates:

Efficient data preparation with Python 13

Identifying inconsistencies This gives us a list of distances (or how many letters are off) from
the correct spelling of Tokyo and Yokohama. You will find that
Common inconsistencies in data sets include capitalization, misspelled words are typically only a distance of 1 or 2 from the
data formats, and misspelled words. There are a few simple correct word. Greater distances likely indicate a different city.
operations in Python for remediating these common To remediate misspelled words in this category, we can focus
inconsistencies. To avoid confusion from variations in on those with a distance of two or less from the correct spelling.
capitalization, it’s often easiest to make all data lowercase:
To remediate this:

# make everything lower case. msk = df_city_ex['city_distance_tokyo'] <= 2

df['sub_area_lower'] = df['sub_area'].str.lower() df_city_ex.loc[msk, 'city'] = 'tokyo'
df['sub_area_lower'].value_counts(dropna=False)
msk = df_city_ex['city_distance_yokohama'] <= 2
To identify and remediate misspelled words, use fuzzy logic. df_city_ex.loc[msk, 'city'] = 'yokohama'
To start, search for a few common entries and variations within
a category. This example uses cities. df_city_ex

Obviously, great care must be taken to preserve names which are

from nltk.metrics import edit_distance
genuinely different but happen to be similar as separate identifiers.

df_city_ex = pd.DataFrame(data={'city': ['tokyo', 'tokoyo', Another common formatting problem is making address entries
'tokiyo', 'yokohama', 'yokohoma', 'yokahama', 'osaka', 'kobe']}) uniform. Address entries often contain a combination of spacing,
punctuation, and capitalization inconsistencies. To remediate this,
df_city_ex['city_distance_tokyo'] = df_city_ex['city']. we can use a combination of functions to make all values lower
map(lambda x: edit_distance(x, 'tokyo')) case and standardize punctuation.
df_city_ex['city_distance_yokohama'] = df_city_ex['city'].
map(lambda x: edit_distance(x, 'yokohama')) See detailed examples of how to use Python to remove duplicates,
Df_city_ex find and correct misspelled words, make capitalization and
punctuation uniform, find inconsistencies, make address formatting
uniform and more in this detailed data cleaning guide published on
Towards Data Science.

Efficient data preparation with Python 14

Date and time formatting
Data and time format is also often inconsistent. We can use Python to standardize date and
time formats, to convert string to date and vice versa, compare and change time zones, and
calculate time differences. Here are some useful Python directives to get started:

DIRECTIVE MEANING EXAMPLE

%a Weekday as abbreviated name Sun, Mon, Tue...

%A Weekday as full name Sunday, Monday...

%w Weekday as a decimal number; 0 is Sunday, 6 is Saturday 0, 1, 2...

%d Day of month as a zero-padded decimal number 01, 02, 03...

%b Month as abbreviated name Jan, Feb, Mar...

%B Month as full name January, February...

%m Month as zero-padded decimal number 01, 02, 03...

%y Year without century as zero-padded decimal number 98, 99, 00, 01...

%Y Year with century as a decimal number 1998, 1999, 2000...

Source: Towards Data Science. To learn more about how to use these directives read
How to Manipulate Date and Time in Python like a Boss.

Efficient data preparation with Python 15

Data transformation
Data transformation is the process of modifying or converting data format or structure.
This can include aggregating columns or categories, converting units, converting times and dates.

There are a variety of Python packages available to assist with units conversions. Here are a few of them to note:

• Arrow - a library that simplifies date/time conversion and formatting

• QuantiPhy - a library for converting physical quantities; supports SI scale factors
• Astropy - a library for astronomy that includes comprehensive units/dimensionality handling
• Pint - a Python package to define, operate and manipulate physical quantities
• SymPy - a module that allows for symbolic manipulation of variables, which can represent quantities with units

There are many other Python packages out there to assist with domain-specific conversions.

Efficient data preparation with Python 16

Automating data prep workflows
Particularly for data that is updated frequently, data engineers and data scientists can speed up the data preparation process by
automating some of their workflows. Data professionals can accelerate the data preparation process with these Python-based tools.

Airflow Luigi
One way to do this in the open-source world is with Apache Airflow. Luigi is a Python workflow management tool originally developed at
Airflow is a Python-based data workflow management system Spotify. Luigi is great for managing long-running batch processes, and it’s
originally developed by Airbnb’s data team to help speed up their easy to build long-running pipelines that take days or weeks to complete.
data pipelines. It can be used to schedule tasks to aggregate, Luigi’s interface is less user-friendly than Airflow’s, but it allows for custom
cleanse, and organize data, among other tasks unrelated to data scheduling and has a comprehensive library of stock tasks and target data
preparation. systems.

Learn more about Airflow. Learn more about Luigi.

Celery Prefect

Celery is a workflow automation tool that was developed with Prefect is a workflow management platform that allows users to automate
Python, but with a protocol that can be implemented in any anything they can do with Python. Designed and built with Dask in mind,
language. Celery uses task queues to distribute work across Prefect schedules workflows while allowing Dask to schedule and manage
machines or threads and allows for high availability and horizontal the resources for tasks within workflows. Prefect generally allows for more
scaling. Celery can process millions of tasks per minute and can run flexibility and more dynamic workflows than Airflow or Luigi. For example,
on single machines or multiple. A Celery system can even run across Airflow’s scheduler is critical to the execution of multiple stages in a
data centers. workflow, while Prefect decouples this logic into separate processes.
Airflow’s design makes it better for managing larger tasks while Prefect is
better for smaller, modular tasks.
Learn more about Celery.
Learn more about Prefect.

Efficient data preparation with Python 17

Tools are no substitute
for human intuition
While there are tools that make the data preparation process
more efficient, and automation tools will become more
sophisticated over time, it’s not easy to use them to apply a
critical lens to data. Human experience and intuition are
necessary and will continue to be necessary to evaluate data sets
and determine if outliers and inconsistencies are errors or simply
unusual given the context. Human intuition is also needed to
determine whether a data set can contribute valuable insight to
solve specific problems or provide relevant answers to the
research questions at hand. Machines will not take over the task
of data preparation any time soon.

Efficient data preparation with Python 18

About Anaconda
With more than 20 million users, Anaconda is the world’s most popular data science
platform and the foundation of modern machine learning. We pioneered the use of
Python for data science, champion its vibrant community, and continue to steward
open-source projects that make tomorrow’s innovations possible. Our enterprise-grade
solutions enable corporate, research, and academic institutions around the world to
harness the power of open-source for competitive advantage, groundbreaking research,
and a better world.

Visit https://www.anaconda.com to learn more.

ICT - Basic Abbreviations
100% (3)
ICT - Basic Abbreviations
6 pages
How To Use AI For Data Analysis
No ratings yet
How To Use AI For Data Analysis
19 pages
Interactive Notebook Pages
100% (1)
Interactive Notebook Pages
16 pages
Python Data Visualisation
100% (1)
Python Data Visualisation
56 pages
Downloadable: Cheat Sheets For AI, Neural Networks, Machine Learning, Deep Learning & Data Science PDF
No ratings yet
Downloadable: Cheat Sheets For AI, Neural Networks, Machine Learning, Deep Learning & Data Science PDF
34 pages
FATEK - Win Pro Ladder Simulator Manual
100% (2)
FATEK - Win Pro Ladder Simulator Manual
10 pages
The Primaver P6 Users Guide To Excel V2
100% (1)
The Primaver P6 Users Guide To Excel V2
34 pages
Linux Commands Cheatsheet V1.01
No ratings yet
Linux Commands Cheatsheet V1.01
36 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
Automotive Safety Handbook: Second Edition
No ratings yet
Automotive Safety Handbook: Second Edition
2 pages
Brochure
No ratings yet
Brochure
14 pages
Decoder
0% (1)
Decoder
6 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Data Analysis Expressions (DAX)
No ratings yet
Data Analysis Expressions (DAX)
3 pages
Python Technical Interviews Questions
100% (1)
Python Technical Interviews Questions
15 pages
Pandas
100% (1)
Pandas
1,131 pages
Python Data Associate Certification Study Guide
No ratings yet
Python Data Associate Certification Study Guide
2 pages
Top 10 Excel Formulas
No ratings yet
Top 10 Excel Formulas
12 pages
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
100% (1)
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
24 pages
Python For Data Exploration
No ratings yet
Python For Data Exploration
28 pages
AlarmLogs 2023-12-05 Zamtel
No ratings yet
AlarmLogs 2023-12-05 Zamtel
936 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
No ratings yet
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
40 pages
SQL MONGODB CASSANDRA REDIS Essential Commands 1696896952
No ratings yet
SQL MONGODB CASSANDRA REDIS Essential Commands 1696896952
6 pages
SAS Job Execution Web Application 2.1 - User S Guide
No ratings yet
SAS Job Execution Web Application 2.1 - User S Guide
94 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Numpy Reference
No ratings yet
Numpy Reference
1,183 pages
Excel Interview Questions
No ratings yet
Excel Interview Questions
14 pages
Matplotlib Cheat Sheet (For Intermediate Users)
No ratings yet
Matplotlib Cheat Sheet (For Intermediate Users)
8 pages
Cis4000 Assignment Answer Cardiff St20212772outlook
No ratings yet
Cis4000 Assignment Answer Cardiff St20212772outlook
31 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
003 Python-Syntax-Cheat-Sheet-Booklet
No ratings yet
003 Python-Syntax-Cheat-Sheet-Booklet
26 pages
Data Visualization in Python
No ratings yet
Data Visualization in Python
11 pages
DAX Functions For Data Analysis
No ratings yet
DAX Functions For Data Analysis
9 pages
Data Literacy Fundamentals: Understanding the Power & Value of Data
From Everand
Data Literacy Fundamentals: Understanding the Power & Value of Data
Ben Jones
No ratings yet
SARA-R5 ATCommands (UBX-19047455)
No ratings yet
SARA-R5 ATCommands (UBX-19047455)
489 pages
Azure Machine Learning Algorithm Cheat Sheet Nov2019
100% (1)
Azure Machine Learning Algorithm Cheat Sheet Nov2019
1 page
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Python Pandas Interview Questions and Answers
No ratings yet
Python Pandas Interview Questions and Answers
20 pages
IICT - Data Science
No ratings yet
IICT - Data Science
22 pages
Department of Computer Applications
No ratings yet
Department of Computer Applications
10 pages
Data Structures and Algorithms (DSA) in Python - Self Paced
No ratings yet
Data Structures and Algorithms (DSA) in Python - Self Paced
4 pages
Detailed Teaching Syllabus (DTS) and Instructor Guide (Ig'S)
No ratings yet
Detailed Teaching Syllabus (DTS) and Instructor Guide (Ig'S)
15 pages
Windows Powershell Logging Cheat Sheet - Win 7/win 2008 or Later
No ratings yet
Windows Powershell Logging Cheat Sheet - Win 7/win 2008 or Later
10 pages
Sss Log 01 04 2025 10 44 25
No ratings yet
Sss Log 01 04 2025 10 44 25
62 pages
A6 Voltage Variations in Capacitors
No ratings yet
A6 Voltage Variations in Capacitors
5 pages
Geek Hunter Z
No ratings yet
Geek Hunter Z
2 pages
R VS Python
No ratings yet
R VS Python
12 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
VIP Cheatsheet: Convolutional Neural Networks: Afshine Amidi and Shervine Amidi November 26, 2018
No ratings yet
VIP Cheatsheet: Convolutional Neural Networks: Afshine Amidi and Shervine Amidi November 26, 2018
5 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
10.5.1 Displaying/Hiding The User Maintenance Menu
No ratings yet
10.5.1 Displaying/Hiding The User Maintenance Menu
5 pages
Data Scientist Certification Study Guide
No ratings yet
Data Scientist Certification Study Guide
7 pages
Data Science in Business
No ratings yet
Data Science in Business
9 pages
Comprihensive
No ratings yet
Comprihensive
39 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Visit Lucknow Project
No ratings yet
Visit Lucknow Project
33 pages
Logitech G-Series Lua API V3.02 Overview and Reference
No ratings yet
Logitech G-Series Lua API V3.02 Overview and Reference
37 pages
MayankMrinal2024Resume Latest
No ratings yet
MayankMrinal2024Resume Latest
1 page
800xa Outline - IO Systems - S800 IO
No ratings yet
800xa Outline - IO Systems - S800 IO
20 pages
Machine Learning Algorithm Cheat Sheet - Laura Diane Hamilton
No ratings yet
Machine Learning Algorithm Cheat Sheet - Laura Diane Hamilton
2 pages
Constant Current Fuzzy Logic Controller For Grid Connected Electric Vehicle Charging
No ratings yet
Constant Current Fuzzy Logic Controller For Grid Connected Electric Vehicle Charging
6 pages
Mid Sem Question & Answer Bank
No ratings yet
Mid Sem Question & Answer Bank
40 pages
Data Analysis in Excel
No ratings yet
Data Analysis in Excel
10 pages
Eaton C445 Demo Stand Instructions
No ratings yet
Eaton C445 Demo Stand Instructions
2 pages
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
No ratings yet
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
20 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
Mpi Lab Project
No ratings yet
Mpi Lab Project
4 pages
Anskey Scratch Worksheet 2
No ratings yet
Anskey Scratch Worksheet 2
3 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
Python For Data Science Quickstart Guide
No ratings yet
Python For Data Science Quickstart Guide
13 pages
1 - Literature Databases and References Management
No ratings yet
1 - Literature Databases and References Management
11 pages
The Earth Mover's Distance As A Metric For Image Retrieval
No ratings yet
The Earth Mover's Distance As A Metric For Image Retrieval
20 pages
Data Modeling: Jak Na Cheatsheet
No ratings yet
Data Modeling: Jak Na Cheatsheet
3 pages
Review of Basic Statistical Concepts Hanke
No ratings yet
Review of Basic Statistical Concepts Hanke
28 pages
Python Glossary - Codecademy
No ratings yet
Python Glossary - Codecademy
10 pages
Farmer Sanc's Boost Almanac
No ratings yet
Farmer Sanc's Boost Almanac
3 pages
What Is Fortnite
No ratings yet
What Is Fortnite
1 page
Velammal Bodhi Campus, Theni: Computer Science
No ratings yet
Velammal Bodhi Campus, Theni: Computer Science
2 pages
Machine Learning Checklist
No ratings yet
Machine Learning Checklist
6 pages
Beginners Python Cheat Sheet PCC Pygal PDF
No ratings yet
Beginners Python Cheat Sheet PCC Pygal PDF
2 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
Anaconda Cheat Sheet
No ratings yet
Anaconda Cheat Sheet
1 page
Keras Cheat Sheet Python
No ratings yet
Keras Cheat Sheet Python
1 page
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet

Efficient Data Preparation: With Python

Uploaded by

Efficient Data Preparation: With Python

Uploaded by

Efficient data

5......... Data anonymization

6......... Data discovery and extraction

7......... Data exploration and profiling

8......... Data visualization

11....... Data cleansing

16....... Data transformation

17....... Automating data prep workflows

18....... Tools are no substitute for human intuition

Efficient data preparation with Python 2

Efficient data preparation with Python 3

What does data preparation include?

Efficient data preparation with Python 4

To maintain the integrity and usefulness of the data after

Efficient data preparation with Python 5

Efficient data preparation with Python 6

Learn more about pandas_profiling.

Efficient data preparation with Python 7

Data exploration is an art, and there is no one perfect way to

Visit PyViz.org for a complete list of Python

Efficient data preparation with Python 8

Discovering structure in Paired density and Scatterplot with

Efficient data preparation with Python 9

Datashader for painlessly exploring large datasets

Efficient data preparation with Python 10

One way to make the data cleansing process more

There are several ways to clean data using Python and

Efficient data preparation with Python 11

One way to facilitate the search for missing data is to create

Alternatively, one can quickly gain an understanding of what

Efficient data preparation with Python 12

df[‘feature_name’].describe() Violin Plot Showing Distributions,

Efficient data preparation with Python 13

# make everything lower case. msk = df_city_ex['city_distance_tokyo'] <= 2

Obviously, great care must be taken to preserve names which are

Efficient data preparation with Python 14

DIRECTIVE MEANING EXAMPLE

%a Weekday as abbreviated name Sun, Mon, Tue...

%A Weekday as full name Sunday, Monday...

%w Weekday as a decimal number; 0 is Sunday, 6 is Saturday 0, 1, 2...

%d Day of month as a zero-padded decimal number 01, 02, 03...

%b Month as abbreviated name Jan, Feb, Mar...

%B Month as full name January, February...

%m Month as zero-padded decimal number 01, 02, 03...

%Y Year with century as a decimal number 1998, 1999, 2000...

Efficient data preparation with Python 15

• Arrow - a library that simplifies date/time conversion and formatting

Efficient data preparation with Python 16

Learn more about Airflow. Learn more about Luigi.

Efficient data preparation with Python 17

Efficient data preparation with Python 18

Visit https://www.anaconda.com to learn more.

© 2020 Anaconda, Inc.

You might also like