Efficient Data Preparation: With Python
Efficient Data Preparation: With Python
preparation
with Python
What’s Inside
3......... Introduction
Data discovery and data preparation have always been among the most
time-intensive steps of a research project and for good reason: If there are
unaddressed errors in data, or if the wrong version of data is used, there will be
errors in the resulting analysis or model. In some cases these errors could
render the analysis or model completely useless. This is why thorough data
preparation is essential for accurate analyses and accurate models.
Let’s take a look at some tools and tips to make each of these steps more efficient.
Seaborn is a library built on top of Matplotlib that provides a large variety of visualization tools for data exploration, focusing on statistical
visualizations. It is closely integrated with pandas and provides a dataset-oriented API for examining relationships between multiple
variables. Here are a few examples of using visualization to explore data sets with Seaborn:
Plotting 1% of the data with Matplotlib takes 5 Patterns can be revealed by adjusting Matplotlib Datashader renders all 300 million points in 0.1
seconds and shows only limited information parameters, if you know what to expect second with no parameter tuning needed
Holoviz
HoloViz is an Anaconda project that provides tools for high-level and large-data visualization built on top of Matplotlib, Bokeh, and Plotly.
HoloViz includes hvPlot for easy one-line plotting from Pandas, Xarray, or Dask to encourage you to visualize every step of your
processing pipeline, plus Datashader so that you can render all your data without making any assumptions about its content, and Panel
for creating web applications for interactive exploration in Jupyter that can later be deployed for sharing with stakeholders. hvPlot works
similarly to Seaborn or the default plotting for Pandas, but provides interactive visualizations that support easy exploration.
# % of missing
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{} - {}%'.format(col, round(pct_missing*100)))
Visualization tools are also great for identifying outliers, especially scatter plots,
clustering, and violin plots. Use Seaborn or HoloViz tools to easily create these
visualizations to view data distributions and identify outliers.
Identifying duplicates
In addition to removing any features that # sorting by first name
are unrelated to answer the research data.sort_values("First Name",
questions at hand, duplicates must be inplace=True)
removed so that they do not bias the
results. Sorting to help identify and then
# dropping duplicate values
removing duplicates is easy with Python
data.drop_
and pandas. Here is an example of sorting
duplicates(keep=False,inplace=True)
by first name (useful if inspecting the data
manually) and dropping duplicates:
df_city_ex = pd.DataFrame(data={'city': ['tokyo', 'tokoyo', Another common formatting problem is making address entries
'tokiyo', 'yokohama', 'yokohoma', 'yokahama', 'osaka', 'kobe']}) uniform. Address entries often contain a combination of spacing,
punctuation, and capitalization inconsistencies. To remediate this,
df_city_ex['city_distance_tokyo'] = df_city_ex['city']. we can use a combination of functions to make all values lower
map(lambda x: edit_distance(x, 'tokyo')) case and standardize punctuation.
df_city_ex['city_distance_yokohama'] = df_city_ex['city'].
map(lambda x: edit_distance(x, 'yokohama')) See detailed examples of how to use Python to remove duplicates,
Df_city_ex find and correct misspelled words, make capitalization and
punctuation uniform, find inconsistencies, make address formatting
uniform and more in this detailed data cleaning guide published on
Towards Data Science.
%y Year without century as zero-padded decimal number 98, 99, 00, 01...
Source: Towards Data Science. To learn more about how to use these directives read
How to Manipulate Date and Time in Python like a Boss.
There are a variety of Python packages available to assist with units conversions. Here are a few of them to note:
There are many other Python packages out there to assist with domain-specific conversions.
Airflow Luigi
One way to do this in the open-source world is with Apache Airflow. Luigi is a Python workflow management tool originally developed at
Airflow is a Python-based data workflow management system Spotify. Luigi is great for managing long-running batch processes, and it’s
originally developed by Airbnb’s data team to help speed up their easy to build long-running pipelines that take days or weeks to complete.
data pipelines. It can be used to schedule tasks to aggregate, Luigi’s interface is less user-friendly than Airflow’s, but it allows for custom
cleanse, and organize data, among other tasks unrelated to data scheduling and has a comprehensive library of stock tasks and target data
preparation. systems.
Celery Prefect
Celery is a workflow automation tool that was developed with Prefect is a workflow management platform that allows users to automate
Python, but with a protocol that can be implemented in any anything they can do with Python. Designed and built with Dask in mind,
language. Celery uses task queues to distribute work across Prefect schedules workflows while allowing Dask to schedule and manage
machines or threads and allows for high availability and horizontal the resources for tasks within workflows. Prefect generally allows for more
scaling. Celery can process millions of tasks per minute and can run flexibility and more dynamic workflows than Airflow or Luigi. For example,
on single machines or multiple. A Celery system can even run across Airflow’s scheduler is critical to the execution of multiple stages in a
data centers. workflow, while Prefect decouples this logic into separate processes.
Airflow’s design makes it better for managing larger tasks while Prefect is
better for smaller, modular tasks.
Learn more about Celery.
Learn more about Prefect.