0% found this document useful (0 votes)
27 views2 pages

Schifferer 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views2 pages

Schifferer 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Tutorial: Feature Engineering for Recommender Systems

Chris Deotte∗ Benedikt Schifferer∗ Even Oldridge∗†


[email protected] [email protected] [email protected]
NVIDIA NVIDIA NVIDIA
San Diego, California, United States New York City, New York, United Vancouver, British Columbia, Canada
States
ABSTRACT 1 MOTIVATION
The selection of features and proper preparation of data for deep In our tutorial, we provide a general framework for feature engi-
learning or machine learning models plays a significant role in the neering specific to recommender systems building off our teams’
performance of recommender systems. To address this we propose collective experience creating production recommender systems or
a tutorial highlighting best practices and optimization techniques competing in data science competitions, such as Kaggle and RecSys.
for feature engineering and preprocessing of recommender system Academic literature in recommender systems focuses mainly on
datasets. The tutorial will explore feature engineering using pan- the different models and model types and rarely discusses the steps
das and Dask, and will also cover acceleration on the GPU using for preprocessing or feature engineering. Feature engineering is
open source libraries like RAPIDS and NVTabular. Proposed length an important component in recommender systems, which can be
is 180min. We’ve designed the tutorial as a combination of a lec- easily integrated into an existing model. The data structure of the
ture covering the mathematical and theoretical background and an recommender system (tabular data) limits the models capabilities
interactive session based on jupyter notebooks. Participants will to learn the relationships between features and adding hand crafted
practice the discussed features by writing their own implementa- features can significantly boost their performance. For example, we
tion in Python. NVIDIA will host the tutorial on their infrastructure, observed in the RecSys2020 challenge, that hand crafted features
providing dataset, jupyter notebooks and GPUs. Participants will with simple models outperformed complex model architectures.
be able to easily attend the tutorial via their web browsers, avoid-
ing any complicated setup. Beginner to intermediate users are the 2 IMPORTANCE FOR THE RECSYS
target audience, which should have prior knowledge in python pro- COMMUNITY
gramming using libraries, such as pandas and NumPy. In addition,
Our goal is that participants can integrate the learned material into
they should have a basic understanding of recommender systems,
their recommender systems. As mentioned above, engineering hand
decision trees and feed forward neural networks.
crafted features can significantly improve recommendation systems.
Furthermore, participants will learn to optimize their feature engi-
CCS CONCEPTS neering pipelines, allowing for more exploration and iteration. The
• Information systems → Recommender systems; Content time taken to perform feature engineering, categorical encoding
analysis and feature selection; • Computer systems organiza- and normalization of numerical variables often exceeds the time
tion → Single instruction, multiple data. it takes to train a deep recommender model itself. Optimizing the
data processing enables the participants having more iterations and
KEYWORDS to try out more ideas. In our experiments, we are able to reduce the
Recommender Systems, Deep Learning, Boosting, Preprocessing, calculation time from multiple days into less than an hour. Apply-
Feature Engineering, GPU Acceleration ing our techniques allows the participants to focus on their actual
work of designing recommendation models instead of waiting for
ACM Reference Format:
the preprocessing calculations. Finally, reducing the data process-
Chris Deotte, Benedikt Schifferer, and Even Oldridge. 2020. Tutorial: Feature
Engineering for Recommender Systems . In Fourteenth ACM Conference on ing time empowers retraining the recommendation system more
Recommender Systems (RecSys ’20), September 22–26, 2020, Virtual Event, frequently, having updated models in production systems.
Brazil. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3383313.
3411543 3 OUTLINE
∗ Authors
Section 1 - Theory (40 min)
contributed equally to this research.
† Corresponding Author • Introduction and tutorial overview
• Short review recommendation models, tree-based and deep
Permission to make digital or hard copies of part or all of this work for personal or learning
classroom use is granted without fee provided that copies are not made or distributed • Overview different input feature types
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. • Preprocessing techniques
For all other uses, contact the owner/author(s). – cleaning, imputing missing values, correcting outliers
RecSys ’20, September 22–26, 2020, Virtual Event, Brazil • Feature engineering
© 2020 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-7583-2/20/09. – see Table 1
https://doi.org/10.1145/3383313.3411543 – Overview of all feature engineering techniques

754
RecSys ’20, September 22–26, 2020, Virtual Event, Brazil Deotte, Schifferer and Oldridge

Table 1: Overview of different feature engineering techniques by feature types

Feature Type Feature Engineering


Categorical Target Encoding
Count Encoding
Categorifying
Unstructured Lists Target Encoding
Count Encoding
Categorifying
Numeric Normalization (mean/std, min/max, log-based, Gauss Rank)
Power transformer
Binning
Timestamp Extract Month, Day, Weekday, Weekend, Hour, Minute, Second
Target Encode intervals
Count Encode intervals
Normalize based on time zone
Timeseries Time since last event
Differences in time (lag features)
# of events in the last 1min, 5min, 30min, etc.
Text Extract keywords
Tf–idf
Language embeddings (deep learning)
Length/Quality/Complexity
Images Image embeddings (deep learning)
Resolution
Quality
Color spectrum
Social Graph Link analysis
Geo Location Distance to different POI
Characteristics in area

– In-depth lecture of most common techniques (bold) • Rewriting Preprocessing and Feature Engineering pipeline
with NVTabular
5min break
Section 2 - Hands-on (80 min) Section 4 - Wrap up/Summary (10 min)

• Example of different types of data (exploring dataset) ACKNOWLEDGMENTS


• Hands-on contains multiple exercises, participants have to The authors wish to thank our colleagues on the Deep Learning
fill-in blank code blocks Institute, RecSys, KGMON and RAPIDS.AI teams for their support
– Preprocess techniques and in particular Joshua Patterson for his vision of a GPU accel-
∗ cleaning, imputing missing values, correcting outliers erated data science workflow and Nicolas Koumchatzky for his
– Feature engineering guidance and recommender system expertise.
∗ Implementation of most common techniques (see Table
1 bold)
5min break
Section 3 - Optimization (40 min)
• Definition of typical bottlenecks
• General optimization best-practices to speed-up
• Implementation of preprocessing techniques / feature engi-
neering, which required significant calculation time
• Participants will implement some operations
• Introduction to NVTabular pipeline

755

You might also like