0% found this document useful (0 votes)

16 views672 pages

Data Distiller Guide - Saurabh Mahapatra

The Adobe Data Distiller Guide outlines the importance of understanding the philosophical foundations of the Data Distiller product, emphasizing its role in enhancing customer experience through effective data management and engagement strategies. Data Distiller serves as a powerful data processing engine that transforms raw data into actionable insights, facilitating marketing efforts such as audience segmentation, personalization, and campaign optimization. The guide also includes a Use Case & Capability Matrix to help organizations leverage Data Distiller's features for achieving specific business objectives.

Uploaded by

harish.v.kulkarni91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views672 pages

Data Distiller Guide - Saurabh Mahapatra

Uploaded by

harish.v.kulkarni91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 672

~/Downloads/dd.

https://data-distiller.all-stuff-data.com/ [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

Adobe Data Distiller Guide

Last updated 5 months ago

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-100-why-was-data-distiller-built * * *

Before you start, you need to understand the philosophical underpinnings of why Data Distiller exists as a product in
the first place. The guide contains a lot of examples of how to use the product. The product capabilities are very
powerful and are evolving fast.

If you focus on this as a tool to get by, you will completely miss the point. The value is not what the product can do and
its capabilities but how it fits as a key component in their overall experience delivery strategy. Here is how that
argument goes.

I am going to assume that the reason why you are reading this documentation is because:

1. You or your organization wants to make a positive impact in the world by changing things for the better.

2. You want to make this impact via the delivery of products or services.

3. The unit of how your users/customers/world will experience what you will offer is an “experience”

The fundamental equation in delivering an experience is:

Experience = Product/Service/Offer + Engagement + Data

Your products or services ultimately define what you have to offer to your customers. This is the reason why you
exist in the first place. But that is not going to cut it because you need to engage with them i.e. a means by which they
experience the product from awareness to becoming a champion of it. If you have the greatest product but cannot
engage- be ready to fail.

Engagement is dependent on when and how you talk to them. The place where you engage becomes the channel, the
style of communication, and the content you use (text, images, video). The sum total packaging of these elements and
how it manifests when delivered is also critical. But if you do not understand where, who, when, and what to engage
with, you are still going to fail.

You will have to work with the data to figure that out. In fact, you have to use the same data to design the product. No.
You will create products and engagements that will generate the data so that you can serve your customers better.

If you cannot do this, there are not enough good reasons as to why you should exist as a business.

Reality Check: For as long as we have collected data as a civilization, data has been messy. The messiness is just a
projection of the complexity of how we operate as agents. As we build more complex systems, the data we will need to
collect about them will increase and become messier.
Corollary: Just having a ton of data and analyzing that data all day is no good either. You will know a lot about how
the world works but to make an impact, you need a product.

Data Distiller is one of the data products in the Adobe Experience Platform that is architected to solve your data
problems so that you are empowered to deliver the best experience.

Customers experience what you offer in chunks called as an “experience” that is powered by the 3 elements.

All components such as.your offering, engagement and data need to be managed well enough to deliver an exceptional
experience.

https://data-distiller.all-stuff-data.com/what-is-data-distiller * * *

What is Data Distiller?

Data Distiller is an advanced data processing engine designed for data engineers, data scientists, and marketing
operations teams to streamline the transformation of raw data into actionable insights for marketers. By “distilling”
large datasets, it refines, filters, and processes information, helping businesses unlock the true value hidden in their
data. Similar to a distillation process that purifies and concentrates substances, Data Distiller extracts the most relevant
and impactful information, reducing noise and enhancing data quality. With its powerful capabilities, Data Distiller
accelerates data workflows, enabling faster analysis and delivering insights that drive informed decision-making across
a range of business functions

Data Distiller serves as the bridge between raw data and actionable marketing insights, optimizing the entire data
journey from storage to analysis. In modern marketing, data lakes and warehouse systems form the backbone, enabling
efficient data processing and insight generation. Mastering data processing techniques is crucial in this landscape for
several key reasons:

1. Data Analysis: Marketing generates extensive data, including customer profiles, sales figures, website analytics,
and campaign metrics. Data processing empowers marketers to query and analyze this information, providing
valuable insights into customer behavior, campaign performance, and overall marketing effectiveness.

2. Segmentation: Marketers can segment audiences based on demographics, location, purchase history, and
behavior. This level of segmentation enables targeted campaigns that improve conversion rates and return on
investment (ROI).

3. Personalization: Data analysis helps personalize marketing messages by allowing deep exploration of customer
data. Marketers can create personalized recommendations, email content, and advertisements tailored to
individual customers, boosting engagement and resonance.

4. Campaign Optimization: By analyzing real-time data on click-through rates, conversions, and customer
engagement, marketers can optimize campaigns. This data-driven approach ensures campaigns are fine-tuned for
the best possible results.

5. Customer Retention: Data analysis enables the identification of patterns related to customer churn. This
knowledge helps in developing strategies to retain customers, fostering loyalty, and reducing churn rates.

6. A/B Testing: Data processing is invaluable for conducting A/B tests to determine which strategies and messaging
perform best. The results can be analyzed to refine and enhance marketing approaches.

7. Data Integration: Marketing teams often use various platforms, from email marketing tools to social media
managers. Data processing integrates information from multiple sources into a centralized database, offering a
unified view of marketing performance.
8. Reporting and Dashboards: Data processing facilitates the creation of custom reports and dashboards,
delivering real-time insights to marketing teams and stakeholders. These tools help visualize key performance
indicators (KPIs) and track progress toward goals.

9. Career Advancement: In a data-driven marketing world, proficiency in data analysis is a highly sought-after
skill. Marketers who can effectively work with data and extract actionable insights are better positioned for career
growth.

10. Data Governance: Understanding data management is essential for ensuring accuracy and regulatory
compliance. Marketers need to responsibly manage customer data, and expertise in data processing aids in
maintaining data integrity.

In the sections that follow, we will dive into real-world examples, offering practical insights and considerations for
leveraging Data Distiller to elevate marketing strategies.

This book is freely available and has been crafted as a self-help resource for data leaders grappling with complex
challenges in the realm of customer data management. It’s essential to note that this book is an independent project and
is neither endorsed nor affiliated with Adobe or any of the author’s past or current employers.

Disclaimer

This book is provided for informational purposes only and does not constitute legal, financial, or professional advice.
The author makes no representations as to the accuracy, completeness, currentness, suitability, or validity of any
information in this book and will not be liable for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its use. The reader should consult with appropriate professionals for advice tailored
to their specific situation. Any reliance you place on information from this book is strictly at your own risk.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of
the author, except for learning and noncommercial uses permitted by copyright law.

Last updated 4 months ago

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-200-data-distiller-use-case-and-capability-matrix-
guide * * *

1. UNIT 1: GETTING STARTED

PREP 200: Data Distiller Use Case & Capability Matrix Guide
Navigate your data journey with precision—empower every decision with the Data Distiller Use Case & Capability
Matrix

The Data Distiller Use Case & Capability Matrix serves as a comprehensive guide to understanding how various
capabilities of Data Distiller can be leveraged to meet critical business objectives. This framework outlines key use
cases such as customer data onboarding, ETL (extract, transform, load) operations, and batch audience segmentation.
Each use case is paired with descriptions, benefits, and core functionalities that enhance the efficiency of data-driven
processes. By utilizing these capabilities, organizations can improve consistency in marketing efforts, streamline data
transformations within their data lakes, and drive large-scale audience segmentation with actionable insights. This
matrix provides a clear path to unlocking the power of data through tailored solutions and features that address specific
data challenges.
Data Distiller delivers a powerful set of features, augmenting what you can achieve with Adobe Experience Platform
(AEP) Intelligence or standalone applications. Here’s a summary of its key features and how it compares across
different scenarios:

Features in Data Distiller

1. Data Exploration

Concurrency: Support for up to 5 concurrent users, enhancing collaboration.

Approximate Analytics: Use aggregate functions for estimates on large datasets, eliminating the need for
extensive ETL processes.

2. ETL Engine

Scheduled Dataset Creation: Generate high-value datasets for Real-Time Customer Profile, Adobe
Journey Optimizer, and Customer Journey Analytics.

Unlimited Scheduled Jobs: Independent jobs tested at production scale, processing hundreds of billions of
records.

Derived Attributes: Robust capabilities for optimized Profile Storage compared to Computed Attributes.

Incremental Processing: Efficiently handle fast-changing event data.

Extensive Function Library: Includes sampling, attribution, windowing, sessionization, privacy, and
encryption functions.

3. BI Engine with Advanced Dashboarding

Warehousing Engine: Low-latency queries for dashboards, BI, and API integrations.

Star Schema Support: Simplify reporting workflows.

Dashboard Enhancements: Features like tables and CSV downloads.

Query Pro Mode: Create charts directly from SQL queries.

4. Audience Creation and Orchestration

Batch Audiences: Generate and orchestrate batch audiences using data lake insights.

Real-Time Integration: Attributes available for personalization in Real-Time Customer Profile and Adobe
Journey Optimizer.

Augmented Targeting: Combine real-time and batch audiences for precision marketing.

5. Data Activation for Ecosystem

Efficient Processing: Enhance activation workflows within AEP.

Storage Expansion: Add 1 TB for every 10,000 compute hours to store extra data (up to 24 months).

Cloud Storage Export: Export data in JSON or Parquet formats.

6. Machine Learning & Statistics

Statistical Functions: Built-in library for descriptive analytics.

Feature Engineering: Create features using SQL for advanced ML models.

Model Training: Train regression, classification, and clustering models.

Batch Inferencing: Derive attributes like propensity from scored datasets.

AEP Intelligence (Limited Capacity)

Standalone AEP Applications

Approximate functions, concurrency (5 users)

Basic exploration, limited concurrency (1 user)

Limited exploration, concurrency per app

Advanced scheduling, visibility, quarantine features

Limited ETL functionality

Full warehousing engine, dashboards, Pro Query Mode

Batch and real-time integration for personalization

Limited audience features

Full statistical library, model training, and batch inferencing

Limited statistical and ML features

1 TB per 10,000 compute hours

Top Use Cases Across Customers (As of December 2024)

1. Data Cleaning and Shaping: Prepare data for Profile (personalization) and Customer Journey Analytics
(insights).

2. BI Dashboards: Build highly customized dashboards for business intelligence use cases.

3. Derived Attributes: Use advanced computed attributes to enrich Profiles for segmentation and personalization.

4. Custom Identity Stitching: Fine-tune how data combines across channels (Profile and CJA).

5. Deep Data Analysis: Leverage OOTB functions for advanced analysis on lake data.

6. Emerging Use Cases:

Machine Learning for data analysis.

Audience management for Profile entitlement.

How to Use the Data Distiller Use Case & Capability Matrix
To effectively use the Data Distiller Use Case & Capability Matrix, start by identifying the primary business goals
your team is looking to achieve, whether it’s customer data onboarding, data transformation, or audience segmentation.
For each goal, review the corresponding use case in the matrix to understand the relevant capabilities and their
benefits. This will guide you in selecting the appropriate Data Distiller functionalities, such as exploration tools, batch
engines, or orchestration frameworks, to meet your needs. It’s important to assess how each capability aligns with your
specific marketing objectives and technical infrastructure. While the matrix simplifies decision-making, successful
implementation requires close collaboration between your marketing and data teams. The data team must be actively
involved to ensure the right data is available, properly transformed, and integrated into your workflows, enabling
marketing to make data-driven decisions that are accurate and actionable.

Data Distiller Use Case & Capability Matrix

The Key Capabilities column outlines a subset of core features you’ll likely use, but you may find yourself leveraging
many additional functionalities. Treat these as a starting point rather than an exhaustive list. For instance, it doesn’t
mention Data Distiller Query Pro Mode, the advanced SQL editor used to author SQL for all the use cases listed
below.

Check the comprehensive Data Distiller Capabillity Matrix below.

The following use case list represents over six years of Data Distiller implementations across various industry verticals
and organizations of all sizes.

Customer Data Onboarding & Activation

Onboard offline customer data and activate it across online platforms for more comprehensive retargeting.

Improve consistency and reach across marketing channels.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Perform data extraction, transformation, and loading (ETL) tasks within AEP data lake.

Streamline data transformation directly in the AEP data lake, reducing the need for external ETL tools.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Periodically process customer data in batches to create audience segments based on purchase behavior, demographics,
or engagement levels. Activate these audiences in Adobe Real-Time CDP and Adobe Journey Optimizer

Enables large-scale audience segmentation and provides marketers with up-to-date, actionable customer lists for
targeted campaigns.

Data Distiller Audiences, Data Distiller ETL

Real-Time Personalization & Offers

Deliver dynamic, personalized offers in Adobe Real-Time CDP, Adobe Target and Adobe Journey Optimizer based on
real-time customer interactions.

Increase engagement and conversion rates through timely, relevant content.

Data Distiller Enrichment, Data Distiller Orchestration

Content and Offer Recommendations at Scale

Batch-process customer interaction and purchase history data to generate personalized content or product
recommendations. For example, nightly batch jobs can update recommendation models for email campaigns, ensuring
that the right products or offers are surfaced.

Enhances customer engagement by delivering relevant recommendations at scale, personalized based on the latest
customer data.

Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Batch Data Integration for Customer 360 Profiles

Batch-process data from multiple sources (CRM, social media, transactional data, web analytics) to periodically
update complete customer profiles in the data lake. These profiles can be used to deliver personalized experiences and
communications across channels.

Ensures that customer profiles remain up to date and comprehensive, enhancing personalization efforts.

Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Customer Lifetime Value (CLV) Modeling

Model the long-term value of customers using transactional and behavioral data.

Focus marketing spend on high-value customers and optimize retention efforts.

Data Distiller Enrichment, Data Distiller ETL, Data Distiller Statistics & Machine Learning

Compliance Audits and Data Governance

Data Distiller can run batch processes to audit marketing data for compliance with regulations like GDPR, CCPA, or
other data privacy standards. This could include identifying and anonymizing sensitive data, tracking opt-ins and opt-
outs, and ensuring data usage aligns with legal requirements.

Ensures that marketing activities remain compliant with privacy regulations, reducing the risk of penalties and
enhancing customer trust.

Data Distiller Data Exploration, Data Distiller ETL

Cross-Sell and Upsell Opportunity Identification

Use batch processing to analyze customer purchase history and identify cross-sell and upsell opportunities. For
instance, weekly batch jobs can surface customers who recently purchased complementary products, allowing
marketers to target them with relevant offers.

Drives additional revenue by identifying and capitalizing on opportunities for cross-selling and upselling.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Long-term Customer Retention and Loyalty Program Analysis

Batch-process customer loyalty and retention data to analyze trends and the effectiveness of retention strategies. For
example, monthly batch jobs can evaluate the success of loyalty programs, discount campaigns, and re-engagement
efforts.

Helps refine retention strategies by providing regular, data-driven insights into what drives customer loyalty.
Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Customer Migration Analysis

Batch-process historical customer data to analyze patterns of customer migration between segments (e.g., frequent
buyers to inactive customers). This analysis helps identify why customers move between different value segments and
can trigger retention or re-engagement campaigns.

Reduces churn and increases customer lifetime value by identifying early signals of customer migration.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Competitive Benchmarking & Market Analysis

Batch-process data on competitor marketing efforts (e.g., social media activity, ad campaigns) and compare it to your
own. This data can be collected from third-party services or public sources and analyzed to understand market
positioning and identify competitive gaps.

Helps marketers adjust their campaigns based on competitor strategies and market trends.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Historical Campaign Performance Benchmarking

Run batch jobs to process historical campaign data and create benchmarks for marketing performance (e.g., click-
through rates, conversion rates) across various channels and periods. This allows marketers to measure current
campaigns against historical benchmarks.

Provides context for campaign performance by offering benchmarks based on past results, enabling better goal setting
and evaluation.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Attribution Analysis Over Long Periods

Run batch jobs to compute marketing attribution for extended periods (e.g., quarterly or yearly). This can involve
processing massive datasets from multiple campaigns, touchpoints, and channels to calculate performance metrics
using various attribution models (e.g., multi-touch, first-touch, last-touch).

Provides a holistic, long-term view of campaign effectiveness and helps allocate future marketing budgets based on
historical performance.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Sales & Marketing Alignment

Unify marketing and sales data to provide a complete view of the customer journey from lead to conversion.

Improve collaboration and drive revenue growth by identifying effective strategies.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Activation & Data Export, Data Distiller Business
Intelligence

Batch Processing for Lead Scoring

Run batch jobs to score leads based on historical interaction data (e.g., email opens, clicks, form submissions) and
assign predictive lead scores. This scoring can be refreshed daily or weekly, helping sales and marketing teams focus
on high-potential prospects.

Improves lead prioritization by automating the lead scoring process based on batch-analyzed historical data.

Data Distiller Data Exploration, Data Distiller ETL,

Data Distiller Statistics, & Machine Learning, Data Distiller Business Intelligence

Leverage Data Distiller’s SQL capabilities to analyze historical campaign performance and allocate future marketing
budgets based on the best-performing channels and segments.

Maximize ROI by focusing spend on the best-performing channels and segments. Ensure optimal use of marketing
budgets by focusing spend on strategies that deliver the highest returns.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Campaign Data Cleanup and Standardization

Periodically run batch processes to clean, standardize, and enrich marketing data from disparate sources (e.g., social
media, CRM, and transactional data). This includes removing duplicates, filling in missing data, and ensuring
consistency across datasets before further analysis.

Improves data quality, leading to more accurate analytics, reporting, and decision-making.

Data Distiller Data Exploration, Data Distiller ETL

Batch Omnichannel Campaign Performance Analysis Use Case

Data Distiller enables the aggregation of data from multiple marketing channels, such as email, social media, and paid
search, to provide a holistic view of campaign performance. Through batch processing, it delivers a comprehensive
analysis of large-scale marketing efforts over time, uncovering trends and optimization opportunities across all
touchpoints that single-channel or real-time data may miss.

Deeper insights into marketing effectiveness, enabling strategy adjustments based on historical trends and outcomes.
Batch analysis with Data Distiller ensures accurate data, empowering marketers to make informed, data-driven
decisions for more effective omnichannel strategies.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Time-Series Analysis for Marketing Trends

Periodically process large datasets to identify long-term marketing trends using time-series analysis. For example, a
batch process could analyze customer engagement over time to spot seasonal patterns or emerging behavior trends.

Informs long-term marketing strategy by identifying shifts in customer behavior and campaign performance over
extended periods.

Data Distiller Datq Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Batch jobs can process data after the completion of marketing campaigns to generate post-campaign reports, including
performance metrics, audience engagement, and ROI. This can be run at the end of each campaign cycle.

Provides detailed post-mortem insights into campaign success and areas for improvement, informing future
campaigns.
Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Periodically ingest and process marketing performance data from various platforms (Google Ads, Facebook Ads, email
platforms) into a unified marketing data warehouse. This allows for scheduled updates to marketing dashboards or
reporting systems.

Provides centralized and up-to-date marketing performance insights that are accessible across teams.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Activation & Data Exports

Customer Journey Stitching Across Channels

Batch jobs can be run to stitch together customer interactions from different channels (e.g., mobile, desktop, in-store).
This provides a unified view of the entire customer journey, allowing for deeper insights into how customers interact
with various touchpoints before conversion.

Allows marketers to understand how different channels contribute to the overall customer experience, helping refine
omnichannel strategies.

Data Distiller Data Exploration, Data Distiller ETL

Scheduled A/B Test Performance Analysis

Automate the analysis of A/B test results by running batch jobs that process performance data from multiple tests (e.g.,
different ad creatives or email subject lines). Batch processing allows for timely comparison of performance across test
groups.

Automates the evaluation of A/B tests at scale, allowing marketers to quickly identify winning strategies and optimize
campaigns.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Use batch processing to analyze large volumes of transaction data to identify which products are commonly purchased
together (market basket analysis). This data can then inform product bundling strategies or personalized offers.

Helps optimize merchandising and product recommendations by identifying patterns in customer purchase behavior.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Lookback Windows for Event-Based Campaigns

Batch-process data to evaluate customer behavior during specific lookback windows (e.g., 7 days, 30 days). This can
be used to trigger event-based campaigns, such as re-engagement emails for customers who haven’t purchased in the
last 30 days.

Enables timely, event-based marketing campaigns that are triggered based on customer behavior over specific time
windows.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Run batch jobs to analyze and optimize ad spend across channels. These jobs can look at historical performance data,
cost-per-click (CPC), return on ad spend (ROAS), and other metrics to recommend optimal budget allocation.

Maximizes marketing ROI by providing insights into where ad spend is most effective and where adjustments are
needed.
Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Customer Feedback Aggregation

Batch-process customer feedback data (e.g., surveys, product reviews) collected across multiple channels to generate
insights into customer satisfaction and sentiment. This can be done monthly or quarterly to inform product and
marketing strategies.

Helps marketers understand customer sentiment and improve messaging or product offerings based on aggregated
feedback.

Data Distiller Data Exploration, Data Distiller ETL,

Data Distiller Statistics & Machine Learning

Data Distiller can run batch jobs to analyze sales and customer engagement data over multiple years to detect seasonal
trends and predict future demand. This helps marketers adjust inventory, promotions, and campaign timing based on
historical trends.

Optimizes seasonal marketing efforts by aligning promotions with peak demand periods based on historical data.

Data Distiller Data Exploration, Data Distiller ETL

For data sharing across partners or for data collaboration in a clean room, Data Distiller can regularly anonymize large
datasets through batch processing. This could include hashing, tokenization, or other privacy-preserving techniques
before sharing data with external partners.

Enables privacy-compliant data sharing for joint marketing activities or external analysis, while protecting individual
customer data.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Activation& Data Exports

Batch-Powered Dynamic Pricing

Data Distiller can be used to run batch jobs that analyze pricing data in combination with competitive data, demand
trends, and customer behavior. Based on the results, dynamic pricing models can be adjusted periodically to optimize
pricing strategies for promotions or specific customer segments.

Increases revenue by optimizing prices based on real-time market conditions and customer willingness to pay.

Data Distiller Data Exploration, Data Distiller ETL,

Data Distiller Statistics & Machine Learning, Data Distiller Business Intelligence

Post-Purchase Experience Optimization

Batch-process customer feedback, return data, and post-purchase behavior to identify friction points in the post-
purchase experience (e.g., product returns, negative feedback). This analysis can lead to improved communication
strategies, such as targeted post-purchase emails or customer support outreach.

Enhances customer satisfaction by proactively addressing post-purchase issues, leading to improved customer
retention.

Data Distiller Data Exploration, Data Distiller ETL,

Data Distiller Statistics & Machine Learning, Data Distiller Business Intelligence
Inventory-Based Marketing Automation

Batch jobs can process inventory data to adjust marketing campaigns in real-time. If certain products are low in stock
or overstocked, marketing campaigns can be adjusted to feature promotions or highlight alternative products.

Aligns marketing efforts with current inventory levels, ensuring customers are shown relevant products and preventing
the promotion of out-of-stock items.

Data Distiller Exploration, Data Distiller ETL,

Data Distiller Statistics & Machine Learning,

Data Distiller Business Intelligence

Regional Campaign Analysis for Global Brands

Data Distiller can process data in batches to compare marketing performance across different regions or markets. This
could include understanding which messaging, products, or channels work best in each region, allowing global brands
to localize their campaigns more effectively.

Increases marketing effectiveness by tailoring strategies to the specific needs and behaviors of customers in different
regions.

Data Distiller Data Exploration, Data Distiller ETL,

Data Distiller Statistics & Machine Learning, Data Distiller Business Intelligence

Behavioral Retargeting Analysis

Batch jobs can analyze browsing behavior, cart abandonment, or product interactions to power retargeting campaigns.
Data Distiller can process this data to identify customers who have interacted with certain products but haven’t
purchased, allowing for targeted remarketing campaigns

Increases sales by identifying potential buyers based on their browsing behavior and targeting them with relevant
offers.

Data Distiller Data Exploration, Data Distiller Enrichment , Data Distiller ETL

Batch Processing for Predictive Maintenance of Marketing Campaigns

Analyze historical campaign data in batches to predict when ongoing campaigns may require updates, changes in
creative, or shifts in messaging. This could help marketing teams adjust campaigns before performance declines.

Maintains the effectiveness of long-running campaigns by proactively adjusting strategies based on predictive insights.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Influencer Marketing Performance Analysis

Batch-process data from influencer campaigns (e.g., social media engagement, conversion rates) to analyze their
effectiveness. This could be used to identify which influencers drive the most conversions and engagement, allowing
marketers to refine their influencer partnerships.

Optimizes influencer marketing spend by focusing on partnerships that deliver the best ROI.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence
Personalization Attributes for Campaign Activation

Batch processing to join and manipulate data from multiple sources like analytics, product pricing, and customer
profiles to derive personalized fields.

Enables personalized emails based on customer behavior (e.g., abandoned cart).

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Merging & Pivoting Data from Multiple Brands for CLV (Customer Lifetime Value)

Combine and standardize sales data from different departments, clean it for inconsistencies, and calculate custom CLV
through batch processing.

Provides unified sales data for better insights and personalized customer profiles.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Update all datasets with a master identity column for advanced attribution modeling, unifying data under a single
customer identity (ECID, AAI ID).

Enhances marketing attribution and allows cross-channel analysis (e.g., click-to-brick behavior).

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Segment Sharing in AEP Apps (RTCDP & CJA)

Batch process out-of-box datasets from AEP to create experience event datasets for segment membership reporting in
CJA.

Facilitates data sharing between AEP and CJA for marketing performance reporting.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Customer 360 Data Model for Reporting

Combine data from multiple customer touchpoints (transactions, CRM, browsing history) to create a customer-centric
data model for BI reporting.

Enables personalized BI dashboards with detailed customer insights (e.g., frequency of visits, spend per customer).

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Batch processing to track prospects who searched for ineligible services (e.g., 5G/LTE) and retarget when services
become available in their area.

Activates prospects based on real-time changes in service availability.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Suppression for Adobe Journey Optimizer Segments

Extract journey history from logs and create attributes on profiles to suppress over-communication in marketing
journeys.

Helps manage communication frequency and avoid customer fatigue.

Data Distiller Data Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Batch process browsing and transaction data to create datasets for personalized product recommendations based on
customer history and preferences.

Drives upsell and cross-sell opportunities with personalized offers.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Identify bot patterns using batch processing of click and interaction data, apply machine learning models for bot
filtering, and refine data for reporting.

Increases ad spend efficiency by excluding bot-generated traffic.

Data Distiller Data Exploration, Data Distiller Statistics & Machine Learning, Data Distiller ETL

Consolidated Lookup Tables for Data Transformation

Batch processing to build a master lookup table from multiple sources, ensuring consistency across datasets used in
Customer Journey Analytics (CJA).

Improves data accuracy for downstream reporting and analysis.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Identity Graph Segmentation

Explode segment memberships and aggregate data for identity validation, ensuring consistent identity mapping across
datasets.

Ensures accurate segmentation and identity consistency for marketing campaigns.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Business Logic on Data Contracts for Campaign Optimization

Explode data from multiple sources, apply business logic (e.g., loyalty status, contract details) and use window
functions to prepare datasets for profile-based campaigns.

Optimizes campaigns with personalized messaging based on business-specific rules.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Next Best Offer Using Derived Attributes

Batch process browsing and purchase history to generate next best offer datasets, used for personalized email and
product recommendations.

Drives conversion by delivering timely, relevant offers to customers.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Sales & Marketing Insights Reporting

Ingest data from multiple sources (3rd party services, Adobe Real Time CDP & Marketo), process through batch
operations, and generate insights dashboards for sales and marketing.
Provides real-time insights into sales and marketing performance across regions and channels.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Flattening Nested Data for Customer Journey Analytics

Flatten highly nested retail or event data to prepare it for ingestion into analytics platforms like CJA for detailed
customer interaction analysis.

Simplifies complex data structures for better analytics and reporting.

Data Distiller Data Exploration, Data Distiller Functions & Extensions, Data Distiller ETL

Derived Attributes for Customer Churn Prediction

Batch process customer interaction data to identify churn risks and create datasets for retention campaign activation.

Reduces churn by targeting at-risk customers with proactive retention offers.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL

Cross-Brand Affinity Scores

Data Distiller batch processes customer’s browsing and purchase data across multiple brands to identify cross-brand
affinities. By analyzing nteractions (e.g., fashion items from Brand A, beauty products from Brand B), the system
generates personalized recommendations that span her interests across these brands, providing a comprehensive view
of her preferences.

This approach enhances cross-sell opportunities, boosts customer engagement, and fosters brand loyalty by delivering
relevant product suggestions across brands. It also drives higher revenue through personalized recommendations,
offering a unified shopping experience tailored to Susan’s cross-brand preferences.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Derived Attributes: Engaged, Re-engaged, Active, Inactive, Return Order Counts, Preferred Brand

Data Distiller derives summary aggregates from cross-brand profiles and behavioral data (e.g., engagement status,
return order counts, preferred brands). These aggregates may vary by brand and customer level and are computed as
attributes to be ingested into the customer profile.

Provides detailed insights into customer engagement and behavior across multiple brands. Enables personalization and
targeted marketing by understanding customer preferences. Supports better decision-making with cross-brand metrics
available at both brand and person levels.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Variance of a Derived Attribute

Data Distiller calculates time-series aggregates to capture the variance in computed attributes over time. These
aggregates are timestamped with the current date and ingested into an Experience Event Schema for tracking historical
changes.

Offers insights into how customer attributes evolve over time. Helps identify trends and patterns by comparing past
and present aggregated data. Improves forecasting and marketing strategies by leveraging time-series data for variance
analysis.
Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Data Enrichment from Adobe Analytics Data

Data Distiller processes clickstream events from Adobe Analytics to derive key customer engagement metrics such as
the last viewed product, style color, cart ID, and timestamps for critical events (e.g., product views, cart additions, and
page views). These attributes are used to track the most recent customer interactions across the site, including product
browsing and cart activity.

Provides insights into customer preferences by tracking key engagement metrics, enabling targeted product
recommendations and personalized marketing. By identifying abandoned carts and tracking products viewed or added,
it supports effective retargeting campaigns to improve conversion rates. Additionally, it enhances the overall user
experience by analyzing browsing patterns to optimize product offerings and site navigation based on customer
behavior.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence

Pipeline Management and Forecasting

By analyzing historical sales and opportunity data, Data Distiller can predict future revenue, identify bottlenecks in the
pipeline, and provide insights into the likelihood of deals closing.

Provides data-driven forecasting and pipeline management, helping sales teams allocate resources effectively.

Data Distiller Data Exploration, Data Distiller Statistics & Machine Learning, Data Distiller Business Intelligence

Data Distiller automates B2B data cleaning, validation, and standardization, addressing duplicates, incomplete records,
and inconsistent formats. It enriches datasets with missing information and provides continuous monitoring of data
quality metrics like accuracy and completeness.

Enhances decision-making with accurate data, improves customer segmentation for personalized marketing, boosts
efficiency by reducing manual corrections, ensures compliance, and optimizes CRM and marketing performance by
eliminating data friction.

Data Distiller Data Exploration, Data Distiller ETL, Data Distiller Business Intelligence

Custom Audience Format Export

Data Distiller allows the creation and export of custom audience segments in formats tailored for various marketing
platforms, ensuring seamless integration with external systems like CRM, ad platforms, and email tools.

This capability streamlines data sharing, enables precise targeting, and enhances campaign efficiency by delivering
well-defined audience segments ready for use across multiple channels.

Data Distiller Activation & Data Exports, Data Distiller ETL

Leverage Net Promoter Score (NPS) data to assess customer satisfaction and loyalty across different segments. This
involves transforming customer survey results into actionable insights by categorizing customers into Promoters,
Passives, and Detractors.

Analyzing customer sentiment through NPS helps businesses refine strategies, improve satisfaction, and boost
retention by focusing on Detractors and converting Passives into Promoters. It also enables more effective marketing
and product improvements based on customer feedback.

Data Distiller Data Exploration, Data Distiller Enrichment, Data Distiller ETL, Data Distiller Business Intelligence
Data Distiller Capability Matrix

Data Distiller is often referred to as the “Swiss army knife” of the platform due to its extensive feature set, offering
incredible flexibility to tackle a wide array of custom use cases tailored to your organization’s unique needs. It is built
on a foundation of powerful, massive-scale data processing and analytical engines, making it a versatile and essential
tool.

The following Data Distiller capabilities are required for the above use case implementations and are not included in
Adobe Experience Platform applications. They require a separate Data Distiller license:

Data Distiller Query Pro Mode

This refers to the advanced SQL Query Editor, offering an object browser for easy exploration, a detailed query log
with search capabilities, and full visibility into orchestration jobs and schedules. It includes query-saving functionality
and is integrated with Data Distiller Insights for SQL-based chart creation. Additionally, Query Pro Mode allows you
to connect third-party editors and BI tools to Data Distiller through IP whitelisting.

This refers to the capability of querying the AEP Data Lake on massive relational or semi-structured datasets. Data
Distiller’s engine is highly optimized for querying deeply nested data. Data Distiller’s ad hoc query engine
dynamically scales (serverless) with user demand, democratizing data exploration. Additionally, it increases the limits
on concurrent query execution, ensuring smooth performance even with high system activity. The query timeout in this
mode is set to 10 minutes, providing ample time to execute the vast majority of exploratory queries efficiently

This capability allows the creation of datasets on the data lake through scheduled batch jobs, which can be chained
together, conditionally branched, and processed incrementally. These batch jobs can generate datasets that are
ingestible into the Data Distiller Warehouse (Accelerated Store), Real-Time Customer Data Platform, and Customer
Journey Analytics. Furthermore, Data Distiller provides visibility into the compute resources used for each job, down
to fractional amounts, offering greater transparency and control over resource consumption. This engine delivers
performance and scalability on par with leading market solutions, with a 24-hour timeout set for batch jobs to ensure
extensive processing capabilities

This capability enables the creation of batch audiences using SQL on the AEP Data Lake, which are automatically
ingested as external audiences into Adobe Real-Time Customer Data Platform and Adobe Journey Optimizer. When
combined with Data Distiller Orchestration, it supports both simple and complex audience composition tasks that go
beyond the capabilities of most segmentation and campaign tools, allowing for more sophisticated audience targeting
and optimization. External audiences in the Real-Time Customer Profile automatically expire after 30 days.

This capability enables the creation of SQL-based attributes on data stored in the data lake, allowing for the
development of more complex attributes that are typically challenging to author in a standard segmentation engine. It
supports extended lookback periods and intricate, chained logic with windowing functions. Additionally, these derived
attributes can be orchestrated and published to the Real-Time Customer Profile, making them available for
segmentation and personalization across various destinations, including Adobe Journey Optimizer.

This capability allows you to create SQL fragments and parameters that can be reused and executed multiple times
with different values. Inline templates help modularize your code by enabling the use of SQL code blocks throughout
the program.

Data Distiller Data Models

This capability allows you to structure datasets and views into a star schema format, both in the AEP Data Lake and
the Data Distiller Warehouse. It also provides mechanisms to define primary and secondary key relationships between
columns, facilitating efficient data organization and querying.
This capability also known as the Accelerated Store, is an interactive engine designed for low-latency queries, ideal for
dashboarding. It enables the creation of reporting star schemas, called Data Distiller Data Models, customized to meet
your organization’s specific requirements.

This capability integrates with the Data Distiller Warehouse to build reporting star schemas and features Data Distiller
Dashboards, offering charting capabilities, global filters, date pickers, and CSV downloads. It also allows for SQL-
based authoring of complex charts and filters, surpassing the limitations of standard BI tools. Additionally, the
warehouse seamlessly integrates with external BI tools for advanced data analysis. The Data Distiller Warehouse
supports up to 4 concurrent queries and provides 500GB of data storage.

This capability enables the activation and export of Data Distiller datasets from Adobe Experience Platform to
supported cloud storage destinations. Users can set batch schedules and export data in JSON or Parquet formats.
Additionally, the system ensures that all data is activated incrementally, streamlining the process and optimizing
resource use for efficient data handling. The activation size (GB) limits for the year are decided on your entitlement.

Data Distiller provides a robust set of tools for advanced data analysis and machine learning. It includes statistical
functions such as MEAN, MEDIAN, VARIANCE, STANDARD DEVIATION, SKEWNESS, and KURTOSIS for
summarizing data and measuring distribution properties, as well as correlation metrics like PEARSON and SPEARMAN
to assess relationships between variables. Approximate queries (SUM, COUNT) enable efficient calculations on large
datasets, while column statistics offer detailed insights into data distributions, including distinct counts, null values,
and min/max metrics. Sampling techniques facilitate quick exploratory analysis, and hypercubes support efficient data
aggregation across dimensions, enabling incremental unique counts. It also includes tools for feature engineering and
supports machine learning algorithms such as linear regression, logistic regression, decision trees, and k-means
clustering, helping users transform data and build optimized models for actionable insights.

This capability enables the use of specialized functions for ETL transformations, including row-level operations such
as array, string, math, and date manipulations, as well as anonymization functions. It also includes tools for attribution,
sessionization, and pathing analysis on Adobe Analytics data. Additionally, lambda functions are available for
performing more complex, custom operations. Data Distiller extensions, which are enhancements to the SQL syntax,
allow for automation of tasks such as enabling datasets for profiles, schema authoring, and creating star schemas (data
models) within the Data Distiller Warehouse (Accelerated Store).

Data Distiller Accelerators

This capability allows users to configure and execute common Data Distiller tasks and use cases by simply inputting
the required parameters.

Data Distiller Lake Storage

Data Distiller users get additional AEP Data Lake storage based on their entitlement.

NPS (Net Promoter Score) Calculation and Customer Satisfaction Analysis

The product comes equipped with powerful foundational capabilities that take your data to the next level, featuring a
Data Distiller-powered dashboard seamlessly integrated into Looker.

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-300-adobe-experience-platform-and-data-distiller-
primers * * *

Adobe Experience Platform Primer

In this section, we’ll delve into the fundamental concepts of Adobe Experience Platform. Data enters the Adobe
Experience Platform through edge, streaming, and batch methods. Regardless of the ingestion mode, all data must find
its place in a dataset within the Platform Data Lake. The ingestible data falls into three categories: attributes, events,
and lookups. The Real-Time Customer Profile operates with two concurrent stores – a Profile Store and an Identity
Store. The Profile Store takes in and partitions data based on the storage key, which is the primary identity. Meanwhile,
the Identity Store continuously seeks associations among identities, including the primary one, within the ingested
record, utilizing this information to construct the identity graph. These two stores, accessing historical data in the
Platform Data Lake, open avenues for advanced modeling, such as RFM, among other techniques.

Adobe Experience Platform excels in ingesting data from diverse sources. However, marketers face a significant
challenge in extracting actionable insights to enhance their understanding of customers. Data Distiller addresses this
challenge by providing the flexibility to query data using standard SQL in the Query Editor.

A valuable addition to this capability is the Data Distiller package, which encompasses a subset of functionalities
available in Adobe Experience Platform. Specifically designed to facilitate post-ingestion data preparation, Data
Distiller tackles key tasks such as cleaning, shaping, and manipulation. It executes batch query in the Query Service,
preparing data for use in Real-Time Customer Profile and other applications.

Utilizing Data Distiller, you gain the capability to join any dataset within the data lake and capture query results as a
new dataset. This newly created dataset proves versatile, serving various purposes such as reporting, machine learning,
or ingestion into Adobe Experience Platform-based applications like Real-Time Customer Profile Data, Adobe Journey
Optimizer, and Customer Journey Analytics.

There are three primary use cases for Data Distiller, and this continues to expand every few releases:

Next, let us get familiar with a few key terms, which will be used throughout this book

Adobe Experience Platform: This is the shorthand for Adobe Experience Platform.

Adobe Experience Platform Data Lake: This denotes the data lake store housed within the Adobe Experience
Platform governance boundary. Irrespective of the ingestion mode, all data is directed to the Adobe Experience
Platform Data Lake. Currently, Data Distiller interacts with this lake, both reading and writing datasets.
Additionally, Data Distiller possesses its own accelerated store designed for business intelligence reporting,
allowing seamless reading, and writing of datasets. The Adobe Experience Platform Data Lake contains datasets
which can be either attributes, events, or lookups. Each of these datasets must have an associated schema with
them.

Query Service: This is a broad set of SQL capabilities in the Adobe Experience Platform. Some of these
capabilities may be included in the packaging of various Apps such as Adobe Journey Optimizer but most of it is
packaged in Data Distiller. It is referred to as a service as the entire foundation is built on service-oriented
architecture.

Derived Attributes: In Data Distiller, derived attributes are calculated or derived from other attributes within a
dataset or table, and they are stored in a customized dataset called as a Derived Dataset. These attributes are
computed using expressions or mathematical functions applied to existing attributes or events within the same
table or through joins with other tables. For example, calculating the Customer Lifetime Value (CLTV) based on
the last 5 years of transactions for each customer.

Audiences: Audiences are constructed on top of attributes, events and derived attributes which include logic for
metrics such as Customer Lifetime Value (CLTV) or the count of transactions. Audiences can encompass 1st,
2nd, or 3rd party data and may combine data from multiple sources associated with the same person.

Ad hoc queries: Ad hoc queries refer to SQL queries utilized for exploring ingested datasets, primarily for
verification, validation, experimentation, etc. These queries, crucially, DO NOT write data back into the Adobe
Experience Platform Data Lake.

Batch queries: Batch queries are SQL queries employed for post-ingestion processing of ingested datasets.
These queries undertake tasks like cleaning, shaping, manipulating, and enriching data, with the results written
back to the Platform data lake. Batch queries can be scheduled, managed, and monitored as batch jobs.
Accelerated Store: SQL queries executed against this reporting layer support interactive dashboards and BI
workflows. The results are cached for faster response time. Within the Data Distiller offering, customers can
utilize an accelerated store to create insights data models efficiently, including the one employed for RFM
analysis in this lab. Directly within our user interface, users can employ a lightweight BI-type dashboard to
visualize key performance indicators (KPIs). Additionally, there is the option to seamlessly connect external BI
tools, such as Power BI, enhancing flexibility in data visualization and analysis.

Derived Datasets: The Derived Datasets feature can be leveraged for cleaning, shaping, and manipulating
specific data from the Adobe Experience Platform Data Lake to generate custom datasets. These datasets can be
regularly refreshed at cadence to enrich the Real-Time Customer Profile. By leveraging derived datasets, you can
create complex calculations with distributions such as deciles, percentiles, or quartiles or simpler ones such as
maximum value, counts, and mean value. These datasets can be tailored to individual users or business entities,
associating directly with identities such as email addresses, device IDs, and phone numbers, or indirectly with
user or business profiles.

Why use Derived Datasets?

Derived Datasets play a crucial role in various data analysis and enrichment scenarios, especially when analyzing data
on the Adobe Experience Platform Data Lake. Furthermore, they can be marked for use in the Real-Time Customer
Profile and applied in downstream use cases such as audience targeting. Potential use cases include:

Identifying the bottom 10% of subscribers based on channel viewership to target specific audience segments for
new subscription packages.

Identifying top 10% flyers based on total miles traveled and “Flyer” status to target them for new credit card
offers.

Analyzing subscription churn rates.

Identifying the top 1% of household income in a region and tracking the number of individuals moving out of
that income bracket over a specified period.

Why use Data Distiller Customizable Insights Dashboards?

Dashboards provide a dynamic and interactive way to review RFM (Recency, Frequency, Monetization) marketing
analysis, offering insights and trends at a glance. This approach enables businesses to quickly identify valuable
customer audiences and adjust their marketing strategies, accordingly, maximizing both engagement and ROI.

Basic Architecture of the Adobe Experience Platform.

Figure 1: Data Distiller Use Cases Marketecture Diagram.

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-301-leveraging-data-loops-for-real-time-
personalization * * *

1. UNIT 1: GETTING STARTED

PREP 301: Leveraging Data Loops for Real-Time Personalization

Real-time personalization isn’t just about having the best tools—it’s about creating efficient data loops that allow you
to respond instantly to customer needs and provide exceptional service.

Last updated 6 months ago

Yesterday, I was in a customer meeting where the data architects walked me through their real-time personalization
data architecture. The presentation was impressive—a mosaic of ten different tools, each with its own color scheme
and architecture. Some were legacy systems, others were roadmapped for future implementation, and there were even
boxes planted for Adobe Experience Platform components (perhaps to please me). They were designing the ultimate
data model, the perfect data dictionary where everything would work seamlessly end-to-end. Every actor in this
elaborate play was poised to execute their role so perfectly that there was no doubt in their minds that this would be a
hit with their audience—the marketing team. Governance, security, and privacy concerns? All addressed seamlessly in
this utopian vision.

And then they asked me this question: “If we get the data foundation right, what could possibly go wrong even if the
marketing team threw new use cases at us?”

I just did not know what to respond.

Here is the thing - Technology can be so blinding that we can easily miss the point. It’s never about having the best
technology because, honestly, you can shop around for that. The key to personalization is data. By now, that should be
clear. But there’s one extra thing—creating effective data loops. But even that does not cut it.

Consider your customer for a moment. Even if the marketing team hasn’t presented specific use cases, take a moment
to imagine how the data you have can be used to better serve your customers.

Let me paint a picture for you. Imagine a customer standing at your doorstep—what’s the most relevant information
you need to serve them effectively in that moment? Should you waste time calling customer service to ask about their
recent return experience? Or do you quickly check your computer to see that she’s been buying gifts for her family
every week before visiting home? Perhaps she needs luggage to carry all those items—should you ask her about that?
Personalization isn’t about guessing; it’s about having a meaningful conversation focused on how you can best serve
your customer, using the data you have right at your fingertips. The whole point of leveraging their data is to make this
conversation as efficient and impactful as possible.

In today’s rapidly evolving data landscape, “composable data architecture” has become a buzzword. It emphasizes the
use of top technologies, modular components, and the ability to adapt to changing data needs. However, beyond the
hype around new tools, the true value of data architecture lies in its ability to transform data into actionable insights
that facilitate meaningful conversations and exceptional customer service. Regardless of whether your architecture is
composable or which vendor you choose, your primary focus should be on effective personalization data loops.

The Heart of Personalization Data Architectures: Data Must Drive Action and Reflection

Data and Action are the Yin-Yang of Personalization.

Personalization data architectures aren’t just about assembling the most advanced tools; they’re about enabling your
organization to swiftly turn data into actionable insights. Whether you choose a centralized or decentralized approach,
the end goal is the same: leveraging data to drive both real-time decisions and long-term strategic outcomes.

In real-time personalization, speed is key. Customers expect immediate responses and personalized experiences in
every interaction. To achieve this, organizations need to establish a fast data loop—a system where data is quickly
ingested, processed, and acted upon. This fast loop is crucial for turning raw data into personalized actions, delivering
value right when it’s needed.

However, balancing speed and quality presents a challenge: quick decision-making often leaves little room for
reflection on past experiences. The urgency of the situation requires immediate action, while quality decisions
typically involve more thoughtful consideration of past data. This is where it’s essential to design data loops that
effectively support both fast and informed decision-making.

The Need for Speed: Fast Data Loops for Real-Time Personalization
Real-time personalization depends on the quick turnaround of data and insights. Picture a customer interacting with
your platform—every click, scroll, and purchase generates valuable data that, if processed rapidly, can instantly
enhance their experience. The faster you can bridge the gap between data collection and action, the more relevant and
personalized the experience you can deliver.

In the Adobe Experience Platform architecture, we made a deliberate choice to enable this fast loop by incorporating
technologies designed for low-latency processing. This includes leveraging in-memory databases, stream processing,
and real-time edge technologies. To drive a data loop that closely aligns with personalization, we developed the
Experience Data Hub, where events can be activated within minutes in Adobe Journey Optimizer. Additionally,
Customer Journey Analytics allows us to analyze patterns within 15 minutes. Working alongside these is Data Distiller,
equipped with powerful data processing engines that can compute new attributes for personalization within an hour.
Together, these components ensure that data flows seamlessly from source to action, allowing you to reach your
customers with the right message at the right time.

Now, consider this: we could have bypassed many of these elements and focused solely on building a single product,
like an exceptional email sender. But personalization requires more than just the best technology for one task. As a
solutions provider, I must think beyond that and build a comprehensive system where all these elements work together.
This is what’s needed to drive the personalization revolution that’s still missing from our experiences as customers.

The Power of Reflection: Slow Data Loops for Deep Insights

While fast loops are essential for real-time actions, not all insights need to be immediate. Some of the most valuable
insights come from deep, sophisticated analysis and reflection that takes time to develop. These slower loops involve
aggregating large datasets, building complex models, and uncovering trends that inform long-term strategies.

In personalization data architectures, slow loops often require moving or accessing data across different systems. You
might need to aggregate data from multiple sources, apply machine learning models, or run advanced analytics to
generate insights. This process is not about speed but about depth and accuracy. The insights generated in these slow
loops help you understand customer behavior, optimize business processes, and make informed decisions that drive
future growth.

Bridging the Fast and Slow Data Loops: A Balanced Approach

The beauty of personalization data architectures lies in their ability to support both fast and slow loops effectively. By
modularizing your data architecture, you can optimize for both real-time and deep insights without compromising on
either. This balanced approach ensures that you’re not just reacting to data but also learning from it, evolving your
strategies, and continuously delivering value to your customers.

It’s About Data Loops, Not the Technology

In the end, the success of a personalization data architecture isn’t measured by the technologies you use or the
complexity of your systems. It’s measured by how well you can turn data into action—how quickly you can respond to
customer needs in real-time, and how deeply you can understand and anticipate those needs over time.

As you build and refine your data architecture, remember that the real goal is to create a system that enables both fast
and slow loops of insight, each serving its unique purpose. Whether you are activating real-time personalization or
developing sophisticated data models, what matters most is that you’re consistently turning data into meaningful,
actionable insights for your customers.

Fast data loops are like reflexes, quickly responding to stimuli.

Slow data loops are similar to reflection that involves deliberate and thoughtful consideration.

Inner personalization data loops run faster because they either are reacting to fresh behavioral data or have
precomputed historical behaviors encapsulated in attributes.

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-303-what-is-data-distiller-business-intelligence * * *

1. UNIT 1: GETTING STARTED

PREP 303: What is Data Distiller Business Intelligence?

Unleash the Power of BI with Speed, Flexibility, and Precision

What is Business Intelligence?

Business Intelligence (BI) is the process of turning raw data into actionable insights that drive better decision-making
within an organization. BI involves collecting, integrating, analyzing, and visualizing data to uncover trends, identify
opportunities, and solve business challenges. It empowers businesses to make data-driven decisions, ensuring they stay
competitive in a rapidly evolving market.

Modern BI tools provide interactive dashboards, reports, and data visualizations, enabling users to explore data in real-
time. These tools are designed to be user-friendly, making it easier for non-technical stakeholders to interpret complex
datasets. BI is used across industries for tasks like sales forecasting, customer segmentation, operational efficiency, and
financial planning, ensuring every decision is backed by evidence and insights.

The technology stack supporting Business Intelligence typically includes the following components:

Data Sources BI starts with collecting data from various sources such as transactional systems, CRM platforms,
marketing tools, IoT devices, or third-party APIs. These data sources can be structured (e.g., databases), semi-
structured (e.g., JSON files), or unstructured (e.g., social media posts).

Data Integration and ETL Extract, Transform, and Load (ETL) tools gather data from multiple sources, transform it
into a consistent format, and load it into a centralized repository. This step ensures data quality, consistency, and
readiness for analysis.

Data Warehousing A data warehouse serves as the central hub where cleaned and organized data is stored. It is
optimized for analytical queries rather than transactional operations, enabling users to access historical and aggregated
data efficiently. Popular data models like star and snowflake schemas organize data for easy querying.

Data Transformation and Modeling In this step, data is further refined and modeled to create relationships between
different entities, dimensions, and measures. Techniques like star schemas provide a user-friendly structure for analysts
to query and visualize data effectively.

Query and Analysis Tools The query layer allows users to interact with data using SQL or other query languages.
This layer often includes a query engine optimized for speed, enabling real-time or near-real-time analysis.

Visualization and Dashboarding Tools BI platforms provide visual interfaces for creating dashboards, charts, and
reports. These tools help users interact with data through intuitive visuals, uncovering trends and patterns quickly.

Advanced Analytics and AI Modern BI stacks incorporate machine learning and AI for predictive analytics, anomaly
detection, and natural language queries. This layer helps organizations go beyond descriptive analytics to answer
“what will happen next?” and “what should we do?”
Collaboration and Sharing BI platforms support collaboration by enabling users to share dashboards, reports, and
insights across teams. This ensures alignment and drives organization-wide data literacy.

Key Benefits of the BI Stack

By leveraging this comprehensive technology stack, organizations can unlock data’s full potential—delivering
actionable insights faster, scaling data usage, and empowering users at all technical levels. With the right BI stack,
businesses can respond to market trends with agility, optimize operations, and achieve a competitive edge.

Unlock the Future of Business Intelligence with Data Distiller

Data Distiller Business Intelligence revolutionizes the way you analyze and visualize data, offering a uniquely
powerful platform tailored for businesses that demand flexibility, precision, and speed. With seamless SQL-driven
chart creation, advanced filter logic, and high-performance data access, Data Distiller empowers you to transform raw
data into actionable insights with unmatched efficiency.

A Next-Generation SQL Engine for Actionable Insights

At the heart of Data Distiller is a high-performance SQL engine purpose-built for Business Intelligence. Unlike
traditional data warehousing systems that prioritize storage and batch processing, Data Distiller’s engine is optimized
for real-time queries and advanced analytics. This design allows for lightning-fast responses, even when working with
massive datasets, ensuring your dashboards and reports deliver insights at the speed of your business.

Flexible Data Modeling for Deeper Insights

Data Distiller embraces the flexibility of star schemas and custom data models, enabling you to design your data
architecture for optimal performance and usability. Star schemas simplify complex relationships into intuitive
structures, making it easier to query, visualize, and understand your data. This approach enhances both speed and
scalability while empowering analysts to answer even the most intricate business questions without unnecessary
complexity. Whether you need to adapt your model to support new metrics, dimensions, or hierarchies, Data Distiller
ensures your data model evolves with your business.

Empower Decision-Makers with Contextual Insights

Gone are the days of static dashboards. Data Distiller allows you to drill through from high-level metrics to granular
data effortlessly. Whether you’re exploring regional trends or investigating anomalies, every interaction is backed by
real-time contextual filters that ensure consistency and relevance across all visualizations.

Unleash Flexibility with SQL Chart Authoring

Why settle for rigid interfaces when you can have complete control? Data Distiller brings the full power of SQL
directly into the chart authoring process, enabling you to craft complex metrics—like rolling averages or custom
aggregations—right where you need them. No need to reprocess metrics at the backend; just write, visualize, and act.

Next-Level Filter Precision

With Data Distiller, filters are smarter. Create global filters that cascade seamlessly across your dashboards or define
local filters for specific charts—offering unparalleled customization. Advanced date filters provide intuitive options for
both fixed ranges and presets, enabling faster, more precise temporal analysis.
Optimized Performance Meets Seamless Integration

Leverage high-performance, optimized data models with effortless connectivity to your preferred analytics tools. Data
Distiller ensures you get the speed and efficiency of an advanced query engine while working in a familiar BI
environment, so you can focus on insights—not technical constraints.

A Solution Built for Business Agility

Whether you’re crafting dashboards, diving into complex queries, or refining filters, Data Distiller is designed to grow
with your needs. It bridges the gap between power users and business teams, making even the most complex data
accessible and actionable.

How Data Distiller Addresses It

Data Distiller seamlessly integrates with diverse data sources, from transactional systems to APIs, supporting
structured and semi-structured data.

Simplifies ETL with direct SQL-based transformations, reducing the complexity of traditional ETL pipelines while
maintaining data quality.

Combines optimized storage in the Accelerated Store with a high-performance query engine tailored for real-time
analytics.

Data Transformation and Modeling

Supports flexible data modeling, including star schemas, allowing businesses to easily define relationships and create
scalable data structures.

Offers a SQL-driven approach for advanced metric calculations, enabling real-time queries and unmatched analytical
flexibility.

Visualization and Dashboarding Tools

Provides intuitive dashboards with SQL-powered chart creation, advanced global filters, drillthroughs, and interactive
visualizations.

Advanced Analytics and AI

Enhances analytics with SQL capabilities for custom logic, and integration-ready architecture for AI/ML workflows on
high-performance data.

Collaboration and Sharing

Enables easy sharing of dashboards and insights, ensuring alignment across teams with customizable access and filter
configurations.

Comparing Data Distiller and Customer Journey Analytics: A Comprehensive Analysis

In the rapidly evolving world of data-driven decision-making, tools that address distinct needs in data processing,
analytics, and activation are critical. Data Distiller and Customer Journey Analytics (CJA) represent two powerful
platforms that cater to complementary aspects of an organization’s analytics strategy. While Data Distiller excels in
foundational data processing, complex modeling, and advanced machine learning (ML) capabilities, CJA shines in
providing real-time, multi-channel insights into customer journeys, extending beyond traditional web analytics. This
analysis explores how these platforms differ and how they can work together to create a unified analytics ecosystem.

Core Purpose and Use Cases

Data Distiller serves as a general-purpose data platform, combining powerful ETL capabilities, scalable data
processing, and integrated machine learning. It is designed to process raw and aggregated data, enabling businesses to
create robust data pipelines, build custom metrics, and deploy advanced models. Use cases range from segmentation
and predictive analytics to batch ETL and real-time data transformation.

Customer Journey Analytics is purpose-built for tracking and analyzing customer interactions across channels in real
time. It extends traditional analytics capabilities by stitching together data from multiple sources, enabling a unified
view of the customer journey. This platform is ideal for analyzing cross-channel behavior, monitoring campaigns, and
delivering personalized customer experiences.

Data Handling and Data Modeling: Flexibility vs. Optimization

Data Distiller offers robust capabilities for handling both raw and aggregated data, giving organizations unparalleled
flexibility in managing their data workflows. It can process raw data at full granularity, enabling complex joins,
advanced metric calculations, and exploratory analysis. This flexibility allows businesses to adapt their data processing
to a wide range of use cases, from ad hoc deep dives to creating materialized views and pre-computed metrics for
efficient reporting. Data Distiller’s dual capability to handle both raw and aggregated data ensures that it is not limited
to any one approach, making it versatile for foundational data preparation and analysis.

Customer Journey Analytics (CJA), on the other hand, is optimized for ingesting and stitching raw event data from
multiple channels to create a unified view of customer journeys. While it focuses on handling raw interaction data, its
architecture is designed to aggregate and unify this data across touchpoints, resulting in highly efficient, real-time
insights. This makes CJA exceptionally fast for tracking customer behavior and calculating key performance indicators
(KPIs), but it is less suited for exploratory data modeling or detailed transformations. Its emphasis is on delivering
actionable insights from stitched, event-level data.

Data Distiller supports a wide range of data modeling options, offering the flexibility to design schemas that best suit
specific business needs. This includes support for star schemas, normalized structures, and custom relational models
that can adapt to evolving analytical requirements. Analysts and engineers can build models that align with their
business logic, enabling deep exploration and customization for complex queries or unique business scenarios. This
flexibility makes Data Distiller an excellent choice for businesses looking to develop sophisticated metrics,
segmentation strategies, or predictive models.

In contrast, Customer Journey Analytics relies on a predefined, denormalized schema optimized for speed and
simplicity. The data is highly indexed and tailored for real-time queries, ensuring low latency and high efficiency when
analyzing customer journeys. While this design is perfect for delivering fast, actionable insights, it sacrifices the ability
to customize data models extensively. The predefined structure streamlines operations but limits flexibility, making it
more suitable for standardized reporting and real-time use cases than for exploratory or customized analytics.

The distinction between Data Distiller and CJA lies in their approach to balancing flexibility and performance. Data
Distiller prioritizes adaptability, allowing businesses to model their data as needed and enabling a wide array of
analytical use cases. CJA, by contrast, is purpose-built for optimized performance in tracking and analyzing customer
journeys, leveraging its predefined schema and indexing to deliver immediate insights.

Real-Time vs. Batch Processing

Data Distiller offers strong batch processing capabilities, making it ideal for large-scale data preparation, such as ETL
workflows for creating comprehensive data models or refining customer segments. It also supports real-time ingestion
pipelines, enabling near-real-time analytics when required. This balance of batch and real-time processing makes it a
versatile platform for foundational analytics.

Customer Journey Analytics, however, is natively designed for real-time data processing. Its ability to ingest and
analyze event streams instantaneously makes it a critical tool for time-sensitive applications. Businesses can monitor
live customer interactions, respond to trends as they happen, and deliver real-time personalization across multiple
channels, ensuring they stay agile in a competitive landscape.

Analytics and Query Complexity

One of Data Distiller’s strengths is its SQL-driven approach, which allows users to build custom metrics and perform
advanced calculations with unparalleled flexibility. It supports complex queries, advanced relational modeling like star
schemas, and even integrates machine learning for predictive analytics and clustering. This makes it a powerful tool
for exploratory analysis and hypothesis testing.

Customer Journey Analytics, by contrast, is optimized for speed and simplicity. Its flat, denormalized data structures
enable lightning-fast query performance but are less suited for highly complex, ad hoc analyses. Instead, it focuses on
descriptive and diagnostic analytics, providing rapid insights into customer journeys and enabling segmentation and
activation in real time.

Data Modeling and Schema Design

Data Distiller supports flexible data modeling, allowing businesses to design star schemas and other relational
structures that enable deep analytical queries. This flexibility makes it an excellent choice for scenarios where
understanding relationships and hierarchies in the data is crucial, such as building customer propensity models or
analyzing multi-dimensional sales performance.

Customer Journey Analytics focuses on stitching cross-channel data into unified, denormalized schemas. This
approach simplifies data representation, ensuring that customer journeys are seamlessly integrated and easy to query. It
excels in creating a single source of truth for customer interactions, enabling businesses to monitor and act on insights
across web, mobile, email, and other channels.

Performance and Scalability

Data Distiller is built to scale, separating storage and compute to handle massive datasets efficiently. Its architecture
supports high-throughput batch processing and real-time data flows, making it versatile for both foundational data
preparation and insights generation. However, its focus on flexibility can sometimes result in slower query
performance for pre-aggregated metrics compared to platforms optimized for real-time analytics.

Customer Journey Analytics is designed for high-speed, low-latency operations, with an architecture built to handle
real-time event ingestion at scale. This makes it ideal for analyzing interaction-heavy datasets, such as customer
behavioral data, where immediate insights are critical. Its scalability ensures that even as data volumes grow, query
performance remains consistent.

Integration with Business Use Cases

Data Distiller’s versatility makes it the backbone of foundational analytics. Its ETL capabilities and machine learning
integration enable businesses to explore and refine insights, create predictive models, and prepare datasets for
downstream use cases. It is particularly valuable in scenarios where businesses need to define and test metrics or
analyze historical trends.

Customer Journey Analytics, on the other hand, excels in real-time environments, where timely insights and activation
are paramount. By stitching together data from multiple channels, it provides a unified view of the customer journey,
enabling businesses to act on insights as they happen. This makes it an essential tool for campaign optimization,
personalization, and cross-channel performance monitoring.

Complementary Roles in a Unified Analytics Strategy: Why Data Distiller and Customer Journey Analytics
Excel

When used together, Data Distiller and Customer Journey Analytics (CJA) form a powerful, integrated analytics
framework that bridges strategic data preparation, advanced business intelligence (BI), and real-time customer
engagement. This unified approach provides businesses with the flexibility and agility to drive both long-term strategic
decisions and immediate, actionable insights—all without the complexity of piecing together multiple tools.

While Data Distiller excels in data processing, SQL-driven ad hoc exploration, business intelligence, and advanced
analytics, CJA delivers real-time, cross-channel customer insights optimized for activation. Together, these platforms
outshine traditional systems like Snowflake or Databricks, which often require extensive customization to achieve the
same level of integration and performance.

Data Distiller: The Engine for BI and Advanced Analytics

Data Distiller is more than just a data processing platform—it serves as the backbone for business intelligence. By
allowing users to write SQL queries directly against the data lake, it offers unparalleled flexibility for exploring and
analyzing raw and aggregated data. This capability enables analysts to perform ad hoc exploration without needing to
predefine complex pipelines or move data into a separate BI tool. Users can drill into raw data, create complex joins,
and generate insights on the fly, all while leveraging the familiarity and power of SQL.

SQL remains a cornerstone of modern analytics because of its simplicity, expressiveness, and versatility. Data Distiller
takes SQL to the next level by integrating it directly into the data lake environment, eliminating the need for data
extraction or movement. Analysts can create materialized views, calculate advanced metrics, and query massive
datasets in seconds, bridging the gap between raw data exploration and actionable business intelligence. This ad hoc
SQL capability transforms the data lake into an interactive analytical playground—something traditional platforms like
Snowflake or Databricks often struggle to achieve without additional layers of tooling.

Customer Journey Analytics: Real-Time Analysis Across Channels

In contrast to Data Distiller’s focus on foundational analytics, CJA is purpose-built for real-time customer journey
insights. By ingesting raw event data across multiple channels and stitching it together in real time, CJA provides a
unified, cross-channel view of customer behavior. Its predefined, denormalized schema is highly indexed and
optimized for speed, ensuring ultra-fast query performance for monitoring customer interactions and delivering
actionable insights.

For instance, if a customer interacts with a campaign on social media and visits a website, CJA can dynamically update
their journey in real time, triggering personalized responses like tailored offers or targeted messages. While platforms
like Snowflake or Databricks can ingest and store similar event data, they lack CJA’s native stitching capabilities and
real-time activation tools, often requiring custom engineering and external systems to achieve similar outcomes.

Why This Combination Outperforms Other Solutions

Unified Data Exploration and Activation: Data Distiller enables ad hoc data exploration with SQL, allowing
analysts to uncover deep insights directly within the data lake. These insights feed seamlessly into CJA, which
activates them in real time to enhance customer engagement across channels. In contrast, Snowflake and Databricks
often require multiple tools to bridge this gap, introducing complexity and latency.

Business Intelligence Meets Real-Time Analytics: With Data Distiller, organizations can build comprehensive
dashboards, perform BI reporting, and run exploratory queries, all while leveraging the scalability of the data lake.
CJA complements this by translating insights into immediate, actionable outcomes, such as personalizing a customer’s
journey in real time. Snowflake excels in data warehousing but lacks native BI capabilities, while Databricks focuses
more on data engineering and machine learning workflows.

Event Stitching and Low-Latency Insights: CJA’s real-time stitching of raw event data provides a level of
immediacy that competitors cannot match. It eliminates the need for external systems to unify customer interactions,
ensuring that businesses can act instantly on insights—whether it’s sending a personalized email or optimizing a web
experience. Snowflake and Databricks lack this real-time stitching capability, making them less effective for activation
use cases.

End-to-End Integration: Data Distiller and CJA operate as a cohesive system, reducing the need for custom
integrations and external tools. Together, they cover the full spectrum of analytics, from strategic exploration and BI to
real-time engagement. Competitors often require stitching together separate solutions, increasing costs and complexity.

A Unified Analytics Framework for Modern Business

For example, an organization could use Data Distiller to analyze historical purchase data, create predictive models for
customer churn, and define detailed customer segments using SQL queries. These insights can then feed directly into
CJA, which tracks real-time customer interactions and dynamically tailors campaigns or experiences based on
behavioral triggers. This synergy ensures businesses can seamlessly transition from raw data exploration to actionable
insights, enabling both strategic planning and agile decision-making.

In contrast, achieving this with Snowflake or Databricks would involve exporting data into external systems for BI,
custom engineering for event stitching, and integrating real-time activation tools—adding complexity, latency, and
costs.

Data Distiller and Customer Journey Analytics excel because they bring together the best of BI, advanced data
exploration, and real-time analytics in a unified ecosystem. Data Distiller’s SQL-powered ad hoc exploration, business
intelligence capabilities, and machine learning integrations make it a powerhouse for data preparation and insight
generation. CJA complements this with its optimized, real-time stitching and activation capabilities, delivering
immediate value across customer touchpoints.

Together, they provide a comprehensive solution that outperforms traditional platforms like Snowflake and Databricks,
offering businesses the speed, flexibility, and agility needed to stay ahead in today’s data-driven world. By uniting
strategic and real-time analytics, Data Distiller and CJA empower organizations to transform their data into decisions
and actions with unparalleled efficiency.

Last updated 3 months ago

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-302-key-topics-overview-architecture-mdm-personas
***

To understand the architecture of a Data Distiller, it is important to understand a few things:

1. Adobe Experience Platform is built on a Service-Oriented Architecture foundation. What that means is that every
component is a separate service and can talk to others and be talked to.

2. Query Service is the service name of the SQL capabilities in the Adobe Experience Platform.

3. Data Distiller is the packaging of these capabilities that is sold to customers. There are some Data Distiller
capabilities that are given as part of the Apps themselves. To understand what comes with the app and what
comes in the Data Distiller standalone, you will need to talk to an Adobe rep.

4. If you have the Data Distiller product, you have all of these capabilities in one place. For this book, we will
assume that you indeed do.
For the rest of this discussion, we will be talking about Query Service architecture so that you know what pieces are
involved and why the query execution behaves the way it does.

There are 3 query engine implementations in Data Distiller each tuned for a specific set of use cases that gives a lot of
flexibility to address a wide spectrum of customer data processing and insights use cases.

The query engine implementations are:

1. Ad Hoc Query Engine: This query engine implementation enables users to type SELECT queries on the
structured and unstructured data in the data lake. The scale of data being queried is way larger than what you
would query in your warehouse. Queries time out after 10 minutes of execution (waiting time is not included).
The system auto-scales as more users come into the system so that they are not waiting for cluster initialization
time. If you use TEMP tables for data exploration, the data and the results can be cached.

2. Batch Query Engine: This is a batch processing engine implementation that creates or adds new data to the data
lake. In this case, depending on the query and the size of the data to be processed, we spin up a separate cluster
with the required resources for the execution of the query. Tee SQL queries CREATE TABLE AS and INSERT
INTO will use this engine. This is very similar to the “T” step in the ETL step you will see in state-of-the-art ETL
engines. Queries can execute for a maximum of 24 hours with no limits on the concurrency of jobs (scheduled or
otherwise).

3. Accelerated Query Engine: This is an MPP engine that is designed specifically to address BI dashboard-style
queries which has it’s own accelerated store. The query engine along with the store is called the Data Distiller
Warehouse. This is very similar to what you would see in state-of-the-art warehousing engines. Results do get
cached and reused across other similar queries. User concurrency is limited but there are limits on the query
concurrency (4) and the size of the data (1TB) today.

Let us now look at the routing of the various kinds of queries.

Data Exploration & Dashboarding

All queries that use SELECT in the main query are essentially “read from” queries that are either executing subqueries
or complex conditions.

If you look at the above diagram, it means that you can either read large datasets from the Data Lake via the Ad Hoc
Query engine path or you could read compact aggregated datasets from the Accelerated Store. Here is how you would
differentiate between the queries:

1. All datasets across Data Lake and Accelerated Store are treated as if they belong to the same storage layer. This
means that the dataset names are unique across these data layers. It also means that by looking at a dataset or
table name, you cannot make out where it is located. You don’t need to as the Data Distiller engine routes the
query automatically.

2. All datasets in the Accelerated Store have to be created with the following declaration clause:

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

3. If you want to know which dataset is where simply type:

The results will look like this in DBVisualizer:

If the description says “Data Warehouse” table, it means that it is in the Accelerated Store. If it says “null”, it means
that it is on the Data Lake. Accelerated Store tables will be queried via the Query Accelerated Engine. Data Lake
tables will be queried via the Ad Hoc Query Engine.
Hint: Another way to detect if a table is on the Data Lake or Accelerated Store is to see if it is a flat table or not. If it is
a nested or complex table, then it is on the Data Lake. Accelerated Store requires that datasets or tables be flat as it
supports only relational structures.

Federated Data Processing

Any SQL statement that contains “CREATE TABLE AS” or “INSERT INTO” will be routed to the Batch Query
Engine.

The batch query engine can write to the Data Lake or the Accelerated Store. The data layer it writes to is based on the
same condition as the route path for reading and writing the tables. If the table to be written to exists on the
Accelerated Store, it will do so.

Note: Data Distiller allows you to mix and match tables in your query across the Data Lake and Accelerated Store.
This means you can reuse the results of your work in the Accelerated Store to create richer datasets

Data Distiller SQL conforms to the Postgres SQL syntax. PostgreSQL is compliant with ANSI SQL standards. It is
compatible with ANSI-SQL2008 and supports most of the major features of SQL:2016. However, the syntax accepted
by PostgreSQL is slightly different from commercial engines. SQL is a popular relational database language that was
first standardized in 1986 by the American National Standards Institute (ANSI). In 1987, the International
Organization for Standardization (ISO) adopted SQL as an international standard.

A Note on Master Data Management

Master Data Management (MDM) is a method and a set of tools used to manage an organization’s critical data. MDM
focuses on ensuring that essential data is consistently defined, shared, and used throughout an organization, which can
help improve data quality, streamline data integration, and enable more accurate reporting and analytics. Data Distiller
is not an MDM tool but it has features that can replicate MDM-like features on datasets in the data lake in the Adobe
Experience Platform.

Data Scope: Note that MDM covers the entire enterprise data while the scope of data that can be covered by Data
Distiller is only the data brought into the Adobe Experience Platform. Hence, the MDM-like functionality is restricted
to the data that is available.

Data Distiller Implementation

Data Governance: MDM involves establishing data governance policies and procedures to ensure that data is
accurate, consistent, and secure. MDM helps organizations comply with data privacy regulations, such as GDPR or
HIPAA, by ensuring that sensitive data is properly managed and protected.

Data Governance in Data Distiller is always within the context of the Data Lake, Accelerated Store, and the Apps
(Adobe Real-Time CDP, etc.). Compliance with GDPR and HIPAA are supported.

Data Quality: MDM aims to improve data quality by cleansing and standardizing data.

You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets.

Data Matching and Deduplication: MDM tools use algorithms to identify and merge duplicate records

You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets.

Data Enrichment: MDM can involve enriching data with additional information. For example, appending
geographical coordinates to customer addresses to enable location-based analytics.
You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets.

Data Integration: MDM helps integrate data from various sources, making it accessible and usable across the
organization.

This is covered by the Sources functionality in Adobe Experience Platform. When you get a license to an App, you get
access to the same set of sources. Data Distiller can leverage the same input data sources.

Hierarchical Data Management: MDM can manage hierarchical relationships, such as product categories and
subcategories.

XDM modeling gives you the flexibility to model a wide range of relationships on the data lake. The closest Data
Distiller gets is with star or snowflake schema modeling with primary and secondary key relationships between
datasets.

Customer 360: One common example is building a “Customer 360” view, where all relevant customer information,
including demographics, purchase history, and support interactions, is consolidated into a single, unified profile.

This is supported by the Real-Time Customer Profile and hence Data Distiller has access to the same data.

Product Information Management (PIM): In e-commerce and retail, MDM is used to manage product data,
ensuring consistent and complete product information across various sales channels.

Data Distiller’s functionality is similar to that of an OLAP database to an OLTP database. You cannot UPDATE
records.

Supplier Data Management: In supply chain management, MDM can be used to maintain accurate and up-to-date
information about suppliers, including contact details, certifications, and performance metrics.

Data Distiller’s functionality is similar to that of an OLAP database to an OLTP database. You cannot UPDATE
records.

Financial Data Management: MDM can be applied to financial data, ensuring that financial reports and statements
are based on accurate and consistent data from various sources.

Data Distiller’s functionality is similar to that of an OLAP database than an OLTP database. You cannot UPDATE
records.

Centralized User Experience for Master Data Management use cases

Data Distiller is still a data processing and analytics tool.

CRUD Operations in Data Distiller

Supported: you can replace a dataset or add new batches of data.

Not supported as the unit of update is a “batch of records” in the data lake. You will need to replay the data.

Record-level delete is not supported, dataset level delete is supported. You will need to replay the data in order to
delete the records you do not want.

Key Data Distiller Personas

One of the patterns that you will see in the world of data is the convergence of multiple domain expertise into one. The
overlaps are very strong and the traditional thinking that one area of expertise is the future (such as AI engineers will
be the future, data science will replace analysis) is misguided and wrong. You can give/have all the fancy titles you
want for your team but you will need a team to pull off these tasks. Focus on the expertise they bring rather than their
persona. Your team will be lacking some of these and that should be an area of investment for you.

Conceptual diagram of various query engine flows.

Pay attention to the description column

A popular diagram showing areas of expertise and overlap. Coverage area does not indicate importance.

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-304-the-human-element-in-customer-experience-
management * * *

1. UNIT 1: GETTING STARTED

PREP 304: The Human Element in Customer Experience Management

Where data meets humanity: elevating customer experience with insight and empathy

Years ago, I was a Product Manager at MathWorks, working on Simulink—a tool that allowed engineers to design
embedded algorithms without needing to write a single line of C code. I remember vividly being in a meeting with
General Motors, presenting to some of the brightest engineers working on the next generation of hybrid vehicles. I was
the nerdy kid in the room, passionately explaining how they could simulate smarter engine designs right on a canvas,
bypassing countless hours of manual coding.

After my presentation, one of their senior engineers walked to the podium. He smiled at me and said, “Saurabh, have
you ever lifted and felt what an engine is like? We’re going to put this in someone’s car—it’s deeply personal. As
much as we love your algorithms, we expect that you won’t take away the human element of testing whatever folks
build on a computer out on the shop floor (their factory testbeds). We’ll test more, double-check ourselves, and ensure
that it’s the absolute best for our customers.” The uber point he was making was that integrating new, sophisticated
algorithms necessitates thorough human testing and judgment.

That moment stayed with me. It taught me a lesson I’ve carried ever since: tools—no matter how powerful—cannot
replace the human touch. Products that convey the image of a brand are more than the sum of their algorithms and
designs; they carry the weight of human judgment, care, and responsibility.

Fast forward to today, as the Product Manager for Data Distiller, I see parallels in the world of customer experience
management. Data Distiller empowers businesses with cutting-edge tools to process customer experience data and
drive decision-making at scale. But just as with those engineers at General Motors, I believe that no tool—no matter
how advanced—should ever replace the human element in crafting customer experiences.

The Limits of Data-Driven Decision-Making

Data-driven approaches excel at analyzing operational metrics, identifying trends, and predicting customer behaviors.
They provide businesses with a powerful foundation for decision-making. However, they often fail to account for the
nuances, emotions, and human experiences that shape customer interactions and loyalty.

Recently, I wrote a technical article on Net Promoter Score (NPS) and how Data Distiller can enhance its
effectiveness. NPS is widely regarded as a key metric for measuring customer loyalty, yet it often overlooks the
cultural nuances that influence how customers perceive and interact with a brand. For instance, in cultures where
modesty is highly valued, customers may avoid giving extreme scores, even when highly satisfied. Conversely,
cultures that encourage overt enthusiasm might yield higher ratings, even if loyalty is fleeting. Additionally, the
concept of “recommending” a product may hold different levels of significance—some cultures value individual
recommendations highly, while others prioritize collective decision-making or peer-reviewed advice. These subtle
differences can skew NPS insights, leading businesses to draw conclusions that may not align with the diverse realities
of their global customer base.

Data Distiller’s ability to create robust customer propensity models is a significant step forward in predicting behaviors
like purchase likelihood, churn, or engagement. However, these models often stop short of capturing the deeper
emotions or motivations driving these actions. For example, a model might predict that a customer is likely to make a
purchase but cannot explain why—whether it’s due to genuine preference, the allure of a discount, or external peer
influence. Similarly, churn predictions might identify at-risk customers but fail to highlight the exact frustrations or
unmet expectations causing dissatisfaction. These limitations underscore the need for businesses to go beyond
predictions and pair their findings with qualitative research and human insight to fully understand the emotional
underpinnings of customer behavior.

While data can highlight operational wins, such as increased sales or improved response times, it often misses the
broader story of brand perception. Take, for example, a fashion retailer that sees a spike in sales following a new
campaign. On the surface, the numbers suggest success. However, the campaign’s imagery unintentionally perpetuates
cultural stereotypes, leading to widespread criticism on social media. While sales data might reflect short-term success,
the long-term impact—negative press, reduced customer trust, and a tarnished brand reputation—remains hidden in
the data. Months later, the retailer may see reduced engagement and loyalty without fully realizing the cause. This
scenario highlights how data can provide an incomplete picture, focusing on immediate outcomes while overlooking
the nuanced, enduring effects on brand perception.

By relying solely on data-driven decision-making, businesses risk creating strategies that appear efficient but alienate
customers in subtle, lasting ways.

The Role of Human Insight

No algorithm—no matter how advanced—can replicate the creativity, empathy, and contextual understanding that
humans bring to customer experience management. To build truly impactful experiences, businesses must integrate
human insight alongside data-driven approaches.

Deep Business Understanding: A comprehensive understanding of your business’s values, market position, and
long-term goals is essential for interpreting data within the right context.

Empathy and Human Judgment: Customer feedback, even when captured quantitatively, must be understood
emotionally. Human judgment ensures that responses are thoughtful, genuine, and aligned with customer needs.

Cultural Sensitivity: Data often struggles to quantify the cultural subtleties that influence customer interactions.
Humans can bridge this gap, ensuring that strategies resonate with diverse audiences across geographies and
demographics.

Data is a powerful enabler, but it is not a replacement for the human element. When balanced effectively, data and
human insight can complement each other to create customer experiences that are both efficient and deeply
meaningful.

Data Distiller: A Catalyst, Not Replacement

Data Distiller is designed to propel businesses forward in their customer experience journey. With its ability to process
vast datasets, uncover actionable insights, and power personalization, it is a transformative tool in today’s AI-driven
world. Its integration with artificial intelligence (AI) and generative AI (GenAI) adds even greater capabilities,
enabling the analysis of complex patterns, the prediction of customer behaviors, and the generation of tailored content
at scale. Yet, as advanced as these technologies are, the essence of exceptional customer experiences still lies in the
human element.
Consider a clothing retailer using Data Distiller’s AI-powered algorithms to identify that customers in a specific region
prefer vibrant colors. AI might suggest this trend based on purchasing patterns or social sentiment, and GenAI could
even draft campaign ideas. However, understanding why those preferences exist—whether tied to local festivals,
cultural traditions, or seasonal styles—requires the intuition, empathy, and expertise of a human marketer. Without
this, even the most advanced AI-driven strategies risk missing the emotional and cultural nuances that foster deeper
connections with customers.

The integration of advanced algorithms (AI, GenAI) into Data Distiller will redefining the role of data in customer
experience management. These new-age algorithms will amplify what businesses can achieve with data, offering
unprecedented speed, scalability, and precision. However, the goal isn’t to rely solely on algorithms and automation—
it’s to harmonize them with human judgment to create truly impactful customer experiences.

Use Data Distiller as the foundation: Its algorithms empower businesses to uncover trends, predict behaviors,
and generate actionable solutions at scale. This serves as the bedrock for informed decision-making.

Enrich insights with human expertise: The outputs of Data Distiller’s algorithms must be contextualized with
human understanding—aligning them with your brand’s identity, customer emotions, and cultural nuances to
ensure they resonate meaningfully.

Adapt continuously with human oversight: Data-driven strategies are powerful but require ongoing evaluation
and refinement by humans. Real-world feedback and emotional intelligence ensure that strategies stay aligned
with customer expectations and brand integrity.

At Data Distiller, we often say this to ourselves: “Data and algorithms can illuminate the path, but it’s the human
touch that ensures the customer journey is meaningful.”

Last updated 2 months ago

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-305-driving-transformation-in-customer-experience-
leadership-lessons-inspired-by-lee-iacocca * * *

1. UNIT 1: GETTING STARTED

PREP 305: Driving Transformation in Customer Experience: Leadership

Lessons Inspired by Lee Iacocca
Why Leadership Lessons Matter in Customer Experience Management

Customer experience management isn’t just about data, algorithms, or tools—it’s about creating connections that
matter.We will be taking a close look at Lee Iacocca’s leadership lessons. His principles—ranging from persistence
and execution to the importance of focus and motivation—resonate profoundly with the challenges of managing
customer experiences in today’s fast-paced, data-driven world.

Customer experience leaders often find themselves navigating a complex landscape of technology, analytics, and
strategy, all while trying to maintain the human touch that drives real loyalty. Iacocca’s story, along with reflections
from my own professional journey, serves as a reminder that behind every dashboard, algorithm, or automation is a
customer who values trust, empathy, and authenticity.

As we dive into the lessons shared, the goal is to highlight the timeless qualities of leadership—focus, resilience, and
clarity—and how they can be applied to elevate customer experiences. Whether it’s crafting a vision, executing
effectively, or overcoming challenges, these lessons are a roadmap for creating impact in a field where technology is
only half the equation.
Let’s explore how these principles intersect with customer experience management and why they’re more relevant
than ever in an age of AI, GenAI, and data-driven insights.

For those unfamiliar, Lee Iacocca stands as one of the most iconic figures in the history of American business. A
visionary leader, he was the driving force behind two legendary automotive successes: the launch of the Ford Mustang
in the 1960s and Chrysler’s miraculous turnaround from the brink of bankruptcy in the 1980s. His story is one of
audacity, resilience, and unwavering commitment to innovation.

Iacocca’s career began at Ford, where he quickly climbed the ranks, becoming one of the youngest vice presidents in
the company’s history. His crowning achievement during his time at Ford was the creation of the Mustang, a car that
not only defined an era but also became an enduring symbol of American automotive ingenuity. The Mustang was
revolutionary—a sleek, affordable sports car that resonated with a generation hungry for freedom and self-expression.
It wasn’t just a product; it was a movement, and Iacocca was its architect.

But in a dramatic turn of events, Iacocca’s tenure at Ford came to an abrupt and public end. In what can only be
described as a “Hollywood-style firing,” Henry Ford II, then chairman and CEO, dismissed Iacocca despite his
monumental contributions to the company. Personal differences and internal politics overshadowed his achievements,
leaving Iacocca to face an uncertain future.

Yet, it was this moment of adversity that would define Iacocca’s legacy. Instead of fading into obscurity, he engineered
one of the greatest comebacks in business history. Joining Chrysler—a struggling automaker on the verge of collapse
—Iacocca not only saved the company but transformed it into a thriving enterprise. His bold leadership decisions,
including securing a controversial government loan guarantee, streamlining operations, and revitalizing the product
line with hits like the minivan, turned Chrysler into a symbol of resilience and innovation.

Iacocca’s journey is a masterclass in leadership, innovation, and perseverance. He taught us that great leaders don’t
just manage through success; they thrive in the face of failure. His ability to inspire teams, challenge norms, and
deliver results is a playbook for anyone navigating challenges, whether in business or life.

The ability to concentrate and to use your time well is everything.

In an age of endless notifications and digital noise, focus has become a superpower—and it’s just as critical in
customer experience management as it is in personal productivity. Time management isn’t just about juggling meetings
or calendars; it’s about cutting through the noise to zero in on what truly matters: delivering value to your customers.
When teams and leaders enter this zone of concentrated effort, the distractions of fleeting trends, data overload, and
surface-level metrics lose their grip, allowing meaningful strategies to emerge.

Just as I’ve chosen to replace passive consumption with active creation—swapping TV for blogging, listening to music
for playing it, or driving for walking—customer experience leaders can transform their approach by prioritizing the
essential. This might mean setting aside time to deeply understand customer feedback, reflecting on long-term goals,
or stepping back from reactive tasks to focus on proactive, value-driven actions. These habits not only improve mental
clarity but also foster innovative solutions that resonate with customers on a deeper level.

Mastering this focus can be transformative in customer experience management. It enables teams to cut through the
noise, make impactful decisions, and deliver experiences that truly connect with customers. Much like the way
personal reflection and discipline fuel productivity, focused time dedicated to understanding and enhancing the
customer journey can multiply the impact of every effort, making each interaction more meaningful and rewarding for
both customers and teams.

Management is nothing more than motivating other people. Start with good people, lay out the rules,
communicate with your employees, motivate them, and reward them. If you do all those things effectively, you
can’t miss.
Customer experience management is ultimately a team effort, and effective leadership is the cornerstone of success. At
its heart, management is nothing more than motivating people—finding the right talent, setting clear expectations, and
giving them the tools and encouragement they need to excel. Start with good people, lay out the rules, communicate
openly, inspire belief, and reward their contributions. Do these things effectively, and your team won’t just meet
expectations—they’ll exceed them.

Leadership in this space isn’t about barking orders or controlling every detail; it’s about inspiring belief in a shared
vision and having your team’s back. The best leaders I’ve seen don’t crave the spotlight—they shine it on others.
When the team wins, the leader wins. This principle is particularly relevant in customer experience, where success
depends on a seamless collaboration between data analysts, marketers, product managers, and frontline teams.

There’s no magic formula for managing people, but simplicity is powerful. Hire smart, communicate clearly, recognize
effort, and celebrate success. Skip the politics and unnecessary complexity. A motivated, engaged team will always
create better customer experiences, because they feel supported, valued, and connected to the larger mission.
Leadership, when done right, creates an environment where both the team and the customers thrive.

People want direction from a leader. It’s not a question of being bossy or autocratic; it’s about being clear and
firm.

In customer experience management initiatives, people look to their leaders for clear direction—not to be bossy or
autocratic, but to provide confidence and focus. Leadership without clarity is a road to failure; I’ve seen it firsthand. A
team thrives when its leader is decisive and clear about the goals. Uncertainty from the top creates confusion
throughout the ranks. Leadership demands humility, but it also requires the confidence to set a firm direction.

That direction must be anchored in a compelling vision of a better future—one that answers the fundamental question:
Why are we doing this? This isn’t about painting a picture of unattainable perfection or indulging in abstract ideas. A
great vision is both inspiring and practical, rooted in a tangible understanding of how it will improve the lives of
customers, stakeholders, and team members alike.

But a vision alone won’t take a team far without a path forward. The real power of leadership lies in translating that
vision into actionable steps. What does success look like? What challenges will we face, and how will we overcome
them together? When leaders can articulate both the “what” and the “how,” they turn aspirations into momentum.

In customer experience, this balance of big-picture thinking and concrete execution ensures that every effort
contributes to a shared goal. Clear direction empowers teams to innovate, collaborate, and deliver the experiences that
define a brand’s success. Leadership rooted in clarity and purpose transforms teams from just following orders to
passionately driving toward a meaningful mission.

We are continually faced with great opportunities brilliantly disguised as insoluble problems.

The greatest opportunities often come disguised as impossible problems. What may initially seem like an
insurmountable challenge—whether it’s a dissatisfied customer base, a fragmented data landscape, or declining
engagement—can be the seed of transformative innovation. These moments force teams to think differently, push
boundaries, and uncover solutions that redefine what’s possible.

For example, integrating real-time personalization or building seamless customer journeys might feel overwhelming at
first, especially with legacy systems or siloed data. But tackling these “insoluble” problems often leads to
breakthroughs—streamlined processes, enhanced tools, or entirely new ways of connecting with customers. The key
lies in embracing the challenge, reframing it as an opportunity, and approaching it with creativity and persistence.

Great leaders and teams thrive in these moments, not by avoiding the challenges but by seeing them for what they are:
stepping stones to something greater. In the world of customer experience, the problems you solve today often become
the competitive advantages you’ll celebrate tomorrow.
If you set a good example, you need not worry about setting rules.

Passion and dedication start at the top. If you deeply care about creating meaningful interactions for customers, that
passion will resonate through your entire team. As a leader, your actions—not your words—set the tone for execution,
strategy, and values. Culture isn’t built on policies or rulebooks; it’s built on example.

If you show up every day with a focus on the customer, a commitment to excellence, and an eye for detail, your team
will follow suit. If you’re deeply invested in solving customer problems, refining their journeys, and making every
interaction special, that energy becomes contagious. But if you’re inconsistent, disengaged, or sloppy in your
approach, those behaviors will inevitably trickle down.

Customer experience is a reflection of the values your team embodies. The way you lead sets an unspoken standard—
one that shapes not only how your team works but also how customers feel when interacting with your brand. Passion
is the foundation of great customer experiences, and when leaders lead with it, it creates a culture where going above
and beyond for the customer isn’t just an expectation—it’s a way of life.

“No deal” is better than a “bad deal”.

Saying “no” to a bad deal can be just as important as saying “yes” to the right opportunity. Whether it’s negotiating
partnerships, adopting new tools, or making strategic trade-offs, not every deal is worth taking. A bad decision today
can snowball into larger issues tomorrow—misaligned expectations, resource drains, or initiatives that fail to deliver
value to customers. Knowing when to walk away is a hallmark of strategic leadership.

Think of negotiation as a chess game. Every move you make—whether it’s agreeing to a partnership, prioritizing a
project, or aligning with a vendor—impacts not just the immediate outcome but the future of your customer experience
strategy. A skilled leader doesn’t just focus on the current exchange; they consider how the decision will affect the
brand, the customer journey, and the team’s ability to execute over time.

Saying “no” is sometimes the most strategic move. It preserves your leverage, maintains your focus, and ensures you
don’t compromise on what truly matters: delivering exceptional customer experiences. Like a chess master, knowing
when to hold back or pivot keeps you in control, empowering you to create long-term wins that align with your vision
and values.

Apply yourself. Get all the education you can, but then, by God, do something.

Knowledge without action is wasted potential. It’s not enough to attend workshops, earn certifications, or analyze
endless datasets—what matters is how you use that knowledge to create value for customers and make a real impact.
As the ancient Hindu text Panchatantra wisely teaches, knowledge only becomes meaningful when it is applied for
the benefit of others.

Learning is the foundation, but action is the structure you build on it. Whether it’s crafting seamless customer
journeys, solving complex pain points, or innovating new ways to connect with your audience, the key lies in applying
what you know. It’s about taking insights from data and using them to personalize experiences, improve satisfaction,
and strengthen relationships.

Customer experience leaders don’t just study problems—they act on them. They use their expertise to solve real-world
challenges, drive meaningful change, and leave a lasting impact. So don’t just learn for the sake of learning. Apply
yourself, make your work count, and ensure your efforts lead to better outcomes for your customers and your brand.
Knowledge is the spark, but action is what builds the fire.

You can’t go through life quitting everything. If you’re going to achieve anything, you’ve got to stick with
something.

Persistence is everything. You can’t create meaningful, lasting change by hopping between strategies, chasing trends,
or abandoning efforts at the first sign of difficulty. To achieve excellence, you have to commit—to your vision, your
customers, and your long-term goals. Dabbling in everything and mastering nothing gets you nowhere. Success
demands focus and follow-through.

Consider Olympic athletes who train relentlessly for years, often for a single defining moment. They don’t switch
sports or lose focus midway through. Every decision, every sacrifice, is aligned with their ultimate goal: standing on
the podium. Similarly, building exceptional customer experiences requires that same unwavering commitment.
Whether you’re overhauling a customer journey, integrating new tools, or personalizing interactions at scale,
persistence is what transforms effort into results.

Customer experience isn’t built in a day—it’s an ongoing journey of learning, refining, and improving. Choose your
path, dedicate yourself to it, and pour everything you have into seeing it through. The rewards—loyal customers, a
strong brand, and lasting impact—are worth every ounce of effort.

Even a correct decision is wrong when it is taken too late.

Even the right decision can be wrong if it’s made too late. Timing is everything. A great idea or strategy loses its
impact when delayed, and hesitation often costs far more than action ever will. In an industry where customer
expectations evolve rapidly, speed is not just an advantage—it’s a necessity.

Consider a retail brand facing mounting complaints about a clunky online shopping experience. Customers struggled
with slow-loading pages and a confusing checkout process. While the company eventually revamped its website with a
streamlined interface and faster performance, the delay came at a steep cost. Frustrated customers had already turned
to competitors with smoother experiences, and winning back their loyalty required far more effort than addressing the
issue earlier would have.

The same principle applies to all aspects of customer experience. Whether it’s adopting new tools, addressing
feedback, or seizing an emerging trend, timely action is critical. Delays can result in lost opportunities, eroded trust, or
falling behind competitors who acted decisively. In a fast-paced world, the ability to make the right call at the right
time defines the difference between leading the market and playing catch-up.

Be creative, but make sure what you create is practical.

Creativity is the heartbeat of great marketing, but it must be rooted in practicality to truly resonate. Innovation without
feasibility is just fantasy. Marketing creativity should solve real customer problems, connect with audiences
meaningfully, and drive measurable results. Without this grounding, even the most imaginative ideas risk falling flat.

Consider marketing campaigns that prioritize flash over substance—complex promotions, convoluted messaging, or
high-budget stunts that fail to address what customers actually need. These efforts may generate temporary buzz, but
without a clear connection to customer value, they often fade into irrelevance. In contrast, some of the most
memorable campaigns are both creative and practical—whether it’s a clever social media strategy that simplifies
customer engagement or a personalized email that solves a customer’s specific pain point.

In marketing, creativity isn’t about being the loudest or the flashiest—it’s about being relevant and impactful. Practical
creativity ensures your innovation serves a purpose, resonates with your audience, and drives tangible outcomes,
transforming good ideas into great customer experiences.

You can have brilliant ideas, but if you can’t get them across, your ideas won’t get you anywhere.

Having brilliant ideas isn’t enough—they must be communicated effectively to create impact. Your message to
customers needs to be clear, compelling, and aligned with their needs and values. Even the most innovative strategies
or exciting offers will fall flat if they aren’t presented in a way that resonates.

As Zig Ziglar famously said, “You can have everything in life you want if you will just help enough other people get
what they want.” The essence of customer communication lies in this principle: it’s not about selling a product or an
idea—it’s about demonstrating how it solves a customer’s problem, fulfills a need, or makes their life better.
Whether it’s a campaign launch, a product update, or a simple email, the message must be crafted with clarity and
conviction. Customers should immediately understand the value you’re offering and feel compelled to take action.
Selling an idea to a customer is no different from selling it to a team—it requires understanding their perspective,
addressing their concerns, and inspiring them to believe in the future you see.

The art of customer communication is a skill worth mastering because the best ideas are only as impactful as your
ability to bring others along for the journey. Effective communication isn’t just about words; it’s about building trust,
sparking interest, and creating a connection that turns an idea into a shared vision. When done right, it’s the bridge
between great ideas and exceptional customer experiences.

In times of great stress or adversity, it’s always best to keep busy, to plow your anger and your energy into
something positive.

In times of great stress or adversity, success often comes to those who channel their frustration into meaningful
progress. A powerful example of this is LEGO’s remarkable comeback in the early 2000s.

By 2003, LEGO was on the brink of bankruptcy. Years of over-expansion, poorly received products, and declining
interest in traditional toys had left the company in a financial crisis. It seemed like the iconic brand might crumble
under the pressure of a changing market dominated by video games and tech-driven entertainment.

But instead of folding, LEGO doubled down on its core strengths: creativity, simplicity, and customer connection.
They cut back on non-essential product lines, focused on their signature brick-based sets, and invested in
collaborations with beloved franchises like Star Wars and Harry Potter. At the same time, LEGO embraced digital
innovation, launching products like LEGO Mindstorms to combine physical play with programming, and fostering a
community-driven approach with initiatives like LEGO Ideas, which brought fan-created designs to market.

By refocusing on their vision and rebuilding from the ground up, LEGO not only recovered but became a global
powerhouse in the toy industry. Today, the brand is celebrated for its resilience and its ability to adapt while staying
true to its identity.

LEGO’s comeback reminds us that challenges aren’t endpoints—they’re opportunities. By showing up, adapting, and
staying true to their vision, LEGO turned what could have been their downfall into a historic triumph. That’s the power
of persistence and progress, even in the face of overwhelming odds.

In business, the real advantage isn’t knowing what your competitor or the market is doing—it’s executing
better than they ever could.

In business, the true advantage isn’t simply knowing what your competitors are doing—it’s executing better than they
ever could. Success isn’t just about crafting strategies; it’s about delivering on them, day after day, with precision and
consistency. Execution is the grind. It’s solving new problems, refining old processes, and showing up every day to
make incremental progress.

In customer experience management, plans and ideas can be easily copied, but execution can’t. It’s the way your team
interacts with customers, the attention to detail in every campaign, and the seamless delivery of personalized
experiences that set your brand apart. Customers don’t just remember what you promised—they remember how you
delivered.

This isn’t glamorous work. It’s not about titles, accolades, or flashy initiatives. Execution is about persistence, focus,
and the discipline of showing up, every single day, with a commitment to excellence. Leadership in customer
experience isn’t about commanding from the top—it’s about being present, solving challenges, and consistently
delivering value to your customers.

At its core, execution is where the magic happens. Plans inspire action, but execution is what builds trust, loyalty, and
success. It’s not about perfection—it’s about progress, made possible by leaders and teams who are dedicated to
showing up as their best selves, every single day.
Want to Be Inspired More?

You can read the book available at Amazon

Last updated 2 months ago

https://data-distiller.all-stuff-data.com/prep-500-ingesting-csv-data-into-adobe-experience-platform * * *

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Last updated 6 months ago

You need to setup DBvisualizer:

The goal of this exercise is to ingest test data into the Adobe Experience Platform so that you can do the modules. Note
that the CSV file upload approach as shown here only works for smaller-sized datasets (1GB or less). If you need
larger-sized test data, you will need to use a dedicated connector or the Data Landing Zone. To see how to use the Data
Landing Zone, check this out:

Download the following file locally to your machine.

Ingesting CSV Files into the Adobe Experience Platform

1. Navigate to Adobe Experience Platform UI->Workflows->Create Dataset from CSV File.

2. Configure the name of the dataset as Movie data

3. Drag and drop the CSV file into the Add data box. You can also navigate to the file by using the “Choose File”
button as well.

4. Once the data is loaded, you will see a data preview.

5. Click Finish to complete the upload.

6. Navigate to AEP UI->Datasets to locate the dataset Movie data. You will notice that the manual upload of the
CSV file by you has caused the file to be ingested in batch with a Batch ID and 1000 records are ingested. On the
right side panel, observe the table name that shows it as movie_data. The SQL engine in Data Distiller will be
using this table name to query against the data, not the Dataset name.

7. Preview the dataset by clicking on the Preview dataset button in the top right corner. You will get a dataset
preview that looks like this:

Execute the following code:

The result you will get will look like this:

https://data-distiller.all-stuff-data.com/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller *
**

1. UNIT 1: GETTING STARTED

PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
I use DBVisualizer for my example creation and prototyping. This SQL Editot has a free version that will meet most of
your needs except the ability to download JSON data. But if you can work with flat data (after selectively choosing
and denormalizing that data) which is what you will do most of the time, this limitation is not an issue.

Note: I will be using the AEP and Adobe Experience Platform interchangeably in this tutorial.

Warning: Make sure you have the requisite permissions from your IT team on whether you can install DBVisualizer
or not. If you are working in a regulated industry, you definitely need to find out what is allowed and what is not.

The Data Distiller Query Editor is a basic SQL editor that is perfect for fast data exploration and also for query
operationalization. However, it has limitations as far as query development is concerned that are addressed by
DBVisualizer. With DBVisualizer, you can:

1. Query multiple databases from within a single UI.

2. Reuse the SQL code developed in one environment in another. If you were executing queries on a table in a
warehouse and you migrated that table (its creation) over to AEP, you can just reuse the same SQL as long as it is
Postgres compliant which is mostly the case.

3. You can preview up to 100 rows for the query. In DbVisualizer, you can set an upper limit on the queries. The
maximum number of rows that a SELECT query can return in DBVisualizer is 50,000 rows as long as the query
finishes within 10 minutes since it started execution.

4. If you have a set of deeply nested SQL subqueries, you can highlight the subquery and execute it.

5. You can execute a sequence of SQL commands separated by the semi-colon;

6. Easy access to SQL scripts and tables within the editor.

7. In the Free version, you can download the results locally as a CSV.

The editor is perfect for prototyping and development of complex SQL queries. The Data Distiller editor is evolving
fast and expect these features to be available in the near future.

Why I Do Not Like DBVisualizer

There are some features that make DBVisualizer not ideal for Data Distiller. The Data Distiller Query Editor excels at
these features:

1. You have to be comfortable with setting up database connections. But once setup, it gives you unparalleled power
and control on your query development.

2. Each query execution forces a re-connect.

3. The tables you get to see in DBVisualizer is a snapshot of the tables at the start of the session. If you create new
tables, then you need to refresh the connection with the metadata refresh i.e. disconnect and connect back again.

4. Some metadata commands will cause subsequent SELECT queries to not retrieve results. You have to disconnect
and connect back.

5. Scheduling of queries is not possible. You can use the REST API to schedule these queries but it is better done in
the Data Distiller UI.
6. Every scheduled job requires the creation of a Data Distiller Template that needs to be created inside the Data
Distiller UI.

7. Monitoring and alerting setup for scheduled queries are best done within the Data Distiller UI.

8. Data Distiller Editor also has a Dashboards component where you can build Business Intelligence (BI) style
dashboards with visualizations powered by star schemas in the Data Distiller Accelerated Store.

Tip: Prototyping, development, and validation is best done in a DBVisualizer-like tool. But operationalization of the
SQL queries developed in DBVisualizer is best done in Data Distiller UI.

Download & Install DBVisualizer

Go to this link and download/install the appropriate version based on your OS

If you are on a Mac and do not know whether you have a Intel processor or otherwise, check the upper right corner

Setup Connection to Adobe Experience Platform

All of the AEP datasets that you will need to work with are called tables in the world of SQL. All of these tables reside
under a database which acts like a namespace or a scope or a separator for this collection of tables. We need to log into
this database that is on the web and so we will need a public IP address. As we send these SQL queries as requests
over, the server needs to listen to them and so we need a port number. Last, but not least, we need a username and a
password. As this communication happens between the client and the server, AEP needs to make sure that it is secure
and expects that DBVisualzier will need to set the SSL mode to be set up. If you do not enable this, AEP will refuse
your connection even if you got everything else right.

1. Click on Database->Create Database Connection

On the left-hand side pane, you can also see a list of Connections. Alternatively, you can right-click on Connections
and create a database connection there as well.

1. You may be prompted with using the Wizard. Do not use it. We could have used it but it does not add value since
we will be copy-pasting values from Data Distiller.

2. You will see a screen that looks like this. Name this connection as you will use it with a SQL script to send your
queries into AEP.

3. The SQL dialect that Data Distiller speaks is PostgreSQL. There are two important pieces of information that we
need to provide DBVisualizer so that it can interact and talk the same language.

Tip: Suppose you do not have access to AEP but still want to learn SQL concepts I showcase for free, you can just
download a local version of PostgreSQL. For smaller-sized datasets as is the case with the examples I am using, it
should mostly work just fine. Combine that with free JupyterLab for data science and you have all that you need:

PostgreSQL: https://www.postgresql.org/download/

JupyterLab: https://jupyter.org/install

1. Under the section of Database, set the Database Type to Postgres. You can also leave it as Autodetect.

2. Under the Database section, set the driver to PostgreSQL JDBC driver. This is a driver that is provided by Data
Distiller. Each time, we execute a SQL query in DBVisualizer, these queries are packaged from DBVisualizer
which is a Java application into API calls. We need a middle layer in between to translate these API calls to
PostgreSQL database calls. Think of it as a translator that translates all that comes in from DBVisualizer. Since
PostgreSQL is universally popular as a SQL language, this should give you a hint as to why finding a tool that
can talk to Data Distiller is easy.

Note: You can guess that DBVisualizer is a Java-based app because it says outright that it needs a JDBC driver to
connect to the database.

Tip: If you have to choose between dialects to learn, note that PostgreSQL is a very popular dialect of SQL and
Python for data science.

Your setup screen so far should look like this:

1. Let us now go into Adobe Experience Platform and access the credentials page. You need to navigate to Data
Management->Queries->Credentials

Tip: Make sure you note the IMSOrg and the Sandbox. The credentials are generated per IMSOrg and the sandbox.
Rename your connection accordingly: YourString_{IMSOrgName}_{SandboxName}. For example, in the above
picture, I could rename the connection as Saurabh_DeveloperEnablement_depadmin001

Tip: Click the overlapping squares to copy the entire string rather than highlighting and doing a copy of the string via
a keyboard shortcut.8. Click and copy the information as shown by the arrows:

1. Click and copy the information from the Data Distiller Credentials UI to the Database screen.

Warning: The password expires every 24 hours. This is done for security reasons so that these passwords expire fast
enough. Data Distiller supports non-expiring passwords for BI dashboard use cases where such an expiry would have
hurt the user experience.

1. Now click on Properties Tab and navigate into Driver properties- this is for the JDBC driver configuration
settings:

2. Set the property value and type require

Screen Shot 2023-08-19 at 1.11.20 PM

1. Go back to the Connections tab, click on Connect and you should see a successful connection that looks like
this:

Tip: It is also possible to connect to a specific database in Data Distiller by using the following syntax for dbname=
<sandbox_name>:<database_name>.<schema_name>.all. This is very helpful when you want to restrict access to the
tables to those within a database.

Create a SQL Script in DBVisualizer

1. Click on the Scripts tab and right-click on it to Create File. When you do so, it will come across as unnamed.sql
under Bookmarks

2. Name the SQL script and hit Enter. Double-click on the script name and it will show the following screen:

3. Choose the Connection from the dropdown. This feature is cool because it means that I can switch between
development and production sandboxes within AEP.

Executing Test SQL Queries

1. You can start typing the following code:

The result you will see will look like the folllowing:

Tip: You can write multiple SQL queries on multiple lines as long as you separate out using the semi-colon.

Tip: You can highlight any SQL query or even a subquery and execute it. This is extremely useful to debug deeply
nested queries.

Some of the columns contain very useful information:

1. datasetID is the unique ID used by the platform to keep track of the datasets.

2. dataset column contains interesting information that can help you differentiate between tables in the Data Lake
vis-a-vis those in the Query Acceleration Layer:

qsaccel.XXX: The qsaccel namespace (it is a database) indicates all the tables that are contained in the
warehousing engine called the Query Accelerated Store. This is a separate storage layer for storing tables
that need to power BI Dashboards. This namespace restricts modifying any of the tables as these are
system-generated tables for Real-Time CDP reporting dashboards. If you have Adobe Real-Time CDP, you
will see these tables pop up. XXX is the table name.

cjm_qsaccel.XXX: The cjm_qsaccel namespace (it is a database) indicates all the tables that are contained
in the warehousing engine called the Query Accelerated Store. This namespace restricts modifying any of
the tables as these are system-generated tables for Adobe Journey Optimizer reporting dashboards. If you
have Adobe Journey Optimizer, you will see these tables pop up. XXX is the table name

XXX.YYY.ZZZ: If you see a name that looks, say, like testexample.lookups.country_lookup__, then this
table was created as a custom table in the Query Accelerated Store via Data Distiller. textexample (XXX)
is the custom database that you created, lookups (YYY) is the schema created underneath the database, and
country_lookup (ZZZ) is the table name.

XXX i.e. Names without a dot notation: These are tables in the Data Lake.

Tip: You do not need to specify the dot notation when executing queries against these tables. Data Distiller treats all
the tables uniformly across the storage layers and this means that all table names are unique regardless of where it is
stored. Just knowing the namespace helps you track what tables are in the Data Lake and what is in the Query
Acceleration Layer.

1. Copy any of the table names from this list and just execute a test query. Highlight the statement and press the
play button.

select * from adwh_dim_segments;

The results will look like this:

Helpful Configurations and Features

1. You can export the results as a CSV by clicking the export icon. The steps to download the results are self-
explanatory. Note that JSON export is not supported in the free version of DBVisualizer.

2. You can access the SQL execution history by clicking the Display the SQL History icon

3. You can set the number of rows that you want to get back in your results or exports by setting the Max Rows
parameter in the UI shown below:
Tip: By default, DBVsiualizer returns 1000 results. Data Distiller like most query engines places a limit of 50,000
rows.

1. If you need more ideas on how to be more productive with DBVisualizer, refer to the following documentation:

Debugging 0801 Errors When Logging Remotely from a Different Geographic Location

I have encountered errors in the past when I was trying to connect from Europe to the Data Distiller IMS Org based
out of US.

This is because of this limitation in DBVisualizer:

The solution to this was to use the IP address and the port 5432.

Last updated 6 months ago

Click on the Apple icon in top left corner.

The Processor specofication is available.

Click on Database->Create Database Connection

Name the connection from within DBVisualizer

Choose Database Type as PostgreSQL.

Choose Database JDBC Driiver as PostgreSQL.

Screen after setting up database type and JDBC driver.

Vavigate to Data Management->Queries->Credentials

Copy the values one by one over to the configuration screens in DbVisualizer.

Navihate to SSL:mode property.

Click on the Value box and type require

Successful connection to Data Distiler.

Create a new SQL script under Scripts tab.

Script editor opening up after double clicking on the script.

Choose the database connection created.

Click on the table name and copy it. Run a select *

Highlight the SQL code and execute it.

Export CSV results and follow the prompts.

Query log accessible to track queries executed.

Max Rows parameter decides how mmany results we get back

Use the Ip address of thee database server and port 5432 to debug 0801 connectivity errors.
https://data-distiller.all-stuff-data.com/prep-501-ingesting-json-test-data-into-adobe-experience-platform * * *

PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
Last updated 6 months ago

In this tutorial, we will learn how to ingest test data especially nested data into the Platform. You will need this in order
to do your Data Distiller modules.

You need to setup DBVisualizer:

You will need to download this JSON file. Extract the zip and copy the JSON file over:

We are going to ingest LUMA data into our test environment. This is a fictitious online store created by Adobe

The fastest way to understand what is happening on the website is to check the Products tab. There are 3 categories of
products for different (and all) personas. You can browse them. You authenticate yourself and also can add items to a
cart. The data that we are ingesting into the Platform is the test website traffic data that conforms to the Adobe
Analytics schema.

Unlike the Movie Genre Targeting example where we simply dropped a CSV file and the data popped out as a dataset,
we cannot do the same with JSON files as we need to specify the nested schema for the system to understand the
schema of the data.

Setup Azure Storage Explorer

1. We will be using an interesting technique to ingest this data which will also form the basis of simulating batch
ingestion. Download the Azure Storage Explorer from this link. Make sure you download the right version based
on your OS and install it.

2. We will be using Azure Storage Explorer as a local file browser to upload files into AEP’s Landing Zone: Azure-
based blob storage that stays outside AEP’s governance boundary. The Landing Zone is a TTL for data for 7 days
and serves as a mechanism for teams to push data asynchronously into this staging zone prior to ingestion. It also
is a fantastic tool for testing the ingestion of test data into AEP.

3. In the Azure Storage Explorer, open up the Connect Dialog by clicking the plug icon and then click on the
ADLSGen2 container or directory option:

4. Choose the connection type as Shared SAS URL. What this means is that if there are multiple users who have
access to the Landing Zone URL, they could all write over each other. If you are seeking isolation, it is only
available at the sandbox level. There is one Landing Zone per sandbox.

5. Name the container and then add the credentials by going into Adobe Experience Platform->Sources->Data
Landing Zone.

6. Now go into Adobe Experience Platform UI->Sources->Catalog->Cloud Storage->Data Landing Zone and
View Credentials:

7. If you click on the View Credentials, you should get this screen. Click to copy the SAS Uri

8. Copy the SAS URI into the Storage Explorer Account setup:

9. Click next to complete the setup:

10. The screen will look like the following. Either drag and drop the JSON file or Upload:

11. Navigate to Adobe Experience Platform UI->Sources->Catalog->Cloud Storage->Data Landing Zone. You
will either see Add Data or Setup button on the card itself. Click it to access the data landing Zone.

12. Voila! You should now see the JSON file you uploaded. You will also be able to preview the first 8 to 10 records
(top of the file) as well. These records will be used for validating our pipeline for ingestion later.

13. Create a XDM Schema by going to Adobe Experience Platform UI->Schemas->Create XDM Experience
Event

14. On the Schema screen, click on the pane for Field groups->Add

15. Search for “Adobe Analytics” as a term for Field Groups:

16. Add Adobe Analytics ExperienceEvent Template field group. This is a comprehensive field group but we will
be using a portion of all the fields.

17. Save the schema as Luma Web Data.

Ingest Data from Data Landing Zone

1. Click on the XDM Compliant dropdown and change it to Yes:

2. Go to the next screen and fill out the details as exactly shown in the screen below. Name the dataset as
luma_web_data, choose the Luma Web Dataset schema, and enable Partial Ingestion.

3. Configure the Scheduling to Minute and for every 15 minutes.

4. Click Next and Finish. Your dataflow should execute and you should see the dataset luma_web_data in Adobe
Experience Platform UI->Datasets. Click on the dataset luma_web_data. You should see about 733K records
ingested.

Note: By marking the dataset as XDM compatible in the dataflow step, we avoided having to go through a mapping
process. I was able to choose XDM compatible because the Adobe Analytics schema I chose was a superset of the
Luma schema. There is no point in me doing a manual mapping. If you are bringing in Adobe Analytics data in
practice, you may not be this lucky as you will need to account for eVars and will need to do the mapping. That is
beyond the scope of this guide.

1. The first query that you can type is:

select * from luma_web_data;

2. To get 50,000 results, you need to configure DBvIsualizer.

3. If you need to query the complex object, say, for the web object, use the to_json construct

select to_json(web) from luma_web_data;

luma_web_data 50,000 results.

https://data-distiller.all-stuff-data.com/prep-600-rules-vs.-ai-with-data-distiller-when-to-apply-when-to-rely-let-roi-
decide * * *
PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely,
Let ROI Decide
Author’s preface: Despite all the advanced capabilities in Data Distiller, no algorithm can replace the creativity at the
heart of great marketing. The most impactful campaigns—the ones that resonate deeply, evoke emotion, and build
lasting brand loyalty—come from human intuition, cultural awareness, and an instinctual understanding of customers
that Data Distiller simply cannot replicate. Data Distiller can optimize, but it cannot originate the kind of storytelling
that turns a brand into a movement. The most profound marketing messages don’t come from data alone; they come
from a deep, human connection to what customers truly want, fear, and aspire to be. Data Distiller can help scale
personalization and efficiency, but the soul of marketing remains a human art, where creativity, empathy, and
experience will always be irreplaceable.

AI is transforming marketing, but does that mean we should always use it? Not necessarily. Rule-based approaches
still work well in many situations, sometimes even better than AI! The key is knowing when to stick with rules and
when to switch to AI-driven systems.

Rule-based systems offer simplicity, transparency, and quick implementation, making them ideal when customer
behavior is predictable and marketing logic remains stable over time. However, as marketing strategies become more
complex and dynamic, manually maintaining rules becomes unmanageable. This is where AI steps in, enabling
personalization at scale by automating decision-making, uncovering hidden customer insights, and adapting to real-
time behaviors.

Even within AI-driven marketing, not all AI is created equal. Statistics and machine learning (ML) models in Data
Distiller play a critical role in extracting deep behavioral patterns that traditional rule-based systems miss. Rather than
relying on predefined logic, ML models detect trends, correlations, and anomalies—helping marketers segment
audiences more effectively, predict purchase intent, and optimize ad spend with greater precision.

So, how do you decide when to use rule-based marketing and when to switch to AI? In this article, I’ll break down the
trade-offs, showing real-world examples of when traditional marketing automation is enough and when AI-driven
personalization becomes the better choice.

The ROI of AI/ML in Marketing: Is It Worth the Investment?

Investing in AI and Machine Learning (ML) for marketing isn’t just about leveraging new technology—it’s about
delivering measurable business impact. But does AI truly provide a better return on investment (ROI) than traditional
rule-based approaches? The answer depends on factors such as scale, complexity, and adaptability in your marketing
strategy.

Rule-based marketing systems are cost-effective and easy to implement, making them ideal for predictable customer
behaviors and straightforward automation. They require low upfront investment and work well in static environments
where personalization needs are simple. However, as marketing complexity grows, rule-based systems fail to scale
efficiently, leading to increased manual effort, inconsistent customer experiences, and missed opportunities for deeper
engagement.

AI-driven marketing, on the other hand, excels in dynamic, high-volume environments where customer behavior is
constantly evolving. AI and ML models can optimize campaigns in real-time, increase conversion rates, and improve
customer retention—all leading to higher marketing efficiency and revenue growth. While AI implementation requires
investment in infrastructure, data, and expertise, the long-term benefits—such as reduced customer acquisition costs,
improved lifetime value, and higher engagement rates—can significantly outweigh the initial expenses.

Investing in AI using the Data Distiller capabilities, sounds promising, but how do you actually know if it’s delivering
value? Many companies rush to adopt AI without clear success metrics, assuming that more automation = better
results. The hard truth? AI is not always worth it—and in some cases, it can be an expensive distraction.
The first sign that AI is delivering value is measurable lift in key performance metrics. If AI-powered
recommendations or predictive models are driving higher engagement rates, improved conversion rates, lower
customer acquisition costs, or better return on ad spend (ROAS), then you have a clear, quantifiable impact.
However, if your AI-driven campaigns perform only slightly better (or worse) than rule-based approaches, you have to
ask: Is the complexity worth it?

Another reality check is whether AI is actually reducing workload or just adding technical debt. AI should simplify
marketing decision-making, not create more confusion. If your team is spending too much time interpreting AI
models, constantly retraining data, or troubleshooting unpredictable AI-driven decisions, it might be costing more
than it’s saving. A rule-based system—though less sophisticated—may deliver 80% of the value with 20% of the
effort.

The biggest AI myth is that once implemented, it will continuously improve on its own. In reality, AI models decay
over time if they are not monitored, retrained, and optimized. If your AI models are still using last year’s data to
predict customer behavior, they may be making the wrong decisions entirely. AI needs constant iteration and high-
quality data—without that, it can make worse decisions than simple rules.

Ultimately, AI only delivers value when it is applied strategically. If your marketing automation runs smoothly with
rules, don’t introduce AI just because it’s trendy. But if your marketing needs real-time decision-making, complex
pattern recognition, or large-scale personalization, AI can generate significant ROI—as long as you measure, monitor,
and optimize it continuously.

Traditional Rule-Based Marketing – When Rules Are Enough

Before AI, marketers used rules and knowledge graphs to automate personalization. And guess what? They still work
—sometimes even better than AI! In fact, with tools like Data Distiller, marketers can take rule-based personalization
even further by leveraging enriched attributes. These attributes can be applied at the profile level to create deeper
insights into customer behavior or used for segmentation and personalization, enabling more granular and targeted
marketing strategies. By incorporating rich customer data—such as lifetime value, engagement scores, or propensity to
purchase—rule-based systems can deliver highly effective personalization without requiring complex AI models.

Rule-Based Email Personalization (Best for Simple, Predictable Workflows)

This approach is best used when customer behavior follows clear, predictable patterns, allowing marketers to define
straightforward rules for engagement. It is particularly effective when the underlying logic remains stable over time,
meaning there is little need for frequent adjustments or complex modeling. Additionally, it is ideal when quick
implementation is required without data science expertise, as rule-based systems can be easily set up using existing
marketing tools without the need for advanced AI or machine learning capabilities.

Example: E-Commerce Re-engagement Campaign

A retail brand wants to bring back customers who abandoned their carts.

Rule-Based Approach:

If (cart abandoned) → Send discount email

If (user ignores email) → Send reminder after 3 days

This approach works because it is quick to set up in any email marketing tool like Adobe Journey Optimizer or Adobe
Campaign, allowing marketers to automate engagement without the need for complex AI models. The simple if-then
logic makes it easy to implement and manage. However, the main limitation is that it is not adaptive—every customer
receives the same response, regardless of individual preferences or behaviors. As a result, this method may not be
effective for all customer types, since it lacks real-time personalization and dynamic adjustments based on user
interactions.

Knowledge Graphs for Product & Customer Relationships (Great for SEO & Content Structuring)

This approach is best used when organizing products, services, or customer preferences in a structured way, making it
easier for users to navigate and find relevant information. It is particularly effective for optimizing search engine
results and content recommendations, ensuring that related products or topics are properly linked and categorized.
Additionally, it works well when AI-powered personalization is not necessary, such as in basic website search or static
filtering, where predefined relationships between items provide sufficient accuracy without the complexity of machine
learning.

A knowledge graph structures relationships by connecting entities (such as products, customer attributes, and
behaviors) in a semantic, flexible manner, allowing AI and marketing systems to infer meaningful connections. Unlike
primary and secondary key relationships in traditional databases, which establish rigid, one-to-one or one-to-many
relationships based on unique identifiers, knowledge graphs create contextual, many-to-many connections that mimic
human understanding. For example, in a relational database, a product table might have a primary key (Product ID)
and a foreign key (Category ID) to indicate that a moisturizer belongs to the “Skincare” category. However, in a
knowledge graph, “Moisturizer” is not just linked to “Skincare” as a category but also to concepts like “Dry Skin,”
“Hydration,” “Winter Care,” and even “Luxury Brands.” This graph-based approach enables flexible, real-time
discovery of relationships rather than relying on predefined table joins and static relationships. It’s especially useful in
personalization, where customers don’t just fit into rigid database categories but have complex, evolving behaviors and
preferences that knowledge graphs can adapt to and leverage dynamically.

Example: Google’s Knowledge Graph for E-commerce

A skincare brand wants to improve product recommendations based on skin type.

A knowledge graph structures relationships

“Moisturizer” → Used for → “Dry Skin”

“Vitamin C Serum” → Best for → “Anti-aging”

“Sunscreen” → Needed for → “Sensitive Skin”

This approach works well because it improves search and navigation on websites by providing structured filtering for
products and content, making it easier for users to find what they need. It is particularly effective for static information,
as it doesn’t require real-time updates or complex data processing. However, its main limitation is that it cannot predict
user behavior, as it relies on pre-structured relationships rather than learning from interactions. Unlike AI-driven
recommendations, it does not dynamically adapt to changing user preferences, which can make personalization less
effective over time.

In fact, the schema modeling done in XDM for Unified Customer Profile follows a similar principle, embedding these
kinds of relationships directly into the data model. This structured approach is at the heart of data modeling, ensuring
that different attributes—such as customer preferences, demographics, and behavioral data—are organized in a way
that enhances segmentation and personalization. However, its main limitation is that it cannot predict user behavior, as
it relies on pre-structured relationships rather than learning from interactions. Unlike AI-driven recommendations, it
does not dynamically adapt to changing user preferences, which can make personalization less effective over time.

Knowledge graphs are highly useful in marketing for structuring and leveraging customer data to enhance
personalization and automation. They enable customer profiles and personalization by linking attributes such as
purchase history, demographics, and browsing behavior to predict future actions and tailor marketing efforts
accordingly.
For product discovery and recommendations, knowledge graphs establish relationships between products, allowing AI
to suggest relevant items (e.g., “Customers who buy X also like Y”).

In intent-based AI chatbots, they provide contextual understanding, enabling chatbots to query structured data and
deliver more accurate responses. Additionally, knowledge graphs play a crucial role in SEO and content optimization,
where search engines use them to enhance search relevance and generate knowledge panels, improving content
visibility and discoverability.

When to Introduce AI for More Scalability

At a certain point, rule-based systems become unmanageable, as manually defining and maintaining rules for every
possible customer behavior does not scale. This is where AI becomes essential, enabling personalization that adapts
dynamically to customer preferences in real time. However, simply switching to AI isn’t enough—to truly understand
customer behavior, Statistics and ML models in Data Distiller play a critical role in uncovering hidden patterns that
rule-based logic would miss.

Unlike predefined rules that operate on explicit conditions, statistical models and ML algorithms detect trends,
correlations, and outliers in large datasets. For example, clustering algorithms in Data Distiller can group customers
based on subtle behavioral similarities, while predictive models can estimate purchase intent, churn likelihood, or
product affinity—insights that rule-based systems cannot infer on their own. These models extract meaningful signals
from raw data, allowing for deeper segmentation, more precise recommendations, and automated decision-making at a
scale that manual rule-setting could never achieve.

AI for Dynamic Personalization (Best for Large-Scale User Interactions)

This approach is ideal when customer preferences change frequently, requiring a system that can continuously learn
and adapt without manual intervention. It becomes especially useful when manually setting up and maintaining rules
becomes too complex, as AI can identify patterns and make adjustments automatically. Additionally, it is the best
choice when marketing campaigns demand real-time adaptation, ensuring that personalized content, recommendations,
and engagement strategies evolve dynamically based on user behavior and interactions.

Example: AI-Powered Email Personalization

A fashion brand wants to personalize promotional emails based on user behavior.

Rule-Based Approach:

If (customer browses sneakers) → Send email about sneakers

If (customer buys sneakers) → Send email about socks

AI-Powered Approach

The AI-powered approach enhances email personalization by learning hidden patterns and predicting what the user is
likely to buy next, going beyond static rules. Instead of relying on predefined triggers, AI dynamically adjusts email
content based on a customer’s browsing habits, past purchases, and engagement with previous emails, ensuring highly
relevant and timely messaging. This works particularly well because AI automatically adapts to different customer
types, eliminating the need for marketers to manually define every rule. However, the approach does have some
limitations—it requires historical data to train models effectively, and its implementation is more complex, as it
demands a robust ML infrastructure to process and analyze large-scale behavioral data in real time.

Advanced AI-Driven Marketing – When AI is the Best Option

Now, let’s explore when AI-powered marketing truly outperforms rules.

AI-Powered Lead Scoring (Best When Rules Fail to Capture Complexity)

This approach is ideal when manually scoring leads becomes too simplistic, as traditional methods may not capture the
full complexity of customer behavior. It is particularly useful when customer intent is influenced by subtle behavioral
signals, such as time spent on a pricing page, repeated interactions with product demos, or engagement patterns that go
beyond basic actions like email opens and clicks.

Example: Predicting High-Value Customers

A B2B software company wants to prioritize leads who are most likely to buy.

Traditional Rule-Based Approach:

If (email opened + 3+ website visits) → High-value lead

If (email unopened + no engagement) → Low-value lead

AI-Powered Approach

The AI-powered approach enhances lead scoring by analyzing past successful conversions to identify patterns that
indicate high purchase intent. Instead of relying on predefined criteria, AI uncovers deep behavioral insights, such as
time spent on a pricing page or repeated engagement with key content, to predict the likelihood of conversion more
accurately. This results in better lead prioritization, allowing sales teams to focus on prospects with the highest
potential. Additionally, AI identifies hidden trends that rule-based logic might overlook, improving overall targeting
efficiency. However, this approach has some limitations—it requires labeled training data, meaning historical
conversion data must be available for the model to learn effectively. Additionally, AI-generated scores can be harder to
interpret than simple rule-based lead rankings, making transparency and explainability important considerations.

AI for Ad Spend Optimization (Best When A/B Testing is Too Slow)

This approach is ideal when manual A/B testing becomes too time-consuming, as traditional methods require running
experiments over extended periods to gather meaningful insights. It is particularly beneficial when there is a need to
optimize ad budgets automatically, ensuring that spending is dynamically adjusted based on real-time performance.
Instead of relying on fixed allocations, AI continuously analyzes engagement, conversions, and audience behavior to
shift budgets toward the most effective campaigns, maximizing return on investment without constant manual
intervention.

Example: AI for Facebook Ad Targeting

A travel company runs ads for different customer segments.

Traditional A/B Testing Approach:

Marketers manually split audiences and test different ad creatives.

They analyze performance after weeks of running ads.

AI-Powered Approach

The AI-powered approach optimizes ad spend by dynamically adjusting bids based on real-time user engagement,
ensuring that marketing budgets are allocated efficiently. AI predicts which ad creatives will perform best even before
testing, allowing brands to launch high-impact campaigns faster. Additionally, it automatically redistributes budget to
the most effective campaigns, maximizing return on investment without requiring manual intervention. This approach
works particularly well because it eliminates guesswork in budget allocation and continuously optimizes performance
using fresh data. However, one key limitation is that it requires high-quality real-time data to make accurate
predictions and adjustments, making data consistency and accuracy essential for success.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-100-data-lake-overview * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 100: Data Lake Overview

The data lake in Adobe Experience Platform centralizes and manages diverse data types, enabling organizations to
harness their data’s full potential for personalized customer experiences.

Adobe Experience Platform includes a data lake as one of its core components. The data lake in Adobe Experience
Platform is a centralized and scalable repository that stores vast amounts of raw, structured, semi-structured, and
unstructured data from various sources. Here’s a brief overview of what the data lake in Adobe Experience Platform
represents:

1. Data Storage: The data lake is designed to store diverse types of data, including customer data, event data,
transaction data, and more. It can handle data in its raw, native format, which makes it highly flexible for
accommodating different data sources.

2. Scalability: Adobe’s data lake is built to scale horizontally, allowing it to handle large volumes of data
efficiently. It can accommodate data from multiple channels, devices, and touchpoints, making it suitable for
enterprises with substantial data needs.

3. Data Ingestion: The platform provides tools and connectors for ingesting data from various sources, such as
CRM systems, web interactions, mobile apps, and IoT devices. This data ingestion can be both batch and real-
time, ensuring that data is continuously updated.

4. Data Processing: Data within the data lake can be processed using Adobe Experience Platform’s data processing
capabilities. This includes data cleansing, transformation, enrichment, and normalization to prepare data for
analytics and other use cases.

5. Data Governance: Adobe Experience Platform includes features for data governance and compliance, allowing
organizations to manage data access, security, and privacy in accordance with regulations like GDPR and CCPA.

6. Data Activation: Data stored in the data lake can be activated for various purposes, such as creating personalized
customer experiences, running marketing campaigns, generating insights, and more.

7. Unified Customer Profiles: The data lake plays a crucial role in building unified customer profiles by
consolidating data from different sources. This enables a 360-degree view of the customer and helps in delivering
personalized experiences.

8. Machine Learning and AI: Adobe Experience Platform integrates machine learning and artificial intelligence
capabilities, allowing organizations to apply advanced analytics and AI models to data within the data lake.

9. API Access: Developers can access and work with data in the data lake using APIs and SDKs, enabling custom
application development and integration with other systems.

Last updated 6 months ago

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-101-exploring-ingested-batches-
in-a-dataset-with-data-distiller * * *
1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 101: Exploring Ingested Batches in a Dataset with Data

Distiller
It is important for you to understand how the data ingestion process works and why interrogating the records ingested
in a batch may be an important tool in your arsenal to address downstream issues.

Last updated 6 months ago

One of the key questions that you will need to answer at some point is verifying and validating the records within a
batch that has been successfully ingested into the Adobe Experience Platform.

Remember that the concept of “batch” is a data ingestion concept where a collection of records contained in a file or
otherwise, batch or streaming is materialized as a “unit” on the data lake. In essence, it is a materialization construct
used by AEP.

Records that are ingested have to pass through several checks before such materialization can take place. This is
handled during the mapping part of the data ingestion process. There are several categories of issues that can arise and
you need to be aware of them. They will manifest themselves with error codes if you peek into a dataset

1. Navigate to the Datasets pane and if you are unlucky, click on a batch that has failed:

2. You will see a bunch of errors that look like this perhaps:

Some bad things have happened to our data ingestion. Let us understand the error codes:

1. ERROR: These are the most egregious of errors possible where data corruption or non-conformance to a format
was not followed. Such types of failures are serious and the entire batch will fail.

2. DCVS: Not seen in the above example but these are less serious than data corruption issues such as a missing
required field. All of these rows are just skipped. A separate dataset containing such records is NOT available as a
dataset on the data lake. These records are kept in a separate location and accessible through the error diagnostics
tools (UI or API). The reality of dealing with such situations is that if those skipped records are critical for your
use case, you will need surgically identify them in the source system and re-ingest the data. And if that is

3. MAPPER: These appear to be the least harmful of the three but you need to pay attention to them because these
are rows that make it to the final dataset BUT the data may have been altered in the process. The mapping
process tries to do a data type conversion of the string data that is at its input to the output datatype. When it
cannot do so because of a malformed string, it will NULLs in the result. If you were not paying attention, you
now have a field that has been NULLs that possibly could have been rectified by you. Thus batches with
MAPPER warnings become a good candidate for some data exploration to see what is going on.

Accessing Dataset Batch Metadata

In order to see what system fields are available in the dataset, set the following in a session:

set drop_system_columns=false;

By doing so, you will see two new columns that will appear to the far right: acp_system_metadata and
_ACP_BATCHID.
As data gets ingested into the platform, a logical partition is assigned to the data based on what data is coming at the
input. _acp_system_metadata.sourceBatchId is a logical partition and _ACP_BATCHID is a physical partition after the
data has been mastered into the data lake in Adobe Experience Platform.

Let us execute the following query:

select _acp_system_metadata, count(distinct _ACP_BATCHID) from movie_data

group by _acp_system_metadata

The results are:

This means that number of batches in the input need not correspond to the number of batches written. In fact, the
system decides the most efficient way to batch and master the data onto the data lake. Let me explain this through an
example below.

Let’s run this on a different dataset below. For those of you who are motivated, you need to ingest this data using
XDM mapping into the Adobe Experience Platform.

This file is a deeply nested set of 35,000 records and they look like this:

select * from drug_orders

Let us generate some batch-based statistics on this dataset:

select _acp_system_metadata, count(distinct _ACP_BATCHID) as numoutputbatches,

count( _ACP_BATCHID) as recordcount from drug_orders
group by _acp_system_metadata

The answers look like this:

The above shows that I created 3 input batches where I ingested 2000, 24000, and 9000 records each time. However,
when they got mastered, there was only one unique batch each time.

Remember that all records visible within a dataset are the ones that successfully got ingested. That does not mean that
all the records that were sent at the source input are present. You will need to look at the data ingestion failures to find
the batches/records that did not make it in.

Querying a Batch in a Dataset

1. If you want to simulate the creation of a batch go to Movie Genre Targeting Example and complete the section
on ingesting CSV files.

2. If you open up the dataset pane, you will see this:

3. Copy the batch ID by going to the panel on the right:

4. Now use the following query to retrieve all the records that made it into the dataset as part of that batch:

select * from movie_data where _ACP_BATCHID=’01H00BKCTCADYRFACAAKJTVQ8P’ LIMIT 1;

_ACP_BATCHID is the keyword that we will be used to filter the Batch ID. The LIMIT clause is useful if you want
to restrict the number of rows displayed. A filter condition is more desirable.

1. If you executed this query in the Query Editor in the Adobe Experience Platform, the results will be truncated at
100 rows. The editor was designed as a quick preview tool. To get up to 50,000 rows, you need to use a third-
party tool like DBVisualizer (my favorite). DBeaver is also another tool used by all. Keep in mind, that these
editor tools are advanced and mostly free.

movie_data table has now metadata columns available.

A grouping by on source batches shows the number of output batches

Preview of the first set of records in the JSON-based drug_orders dataset.

Distribution of how input batches were mastered at a time with record counts.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-201-exploring-web-analytics-data-
with-data-distiller * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 201: Exploring Web Analytics Data with Data Distiller

Web analytics refers to the measurement, collection, analysis, and reporting of data related to website or web
application usage.

Last updated 6 months ago

You need to make sure you complete this module and its prerequisites:

We are going to ingest LUMA data into our test environment. This is a fictitious online store created by Adobe

We need to run some analytical queries on this dataset.

Count the Number of Events in the AA Dataset

SELECT count(event_id) FROM Adobe_Analytics_View

The answer should be 733,265. This is also the web traffic volume.

Count of Visitors and Authenticated Visitors

SELECT COUNT(DISTINCT mcid_id) AS Cookie_Visitors, COUNT(DISTINCT email_id) AS

authenticated_Vistors FROM Adobe_Analytics_View

The answer you should get for both should be 30,000. This means that every cookie is associated with an email which
at first instance should come across as strange. But this is demo data and we can assume that someone has done the ID
resolution for us for ALL mcids.

SELECT min(TimeStamp), max(TimeStamp) FROM Adobe_Analytics_View

The time range should come as 2020-06-30 22:04:47 to 2021-01-29 23:47:04

SELECT WebPageName, count(WebPageName) AS WebPageCounts FROM
Adobe_Analytics_View
GROUP BY WebPageName
ORDER BY WebPageCounts DESC

Count the Number of Visits/Sessions

One of the foundational concepts of web analytics is the idea of a session or a visit. When you visit a website, a timer
starts ticking and all the pages that you visited, say in the next 30 minutes are part of that session. Sessions are great
because they are the atomic unit of a journey. Customers interact with a channel or a medium as part of a session.
What they do in the session has some intent or goal - if we can study what happens in these sessions, then we can get a
solid understanding of the users.

SELECT mcid_id, Timestamp`,

to_json(SESS_TIMEOUT(Timestamp, 60 * 30)
OVER (PARTITION BY mcid_id
ORDER BY `Timestamp`
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
AS session
FROM Adobe_Analytics_View
ORDER BY 'Timestamp' ASC

Let us understand the code first:

1. to_json(SESS_TIMEOUT(Timestamp, 60 * 30): Here, the SESS_TIMEOUT function is used with

the Timestamp column. This function calculates the session timeout by adding 30 minutes (60 * 30 seconds) to
the given Timestamp. The result is then converted to a JSON format using the to_json function.

2. OVER (PARTITION BY mcid_id ORDER BY Timestamp ROWS BETWEEN UNBOUNDED

PRECEDING AND CURRENT ROW): This is a window function that operates on partitions of data defined by
the mcid_id column. It orders the rows within each partition based on the Timestamp column in ascending
order. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW clause specifies that the
window includes all rows from the beginning of the partition up to the current row.

The result is the following:

Let us now parse the results in the session object:

1. If you look at the mcid_id column, all of those ids are sorted by the same person. The sessionization always
operates on a single mcid_id

2. timestamp_diff: The difference in time, in seconds, between the current record and the prior record. It
starts with “0” for the first record and increases for the other records within the same session as indicated by
depth.

3. num: A unique session number, starting at 1 for each mcid_id. isnew is just a flag as to whether the record is
the start of a new session or not.

I can now extract the session number at a visitor level and also assign it a unique session number across all visitors by
doing the following:

SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS

unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
OVER (PARTITION BY mcid_id
ORDER BY `Timestamp`
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS session
FROM Adobe_Analytics_View
ORDER BY 'Timestamp' ASC
)

Warning: I have removed to_json in the code here as I need to access the fields within the session object. If I use
to_json, it will create a string and the fields cannot be extracted.

The results are the following:

Let us compute the number of visits overall:

SELECT COUNT(DISTINCT unique_session_number) FROM (

SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS
unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
OVER (PARTITION BY mcid_id
ORDER BY `Timestamp`
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS session
FROM Adobe_Analytics_View
ORDER BY 'Timestamp' ASC))

The result should be 104,721.

The average number of pages visited per visit is 733,265/104,721=7. This does agree with what we see when we
inspect the results.

The top web pages by counts fro June 30, 2020 to Jan 29, 2021.

Sessionization on the event data

Session number assigned at the visitor level and across all visitors.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-202-exploring-product-analytics-
with-data-distiller * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 202: Exploring Product Analytics with Data Distiller

Product analytics is the process of collecting, analyzing, and interpreting data related to a product’s usage and
performance.

Last updated 5 months ago

You need to make sure you complete this module and its prerequisites:

We are going to ingest LUMA data into our test environment. This is a fictitious online store created by Adobe
The fastest way to understand what is happening on the website is to check the Products tab. There are 3 categories of
products for different (and all) personas. You can browse them. You authenticate yourself and also can add items to a
cart. The data that we are ingesting into the Platform is the test website traffic data that conforms to the Adobe
Analytics schema.

SELECT Product.`name`AS ProductName, WebPageName, count(WebPageName) AS

WebPageCounts FROM (SELECT WebPageName, explode(productListItems) AS Product
FROM Adobe_Analytics_View)
GROUP BY WebPageName, Product.`name`
ORDER BY WebPageCounts DESC

We just exploded i.e. created a row for each item in productListItems and then aggregated the web page count.
Then we grouped by web page and product name.

The results are:

Most Popular Products by Revenue

First, let us find the most popular products by price totals for all possible commerce event types:

SELECT Product.`name`AS ProductName, SUM(Product.priceTotal) AS ProductRevenue,

WebPageName, count(WebPageName), commerce_event_type FROM (SELECT WebPageName,
explode(productListItems) AS Product, commerce_event_type FROM
Adobe_Analytics_View)
GROUP BY WebPageName, Product.`name`, commerce_event_type
ORDER BY ProductRevenue DESC

Here are the results:

If you inspect the webPageName or commerce_event_type,you will observe that “order” is the event type we
are looking for.

SELECT Product.`name`AS ProductName, round(SUM(Product.priceTotal)) AS

ProductRevenue, WebPageName, count(WebPageName), commerce_event_type FROM
(SELECT WebPageName, explode(productListItems) AS Product, commerce_event_type
FROM Adobe_Analytics_View)
WHERE commerce_event_type='order'
GROUP BY WebPageName, Product.`name`, commerce_event_type
ORDER BY ProductRevenue DESC

We used round to round up the decimals and filtered by the order commerce event type.

I am now curious as to what are the different stages that my customers are going through on my website:

SELECT commerce_event_type AS Customer_Stages, COUNT(commerce_event_type) FROM

Adobe_Analytics_View
GROUP BY commerce_event_type

We get the following:

The decrease in the page counts for the various stages shows what we would have expected. Notice some weird things
about the data: Luma customers do seem very eager to add items to their wishlist (at least 33% conversion from
viewing a page), at least 50% of those that add to a wishlist seem to checkout and 50% of them do place an order. If
there was one thing I would fix, I would fix the checkout-to-order conversion rate to be higher.

But wait, how can someone checkout without adding items to a cart?

And that information is there in WebPageName query:

SELECT WebPageName, COUNT(WebPageName) AS WebPageCounts

FROM Adobe_Analytics_View
WHERE WebPageName IN ('order', 'checkout', 'addToCart')
GROUP BY WebPageName
ORDER BY WebPageCounts DESC;

The results are:

I chose order, checkout and addToCart because all the other web pages are just product pages. Note that the
numbers for checkout and order match perfectly with our commerce query. The web page column does not have
information about the ProductListAdds. As an analyst, you may assume that the data is to be trusted but here in
this example, it did not make sense that an add-to-cart step was missing.

Let us put these funnel stages together in a query:

SELECT commerce_event_type AS Funnel_Stage, COUNT(commerce_event_type) AS Count

FROM Adobe_Analytics_View
GROUP BY commerce_event_type

UNION ALL

SELECT WebPageName AS Funnel_Stage, COUNT(WebPageName) AS Count

FROM Adobe_Analytics_View
WHERE WebPageName IN ('order', 'checkout', 'addToCart')
GROUP BY WebPageName

ORDER BY Count DESC;

The results will be:

The results show that ProductListAdds is indeed equivalent to “addToCart”. ProductListAdds is not the addition to the
product wish list as we had assumed. Our analysis is helping us reconcile the differences in the data modeling present
in the data.

Product revenue across all commerce event types

Most popular products are not necssarily the most popular web pages.

Funnel stages as indicated by commerce event types.

WebPageName query gives infromation about addToCart.

Unioning of two datasets gets us all the stages possible.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-200-exploring-behavioral-data-
with-data-distiller-a-case-study-with-adobe-analytics-data * * *
You need to make sure you complete this module that ingests Adobe Analytics web data into the Platform:

And of course, you should have:

We are going to ingest LUMA data into our test environment. This is a fictitious online store created by Adobe

We need to run some analytical queries on this dataset.

Exploratory 1-Dimensional Queries

The goal of this exercise is to explore every column of the dataset individually so that we get a deep understanding of
the columns. Once we understand each column, we can then build 2-dimensional and even n-dimensional queries.

Let us first retrieve all the results:

SELECT * FROM luma_web_data;

You can see that there are complex nested objects. Instead of going into the XDM schemas, we can query the data in
place by using to_json.

Let us dig into the web JSON object (or XDM field group):

SELECT to_json(web) FROM luma_web_data;

Let us dig one level deeper into webPageDetails. We will use the dot notation to access any field in the hierarchy.

SELECT web.webPageDetails FROM luma_web_data;

We can apply to_json again:

SELECT web.webPageDetails FROM luma_web_data;

pageViews is an object. Let us access the elements of that array

SELECT to_json(web.webPageDetails.pageViews) FROM luma_web_data;

You will get the following:

We can access the value by:

SELECT web.webPageDetails.pageViews.value FROM luma_web_data

And you will get:

Let us work on the marketing object:

SELECT to_json(marketing) FROM luma_web_data;

The results show information about campaigns:

If you execute the following code:

SELECT to_json(marketing), to_json(channel) FROM luma_web_data;

You will observe that there is duplication of data across these fields. marketing object truly has a campaign name
while the other fields are present in the channel object.

Let us extract the channel type that is in the type field of the channel object as it has values such as search, email, and
social.

The code for this will be:

SELECT channel._id AS tracking_code, regexp_extract(channel._type, '[^/]+$', 0)

AS channel_type, channel.mediaType AS channel_category FROM luma_web_data

The result will be:

Note the usage of the regular expression that is extracting the last word in the _type field that looks like
_type":"https:``_//ns.adobe.com/xdm/channel-types/XXX"_

regexp_extract(channel._type, '[^/]+$', 0): This is the main part of the query where you use
the regexp_extract function to perform regular expression extraction.

channel._type: This specifies the JSON field "_type" inside the channel JSON object.

'[^/]+$': This is a regular expression pattern. Let’s break it down:

[^/]: This part matches any character except a forward slash (”/”).

+: This indicates that the previous pattern ([^/]) should occur one or more times consecutively.

$: This anchors the pattern to the end of the string.

0: This argument specifies the group index to return. In this case, 0 means that the entire match (the
matched string) will be returned.

Explore ProductListItems Array Object

Let us access the ProductListItems array:

SELECT to_json(productListItems) FROM luma_web_data;

Hint: A single page view for Add to Cart event will have multiple product items.

To access the first elements of this array, use the following:

SELECT productListItems[0] FROM luma_web_data;

Arrays offer themselves to even more interesting SQL queries. Arrays can be exploded i.e. each element of the array
can be put into a separate row of a new table and other columns/fields will be duplicated:

SELECT explode(productListItems) FROM luma_web_data;

Hint: You can also unnestfunction instead of explode``.

Let us now explore the commerce object:

SELECT to_json(commerce) FROM luma_web_data;

commerce object shows some commerce-related actions such as checkouts that thewebPageDetails object does
not have.

Let us reformat this object so that we can extract the commerce event types such as productViews,
productListAdds and checkouts as strings. I want to do this because I want to use GROUP BY on these event
types later on. The fact that some of them are populated while some are not indicates that this is a nested structure and
we will have no choice but to look at the commerce object itself in the XDM schema.

First, let us extract these fields as strings:

SELECT (CASE
WHEN commerce.checkouts.`value`==1 THEN 'checkouts'
WHEN commerce.productViews.`value`==1 THEN 'productViews'
WHEN commerce.productListAdds.`value`==1 THEN 'productListAdds'
END) AS commmerce_event_type
FROM luma_web_data

The results are:

Note: The syntax of commerce.checkouts.`value`==1. Here value has got two single opening quotation
marks. This is to avoid conflict of value as a RESERVED keyword. The same will apply for
commerce.`order`.* as well.

But our string-based approach has a serious flaw. If you check the field group commerce, you will see a lot of
commerce event types. There is no guarantee that we will only see the 3 eveent types that we identified above:

To extract an arbitrary field name of the structs present in the commerce object, we will use:

SELECT commerce_event_type[0] AS commerce_event_type FROM (SELECT

json_object_keys(to_json(commerce)) AS commerce_event_type FROM luma_web_data);

The result will be:

Note the following:

1. json_to_keys extracts the top-level keys of the JSON objects present in commerce.

2. to_json converts the JSON object to a string.

3. commerce_event_type[0] extracts the first and only element of this array.

4. Note that different structs in the commerce object have different values. Page view type structs will have a value
equal to 1 while purchase type structs will have purchase totals. This extraction only works for extracting the
commerce event types but does not extrapolate to the metadata of those events.

Alternatively, we could have simplified this query by avoiding the outer SELECT query by simply doing the following
which will help us later:

SELECT json_object_keys(to_json(commerce))[0] AS commerce_event_type FROM

luma_web_data

Explore endUserIDs Object

Let us also check the endUserIDs

SELECT to_json(endUserIDs) FROM luma_web_data;

We can extract the email addresses by using:

SELECT endUserIDs._experience.emailid.id FROM luma_web_data;

We can extract the mcids by using

SELECT endUserIDs._experience.mcid.id FROM luma_web_data;

The results are:

Create a Semi-Flat View of the Adobe Analytics Data

Let us take the queries that we built and put them all together to create a SQL query that creates a somewhat flat
structure of the data i.e. we will not expand ProductListItems.

CREATE TEMP TABLE Adobe_Analytics_View AS

SELECT _id AS event_id,
`timestamp` AS `TimeStamp`,
endUserIDs._experience.mcid.id AS mcid_id,
endUserIDs._experience.emailid.id AS email_id,
web.webPageDetails.`name` AS WebPageName,
json_object_keys(to_json(commerce))[0] AS commerce_event_type,
productListItems AS productListItems,
marketing.campaignName AS campaignName,
channel._id AS campaign_tracking_code,
regexp_extract(channel._type, '[^/]+$', 0) AS channel_type,
channel.mediaType AS channel_category
FROM luma_web_data;

SELECT * FROM Adobe_Analytics_View

Note the following:

1. We have assembled all the 1-dimensional SELECT queries into a view.

2. The view is semi-flat because ProductListItems is not flattened i.e. put into separate rows or columns.

3. We use a CREATE TEMP TABLE to store this view instead of materializing the view immediately because we
want TEMP tables to be cached in Data Distiller for fast exploration.

Tip: If you want fast exploration of data in ad hoc query engine, just create a TEMP TABLE with the data that you
want to explore. Remember that these temp tables are wiped after the user session ends as the cache is ephemeral.

Warning: If you are on and off, DBVisualizer will disconnect from Data Distiller. In that case, it will complain that
the temp table does not exist because your session needs to be reestablished. In such situations where you cannot
maintain the connectivity for long periods of time, you are better off just using CREATE TABLE which will
materialize the data onto the Data Lake.

If you decide to use CREATE TABLE:

CREATE TABLE Adobe_Analytics_View AS

SELECT * FROM Adobe_Analytics_View

Regardless of what you use, the results of the query look like:

With this view, you are now set to do any kind of analysis. The methodology shown above can be applied to any
schemas we get from the Adobe Apps.

Appendix: Adobe App Schemas to Explore

The skills that you have learned in this module should set you up for success with any complex dataset that you will
come across in the Adobe ecosystem.

If you are interested in Adobe Journey Optimizer, you should explore this module:

You should also explore Adobe Commerce with Adobe Experience Platform integration. Specifically, you need to be
aware of this:

2. There are some field groups that are unique to Adobe Commerce because of the nature of the storefront setup.

3. There can always be custom events that are unique to an industry or implementation.

You can bring Campaign V8 delivery logs into the Adobe Experience Platform.

You can bring in all of these datasets from Marketo:

You can also bring in custom activity data as well.

Appendix: Array Operations

These are the array functions supported in Data Distiller:

size() to determine the number of elements in a list (array)

The bracket [] notation to access specific elements in arrays

transform() to apply a transformation to all elements in an array

explode() to transform elements in a list into single rows

posexplode() to transform elements in a list into single rows along with a column for the index the element
had in the original list

array_contains() to determine if an array contains a specific element

array_distinct() to remove duplicates from an array

array_except() to subtract to arrays

array_intersect() to determine the intersection (overlapping elements) of two arrays

array_union() to determine the union of two arrays without duplicates

array_join() to concatenate the elements of an array using a delimiter

array_max() to get the largest element from an array

array_min() to get the smallest element from an array

array_position() to a specific element from an array counting starting with 1

array_remove() to remove a specific element from an array

array_repeat() to repeat the elements of an array a specific number of times

array_sort() to sort an array

arrays_overlap() to check if two arrays have at least one common element

to_json is able to get information about the various fields within the web object.

Digging into webPageDetails

to_json can be applied at any level of the hierarchy.

Accessing pageViews details

value=1 indicates that there was a single page view.

marketing object gives information about the campaign association.

Duplication of fields in marketing and channel objects.

Extraction of the channel fields.

ProductListItems captures product information about the items added to cart or even browsed.

Accessing the first element of an array.

EXPLODE on the ProductListItems object. Array elements are put in separate rows.

commerce object detaills.

Extracting field names by using CASE logic.

commerce object contains a lot of commerce object types.

Results of using the json_keys_object function retrieves all the possible field names of the structs in the commerce
object.

endUserIDs contains email and mmcid as the identities of the person.

Emails extracted from the ednUserIDs object.

mcids extracted from the ednUserIDs object.

Semi-flat view of Adobe Analytics data.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-500-incremental-data-extraction-
with-data-distiller-cursors * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors

Learn to Navigate Data Efficiently with Incremental Extraction Using Data Distiller Cursors

Last updated 4 months ago

Load the sample dataset

using the following tutorial:

The dataset is very simple and has 5000 records.

id: Unique identifier for each entry.

name: Name of the individual or entity.

value: Numeric value associated with each entry.

We’ll be using the Data Distiller Query Pro Mode Editor, a web-based editor that lets you view results of a query
with limited options- 50, 100, 150, 300, or 500. While this may seem limiting, it’s important to remember that the
editor operates within your browser, which has memory constraints for displaying results.

If you use a dedicated client installed locally on your machine, such as DBVisualizer, you can handle much larger
datasets—up to 100,000 rows or more. However, even with local clients, you’ll eventually hit application memory
limits.

This brings us to an interesting challenge: how can we efficiently paginate through the result set when the client editor
imposes a limit on how much data can be displayed at once?

The answer is Data Distiller Cursors.

What is a Data Distiller Cursor?

A cursor in Data Distiller is a database object used to retrieve, manipulate, and traverse through a set of rows returned
by a query one row at a time. Cursors are particularly useful when you need to process each row individually, allowing
for row-by-row processing that can be controlled programmatically.

1. Sequential Processing: Cursors allow you to sequentially access rows in a result set, which is helpful when you
need to handle each row individually.

2. Row-by-Row Operations: Unlike standard SQL, which typically processes whole result sets, a cursor can fetch
a limited number of rows at a time, letting you work with rows individually in a controlled manner.

3. Memory Efficiency: When working with very large datasets, fetching rows in smaller chunks (instead of loading
all at once) can help manage memory usage and improve performance.
Batch Processing: When you need to process rows in smaller batches, especially with large datasets.

Row-by-Row Operations: For complex operations that require checking or modifying each row one at a time.

Data Migration or Transformation: Cursors can be helpful when copying data from one table to another while
applying transformations.

Procedural Logic: Used in stored procedures or scripts where specific row-based logic or conditions need to be
applied iteratively.

How a Data Distiller Cursor Works

DECLARE: Defines the cursor data_cursor with a specific SQL query.

FETCH: Retrieves a specified number of rows (e.g., 5) from the cursor.

CLOSE: Releases the cursor when no longer needed.

This process is especially valuable when working with large datasets, as it helps control memory usage by processing
smaller chunks of data at a time.

Declare a Data Distiller Cursor

Before you start, you need to open the Data Distiller Query Pro Mode Editor by navigating to

AEP UI->Queries->Create Query

Make sure you choose the Show Results option as 500 as we will use this limit to paginate through the results.

We will declare a cursor to select all rows from the sample_data table. This cursor will allow us to retrieve a
limited number of rows at a time.

-- Declare the cursor

DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;

Once you’ve declared the cursor, open it to prepare it for row fetching.

Fetch Rows Using the Cursor

You can now fetch rows in batches of 500 rows. This is particularly useful if you’re working with large datasets and
want to process data in smaller chunks.

-- Fetch the first 500 rows

FETCH 500 FROM data_cursor;

You should see the first 500 rows i.e. observe the **ID** column

Now let us try and fetch the next 500 rows and observe the **ID** column

-- To fetch the next 500 rows, repeat this command

FETCH 500 FROM data_cursor;

After you’ve fetched all the rows you need, close the cursor to free up resources.
-- Close the cursor when done
CLOSE data_cursor;

The entire code for your reference and templatizing to others:

-- Declare the cursor

DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;

-- Fetch the first 500 rows

FETCH 500 FROM data_cursor;

-- To fetch the next 500 rows, repeat this command

FETCH 500 FROM data_cursor;

-- Close the cursor when done

CLOSE data_cursor;

Appendix: Cursors in Python

The example in the tutorial below allows you to extract a chunk of rows at a time:

If you wanted to persist the results in that tutorial incrementally:

import psycopg2
import csv

# Establish a connection to the database

conn = psycopg2.connect('''sslmode=require host=ZZZZ port=80 dbname=prod:all
user=YYYYY@AdobeOrg password=XXXXX''')

# Create a cursor object for executing SQL commands

cursor = conn.cursor()

# Example query
query = "SELECT * FROM movie_data;"
cursor.execute(query)

# File to save the data

output_file = "movie_data.csv"

# Open the file in write mode

with open(output_file, mode='w', newline='', encoding='utf-8') as file:
csv_writer = None # Initialize the CSV writer variable
chunk_size = 50

while True:
# Fetch the results in chunks
chunk = cursor.fetchmany(chunk_size)

# Break the while loop if there are no rows to be fetched

if not chunk:
break

# Write the header row only once, when processing the first chunk
if csv_writer is None:
column_names = [desc[0] for desc in cursor.description]
csv_writer = csv.writer(file)
csv_writer.writerow(column_names) # Write the header row

# Write each row of the chunk to the CSV file

csv_writer.writerows(chunk)

# Close the cursor and connection

cursor.close()
conn.close()

print(f"Data has been successfully written to {output_file}.")

Show results has limited options for the number of rows that you see in the final result.

Extract the next 500 rows.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-400-exploring-offer-decisioning-
datasets-with-data-distiller * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller

Unleashing Insights from Offer Decisioning Datasets with Data Distiller

Last updated 5 months ago

You need a basic understanding of how to write nested queries and working with nested data.

You should get familiar with navigating around with web data:

You should also familiarize yourself with AJO System Datasets:

Offer Decisioning Framework

The journey begins with activities, which are broad tasks or campaigns defining when and where offers will be shown.
Within each activity, placements define the specific locations (e.g., web banners or emails) where offers will be
delivered. The decisioning engine then uses eligibility rules, ranking algorithms, and profile constraints to
determine which offer—either a personalized offer or a fallback offer—is most appropriate for each user in a specific
placement. When a decision is made, it generates a decision event, which captures the result of that interaction,
including the offer proposition and user engagement with the offer. All these components work together to ensure that
users receive the most relevant and timely offers during their journey.

At the core of Adobe Journey Optimizer’s offer delivery system is the decision-making process. Decisions are the
rules and criteria that determine which offers are presented to a user. Every decision is influenced by a variety of
factors, including profile constraints, contextual data, and business rules. Decisions can be thought of as the
“brains” behind which offer gets presented at any point in the customer journey. They involve multiple steps:

Contextual data is real-time information about the user’s current environment, such as time, location, device,
and session activity. It helps tailor offers based on what’s happening at the moment. For example, users near a
store might receive location-based promotions, or users on a mobile device could see mobile-optimized offers.
Contextual data ensures offers are timely and suited to the user’s immediate situation.
Eligibility: Decides whether a user qualifies for certain offers based on their profile.

Ranking: Determines the priority and relevance of offers using scoring and/or rules.

Constraints: Factors such as time, placement, and profile attributes that limit when and how offers can be shown.

Profile constraints are rules based on a user’s demographics, behavior, preferences, and audience segments
that determine offer eligibility. These include factors like age, location, past purchases, and membership in
loyalty programs. For example, a luxury car promotion might only be shown to high-income users or
frequent shoppers. By using profile constraints, brands ensure that offers are highly relevant to each
individual.

Decisions drive the selection process for offers, taking into account activities and placements to determine the best
offer for a user in a given context.

An offer is the actual content or proposition presented to users. Offers could be discounts, product recommendations,
promotions, or other types of personalized content that a brand wants to deliver. Offers are stored in the Offer Library
and can be dynamically selected based on the decision criteria. Offers contain:

Content: The actual message or media delivered to users (e.g., banners, emails).

Metadata: Details like offer name, description, and associated rules or tags.

There are different types of offers based on how they are chosen and delivered, which brings us to personalized offers
and fallback offers.

Personalized offers are a special type of offer tailored specifically to individual users. These offers are selected based
on detailed user profiles, contextual data, and behavior. The Personalized Offers Dataset provides data about the
content and customization of these offers, including the rules that will be applied to personalize the offer to a specific
user.

A fallback offer is presented when no personalized offer meets the eligibility or decisioning criteria. In cases where
primary offers fail (due to constraints like timing, audience mismatch, or other criteria), fallback offers ensure that
some content is still delivered to the user. The Fallback Offers Dataset captures data about the fallback logic and the
conditions under which these offers are shown. While fallback offers are secondary to personalized offers, they help
maintain engagement when personalization fails.

Placements are the designated spaces or contexts where offers are shown to users. A placement could be a web page
banner, an email slot, an in-app message, or any other digital location where an offer might appear. Placements are
critical in determining where and how an offer is delivered. Each placement has:

Channel information: Where the content will be displayed (e.g., web, email, mobile).

Media type constraints: Ensures the content format (e.g., image, text, video) matches the requirements of the
placement.

Description and names: Describes the function and role of the placement (e.g., “homepage banner”).

The Placements Dataset stores data about these locations, ensuring that the right offer is rendered in the right place at
the right time.

Activities are the overarching campaigns or tasks that determine when and how offers are presented within a customer
journey. An activity could be an email campaign, an ad shown during a promotion, or a banner placed on a website.
The activity serves as the container for offers and is tied to specific placements and decisions.

Activities can have multiple properties:

Start and End Time: Determines the timeframe during which the activity is active.

Ranking and Eligibility: Tied to the decisioning rules that determine which offers are shown during the activity.

Fallback and Constraints: Includes rules for fallback offers if no primary offers are eligible.

The Activities Dataset captures much of the logic behind activities, including ranking and placement constraints.

A decision event is a time-stamped interaction that records what happened when a decision was made. It is
essentially the event log that shows which offers were presented, accepted, or rejected by users. The ODE (Offer
Decision Events) Dataset records these events, providing detailed information about each decision that occurred
during a user’s interaction. Each decision event captures:

Timestamp: When the event occurred.

Proposition details: The offer that was proposed.

Interaction outcome: Whether the user accepted, clicked, or ignored the offer.

Placement and activity context: Where the offer was placed and within which activity the decision was made.

Decision events allow marketers to track the effectiveness of their offers and adjust their decisioning strategies based
on user engagement and outcomes.

High Level Overview of Offer Decisions

Before diving into the datasets, it’s crucial to first understand the specifics of the business process—specifically, the
steps a user takes to configure the system that generates the datasets. This understanding lays the foundation for
meaningful analysis, and without it, grasping the context behind the data becomes much more challenging.

1. Navigate to Decision Management->Offers->Offers->Create Offer

2. Offers have a time-to-live and include attributes referred to as Characteristics within the datasets.

3. You can apply constraints at the offer level to control who can view it and limit the frequency of how many times
the offer is shown within a specific time period.

4. You can add a decision rule as well:

5. The representation is where you define the placement, assets, and the channel through which the offer will be
displayed.

6. Offers have to be part of an offers collection on which decision rules will be applied. Navigate to Decision
Management->Offers->Collections->Create Collection

7. You can add offers to this collection

8. Navigate to Decision Management->Offers->Collections->Create Decision. Decisions have a time to live,

9. You will need to add a decision scope, which is essentially a grouped set of rules, and specify a placement.

10. You will need to add a offer collection

11. With multiple offers available, you can select the audience, algorithm, and other criteria to determine the winning
offer. Some offers will be eliminated at this stage if they do not meet the specified criteria.
12. Every decision rule requires a fallback offer:

13. Decision rule on the offer collection which can now be activated.

Decisions Object Repository - Activities Dataset

The Decisions Object Repository - Activities Dataset contains additional information that is more focused on the
decision-making logic and criteria behind offer selection ot be done.

Criteria and Constraints: The Activities Dataset provides detailed information about the criteria used to make
decisions, such as the constraints that are applied based on profile information, context, and eligibility rules.
Fields like **_experience.decisioning.criteria**,
**_experience.decisioning.criteria.profileConstraints**, and
**_experience.decisioning.criteria.placements** describe the rules, constraints, and filters
applied during decision-making.

Ranking and Prioritization: The Activities Dataset contains detailed fields about how offers are ranked and
prioritized, including scoring functions and ranking strategies. Fields like
**_experience.decisioning.criteria.ranking**,
**_experience.decisioning.criteria.ranking.order**, and
**_experience.decisioning.criteria.ranking.priority** describe how offers are ranked
based on scores or priorities.

Fallback Option Logic: The Activities Dataset contains fields related to fallback options and detailed logic
about how and why fallback options are selected if regular options do not qualify. Fields like
**_experience.decisioning.criteria.fallback** explain the conditions under which fallback
options are selected, including the logic behind their use.

Process: The Activities Dataset provides additional metadata related to the decision-making process, such as
workflow identifiers (**_experience.decisioning.batchID**) and revision tracking (ETags).
Activities Dataset : Includes fields like **_experience.decisioning.batchID**,
**_repo.etag**, and
**_experience.decisioning.criteria.propositionContentKey**, which help track the
versioning and batch processing behind the decision events.

Profile and Audience Constraints: The Activities Dataset includes detailed profile constraints and how
segments or rules are applied to profiles to determine the eligibility of an offer. Fields like
**_experience.decisioning.criteria.profileConstraints**,
**_experience.decisioning.criteria.profileConstraintType**, and
**_experience.decisioning.criteria.segmentIdentities** are used to track the audiences
and segments that influence decisions.:

Ranking Details: The Decisions Object Repository - Activities Dataset has specific fields that explain how the
best option is determined, including ranking orders and scoring functions. It includes fields like
**_experience.decisioning.criteria.ranking.orderEvaluationType**, which specify
how options are evaluated and ranked.

Understand the Structure of the Activities Dataset

SELECT
table_name,
column_name,
data_type
FROM
information_schema.columns
where table_name = 'decision_object_repository_activities'

The result is:

Explore the Structure of the Activities Dataset

SELECT to_json(p._experience.decisioning)
FROM decision_object_repository_activities p
LIMIT 10;

Retrieve Records from the Activities Dataset

SELECT to_json(_experience) FROM decision_object_repository_activities;

The result is:

Retrieve Decisioning Criteria for Offers

This query will show you the decisioning criteria (the rules or algorithms) applied for each activity. This might
include complex decisioning logic, filters, and algorithms.

SELECT
p._id AS activityId,
p._experience.decisioning.name AS activityName,
p._experience.decisioning.criteria AS decisioningCriteria
FROM
decision_object_repository_activities p
WHERE
p._experience.decisioning.criteria IS NOT NULL;

The results will be:

To understand this result, let us navigate to Offers->Decisions->BP Luma Offers in the AEP UI

Let us correlate the result of the query for BPLumaOffes(first line of the result) and the above:

Activity ID Match: Both the query and the screenshot reference the same activity ID (xcore:offer-
activity:15fec9f63011bd8), meaning they are referring to the same decision-making process.

Placements:

The query returns specific placements where offers are shown, such as "xcore-offer-
placement:15fdf378e188bb6e", which likely corresponds to one of the placements like Luma -
Home Banner.

Multiple placements are involved in the same activity, just as in the screenshot where offers are placed in
banners, cards, and emails.

This would require us to pull metadata about the placements from the Placements Dataset.

Decisioning Criteria and Filters:

The query result shows the decision filters applied (e.g., "xcore-offer-
filter:15fdf474893c3ef0"), which control which offers are shown based on the user’s profile,
context, and placement.
The eligibility criteria shown in the query match the audience eligibility shown in the above screenshot
(e.g., "allSegments" in the query vs. “1 audience” in the screenshot).

Ranking Methods:

Note that the query result doesn’t explicitly show the ranking method, but we know from the screenshot that
the ranking method for certain placements is based on a personalized model (e.g., “Luma Personalized
Model” for the Home Banner). In other placements, it is based on offer priority.

Fallback Offer:

The fallback offer shown in the query (xcore:fallback-offer:15fec32dffc546a0) matches

the fallback offer in the screenshot (“BP Luma - Fallback”). This confirms that the system will show the
fallback offer if none of the primary offers qualify

Decisions Object Repository - Personalized Offers Dataset

The Personalized Offers Dataset represents personalized offers that are created and prepared to be served to users
based on various decision-making logic. This dataset includes extensive metadata on offer content, audience
segmentation, eligibility rules, and decision criteria, allowing you to tailor offers based on user profiles, behaviors, and
contextual data. It also captures the ranking, scoring, and prioritization mechanisms used to determine which
personalized offers are presented to users in different scenarios.

Key Features in Personalized Offers Dataset

Profile Constraints: The Personalized Offers Dataset provides detailed rules and constraints regarding which
offers are eligible for certain user profiles, ensuring that offers are customized to meet individual needs. Fields
like **_experience.decisioning.profileConstraints**,
**_experience.decisioning.profileConstraintType**, and
**_experience.decisioning.segmentIdentities** detail the rules applied based on user profiles
and segments.

Content Components: The Personalized Offers Dataset captures granular details about the content associated
with personalized offers, including various language variants, formats, and delivery methods. Fields like
**_experience.decisioning.contents**,
**_experience.decisioning.contents.components.language**, and
**_experience.decisioning.contents.components.format** provide detailed metadata
about the structure of personalized offer content.

Ranking and Prioritization: The Personalized Offers Dataset contains fields related to ranking strategies,
scoring functions, and order evaluation, allowing for complex decision-making regarding which offers are
prioritized for users. Fields like **_experience.decisioning.ranking**,
**_experience.decisioning.orderEvaluationType**, and
**_experience.decisioning.rankingStrategy** provide detailed ranking logic.

Lifecycle Management: The Personalized Offers Dataset tracks the lifecycle status of each offer, allowing for
better workflow management by indicating whether an offer is in draft, approved, live, or archived state. Fields
like lifecycleStatus track the status of offers, ensuring proper management of their visibility and usage in
campaigns.

Understand Structure of Personalized Offers Dataset

SELECT
table_name,
column_name,
data_type
FROM
information_schema.columns
where table_name = 'decision_object_repository_personalized_offers'

You should get:

Retrieve Records from Personalized Offers Dataset

SELECT to_json(_experience) FROM decision_object_repository_personalized_offers

The results you will get will look like this in JSON:

{
"decisioning": {
"ranking": {
"priority": 0
},
"name": "BP Luma - Loyalty Membership",
"contents": [
{
"placement": "xcore:offer-placement:15fdf228c3fec9eb",
"components": [
{
"_dc": {
"format": "image/png"
},
"_type": "https://ns.adobe.com/experience/offer-management/content-
component-imagelink",
"deliveryURL": "https://dpqy7l2qgw0r3.cloudfront.net/0aa64df0-e3ce-
11e9-ace4-cb8a25ba725b/urn:aaid:aem:8b68c634-151e-4059-a626-
a95fdc4e1833/oak:1.0::ci:b7e14744a2dde9486e0a9a45cb9a9e28/93b54966-7c78-3b23-
8afb-649f0e8acff8",
"linkURL":
"https://luma.enablementadobe.com/content/luma/us/en/community/members.html",
"_repo": {
"name": "Loyalty Banner.png",
"resolveURL": "https://author-p28416-
e87881.adobeaemcloud.com/content/dam/BP/Luma/Loyalty%20Banner.png/jcr%3Acontent/
cacheinfo=653eb618fef5c459aed4b796501437a5",
"id": "urn:aaid:aem:8b68c634-151e-4059-a626-a95fdc4e1833"
}
}
]
},
{
"placement": "xcore:offer-placement:15fdf378e188bb6e",
"components": [
{
"_dc": {
"format": "image/png"
},
"_type": "https://ns.adobe.com/experience/offer-management/content-
component-imagelink",
"deliveryURL": "https://dpqy7l2qgw0r3.cloudfront.net/0aa64df0-e3ce-
11e9-ace4-cb8a25ba725b/urn:aaid:aem:d4e92f28-38b5-4e14-a7ab-
9f6bf6cd7dc1/oak:1.0::ci:3cf0cda086124eae041430323016d94b/de10b283-7648-3f0e-
a9c9-bfbe59b01b30",
"linkURL":
"https://luma.enablementadobe.com/content/luma/us/en/community/members.html",
"_repo": {
"name": "Loyalty Card.png",
"resolveURL": "https://author-p28416-
e87881.adobeaemcloud.com/content/dam/BP/Luma/Loyalty%20Card.png/jcr%3Acontent/re
cacheinfo=a3bd18f557beb74edd958adfb0a1cc17",
"id": "urn:aaid:aem:d4e92f28-38b5-4e14-a7ab-9f6bf6cd7dc1"
}
}
]
},
{
"placement": "xcore:offer-placement:15fdf24e2efadcdf",
"components": [
{
"_dc": {
"format": "image/png"
},
"_type": "https://ns.adobe.com/experience/offer-management/content-
component-imagelink",
"deliveryURL": "https://dpqy7l2qgw0r3.cloudfront.net/0aa64df0-e3ce-
11e9-ace4-cb8a25ba725b/urn:aaid:aem:8b68c634-151e-4059-a626-
a95fdc4e1833/oak:1.0::ci:b7e14744a2dde9486e0a9a45cb9a9e28/93b54966-7c78-3b23-
8afb-649f0e8acff8",
"linkURL":
"https://luma.enablementadobe.com/content/luma/us/en/community/members.html",
"_repo": {
"name": "Loyalty Banner.png",
"resolveURL": "https://author-p28416-
e87881.adobeaemcloud.com/content/dam/BP/Luma/Loyalty%20Banner.png/jcr%3Acontent/
cacheinfo=653eb618fef5c459aed4b796501437a5",
"id": "urn:aaid:aem:8b68c634-151e-4059-a626-a95fdc4e1833"
}
}
]
}
],
"calendarConstraints": {
"startDate": "2022-10-25T06:00:00.000Z",
"endDate": "2050-05-31T06:00:00.000Z"
},
"profileConstraints": {
"profileConstraintType": "none"
},
"lifecycleStatus": "approved",
"tags": [
"xcore:tag:1771ac5a22abb9f7",
"xcore:tag:15fdf3abddd39b68"
]
}
}
1. **decisioning**: This is the top-level object that encapsulates all decisioning details related to this offer.

**ranking**:

**priority**: The ranking priority of this offer. A value of 0 typically indicates the highest
priority.

**name**: The name of the offer, here labeled as “BP Luma - Loyalty Membership”, which may
indicate that this is an offer targeted at customers in a loyalty membership program.

2. **contents**: This array holds multiple offer placements. Each object within the contents array
represents one placement of the offer in a specific location or context (e.g., on a website, in an app).

**placement**: This is a unique identifier for where the offer will appear (e.g., a banner on a webpage
or in-app placement).

**components**:

Each component describes the content used in that placement (e.g., an image, text, or link).

**_dc.format**: The format of the content (e.g., “image/png” for PNG image).

**_type**: The type of content component, here it’s an image link, pointing to an external resource.

deliveryURL: The URL where the content (image) is hosted.

**linkURL**: The URL the user is directed to when they interact with the content (e.g., a banner
leading to a loyalty program page).

_repo: Contains metadata about the image asset.

name: The name of the asset (e.g., “Loyalty Banner.png”).

resolveURL: A direct link to a thumbnail of the image.

id: A unique identifier for the asset.

3. calendarConstraints: These fields define when the offer is valid.

**startDate**: The start date of the offer (in ISO 8601 format), meaning this offer becomes active on
October 25, 2022.

**endDate**: The end date of the offer, meaning it will expire on May 31, 2050.

4. **profileConstraints**: These fields define which user profiles are eligible for the offer.

**profileConstraintType**: The type of profile constraint applied. In this case, "none" means
that no specific profile constraints are applied, making the offer available to all users.

5. lifecycleStatus: The current status of the offer.

**approved**: This indicates that the offer has been approved and is ready to be displayed to users.

6. **tags**: These are tags associated with the offer, typically used for categorization, filtering, or reporting
purposes.
Examples of tag identifiers: "xcore:tag:1771ac5a22abb9f7",
"xcore:tag:15fdf3abddd39b68".

Flatten the Personalized Offers Table

SELECT
p._id AS offerId,
p._repo.etag AS repo_etag,
p._experience.decisioning.ranking.priority AS priority,
p._experience.decisioning.name AS offerName,
p._experience.decisioning.contents[0].placement AS placement,
p._experience.decisioning.contents[0].components[0]._dc.format AS
contentFormat,
p._experience.decisioning.contents[0].components[0]._dc.language[0] AS
contentLanguage,
p._experience.decisioning.contents[0].components[0].content AS contentData,
p._experience.decisioning.calendarConstraints.startDate AS startDate,
p._experience.decisioning.calendarConstraints.endDate AS endDate,
p._experience.decisioning.profileConstraints.profileConstraintType AS
profileConstraintType,
p._experience.decisioning.profileConstraints.segmentIdentities[0]._id AS
segmentId,
p._experience.decisioning.characteristics['Offer ID'] AS
offerIdCharacteristic,
p._experience.decisioning.characteristics.domain AS offerDomain,
p._experience.decisioning.characteristics.type AS offerType,
p._experience.decisioning.characteristics.saleType AS saleType,
p._experience.decisioning.lifecycleStatus AS lifecycleStatus
FROM
decision_object_repository_personalized_offers p;

The field p._experience.decisioning.characteristics refers to a sub-object within the

decisioning structure of an offer, which stores specific characteristics or attributes related to that offer. In Adobe
Journey Optimizer, characteristics can be thought of as metadata or additional properties that define key details or
behavior for an offer. These characteristics are typically used to differentiate offers, apply business rules, or drive
personalization and optimization in decision-making.

The results are the following:

Retrieve Latest Version of Each Offer

SELECT
p._id AS offerId,
p._repo.etag AS repo_etag,
p._experience.decisioning.ranking.priority AS priority,
p._experience.decisioning.characteristics.customerLoyalty AS
c_customerLoyalty,
p._experience.decisioning.characteristics['Offer ID'] AS c_offerId,
p._experience.decisioning.characteristics.productCategory AS
c_productCategory,
p._experience.decisioning.characteristics.discountAmount AS
c_discountAmount,
p._experience.decisioning.characteristics.expiryDate AS c_expiryDate,
p._experience.decisioning.characteristics.promotionType AS c_promotionType,
EXPLODE(p._experience.decisioning.contents.placement) AS placementId
FROM
decision_object_repository_personalized_offers p
JOIN
(SELECT
m._id AS offerId,
MAX(m._repo.etag) AS latest_repo_etag
FROM
decision_object_repository_personalized_offers m
GROUP BY
m._id
) mx
ON p._id = mx.offerId
AND p._repo.etag = mx.latest_repo_etag;

The results will be:

Observe the following:

m._id AS offerId: Retrieves each offer’s unique ID.

MAX(m._repo.etag) AS latest_repo_etag: Finds the highest (latest) _repo.etag (which

represents the version) for each offer.

**GROUP BY m._id**: Ensures that the subquery groups the offers by their _id, so that it returns the latest
version for each offer.

Retrieve Personalized Offers Greater than Specific Priority

SELECT
p._id AS offerId,
p._repo.etag AS repo_etag,
p._experience.decisioning.ranking.priority AS priority,
p._experience.decisioning.characteristics['Offer ID'] AS c_offerId
FROM
decision_object_repository_personalized_offers p
WHERE
p._experience.decisioning.ranking.priority >1;

Filter Personalized Offers by Date Range

SELECT
p._id AS offerId,
p._repo.etag AS repo_etag,
p._experience.decisioning.calendarConstraints.startDate AS startDate,
p._experience.decisioning.calendarConstraints.endDate AS endDate,
p._experience.decisioning.characteristics['Offer ID'] AS c_offerId
FROM
decision_object_repository_personalized_offers p
WHERE
p._experience.decisioning.calendarConstraints.startDate <= CURRENT_DATE
AND p._experience.decisioning.calendarConstraints.endDate >= CURRENT_DATE;

Group Personalized Offers by Product Category

SELECT
p._experience.decisioning.characteristics.productCategory AS
productCategory,
COUNT(p._id) AS offerCount
FROM
decision_object_repository_personalized_offers p
GROUP BY
p._experience.decisioning.characteristics.productCategory;

The results will look like the following:

Retrieve Personalized Offers by Placement

SELECT
p._id AS offerId,
p._experience.decisioning.characteristics['Offer ID'] AS c_offerId,
EXPLODE(p._experience.decisioning.contents.placement) AS placementId
FROM
decision_object_repository_personalized_offers p
WHERE
ARRAY_CONTAINS(p._experience.decisioning.contents.placement, 'xcore:offer-
placement:15fdf228c3fec9eb');

Sort Personalized Offers by Priority

SELECT
p._id AS offerId,
p._experience.decisioning.ranking.priority AS priority,
p._experience.decisioning.characteristics['Offer ID'] AS c_offerId
FROM
decision_object_repository_personalized_offers p
ORDER BY
p._experience.decisioning.ranking.priority ASC;

Retrieve Profile Constraints and Segment Identities

SELECT
p._id AS offerId,
p._experience.decisioning.profileConstraints.profileConstraintType AS
profileConstraintType,
EXPLODE(p._experience.decisioning.profileConstraints.segmentIdentities) AS
segmentIdentity
FROM
decision_object_repository_personalized_offers p;

**OfferID** and **PlacementID** combination can act like a composite key joining this metadata sith the
data in the Offer Decision Events table where the actual offer was delivered.

Decisions Object Repository - Fallback Offers

The Fallback Offers Dataset is very similar to the Personalized Offers Dataset except that it provides a detailed
record of fallback offers that should e presented when primary decisioning options do not qualify. This dataset captures
rich metadata about the fallback offer content, including components, formats, delivery URLs, and asset repository
details, ensuring that the offer is accurately rendered across various digital experiences. Additionally, it tracks the
lifecycle status of offers, allowing for effective management of the offer’s state, whether it’s in draft, live, or archived
mode. Each fallback offer is further enriched with characteristics such as tags for categorization, and placement details
that specify where the offer is deployed.
SELECT to_json(_experience) FROM decision_object_repository_fallback_offers

You can take the same queries that we used on the Personalized Offers dataset and apply it here as the fields are
identical.

**OfferID** and **PlacementID** combination can act like a composite key joining this metadata sith the
data in the Offer Decision Events table where the actual offer was delivered.

Decisions Object Repository - Placements Dataset

The Placements Dataset tracks the various contexts or “placements” where offers are to be delivered to users. A
placement is a defined location, such as a banner on a web page, an email slot, or an in-app area, where personalized
offers or dynamic content can be presented. This dataset captures metadata about each placement, including the
associated content types, media formats, channels, and descriptions that help manage and optimize where and how
offers appear to the target audience.

Placement Descriptions and Names: The Placements Dataset contains detailed metadata describing each
placement’s function and purpose, such as a web banner or email slot, and provides a human-readable name for
each placement. Fields like **name** and **description** provide contextual information on where the
content will be rendered.

Content Channels and MIME Types: The Placements Dataset tracks the specific channels (e.g., web, mobile,
email) where the placement occurs, as well as the supported MIME media types (e.g., image formats) for
content rendered in each placement. Fields like **channelID** and **contentTypes.MIME Media**
Type capture the constraints on media formats and channels for each placement.

Content Representation: The Placements Dataset defines the types of content components allowed in each
placement. This helps ensure that the right type of content (e.g., image, text, video) is displayed correctly in the
right context. Fields like **componentType** specify the content component types, ensuring compatibility
between the content and the placement.

Placement ETags: The Placements Dataset tracks the revision history of each placement, providing an ETag
that helps manage and track changes to the placement over time. Fields like **etag** capture revision
metadata, helping maintain version control of placements.

Understand Structure of the Placements Dataset

Execute the following:

SELECT
table_name,
column_name,
data_type
FROM
information_schema.columns
where table_name = 'decision_object_repository_placements';

Explore the Structure of the Placements Dataset

Execute the following:

SELECT to_json(_experience)
FROM decision_object_repository_placements;

The result will be:

You can group the placements by their channelID to see how many placements exist for each channel.

SELECT _experience.decisioning.channelID, COUNT(*) AS total_placements

FROM decision_object_repository_placements
GROUP BY channelID;

The result will be:

Count of Placements by Component Type

To get an overview of how many placements there are for each componentType, you can run this query

SELECT _experience.decisioning.componentType, COUNT(*) AS total_placements

FROM decision_object_repository_placements
GROUP BY componentType;›

The result helps in identifying the distribution of placements across different content types (e.g., HTML, image, text,
JSON).

Personalized or Fallback offer optiions.

Offers can be constrainted in what they do.

You can add a decision rule.

Representation focuses on channel delivery and placement

Create a new collection and you can edit an existing one.

Adding offers to a collection

Decision rule configuration.

Decision rule in a decision scope.

You can add an offer collection

Decision rules impose further constraints

Top level columns in the Activities dataset.

Explore the fields under _experience.decisioning.

Records retrieved from the Activities Dataset.

Results of the decision criteria query

BP Luma offers decision activity

Top level columns within the dataset

Contents of the Persoalized Offers Dataset

Flattening the personalized offers table.

Retrieving the latest version of an offer.

Offers that had a priority greater than 1

Personalized offers filtered by date range

Grouping offers by product category

Offers sorted by priority

Profile constraint and the segment identity

The fallback offers will look simiilar to the personalized offers.

The top level fields in the Placements dataset.

Contents of the Placement dataset.

Results of the channel query

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration/explore-300-exploring-adobe-journey-
optimizer-system-datasets-with-data-distiller * * *

1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets

with Data Distiller
Unleashing Insights from Adobe Journey Optimizer Datasets with Data Distiller

Last updated 5 months ago

You need a basic understanding of how to write nested queries and working with nested data.

You should get familiar with navigating around with web data:

To generate a holistic view of how different datasets contribute to serving an experience via Adobe Journey
Optimizer (AJO), I will now walk through each dataset in the correct order of importance and process flow. These
datasets include:

1. AJO Entity Record Schema Dataset with the dataset name: **ajo_event_dataset**

2. Journey Step Events Dataset with the dataset name:journey_step_events

3. AJO Message Feedback Events Dataset with the dataset namee:

**ajo_message_feedback_event_dataset**

4. BCC Feedback Events Dataset with the dataset name: ajo_bcc_feedback_event_dataset

5. AJO Email & Push Tracking Datasets with the dataset names:
**ajo_email_tracking_experience_event_dataset,
ajo_push_tracking_experience_event_dataset**

6. Offer Decisioning Events Dataset with the dataset name: **ode_decisionevents_{key specific
to your environment}.**
The chapter contains additional 4 Offer Decisioning datasets that give you deeper information about the offers and the
decisioning logic.

How the Datasets Work Together in Adobe Journey Optimizer (AJO)

In Adobe Journey Optimizer, each dataset serves a specific role in orchestrating, delivering, and optimizing customer
experiences. When combined, these datasets provide a comprehensive understanding of how customer journeys are
executed, how messages are delivered and engaged with, and how offers are decided and optimized.

Here’s how each dataset is related, presented in the correct order of importance and process flow:

1. AJO Entity Record Schema Dataset: The Core Foundation

Purpose: The AJO Entity Record Schema Dataset is the central dataset that logs and tracks the metadata for all
journeys. It captures crucial information about the campaign, messages, journey actions, and message
triggers. It forms the basis for connecting all other datasets in the system.

Role in the Process:

Journey Orchestration: This dataset logs the entire structure of the journey, including message triggers,
campaign actions, journey steps, and decisions.

It includes identifiers like Message IDs, Campaign IDs, and Journey Action IDs, which link to the
Message Feedback, Tracking, and ODE datasets.

Without this dataset, none of the other datasets would have the necessary context to operate. It establishes
the backbone of the journey and ensures that all steps are executed as per the designed journey.

2. Journey Step Events Dataset: Tracking Journey Progression

Purpose: The Journey Step Events Dataset provides detailed insights into each step within the journey. It logs
step-level events, including step completions, errors, timeouts, and transitions. This dataset ensures visibility
into how users progress through the journey and helps diagnose any issues.

Role in the Process:

Step-Level Monitoring: This dataset records each step a user takes, whether that step is completed
successfully, if there are errors, or if a journey action times out.

Action Execution: It tracks the execution of actions (such as sending an email or showing an offer) and
logs the results of those actions.

Error Handling: Any errors encountered during journey execution are logged, helping you resolve issues at
specific steps.

Relation to Other Datasets: The Journey Step Events Dataset links to the AJO Entity Record Schema
and the ODE Dataset to ensure that each decision or action triggered within the journey is properly tracked
and logged.

3. AJO Message Feedback Events Dataset: Delivery Tracking

Purpose: The Message Feedback Events Dataset focuses on delivery feedback for emails, SMS, and push
notifications. It logs the delivery status, including whether the message was delivered, bounced, or required
retries.
Role in the Process:

Delivery Status Monitoring: After a message is triggered by a journey step (as logged in the Journey Step
Events Dataset), the Message Feedback Events Dataset tracks whether the message was delivered
successfully or encountered a failure.

Bounce & Failure Tracking: It logs details such as bounce reasons, invalid emails, or retries, providing
insight into delivery issues and helping you troubleshoot any problems with sending.

Relation to Other Datasets: The Message Feedback Dataset ties back to the AJO Entity Record
Schema via the Message ID, ensuring that the status of every message triggered by the journey is
accounted for.

4. BCC Feedback Events Dataset: Tracking Secondary Recipients

Purpose: The BCC Feedback Events Dataset tracks the delivery status of emails sent to BCC (Blind Carbon
Copy) or CC recipients. This dataset is important for ensuring compliance and tracking delivery to these
secondary recipients.

Role in the Process:

Secondary Delivery Monitoring: For messages sent to BCC or CC recipients (often for compliance or
archiving purposes), this dataset logs the delivery status and captures whether these secondary emails were
successfully delivered or excluded.

Exclusion Handling: It tracks exclusions due to compliance rules or typology filters and provides insight
into why certain emails were excluded.

Relation to Other Datasets: Like the Message Feedback Events Dataset, it ties back to the AJO Entity
Record Schema to track secondary recipients, ensuring full coverage of all recipients in the system.

5. AJO Email & Push Tracking Datasets: User Engagement

Purpose: The Tracking Datasets for email and push notifications log user engagement with delivered
messages, including metrics such as opens, clicks, and unsubscribes. This dataset helps measure the
effectiveness of the messages after they are successfully delivered.

Role in the Process:

Engagement Monitoring: Once a message is delivered (tracked via the Message Feedback Dataset), the
Tracking Datasets log how users interact with that message—whether they open it, click on a link, or
unsubscribe.

Performance Reporting: These datasets provide insights into how well messages perform in terms of user
engagement and can be used to optimize future campaigns based on click-through rates and engagement
metrics.

Relation to Other Datasets: The Tracking Datasets link back to the Message Feedback Dataset and the
AJO Entity Record Schema via the Message ID, ensuring that you have a full picture of the message’s
journey from delivery to engagement.

6. Offer Decisioning Events Dataset: Optimizing Decision-Making

Purpose: The Offer Decisioning Events Dataset tracks decision points within the journey where offers are
presented to users. It logs which offers were shown, how users interacted with them (e.g., clicks or
conversions). It logs decisions made during the journey based on rules, algorithms, or fallback options.

Role in the Process:

Decision Tracking: When a decision point in the journey is reached, this dataset logs which offer was
selected and whether the user engaged with it.

Optimization of Decision Strategies: By tracking offer performance, you can analyze which offers
perform best, optimize decision strategies, and refine the algorithms used to present offers.

Relation to Other Datasets: The Offer Decisioning Events Dataset connects with the Journey Step
Events Dataset to log when a decision point was triggered and which offer was selected. It is also tied to
the AJO Entity Record Schema to ensure that decisions made within the journey are fully tracked.

Bringing It All Together: End-to-End Experience Monitoring in AJO

1. Journey Setup and Execution (AJO Entity Record Schema Dataset & Journey Step Events Dataset):

The AJO Entity Record Schema Dataset forms the foundation for the entire journey, logging messages,
actions, and decisions taken within the journey.

The Journey Step Events Dataset tracks each step in the journey, ensuring that actions like sending a
message or making a decision are logged and monitored for performance and errors.

2. Message Delivery (Message Feedback Events Dataset & BCC Feedback Events Dataset):

After a message is triggered in the journey, the Message Feedback Events Dataset tracks whether the
message was successfully delivered or bounced.

The BCC Feedback Events Dataset tracks the status of BCC and CC recipients, ensuring that secondary
recipients are handled properly and that compliance requirements are met.

3. User Engagement (AJO Email & Push Tracking Datasets):

Once a message is delivered, the Tracking Datasets capture user engagement, including opens, clicks,
and unsubscribes. This data provides insights into the effectiveness of messages in driving user behavior.

4. Offer Decisioning and Optimization (Offer Decisioning Events Dataset):

Throughout the journey, decisions are made regarding which offers to present to users. The Offer
Decisioning Events Dataset logs these decisions, tracks offer engagement, and helps you optimize your
decision-making strategies.

How to Use the Datasets Together:

Monitor Journey Progress: Use the AJO Entity Record Schema Dataset and Journey Step Events Dataset
to monitor the overall progress and structure of the customer journey. These datasets help you track which steps
were taken and whether any issues occurred.

Ensure Message Delivery: Leverage the Message Feedback Events Dataset and BCC Feedback Events
Dataset to track whether messages triggered by the journey were successfully delivered, and identify any
bounces or failures.

Analyze Engagement: After messages are delivered, use the Tracking Datasets to analyze user engagement
and optimize future campaigns based on how users interacted with the message.
Optimize Offer Decisions: Use the Offer Decisioning Events Dataset to analyze which offers were presented
to users

Schema Dictionary for AJO System Datasets

You can find the exhaustive list here.

AJO Entity Record Schema Dataset

First, execute the following query in the Data Distiller Query Pro Mode Editor:

SELECT
table_name,
column_name,
data_type
FROM
information_schema.columns
where table_name = 'ajo_entity_dataset'

The result will be:

Now execute:

SELECT to_json(_experience) FROM ajo_entity_dataset LIMIT 500;

The result will be:

The AJO Entity Record Schema is designed to store metadata related to messages sent to end-users within Adobe
Journey Optimizer (AJO). It captures essential data related to campaigns, journeys, channels (email, SMS, push
notifications), and experiments. This schema is integral for tracking and analyzing campaign performance,
engagement, conversions, and message delivery across various channels. Think of this dataset acting as timestamped
lookup dataset for all the otyher datasets that contain tracking and feedback information on the messages that was sent
out. The lookup data is timestamp as the metadata can change as a function of time with users making changes to the
various configurations.

You cannot use event-specific identifiers like _id and timestamp, as they are tied to the logging of individual
events. Therefore, your best option is to link the message IDs together. The **messageID** attribute in every
record in this dataset is absolutely critical because it helps to stitch various datasets such as Message Feedback
Dataset and Experience Event Tracking Datasets to get details of a message delivery from sending to tracking at a
profile level. An entry for a message is created only after journey or campaign is published. You may see the
entry/update 30 minutes after the publication of the campaign/journey.

Since the AJO Entity Record Schema is the central lookup for all the other datasets, this field in the dataset
**ajo_entity_dataset**

_experience.customerJourneyManagement.entities.channelDetails.messageID

All Tracking & Feedback Datasets

**_experience.customerJourneyManagement.messageExecution.messageID**

Journey Step Events Dataset

**_experience.journeyOrchestration.stepEvents.actionID**
Offer Decisioning Events Dataset

**_experience.decisioning.propositions.items.scopeDetails.placement.id**

Here are the key fields that you need to be aware of:

_experience.customerJourneyManagement.entities.campaign.campaignID

Unique identifier for the campaign that triggered the message execution.

Used to track campaign-level performance and engagement.

_experience.customerJourneyManagement.entities.campaign.campaignActionID

Action ID of the campaign that triggered this message execution.

Used to trace specific actions within campaigns and optimize messaging strategies.

_experience.customerJourneyManagement.entities.campaign.campaignVersionID

Immutable version of the campaign, representing a specific version after republishing.

Supports A/B testing and performance tracking across different versions of campaigns.

_experience.customerJourneyManagement.entities.campaign.name

Name of the campaign that sent the message.

Useful for campaign reporting and analyzing which campaigns perform best.

_experience.customerJourneyManagement.entities.channelDetails.channel

Defines the experience channel for the message (email, push, etc.).

Used to differentiate between messages sent across different channels.

_experience.customerJourneyManagement.entities.channelDetails.email.subject

Subject of the email message (non-personalized).

Useful for tracking subject line performance and testing variants.

_experience.customerJourneyManagement.entities.channelDetails.messageID

Unique ID representing the message sent to the end user.

Allows message-level tracking for performance and engagement reporting.

_experience.customerJourneyManagement.entities.channelDetails.messagePublicationID

ID representing a frozen/published version of the message.

Supports message version control and tracking over time.

_experience.customerJourneyManagement.entities.channelDetails.push.title

Title of the push notification (non-personalized).

Used for performance reporting on push notifications, especially when testing different push titles.

_experience.customerJourneyManagement.entities.experiment.experimentId

ID used to track a specific experiment or A/B test.

Helps analyze which message variants perform better during A/B testing.

_experience.customerJourneyManagement.entities.journey.journeyActionID

Represents the action within a journey that triggered the message.

Important for journey-based reporting and understanding which actions drive the most engagement.

_experience.customerJourneyManagement.entities.journey.journeyName

Name of the journey that the message is part of.

Helps in journey-level reporting and identifying high-performing journeys.

_experience.customerJourneyManagement.entities.journey.journeyNodeName

Represents the name of the specific node in the journey canvas where the message was triggered.

Supports granular reporting within journeys, allowing insights into specific journey nodes.

_experience.customerJourneyManagement.entities.journey.journeyVersionID

Frozen version of the journey for tracking historical journey changes.

Useful for comparing the performance of different journey versions.

_experience.customerJourneyManagement.entities.experiment.treatmentName

Name of the treatment or variant in an A/B test.

Supports A/B testing analysis by tracking the performance of different variants.

_experience.customerJourneyManagement.entities.channelDetails.messagePublishedAt

The timestamp of when the message was published.

Important for time-based reporting and determining the impact of send times on engagement.

_experience.customerJourneyManagement.entities.channelDetails.baseMessageID

Represents the base message ID from which the published message is derived.

Used to track the origin of derived messages in case of cloning or re-publishing.

_experience.customerJourneyManagement.entities.tags.values

Array of tags corresponding to the message, journey, or campaign.

Useful for categorization and filtering in reporting based on campaign attributes or tags.

_experience.customerJourneyManagement.emailChannelContext.namespace
Namespace associated with the email address in consent preferences.

Tracks preferences and compliance based on email namespaces.

_experience.customerJourneyManagement.emailChannelContext.outboundIP

Outbound IP address used to deliver the message.

Helps diagnose delivery issues by tracking the outbound IP address.

_experience.customerJourneyManagement.messageInteraction.landingpage.landingPageURL

URL of the landing page associated with the message interaction.

Tracks the effectiveness of landing pages associated with message interactions.

_experience.customerJourneyManagement.messageInteraction.openCount

Count of times the email was opened by the recipient.

Tracks user engagement with the message by counting the number of opens.

_experience.customerJourneyManagement.messageInteraction.clickCount

Count of times links within the message were clicked.

Measures click-through rates by tracking link clicks within the message.

_experience.customerJourneyManagement.messageInteraction.offers.offerName

Name of the offer presented in the email or message.

Tracks engagement with specific offers included in the message.

_experience.customerJourneyManagement.messageInteraction.deliveryStatus

Indicates the delivery status (delivered, failed).

Tracks the delivery status to analyze delivery success or failure rates.

_experience.customerJourneyManagement.messageInteraction.bounceType

Type of email bounce (soft, hard).

Helps understand the reason for delivery failure through bounce type analysis.

_experience.customerJourneyManagement.messageInteraction.interactionType

Type of user interaction with the message (open, click, etc.).

Tracks the type of interaction the user had with the message (e.g., clicks, opens).

_experience.customerJourneyManagement.messageInteraction.label

Human-readable label for the URL or link in the message.

Provides insights into which specific URLs or links drove engagement.

_experience.customerJourneyManagement.messageInteraction.offers.propositionID

ID of the proposition or offer made to the user in the message.

Tracks the effectiveness of specific propositions or offers.

_experience.decisioning.propositions.items.interactionOutcome

Tracks the outcome of the interaction (purchase, sign-up, etc.).

Helps measure the outcome of message interactions and conversions.

_experience.customerJourneyManagement.messageProfile.isTestExecution

Indicates whether the message was sent as a test execution.

Filters test messages out of reporting to avoid skewing performance data.

_experience.customerJourneyManagement.messageProfile.isSendTimeOptimized

Indicates whether send-time optimization was applied to the message.

Tracks the effectiveness of send-time optimization strategies.

identityMap.additionalProperties.items.id

Unique identifier for the user’s identity.

Links the message to the user’s identity for personalized insights.

identityMap.additionalProperties.items.type

Type of identity (email, phone, etc.).

Identifies the type of identity associated with the user.

_experience.customerJourneyManagement.messageInteraction.profileID

Unique identifier for the user profile associated with the interaction.

Links the interaction to a specific user profile for personalized tracking.

identityMap.additionalProperties.items.primary

Indicates whether this is the primary identity for the user.

Identifies whether the tracked identity is the user’s primary identifier.

Timestamp of when the message interaction occurred.

Helps track when specific interactions with the message occurred.

Journey Step Event Dataset

You should be able to execute the following code:

SELECT * FROM journey_step_events LIMIT 500;

The Journey Step Event Dataset in Adobe Journey Optimizer captures and logs all journey step experience events as
part of Journey Orchestration. These events are essential for reporting and analytics in systems like Customer
Journey Analytics. The dataset helps track each step within a journey and its performance, providing insights into
how users progress through their customer journey, how actions are executed, and what the results of those actions are.
This dataset is especially useful for understanding step-level events within journeys, such as errors, transitions, and
completions.

Key Use Cases for the Journey Step Event Dataset:

1. Journey Reporting and Analysis: Provides visibility into the execution and performance of individual steps
within journeys, such as transitions between steps, completion rates, and timeouts.

2. Error Tracking and Resolution: Logs errors and failure codes associated with journey steps, helping diagnose
and resolve issues that affect customer experience.

3. Journey Optimization: Tracks how users move through the journey, allowing marketers to optimize step
transitions, messaging timing, and action results for better engagement.

4. Profile Segmentation and Interaction: Captures profile identifiers and segment qualifications, which are
essential for targeting and personalizing the user journey.

Here are the key fields and the unique ones are in orange:

Field Path (Dot Notation)

Use Case for Reporting and Analysis

_experience.journeyOrchestration.stepEvents.stepID

Unique identifier for each journey step event.

Used for tracking individual steps within journeys.

_experience.journeyOrchestration.journey.ID

Identifier for the overall journey.

Useful for tracking the performance of specific journeys.

_experience.journeyOrchestration.journey.name

Provides context about which journey is being executed and reported.

_experience.journeyOrchestration.journey.versionID

Version identifier of the journey.

Allows tracking of different versions of the same journey for A/B testing or optimization.

_experience.journeyOrchestration.stepEvents.stepID

Unique identifier for the step within the journey.

Important for understanding which steps users are progressing through or encountering issues with.
_experience.journeyOrchestration.stepEvents.stepName

Name of the step as defined in the Journey Canvas.

Used to identify the specific step for reporting and debugging.

_experience.journeyOrchestration.stepEvents.stepStatus

Current status of the step (e.g., error, completed, timed out).

Helps in analyzing step outcomes and identifying bottlenecks in the journey.

_experience.journeyOrchestration.stepEvents.processingTime

Time taken to process the step in milliseconds.

Useful for optimizing journey performance by tracking how long each step takes to complete.

_experience.journeyOrchestration.stepEvents.profileID

Identifier for the profile involved in the journey.

Key for reporting on the profile-level engagement and personalization within the journey.

Segment Qualification Status

_experience.journeyOrchestration.stepEvents.segmentQualificationStatus

Indicates whether the profile is qualified for the segment (e.g., in-segment or exited).

Helps in segment-based journey analysis and targeting.

_experience.journeyOrchestration.stepEvents.interactionType

Type of interaction (e.g., marketing, transactional).

Critical for differentiating between types of interactions and analyzing their effectiveness.

_experience.journeyOrchestration.stepEvents.actionType

Type of action triggered (e.g., email, SMS, custom HTTP).

Important for reporting which channel or action was invoked during the journey.

_experience.journeyOrchestration.stepEvents.reactionActionID

Identifier of the action to which the user reacted (e.g., click, open).

Helps track and analyze user interactions with journey actions.

_experience.journeyOrchestration.stepEvents.actionExecutionTime

Time taken to execute the action during the step.

Useful for optimizing the execution time of actions within steps.

The primary event type associated with this record.

Used for categorizing the type of event (e.g., error, step completion).

The time when the step event occurred.

Important for time-based reporting and understanding journey progress over time.

The Segment ID field is found in the Journey Step Events Dataset.

The field path for Segment ID is

_experience.journeyOrchestration.stepEvents.segmentExportJob.exportSegmentID

This field captures the segment identifier when a segment export job is triggered during the journey
orchestration process.

This is critical for understanding which segment was used during a particular step of the journey, especially in
journeys that are triggered by audience segments. This information allows you to link specific segment behaviors with
journey events, providing detailed insights into how segment membership affects journey progression and outcomes.

AJO Message Feedback Datasets

Focus: Primarily focused on feedback from ISPs or service providers after an attempt to deliver a message (email,
SMS, or push).

First, go ahead and execute this:

SELECT * FROM ajo_message_feedback_event_dataset LIMIT 500;

The AJO Message Feedback Event Dataset is a dataset designed to log and track the delivery of messages within
Adobe Journey Optimizer (AJO). It provides comprehensive feedback on message delivery attempts across multiple
channels such as email, push notifications, and SMS:

Logs detailed delivery information, including bounces, retry attempts, failure reasons, and status (delivered,
failed, etc.).

Provides diagnostic feedback on why a message succeeded or failed, helping improve deliverability.

Focuses on the message journey from the system to the recipient’s inbox or device.

Captures feedback regarding message delivery failure (e.g., async bounce, sync bounce, invalid email address).

Key Use Cases:

Delivery Status Reporting: Detailed insights into delivery success and failure.

Bounce and Retry Analysis: Helps diagnose why messages failed and how many retry attempts were made.

Compliance and Monitoring: Tracks outbound IP addresses, bounce types, and reasons for failures.

Message Feedback Datasets

Feedback from delivery systems (bounce, retry, failure reasons)

User engagement after message delivery (opens, clicks, interactions)

Captures whether the message was delivered or bounced

Does not focus on delivery status; assumes message was delivered

Provides detailed reasons for delivery failures (e.g., hard bounce)

No engagement data; focused only on delivery

Captures user interactions such as opens, clicks, and conversions

Bounce and Retry Analysis

Provides insights into delivery retries and reasons for failures

Does not track unsubscription events

Tracks when users unsubscribe from future communications

No tracking of engagement

Logs user interactions with message content and calls-to-action

Used to improve deliverability, reduce bounces, and troubleshoot issues

Used to optimize content based on engagement and user behavior

Here are the fields that are most critical here. Note that the unique fields are in orange:

Field Path (Dot Notation)

Use Case for Reporting and Analysis

_experience.customerJourneyManagement.messageExecution.messageID

Unique identifier for the message.

Essential for tracking individual messages for performance and issue diagnosis.

_experience.customerJourneyManagement.emailChannelContext.address

The email address or phone number to which the message was sent.

Used to identify the recipient and track message delivery for personalized reporting.

_experience.customerJourneyManagement.emailChannelContext.outboundIP

Outbound IP address used for message delivery.

Helps in monitoring compliance and diagnosing deliverability issues based on IP reputation.

_experience.customerJourneyManagement.messageDeliveryfeedback.feedbackStatus

Status of the message delivery attempt (e.g., delivered, failed, pending).

Key for understanding overall delivery performance and diagnosing issues with undelivered messages.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageFailure.category

Classifies whether the failure was a sync or async bounce (email-specific).

Useful for categorizing bounce types and diagnosing whether failures were temporary or permanent.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageFailure.reason

Provides the detailed reason for the failure (e.g., invalid email, mailbox full).

Important for identifying common delivery issues and improving deliverability in future campaigns.

_experience.customerJourneyManagement.messageDeliveryfeedback.retryCount

Number of retry attempts before success or permanent failure.

Helps in analyzing retry behavior and determining the efficiency of retry policies in case of failed deliveries.

_experience.customerJourneyManagement.messageExecution.campaignID

Unique ID of the campaign that triggered the message.

Critical for linking the message back to its originating campaign for performance comparison and reporting.

_experience.customerJourneyManagement.messageExecution.journeyActionID

The action in the journey that triggered the message.

Tracks which journey actions led to message delivery, useful for journey-based reporting and optimization.

_experience.customerJourneyManagement.messageExecution.messageType

Type of the message (e.g., transactional, marketing).

Enables segmentation and reporting based on message type for targeted performance analysis.

_experience.customerJourneyManagement.messageProfile.isSendTimeOptimized

Indicates whether the message was optimized for the best send time.

Key for measuring the effectiveness of send-time optimization strategies in improving delivery rates.

_experience.customerJourneyManagement.messageProfile.isTestExecution

Indicates whether the message was part of a test execution.

Helps to filter test messages from production messages to avoid skewing performance data.

_experience.customerJourneyManagement.messageDeliveryfeedback.offers.offerID

Unique identifier for the offer presented in the message.

Used to track the success of specific offers by analyzing engagement and conversion rates.

_experience.customerJourneyManagement.messageDeliveryfeedback.offers.propositionTime

Time when the offer proposition was generated.

Useful for analyzing the timing of offers and how it affects engagement or conversions.

_experience.decisioning.propositions.items.interactionOutcome
Tracks the result of interactions with the message (e.g., clicked a link, made a purchase).

Measures the success of a message in driving user behavior, critical for ROI and conversion analysis.

Why These Fields Are Important:

Delivery Status & Failure Reason: These fields are crucial for understanding message delivery success and
failure, as well as diagnosing the reasons behind message bounces and undelivered emails.

Retry Count: Helps analyze retry behavior and can reveal patterns in which retry attempts are successful and
which are not.

Offer & Proposition Data: Offer engagement tracking is essential to understanding how users interact with
promotional content, enabling teams to optimize future campaigns based on conversion data.

Journey Action ID: This links the message feedback back to the customer journey, providing insights into the
effectiveness of different journey steps in triggering user engagement.

Interaction Outcome: This field provides key insights into how recipients are interacting with the message,
allowing for better tracking of conversion rates and user behavior following message delivery.

AJO Email BCC Feedback Event Dataset

First, execute the query:

SELECT * FROM ajo_bcc_feedback_event_dataset LIMIT 500;

The AJO Email BCC Feedback Event Dataset is specifically designed to track and log the delivery status of BCC
(Blind Carbon Copy) emails. It is used primarily for reporting purposes to understand how BCC emails are handled,
delivered, and processed, focusing on feedback such as exclusions, failures, and delivery outcomes.

Key Differences Between the BCC Feedback Event Dataset and the Message Feedback Event Dataset

1. BCC-specific Tracking: The BCC dataset is specifically focused on BCC and CC recipients, whereas the
Message Feedback dataset logs information for all messages across email, SMS, and push channels. It includes
fields for tracking the original recipient and the secondary recipient type (e.g., BCC, CC, Archival).

2. Exclusion Data: The BCC dataset contains fields like Exclusion Code and Exclusion Reason, which provide
specific reasons for message exclusions, such as compliance or typology rules, which may not be as granular in
the Message Feedback dataset.

3. Field Overlap: Both datasets share fields related to message delivery feedback, such as Delivery Status, Failure
Category, Failure Reason, and Offer Information.

4. Use Case: The BCC Feedback Dataset is more narrowly focused on tracking BCC and CC email handling and
is highly specialized for reporting purposes about those secondary recipients. The Message Feedback Dataset
offers a broader scope, focusing on all message types across multiple channels (email, SMS, push), providing a
wider range of delivery feedback, retries, and engagement.

Here are the key fields. Unique fields are marked in orange

Field Path (Dot Notation)

_experience.customerJourneyManagement.messageExecution.messageID
Unique identifier for the message.

Used to track individual messages for performance and issue diagnosis.

_experience.customerJourneyManagement.emailChannelContext.address

Email address of the original recipient.

Tracks the recipient of the message, useful for reporting and personalization.

_experience.customerJourneyManagement.emailChannelContext.outboundIP

Outbound IP address used for message delivery.

Helps monitor compliance and deliverability issues.

_experience.customerJourneyManagement.messageDeliveryfeedback.feedbackStatus

Status of the message delivery (e.g., delivered, failed).

Used to understand delivery performance and detect failures.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageExclusion.code

Top-level exclusion reason (e.g., typology rule, mandatory parameter missing).

Critical for compliance reporting and understanding why messages were excluded.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageExclusion.reason

Detailed exclusion reason (e.g., specific typology rule ID).

Helps in diagnosing specific reasons why a message was excluded.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageFailure.category

Classifies whether the failure was a sync or async bounce.

Provides a detailed breakdown of bounce types for diagnostic purposes.

_experience.customerJourneyManagement.messageDeliveryfeedback.messageFailure.reason

Specific reason for message failure (e.g., invalid email).

Helps improve deliverability by identifying common failure reasons.

_experience.customerJourneyManagement.messageDeliveryfeedback.offers.offerID

Unique ID of the offer in the message.

Tracks the success of specific offers sent via BCC emails.

_experience.customerJourneyManagement.messageDeliveryfeedback.retryCount

Number of retry attempts made before the message was delivered or failed.

Useful for analyzing retries and delivery success rates.

Original Recipient Address

_experience.customerJourneyManagement.secondaryRecipientDetail.originalRecipientAddress

Address of the original recipient for whom the BCC or CC copy was sent.

Essential for tracking how secondary recipients receive the message.

_experience.customerJourneyManagement.secondaryRecipientDetail.type

Type of secondary recipient (e.g., BCC, CC, Archival).

Important for distinguishing between BCC, CC, and archival recipients.

Why These Fields Are Important:

Delivery Status & Exclusion Data: These fields are key for understanding delivery performance and exclusion
reasons, particularly when messages are filtered out by typology rules or compliance filters.

Secondary Recipient Data: Unique to the BCC dataset, fields like Original Recipient Address and Secondary
Recipient Type help track how secondary recipients (BCC, CC) are handled, which is critical for understanding
email distribution and compliance.

Offer & Proposition Data: These fields help measure the effectiveness of offers and promotions sent to BCC
recipients, providing insights into engagement and offer performance.

AJO Email Tracking Experience Event Dataset

Focus: Concentrates on user interactions with delivered messages (email, SMS, push notifications)

Just type this query in the Data Distiller Query Pro Mode Editor:

SELECT * FROM ajo_email_tracking_experience_event_dataset LIMIT 500;

The results from above should be a great starting point for you to dig deeper into this dataset. The AJO Email
Tracking Experience Event Dataset is designed to capture and log detailed interaction data related to email
campaigns sent via the Adobe Journey Optimizer (AJO). This dataset tracks various user actions upon receiving
emails, providing essential insights for performance reporting, segmentation, and optimization of email marketing
campaigns:

1. Capturing User Interactions: The dataset records detailed information about how users interact with email
campaigns, including:

Opens: Whether and how many times a recipient opened an email.

Clicks: Whether the recipient clicked on any links within the email.

Unsubscribes: Whether the user unsubscribed from future emails.

Bounces: Whether the email failed to be delivered (soft or hard bounce).

Deliveries: Logs whether the email was successfully delivered.

2. Email Performance Metrics: The dataset supports analysis of email performance with the following key
metrics:
Open Rates: The percentage of recipients who opened the email, useful for assessing the effectiveness of
subject lines.

Click-Through Rates (CTR): The percentage of recipients who clicked on links within the email,
indicating the relevance of the content or call-to-action (CTA).

Unsubscribe Rates: Tracks how many users opted out of future emails, helping to manage list hygiene and
content relevance.

Bounce Rates: Identifies emails that were not delivered due to issues like invalid email addresses (hard
bounces) or temporary issues (soft bounces).

3. Link and Offer Tracking: The dataset allows for detailed reporting on link and offer engagement, capturing:

Tracker URLs: Tracks the specific URLs that users clicked within the email.

Offer Interactions: Logs interactions with special offers or promotions included in the email, helping to
measure the effectiveness of discounts, sales, or calls-to-action.

Landing Pages: Tracks if users landed on specific pages after clicking links, allowing for detailed
conversion analysis.

4. Campaign and Journey Metadata: The dataset contains critical metadata regarding the email campaigns and
journeys, including:

Campaign IDs: Unique identifiers for each campaign, enabling tracking of email performance across
different campaigns.

Journey Action IDs: Tracks which specific journey actions triggered the email, useful for analyzing the
effectiveness of different touchpoints.

Campaign Versioning: Enables the comparison of different versions of a campaign or journey to identify
which versions are more effective.

5. Segmentation and Personalization: The dataset is enabled for profile integration, meaning it can be used for
segmentation and personalized marketing:

Segment Creation: Build segments based on user behavior, such as frequent openers, non-clickers, or users
who unsubscribed.

Personalization Insights: Analyze how different audience segments interact with emails, helping to tailor
future campaigns for improved engagement.

6. Detailed Reporting for Compliance and Preference Management: The dataset helps track consent and
compliance-related interactions, such as:

Email Preferences: Tracks user consent and opt-in preferences (e.g., GDPR compliance).

Unsubscribes: Provides information about users who opted out of future communications, ensuring
adherence to privacy regulations.

7. A/B Testing and Optimization: The dataset supports A/B testing by tracking different email variants (e.g.,
subject lines, content, offers), allowing you to:

Test different variants: Measure how different content versions, send times, or calls-to-action perform to
optimize future emails.
Send Time Optimization: Track whether send-time optimization strategies were applied, helping you to
analyze the performance impact of different send times.

Performance Monitoring: Gain insight into how well email campaigns perform based on metrics such as opens,
clicks, and conversions.

Engagement Insights: Analyze how recipients interact with emails, including the most clicked links, offers, and
CTAs.

Conversion Tracking: Measure how well emails drive conversions, such as sales, sign-ups, or engagement with
landing pages.

A/B Testing: Compare the performance of different email versions to identify the most effective strategies.

Deliverability and Bounce Analysis: Understand which emails failed to deliver and why, to optimize delivery
rates and maintain list hygiene.

Unsubscribe Management: Track and reduce unsubscribe rates by improving content relevance and targeting
strategies.

Here are the fields that you will need. Fields marked in orange are unique to SMS notifications:

_experience.customerJourneyManagement.emailChannelContext.address

The email address of the recipient.

Key for identifying recipients of emails; useful for segmentation and reporting.

_experience.customerJourneyManagement.emailChannelContext.namespace

Namespace associated with the email address (e.g., domain or region).

Useful for tracking compliance and preferences related to email domains or regions.

_experience.customerJourneyManagement.messageInteraction.deliveryStatus

The status of the email delivery (e.g., delivered, failed).

Measures delivery success and failures, providing insights into deliverability and list hygiene.

_experience.customerJourneyManagement.messageInteraction.bounceType

Type of email bounce (e.g., soft, hard).

Helps identify and diagnose reasons for delivery failures (e.g., permanent vs temporary).

_experience.customerJourneyManagement.messageInteraction.openCount

The number of times the recipient opened the email.

Measures user engagement by tracking how many times the email was opened.

_experience.customerJourneyManagement.messageInteraction.clickCount

Number of times the recipient clicked on links within the email.

Tracks user engagement with links in the email, critical for reporting on conversions.
_experience.customerJourneyManagement.messageInteraction.unsubscribe

Indicates whether the recipient unsubscribed from future emails.

Measures opt-out behavior to optimize future email content and targeting strategies.

_experience.customerJourneyManagement.messageInteraction.urlID

The unique URL included in the email and clicked by the user.

Tracks which specific URLs were clicked within the email for engagement analysis.

_experience.customerJourneyManagement.messageInteraction.offers.offerID

The unique identifier for any offer or promotion included in the email.

Tracks engagement with specific offers and promotions included in the email.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageID

Unique identifier for the landing page visited after clicking a link in the email.

Tracks conversions by following email-driven traffic to landing pages.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageName

The name of the landing page associated with the email link.

Provides insights into which landing pages perform best in driving conversions from email campaigns.

_experience.customerJourneyManagement.messageExecution.campaignID

Unique identifier for the campaign responsible for sending the email.

Useful for tracking campaign performance across different email versions and audience segments.

_experience.customerJourneyManagement.messageExecution.messageID

Unique identifier for the email message sent to the recipient.

Tracks the individual performance of each email sent as part of the campaign.

_experience.customerJourneyManagement.messageExecution.journeyActionID

Unique identifier for the journey action that triggered the email message.

Tracks which journey action led to the email being sent, supporting journey optimization.

_experience.customerJourneyManagement.messageExecution.messageType

The type of email message (e.g., promotional, transactional).

Allows segmentation and reporting based on email message types (promotional vs transactional).

The time when the email was sent or delivered.

Enables time-based reporting for analyzing trends and performance by time of day.
_experience.customerJourneyManagement.messageProfile.isTestExecution

Indicates whether the email was part of a test execution.

Helps exclude test emails from reporting to ensure accuracy in performance metrics.

_experience.customerJourneyManagement.messageProfile.isSendTimeOptimized

Indicates whether send-time optimization was applied for the email.

Tracks whether send-time optimization improved engagement and conversion rates.

_experience.decisioning.propositions.items.interactionOutcome

Tracks the interaction outcome following email engagement (e.g., conversion, purchase).

Measures conversion rates and other outcomes after engagement with the email.

_experience.customerJourneyManagement.messageInteraction.propositionTime

The timestamp when an offer or proposition was generated for the email.

Tracks the timing of offers and their effectiveness in driving user engagement.

Opens: Tracked through openCount and eventType for open events.

Clicks: Measured using clickCount, trackerURL, and trackerURLLabel to see which links were
clicked.

Unsubscribes: The unsubscribed field records if a user opts out after receiving an email.

Bounces: Captured through deliveryStatus and bounceType, detailing whether emails were delivered or
bounced.

Landing Page Engagement: landingPageID and landingPageName track which landing pages users
visited after clicking links.

AJO Push Tracking Experience Event Dataset

Focus: Concentrates on user interactions with delivered messages (email, SMS, push notifications)

To explore this dataset, just type ad execute this in the Data Distiller Query Pro Mode Editor:

SELECT * FROM ajo_push_tracking_experience_event_dataset LIMIT 500;

The AJO Push Tracking Experience Event Dataset is designed to capture and log interaction events related to push
notifications (including SMS) sent via the Adobe Journey Optimizer (AJO). This dataset stores detailed
information about user interactions with push notifications, providing essential insights for reporting, segmentation,
and performance analysis:

1. Capturing User Interactions: The dataset records various actions users take in response to push notifications,
such as:

Receives: Whether the push notification was delivered to the user’s device.

Opens: Whether the user opened the app or interacted with the notification.
Clicks: Whether the user clicked any custom actions within the notification.

Dismisses: Whether the user dismissed the notification without engaging.

Launches: Whether the push notification successfully launched the app.

2. Push Notification Metadata: The dataset contains metadata about the push notifications, including:

Push Provider Information: Identifies which push provider (e.g., APNS for iOS, FCM for Android) was
used to deliver the notification.

Push Provider Message ID: Unique identifier assigned to the notification by the provider.

Custom Actions: Logs any custom actions (e.g., buttons) included in the push notification and records user
interactions with them.

3. Tracking User Engagement: Information in the dataset supports the measurement of key performance indicators
such as:

Open rates: The percentage of users who open or interact with push notifications.

Engagement rates: Based on custom action clicks or other interactions within the notification.

Conversion: If push notifications prompt specific user actions, such as purchases or sign-ups within the
app.

4. Segmentation and Profiling: The dataset is enabled for profile integration, meaning it can be used to build
audience segments based on user interaction data. For example:

Segment users who frequently open push notifications.

Target users who never engage with notifications.

Measure user engagement with specific campaigns to refine marketing strategies.

5. Supporting Campaign Analysis: It includes detailed information about the campaigns and journeys that trigger
push notifications, such as:

Campaign IDs: Track push notification performance by campaign.

Journey Action IDs: Helps identify which journey action led to the notification being sent.

Journey Versioning: Enables performance comparison between different versions of journeys or

campaigns.

6. Geolocation and Contextual Data: For use cases involving location-based push notifications, the dataset can
capture contextual data such as:

Geo-location data: Logs when notifications are triggered by location-based events (e.g., entering a specific
geographical area).

Points of Interest (POIs): Logs interaction with POIs when they are used to trigger notifications.

Performance Monitoring: Understand how different push notifications perform across various campaigns and
journeys.
Engagement Insights: Track how users interact with notifications, including opens, custom action clicks, and
app launches.

Conversion Tracking: Measure how effective push notifications are at driving conversions, such as app launches
or purchases.

A/B Testing: Compare different versions of push notifications to see which variants (message types, delivery
times, custom actions) perform better.

Here are the fields that you will need. Fields marked in orange are unique to push notifications:

_experience.customerJourneyManagement.pushChannelContext.deviceToken

The unique token or ID of the recipient’s device.

Key for targeting push notifications to specific devices and tracking device-level engagement.

_experience.customerJourneyManagement.pushChannelContext.pushProvider

The service provider used to deliver the notification (e.g., APNS, FCM).

Useful for reporting performance by provider and diagnosing delivery issues related to specific push services.

_experience.customerJourneyManagement.pushNotificationTracking.pushProviderMessageID

Unique ID assigned to the message by the push provider.

Helps in troubleshooting issues with message delivery and correlating logs with the provider’s system.

_experience.customerJourneyManagement.messageInteraction.deliveryStatus

The status of the push notification delivery (e.g., delivered, failed).

Measures delivery success and failures, providing insights into message reachability.

_experience.customerJourneyManagement.messageInteraction.bounceType

Type of push notification bounce (e.g., hard, soft).

Helps identify and diagnose reasons for delivery failures (e.g., permanent vs temporary).

_experience.customerJourneyManagement.messageInteraction.openCount

The number of times the recipient opened the push notification.

Measures engagement by tracking how many times a user opens the push notification.

_experience.customerJourneyManagement.messageInteraction.clickCount

Number of times the recipient clicked on any URLs or buttons within the notification.

Tracks user engagement with links or buttons in the notification, critical for conversion reporting.

_experience.customerJourneyManagement.messageInteraction.unsubscribe

Indicates whether the recipient unsubscribed from push notifications.

Measures opt-out behavior, aiding in list hygiene and content relevance optimization.

_experience.customerJourneyManagement.messageInteraction.urlID

The unique URL included in the push notification and clicked by the user.

Tracks user interaction with specific URLs in push notifications, supporting engagement analysis.

_experience.customerJourneyManagement.pushNotificationTracking.richMedia

Contains data on any rich media (e.g., images, videos) included in the push notification.

Tracks engagement with rich media content (images, videos), helping assess the effectiveness of multimedia
notifications.

_experience.customerJourneyManagement.pushNotificationTracking.customAction

Details of any custom actions (e.g., buttons) included in the notification.

Allows tracking of specific in-notification interactions, helping assess user engagement with interactive content.

_experience.customerJourneyManagement.pushNotificationTracking.customAction.actionID

The unique ID of the custom action (e.g., button) clicked by the recipient.

Enables detailed reporting on user interaction with different actions presented within the push notification.

_experience.customerJourneyManagement.pushNotificationTracking.isLaunch

Indicates whether the push notification successfully launched the app.

Critical for measuring how effective notifications are at driving app usage.

_experience.customerJourneyManagement.messageExecution.campaignID

Unique identifier of the campaign responsible for sending the push notification.

Useful for tracking overall campaign performance and engagement metrics.

_experience.customerJourneyManagement.messageExecution.messageID

Unique identifier for the push notification message sent to the recipient.

Allows for detailed tracking and reporting of individual push notification performance.

_experience.customerJourneyManagement.messageExecution.journeyActionID

Unique identifier for the journey action that triggered the push notification.

Tracks performance of specific journey actions that triggered the push notification, for journey optimization.

_experience.customerJourneyManagement.messageExecution.messageType

The type of push notification message (e.g., promotional, transactional).

Allows segmentation and reporting based on push message types (promotional vs transactional).
The time when the push notification was sent or delivered.

Enables time-based reporting, identifying trends over time and performance by time of day.

_experience.customerJourneyManagement.messageProfile.isTestExecution

Indicates whether the push notification was part of a test execution.

Helps exclude test notifications from reporting, ensuring accurate performance metrics.

_experience.decisioning.propositions.items.interactionOutcome

Tracks the interaction outcome following push notification engagement (e.g., conversion, purchase).

Measures the effectiveness of push notifications in driving conversions or other outcomes.

_experience.customerJourneyManagement.messageInteraction.offers.offerID

The unique identifier for any offer or promotion included in the push notification.

Tracks engagement with specific offers, helping optimize promotions within push notification campaigns.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageID

Unique identifier for the landing page visited after clicking a link in the push notification.

Measures the effectiveness of push-driven traffic to landing pages, supporting conversion analysis.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageName

The name of the landing page associated with the push notification link.

Provides insights into which landing pages perform best in driving conversions from push notifications.

_experience.customerJourneyManagement.messageInteraction.propositionTime

The timestamp when an offer or proposition was generated for the push notification.

Helps analyze the timing of offers and their effectiveness in driving engagement or conversion.

Here are the fields that you will need. Fields marked in orange are unique to SMS notifications:

_experience.customerJourneyManagement.smsChannelContext.address

The phone number to which the SMS was sent.

Identifies the recipient of the SMS for segmentation and reporting on message effectiveness.

_experience.customerJourneyManagement.smsChannelContext.namespace

Namespace associated with the recipient’s phone number (e.g., carrier or region).

Tracks the region or carrier for compliance and performance analysis across carriers.

_experience.customerJourneyManagement.messageInteraction.deliveryStatus

The status of the SMS delivery (e.g., delivered, failed).

Measures delivery success and failures, providing insights into message reachability.

_experience.customerJourneyManagement.messageInteraction.bounceType

Type of SMS bounce (e.g., hard, soft).

Helps identify and diagnose reasons for delivery failures (e.g., permanent vs temporary).

_experience.customerJourneyManagement.messageInteraction.openCount

The number of times the recipient opened an SMS (if trackable).

Measures engagement by tracking SMS openings (if applicable with smart messaging).

_experience.customerJourneyManagement.messageInteraction.clickCount

Number of times the recipient clicked on any URLs within the SMS message.

Tracks user engagement with links in SMS, critical for conversion and interaction reporting.

_experience.customerJourneyManagement.messageInteraction.unsubscribe

Indicates whether the recipient unsubscribed from SMS communications.

Measures opt-out behavior, aiding in list hygiene and content relevance optimization.

_experience.customerJourneyManagement.messageInteraction.urlID

The unique URL that was included in the SMS message and clicked by the user.

Tracks user interaction with specific URLs in SMS, supporting engagement and conversion analysis.

_experience.customerJourneyManagement.messageExecution.messageContent

The content of the SMS message that was sent.

Analyzes the effectiveness of different SMS content in driving user engagement.

_experience.customerJourneyManagement.smsChannelContext.shortCode

The short code or long code from which the SMS was sent.

Allows reporting on performance across different SMS short codes, useful in multi-code campaigns.

_experience.customerJourneyManagement.smsChannelContext.carrier

The cellular carrier associated with the recipient’s phone number.

Helps analyze SMS delivery performance across different carriers.

_experience.customerJourneyManagement.messageExecution.journeyActionID

Unique identifier for the journey action that triggered the SMS message.

Tracks performance of specific journey actions that triggered the SMS, for journey optimization.

_experience.customerJourneyManagement.messageExecution.messageID
Unique identifier for the SMS message sent to the recipient.

Allows for detailed tracking and reporting of individual SMS message performance.

_experience.customerJourneyManagement.messageExecution.campaignID

Unique identifier of the campaign responsible for sending the SMS message.

Provides insights into overall campaign effectiveness and engagement metrics.

_experience.customerJourneyManagement.messageExecution.messageType

The type of SMS message (e.g., promotional, transactional).

Allows segmentation and reporting based on SMS message types (promotional vs transactional).

The time when the SMS was sent or delivered.

Enables time-based reporting, identifying trends over time and performance by time of day.

_experience.customerJourneyManagement.messageProfile.isTestExecution

Indicates whether the SMS was part of a test execution.

Helps exclude test messages from reporting, ensuring accurate performance metrics.

_experience.decisioning.propositions.items.interactionOutcome

Tracks the interaction outcome following SMS engagement.

Measures the effectiveness of SMS in driving conversions or other outcomes.

_experience.customerJourneyManagement.messageInteraction.offers.offerID

The unique identifier for any offer or promotion included in the SMS message.

Tracks engagement with specific offers, helping optimize promotions within SMS campaigns.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageID

Unique identifier for the landing page that the recipient visited after clicking a link in the SMS.

Measures the effectiveness of SMS-driven traffic to landing pages, supporting conversion analysis.

_experience.customerJourneyManagement.messageInteraction.landingPage.landingPageName

The name of the landing page associated with the SMS link.

Provides insights into which landing pages perform best in driving conversions from SMS campaigns.

_experience.customerJourneyManagement.messageInteraction.propositionTime

The timestamp when an offer or proposition was generated for the SMS.

Helps analyze the timing of offers and their effectiveness in driving engagement or conversion.

_experience.customerJourneyManagement.messageInteraction.optOutKeywords
Keywords used by the recipient to opt out of future SMS messages (e.g., STOP, UNSUBSCRIBE).

Tracks user-initiated opt-outs to manage compliance and improve future targeting.

Offer Decisions Events Dataset

First, you need to execute the following by locating the dataset that has **ode_decisionevents** in its name:

SELECT * FROM ode_decisionevents_{key specific to your environment};

A proposition offer is a specific type of personalized offer or recommendation presented to a customer during their
journey in Adobe Journey Optimizer (AJO) or Adobe Experience Platform (AEP). It can be anything from a product
recommendation, discount, or special promotion that is generated based on a user’s behavior, preferences, or profile
data. The proposition offer is intended to drive engagement, conversion, or retention by aligning with the user’s
interests and needs. A decision is the process by which the system determines what action or offer to present to a user
based on a set of rules, algorithms, or predefined criteria. It is a critical part of personalized customer experiences,
ensuring that the right content, offers, or communications are delivered to the user at the most opportune moment in
their journey.

The ODE DecisionEvents Dataset tracks decision events and proposition outcomes in Adobe Journey Optimizer. It
focuses on offer propositions made to users, tracking how decisions are made within the system and how users
interact with those propositions. This dataset is used to understand the performance of decisions, offers, and how users
respond to them. It is crucial for reporting and analysis around decision-making processes, offer performance, and
user engagement with propositions.

Key Use Cases for the ODE DecisionEvents Dataset:

1. Offer Performance Tracking: Track how users engage with offers, including clicks, views, and conversions, to
optimize offer strategies.

2. Decision-Making Analysis: Analyze how decisions are made based on rules, algorithms, or strategies, and
measure the performance of decision options.

3. Customer Experience Personalization: Monitor how personalized offers and experiences are delivered based
on user profiles and journey interactions.

4. Optimization of Decision Strategies: Improve decision-making processes by analyzing the performance of

proposition strategies, algorithms, and fallback options.

5. Experience Outcome Measurement: Capture outcomes based on decision events, including success, failure, or
other actions that reflect user engagement with propositions.

Field Path (Dot Notation)

Use Case for Reporting and Analysis

_experience.decisioning.propositionDetails.items.id

Unique identifier for the decision event or offer proposition.

Used to track individual decision events and propositions.

_experience.decisioning.batchID

Identifier for batch-mode decision events.

Useful for tracking decisions made in batch processing versus individual profiles.

_experience.decisioning.experienceID

Identifier for the proposition’s content experience.

Important for tracking and comparing content experiences across propositions.

_experience.decisioning.propositionDetails

Details about the proposition decision, including all offers presented to the user.

Captures the full context of the decision made and the offers shown to the user.

_experience.decisioning.propositionDetails.items.activity.id

Unique identifier for the decision activity.

Helps track the specific decision activity that led to the offer proposition.

_experience.decisioning.propositionDetails.items.id

Unique identifier for the specific offer presented.

Key for tracking offer performance and engagement rates.

_experience.decisioning.propositionDetails.items.fallback

Fallback option used when no other regular options qualified.

Tracks when fallback strategies are used, indicating potential gaps in targeting or personalization strategies.

Decision Option Characteristics

_experience.decisioning.propositionDetails.items.characteristics

Additional properties or attributes related to the decision option.

Used to optimize the performance of different options and measure their impact.

Selected Experience Option

_experience.decisioning.propositions.items.scopeDetails.experience

The experience selected as part of the decision scope.

Tracks which experience option was ultimately selected for the user.

_experience.decisioning.propositionDetails.items.placement.id

Unique identifier for the decision placement (e.g., where the offer was shown).

Critical for measuring performance based on where the proposition was presented (e.g., email, web).

_experience.decisioning.propositions.items.scopeDetails.strategies.items.algorithmID

Identifier of the algorithm used to make the decision, if applicable.

Important for measuring the effectiveness of different decision-making algorithms.

_experience.decisioning.propositions.items.scopeDetails.interactionMeasurements.items.outcome

Outcome of the decision-making event (e.g., user clicked, user converted).

Key for measuring the effectiveness of decisions and offers based on user engagement.

Time when the decision or offer proposition event occurred.

Essential for time-based reporting and analyzing trends over time.

The primary event type for this time-series record.

Useful for categorizing different types of events in decision-making, such as offers presented, clicks, or views.

identityMap.additionalProperties.items.id

Identity of the consumer in the related namespace.

Used to identify and link user-specific events, providing a unified view of user interactions across channels.

identityMap.additionalProperties.items.primary

Indicates this identity is the preferred identity. Used to help systems better organize how identities are queried.

Helps in prioritizing primary identities for reporting, ensuring consistency in user-based tracking and attribution.

Relationship Between ODE and AJO Entity Dataset:

1. Linking via Journey Structure:

The AJO Entity Dataset tracks the entire structure of a journey, including journey steps, messages, and
decision points.

Decision points in the journey are where the ODE Dataset comes into play. When a decision needs to be
made, such as which offer to present to the user, the decision event is logged in the ODE Dataset.

The AJO Entity Dataset would include references to these decision events, ensuring that every decision
made in the journey is tracked.

2. Offer Propositions and Decision Tracking:

Offer decisions made during a journey are recorded in the ODE Dataset, which tracks proposition offers
and their outcomes (e.g., which offer was selected and how the user interacted with it).

These decisions are triggered as part of a journey step in the AJO Entity Dataset, where a decision point is
encountered. The AJO Entity Dataset logs the context around why a decision was needed, such as user
segment data or behavior during the journey.

3. Common Identifiers:

Both datasets share common identifiers such as Journey IDs, Message IDs, and Decision IDs that link the
decision events in the ODE Dataset back to the specific journey steps in the AJO Entity Dataset.

For example, a Journey ID in the AJO Entity Dataset would link to a decision event in the ODE Dataset,
showing how a decision was made within that journey and what offer was presented to the user.
4. Decision Outcomes and Journey Actions:

Once an offer decision is made (logged in the ODE Dataset), the outcome of that decision (e.g., user
accepts or ignores the offer) is tracked as part of the user’s journey.

The AJO Entity Dataset would log the overall journey progress, while the ODE Dataset provides the
specific outcome of the offer decision and whether the user engaged with it. This provides a full picture of
how decisions affect the user’s journey.

5. Optimization and Personalization:

The ODE Dataset feeds back into the AJO Entity Dataset by providing insights into which offers work
best for certain segments of users. This data can be used to optimize future decisions within the journey.

For example, if the ODE Dataset shows that certain offers are leading to high engagement rates for a
specific segment, the AJO Entity Dataset can trigger those offers more frequently during similar journey
steps.

Why Track All Proposition Offers and the Algorithm Used?

1. Optimize Offer Strategy and Personalization

Offer propositions are often personalized based on a user’s profile, behavior, or journey step. Tracking all
proposition offers allows marketers to analyze which offers resonate most with specific segments of their
audience.

Algorithms play a central role in deciding which offer or experience is presented to the user. By tracking the
algorithms used, you can evaluate how effective each decision-making method is in delivering the right offers.

Example: If you are running a personalized journey with different product recommendations, tracking which
offers are being presented (and the underlying decision logic) lets you fine-tune those recommendations based on
engagement outcomes.

2. Measure Offer Performance and User Engagement

Tracking all offer propositions allows you to measure how well different offers perform in terms of
engagement. For example, tracking metrics like click-through rates (CTRs), conversions, or acceptances of
offers provides insights into which offers are driving desired behaviors.

By monitoring the proposition outcomes, you gain insight into how different types of users respond to various
offers. This helps in identifying trends, such as which offers lead to higher engagement with a certain
demographic or segment.

Example: Suppose you are running a campaign with multiple offers (e.g., discount codes, product
recommendations). Tracking which offer users engage with (e.g., accepting a discount code vs. ignoring a
recommendation) helps you adjust the future decision-making process to favor more successful offers.

3. Test and Improve Decision-Making Algorithms

Algorithms determine which offers are presented to a user. Different algorithms may prioritize different factors
(e.g., recency of interaction, likelihood of conversion). Tracking which algorithm was used for each decision
allows you to evaluate the effectiveness of various decision-making strategies.

Why it matters: Not all algorithms will work equally well for all users. For example, an algorithm based on past
behavior might work better for returning customers, while a rules-based algorithm might perform better for
new users. By tracking the algorithm’s performance, you can refine the decision-making process and tailor it to
specific contexts.

Example: You may use one algorithm to optimize for maximizing engagement and another for driving
conversions. By tracking how each algorithm performs under different conditions, you can choose the best one
for each scenario.

4. Understand Fallbacks and Avoid Missed Opportunities

Sometimes, none of the primary offers may meet the decision criteria, so a fallback offer is presented to avoid
presenting no offer at all. Tracking the fallback mechanism ensures you understand when your primary offers
are insufficient and that you don’t miss opportunities to engage users.

Example: If you find that fallback offers are being used frequently, it may indicate that your decision-making
process needs optimization. Maybe your primary offers aren’t relevant enough, or the targeting rules are too
restrictive. By tracking the use of fallback options, you can adjust your strategy to improve primary offer
performance.

5. Support A/B Testing and Iteration

Tracking all offer propositions and the algorithm used allows for A/B testing of different decision strategies. By
analyzing which offers (and which decision algorithms) yield the best engagement or conversion results, you can
iteratively refine and improve the customer journey.

Example: Suppose you’re testing two different algorithms—one that prioritizes discounts and another that
prioritizes recommendations. By tracking the propositions and outcomes, you can determine which approach
leads to better results for specific segments, then optimize your future campaigns accordingly.

6. Improve Customer Experience

By tracking proposition outcomes, you ensure that users receive the most relevant and timely offers. This helps
maintain a consistent and personalized customer experience, leading to higher satisfaction and loyalty.

Why it matters: Presenting irrelevant offers or poorly timed propositions can degrade the customer experience.
Tracking helps prevent this by ensuring you present the best possible offer or take corrective actions when
engagement is low.

Example: If a user consistently ignores product recommendations but engages with discount offers, tracking the
decision events allows you to tailor future offers to align with their preferences, improving the overall
experience.

Retrieve User Information Along with Proposition Offers

SELECT to_json(p.identityMap),
to_json(p.proposition)
FROM (
SELECT identityMap,
EXPLODE(_experience.decisioning.propositionDetails) AS proposition
FROM ode_decisionevents_example_decisioning
) p

This query extracts user identity information and proposition details from the
ode_decisionevents_example_decisioning dataset. It works by first selecting the **identityMap** (which
contains user identity data) and exploding the **propositionDetails** array (which holds details of
propositions made to users) so that each proposition is returned as a separate row. The outer query then converts both
the user identity and the proposition details into JSON format, making them easier to work with for further analysis or
integration into other systems. This approach is typically used to track the specific offers or decisions made for each
user during their journey.

Extracting Decision Event Details by Year and Month

SELECT s.timestamp,
STRING(YEAR(s.timestamp)) AS year,
STRING(MONTH(s.timestamp)) AS month,
STRING(YEAR(s.timestamp) * 100 + MONTH(s.timestamp)) AS yearmonth,
s.propositionId,
s.eventType,
s.customerId,
s.activityName,
s.activityId,
s.selection.name AS offerName,
s.selection.id AS offerId,
s.placementName,
s.placementId
FROM (
SELECT p.timestamp,
p.propositionId,
p.eventType,
p.identityMap.customerid[0].id AS customerId,
p.proposition.activity.name AS activityName,
p.proposition.activity.id AS activityId,
EXPLODE(p.proposition.selections) AS selection,
p.proposition.placement.name AS placementName,
p.proposition.placement.id AS placementId
FROM (
SELECT timestamp,
_id AS propositionId,
eventType,
identityMap,
EXPLODE(_experience.decisioning.propositionDetails)
proposition
FROM ode_decisionevents_example_decisioning
) p
) s

This query extracts detailed information from the ode_decisionevents_example_decisioning dataset, focusing on
propositions (offers) presented to users. It retrieves fields such as the event **timestamp**,
**propositionId**, **eventType**, **customerId**, **activityName**, **activityId**,
offer name and ID, and placement details. Additionally, it formats the timestamp to generate year, month, and a
concatenated **yearmonth** field for temporal analysis. The query uses the **explode** function to break
down the array of selections (offers) into individual rows, ensuring that each offer is captured separately. This structure
allows for a granular view of the decision events, tracking when specific offers were made and linking them to the
customer, activity, and placement involved.

Activity Count by Decision Type

This chart shows the count of activities grouped by decision types.

SELECT
proposition.activity.name AS activityName,
COUNT(*) AS numOfActivities
FROM
ode_decisionevents_example_decisioning
GROUP BY
proposition.activity.name
ORDER BY
numOfActivities DESC;

This chart shows the number of offers per placement.

SELECT
proposition.placement.name AS placementName,
COUNT(*) AS numOfOffers
FROM
ode_decisionevents_example_decisioning
GROUP BY
proposition.placement.name
ORDER BY
numOfOffers DESC;

This chart tracks the number of offers served each month.

SELECT
STRING(YEAR(timestamp)) AS year,
STRING(MONTH(timestamp)) AS month,
COUNT(*) AS numOfOffers
FROM
ode_decisionevents_example_decisioning
GROUP BY
STRING(YEAR(timestamp)),
STRING(MONTH(timestamp))
ORDER BY
year, month;

Unique Customers With an Offer Proposition Per Month

This chart shows the number of unique customers who received an offer each month.

SELECT
STRING(YEAR(timestamp)) AS year,
STRING(MONTH(timestamp)) AS month,
COUNT(DISTINCT identityMap.customerid[0].id) AS numOfUniqueCustomers
FROM
ode_decisionevents_example_decisioning
GROUP BY
STRING(YEAR(timestamp)),
STRING(MONTH(timestamp))
ORDER BY
year, month;

These queries assume that the dataset follows the structure shown in the previous example. You can adjust column
names or logic based on your specific schema or dataset requirements.

Understanding the structure of the dataset

The primary event type for this time-series record.

Useful for categorizing different types of events in decision-making, such as offers presented, clicks, or views.

identityMap.additionalProperties.items.id

Identity of the consumer in the related namespace.

Used to identify and link user-specific events, providing a unified view of user interactions across channels.

identityMap.additionalProperties.items.primary

Indicates this identity is the preferred identity. Used to help systems better organize how identities are queried.

Helps in prioritizing primary identities for reporting, ensuring consistency in user-based tracking and attribution.

https://data-distiller.all-stuff-data.com/unit-3-data-distiller-etl-extract-transform-load/etl-300-incremental-processing-
using-checkpoint-tables-in-data-distiller * * *

Incremental Processing Use Case Overview

Imagine a large e-commerce platform managing millions of transactions daily. To keep their analytics up to date
without processing vast amounts of data repeatedly, they rely on incremental processing to efficiently build and
update fact tables. Instead of recalculating totals from scratch, incremental processing allows the platform to
seamlessly update critical metrics like total sales (SUM) and number of transactions (COUNT) by processing only
the new data. This approach drastically reduces the time and resources needed to maintain accurate business insights.
For more complex operations like window functions, the system can focus on small, relevant data windows, ensuring
insights like customer lifetime value or purchasing trends remain timely and precise—all while avoiding the
computational strain of reprocessing the entire dataset

Imagine a marketing team running a large-scale email campaign. Every minute, new engagement metrics like email
opens and click-through rates are pouring in. With incremental processing, the team can seamlessly update these
metrics in real-time without recalculating data for the entire email list. This means that as new engagement data flows
in, the marketing platform automatically updates reports and dashboards, allowing the team to monitor campaign
performance live, make timely adjustments, and deliver more targeted follow-up emails. The result? Efficient, up-to-
the-minute insights without the overhead of processing millions of records from scratch.

Window Functions and Incremental Processing

Consider a financial services company that tracks customer transactions to rank their top clients. Using window
functions like RANK and ROW_NUMBER, the company can create insights by analyzing the entire transaction
history. These functions, however, are more complex because they depend on the order of transactions and require
access to the entire dataset. For example, to determine the top-spending clients or calculate a running total of
transactions over time, the model must account for both previous and following rows. This makes window functions
powerful for gaining deep insights, but they often require full dataset access rather than incremental updates, ensuring
accurate and consistent results in critical areas like client ranking, loyalty programs, and financial forecasting

For these reasons, window functions do not lend themselves well to incremental processing:

1. Reordering: If new rows are inserted or deleted, the order of rows might change, which affects the window
function result.

2. Dependencies: Window functions depend on multiple rows, so adding new data might require recomputing the
results for previously processed rows.
3. Complex Calculations: Calculations like moving averages or running totals can’t easily be split between old and
new data, as each new row could change the result for previous rows.

In machine learning (ML) use cases, incremental processing can also play a critical role in efficiently handling large
datasets, especially when it comes to building and maintaining models. Let us look at some examples:

1. Feature Engineering: Imagine an online retailer using machine learning to personalize customer experiences. One
key feature the model relies on is the number of times each customer has purchased a product. Instead of recalculating
the total from scratch every time a new transaction occurs, incremental processing allows the system to seamlessly
update this feature with each purchase. The result? A dynamic and real-time count of customer purchases, feeding into
personalized recommendations and marketing efforts—without the computational overhead of reprocessing all
historical data. Whether it’s tracking purchase value or customer interactions, incremental processing ensures the
features stay fresh and relevant, driving smarter personalization at scale

2. Incremental Model Training: Picture a global financial institution using a fraud detection system powered by
machine learning. Every second, new transactions are flowing in. Instead of retraining the entire model from scratch
with each new batch of data, algorithms like stochastic gradient descent (SGD) and decision trees allow the model to
incrementally learn from each new transaction. This means the fraud detection system can continuously adapt to
evolving fraud patterns—whether it’s a new scam technique or a shift in customer behavior—on the fly. With
incremental learning, the model stays one step ahead, identifying fraudulent activity in real-time without the heavy
computational cost of full retraining.

3. Model Deployment and Scoring (Inference): Consider an e-commerce platform with a recommendation engine
powered by machine learning. Each hour, new product interactions—like clicks, views, and purchases—are added to
the system. With incremental processing, the platform’s model only needs to score the fresh batch of user data, instead
of reprocessing the entire dataset. This approach not only boosts efficiency but also enables real-time responses. For
example, when a customer clicks on a product, the recommendation engine immediately updates their personalized
suggestions without retraining the entire model. Incremental processing ensures that the system stays agile, responsive,
and efficient, even as new data flows in constantly

4. Handling Time-series Data: Imagine a retail forecasting engine that adapts to your business in real time: as each
day’s sales roll in, the model instantly adjusts future demand predictions—no need to reprocess months of historical
data. With Data Distiller’s incremental processing, your forecasts stay accurate and up-to-date, ensuring you’re always
stocked for tomorrow’s trends without the heavy computational cost

5. Updating Model Metrics: Imagine a retail company deploying a product recommendation model in production. As
customer behavior shifts over time, it’s crucial to ensure the model remains accurate. Using Data Distiller’s
incremental processing, the company can continuously track key performance metrics like accuracy, precision, and
recall as new customer interactions are processed. For example, if the model starts suggesting irrelevant products due
to seasonal changes or shifts in customer preferences, incremental checks for concept drift will flag the issue in real-
time. This enables the company to adjust the model quickly, maintaining the relevance and effectiveness of their
recommendations without needing to fully retrain the model or recalibrate metrics manually

While incremental processing provides several benefits, it comes with challenges, especially for more complex models
or use cases:

1. Non-incremental models: Consider a financial institution using an XGBoost-based model to predict loan
defaults. While highly accurate, this tree-based model does not support incremental updates natively. When new
loan applications or repayment data arrive, the model must be retrained on the entire dataset to incorporate the
latest information. Although this can be computationally expensive, the retraining ensures that the model captures
the full complexity of interactions in the data, maintaining its high performance. For businesses relying on
models like XGBoost or ensemble methods, the investment in periodic retraining delivers more accurate, up-to-
date insights, critical for making informed decisions in high-stakes industries like finance
2. Complex Feature Engineering: Imagine a healthcare analytics company using machine learning to predict
patient outcomes based on clinical data. Some features, such as the median patient recovery time or percentile
rankings of treatment effectiveness, depend on complex global patterns within the entire dataset. These features
can’t be updated incrementally because they require recalculating based on the full range of historical data. When
new patient data arrives, the model must access the entire dataset to accurately reflect shifts in the overall
distribution. While this process may be resource-intensive, it ensures that models continue to deliver precise and
reliable predictions by accounting for broader trends and patterns, critical in high-accuracy fields like healthcare

3. Concept Drift: Imagine an online retail platform using machine learning to recommend products to customers.
Over time, customer preferences shift—new trends emerge, and old favorites fade away. This phenomenon,
known as concept drift, can cause the recommendation model to lose accuracy as the data patterns it was trained
on change. While incremental processing helps the model adapt to new data, it might not fully capture these
evolving trends. To prevent performance degradation, the platform employs continuous monitoring of the model,
tracking key metrics like accuracy and customer engagement. When concept drift is detected, the system triggers
a full retraining of the model, ensuring it stays aligned with the latest customer behaviors and keeps
recommendations relevant. This proactive approach maximizes both customer satisfaction and business
outcomes.

Data Sync in Dataset Activation

Picture a large enterprise using multiple data systems for sales, marketing, and customer support. Keeping these
systems in sync is critical for seamless operations, but transferring massive datasets repeatedly is inefficient. With Data
Distiller’s incremental processing, only the changes—new sales, updated customer profiles—are sent out on a
scheduled basis. This means the systems always stay up-to-date without the need for full data refreshes, ensuring
consistency across departments. By transferring only the relevant updates, Data Distiller optimizes data syncing,
reducing bandwidth usage and speeding up the flow of critical business information across platforms.”

Case Study: Stock Price Monthly Data Analysis

The goal of our case study is to read the stock prices that were in the lab outlined in section. If you have not done that
lab, you can create the stock_price_table dataset below by executing the following code. It will take about 20-30
minutes for the code to finish executing, so please be patient. We have ti to execute this code to simulate the creation
of snapshots.

--Create an empty dataset via a contradiction

CREATE TABLE stock_price_table AS
SELECT
CAST(NULL AS DATE) AS date,
CAST(NULL AS DECIMAL(5, 2)) AS stock_price
WHERE FALSE;

-- Insert for January 2025

INSERT INTO stock_price_table
SELECT
date_add('2025-01-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

--Insert data for February 2025

INSERT INTO stock_price_table
SELECT
date_add('2025-02-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 27)) AS i) seq -- February has 28 days in 2025
ORDER BY date;

--Insert data for March 2025

INSERT INTO stock_price_table
SELECT
date_add('2025-03-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq -- March has 31 days
ORDER BY date;
END
</annotation></semantics></math>BEGIN
−−DropthetableifitexistsDROPTABLEIFEXISTSstockp

rice
t
able;
−−CreateanemptydatasetviaacontradictionCREATETABLEstock
p


rice
t
ableASSELECTCAST(NULLASDATE)ASdate,CAST(NULLASDECIMAL
(5,2))ASstock
p


riceWHEREFALSE;−−
InsertforJanuary2025INSERTINTOstockp

rice
t
ableSELECTdatea

dd(′
2025
−01
−01′
,
seq.i)ASdate,CAST(30
+(
RAND()
∗
30)ASDECIMAL(5,2))ASstockp

riceFROM(SELECTexplode(sequence
(0,30))ASi)seqORDERBYdate;
−−InsertdataforFebruary2025INSERTINTOstock
p


rice
t
ableSELECTdatea

dd(′
2025
−02
−01′
,
seq.i)ASdate,CAST(30
+(
RAND()
∗
30)ASDECIMAL(5,2))ASstockp

riceFROM(SELECTexplode(sequence
(0,27))ASi)seq−

−Februaryhas28daysin2025ORDERBYdate;−
−InsertdataforMarch2025INSERTINTOstockp

rice
t
ableSELECTdatea

dd(′
2025
−03
−01′
,
seq.i)ASdate,CAST(30
+(
RAND()
∗
30)ASDECIMAL(5,2))ASstockp

riceFROM(SELECTexplode(sequence
(0,30))ASi)seq−

−Marchhas31daysORDERBYdate;END
;

Our goal in today’s tutorial is to figure out a way to read ONE SNAPSHOT at a time from the stock price tables,
compute the cumulative stock prices for the snapshot along with the number of records, log it and then come back next
time to read the next snapshot. If we find no new snapshots, we just skip the execution. Ultimately, these cumulative
stock prices in our fact table will be averaged along with the sum of the records across all snapshots.

Our approach to incremental processing involves using checkpoint tables to systematically track and manage data
snapshots. This method ensures efficient and reliable processing of data updates while minimizing potential issues.
Here’s a more detailed explanation of our strategy:

1. Tracking Processed Snapshots:: We will keep a comprehensive log of all snapshots that have already been
processed. This ensures that we only process new or unprocessed snapshots, avoiding redundant work and
allowing us to resume from the last known state in the event of a failure.

2. Processing Snapshots or Collections of Snapshots:: Our logic will be designed to handle either a single
snapshot or a group of snapshots in each run. This flexibility allows us to adapt to varying data volumes and
processing needs, ensuring that all relevant data is processed, whether it arrives incrementally or in bulk.

3. Maintaining a Watermark:: We will establish a watermark system to track the most recent snapshot that has
been successfully processed. By updating this watermark after each successful run, we ensure that we can resume
from the correct point in future runs, always starting from the next unprocessed snapshot.
4. Advantages of this Approach:

Resilience to Upstream Delays: One of the key benefits of this strategy is that we do not need to worry
about delays in upstream systems, such as those responsible for hydrating the stock_price_table.
Our checkpointing system will allow us to pick up where we left off, regardless of when new snapshots are
generated.

Error Handling and Recovery: If any errors occur during the processing of a snapshot, the job will
gracefully handle them. The subsequent runs will automatically pick up the failed or missed snapshot and
process it without requiring manual intervention, ensuring smooth recovery from failures.

By implementing this incremental processing strategy with checkpoint tables, we can ensure that our system is both
robust and adaptable, capable of handling upstream delays and job errors while maintaining data integrity and
minimizing reprocessing.

Define Canonical Schema for Checkpoint Table

The Checkpoint Table serves as a centralized logging mechanism to track the snapshots processed by various jobs. It
ensures that each job knows which snapshot it has processed and what the status of that processing was, allowing jobs
to resume processing from the correct snapshot during subsequent executions. This table is essential for managing job
checkpoints and ensuring the continuity of snapshot processing.

An important assumption with snapshots is that the history_meta function only provides snapshots for the past 7
days. If we need to retain this data for a longer period, we would need to set up a Data Distiller job that inserts
snapshots into the table every week. For the purpose of this tutorial, we’ll assume our job processes within the 7-day
window.

Although you can design your own checkpoint table based on your specific requirements, let’s explore a common
design pattern that is widely used in Data Distiller workflows. This pattern ensures efficient tracking and management
of job execution and snapshot processing, helping to maintain data consistency and allowing jobs to resume from the
correct state.

DROP TABLE IF EXISTS checkpoint_table;

CREATE TABLE checkpoint_table AS
SELECT
cast(NULL AS string) job_name,
cast(NULL AS string) job_status,
cast(NULL AS int) last_snapshot_id,
cast(NULL AS TIMESTAMP) job_timestamp
WHERE FALSE;

1. **job_name** (STRING, NOT NULL): Represents the name of the job that is processing the snapshot. Each
job can be identified uniquely by its name.

Example: snapshot_ingest_job, data_cleaning_job

Constraint: This field is part of the composite primary key, ensuring that each job’s checkpoint is uniquely
tracked.

2. **job_status** (STRING, NOT NULL): Stores the current status of the job, indicating whether the job
completed successfully or encountered an error.

Possible Values: 'SUCCESS', 'FAILED', 'RUNNING', 'PENDING'

Example: If the job completed successfully, the value would be 'SUCCESS'.

3. **last_snapshot_id** (STRING, NOT NULL): The ID of the most recent snapshot processed by the job.
This allows the job to pick up from the correct snapshot in the next execution.

Constraint: This is part of the composite primary key, ensuring that each job can only log one record for
each snapshot.

4. **job_timestamp** (TIMESTAMP, NOT NULL): Captures the exact date and time when the job was last
run and processed the snapshot. This helps track the job’s execution over time.

Example: 2024-09-25 14:35:22

Use: Useful for monitoring and debugging, especially when tracking when the job processed specific
snapshots.

Create an Empty Output Table

We will be creating the output table into which we will be writing the processeed data:

DROP TABLE IF EXISTS Stock_Updates_Incremental;

CREATE TABLE Stock_Updates_Incremental AS
SELECT
cast(NULL AS int) snapshot_id,
cast(NULL AS double) sum_stock_price,
cast(NULL AS int) record_count,
cast(NULL AS TIMESTAMP) fact_table_timestamp
WHERE FALSE;

The fields are described in the following way:

**snapshot_id**: Stores the ID of the snapshot currently being processed. We are storing it as an integer to
allow for arithmetic operations. However, you could also store it as a string and typecast when needed to perform
mathematical operations, if required.

**sum_stock_price**: Stores the sum of the stock prices from the snapshot, which is of type double (or
float depending on your system).

**record_count**: Stores the count of records processed for that snapshot, which is an integer.

**fact_table_timestamp**: Stores the timestamp when the processing of the snapshot occurred, which
is of type TIMESTAMP.

Initialize the Checkpoint Table

We insert an initial entry to initialize job metadata. The first row acts as the start of the job’s history in the log, which
can be referenced in future job executions.

INSERT INTO
checkpoint_table
SELECT
'Stock_Updates' job_name,
'SUCCESSFUL' job_status,
cast(0 AS int) last_snapshot_id,
CURRENT_TIMESTAMP job_timestamp;
Note that this table serves as a historical log of all job executions, making it useful for auditing. By inserting this
record, you’re starting the process of capturing each job run’s status, including the start time and, eventually, the
snapshot ID it processed. The initialization shows the first log entry for this job. Also, note the casting applied to
**last_snapshot_id**. We’re initializing it with '0' as the starting point for processing, but you could query
the history_meta table to explicitly determine the appropriate watermark.

Variables in Anonymous Block

We will be utilizing several variables in this section of the incremental processing code. Variables are always declared
with an @ sign and are defined within an Anonymous Block, as their scope is limited to the lifetime of that block.
Here’s an example:

In the above query, even though the results are not streamed into the UI - the variable **@MAX_STOCK_PRICE** is
accessible in any of the conditions or paramaters in the queries within that block.

Conditional Branching in Anonymous Block

In Data Distiller, you can use **IF THEN ELSE** semantics for conditional branching to control the flow of logic
based on variables or conditions. The key idea is that variables define the branching conditions, and the predicates of
these conditions can be more complex SQL code blocks themselves.

Here’s a more structured example demonstrating how you can implement **IF THEN ELSE ENDIF** logic with
SQL code blocks as conditions and how to utilize variables effectively in Data Distiller:

-- Step 2: Conditional branching using variables

IF @variable_A > 50 THEN
-- If there are more than 50 stocks with a price greater than 100
-- SQL code block executed when condition is true
SELECT 'High number of expensive stocks' AS condition_met,
@variable_A AS stock_count;

ELSEIF @variable_B > 150 THEN

-- If the average stock price is greater than 150
-- SQL code block executed when condition is true
SELECT 'Average stock price is high' AS condition_met,
@variable_B AS average_price;

ELSE
-- Default case when none of the previous conditions are met
-- SQL code block executed when all conditions are false
SELECT raise_error('Neither condition met, check the data');

END IF;
END
</annotation></semantics></math>BEGIN
−−Step1
:SetinitialvariablesSET@variable
A
=
SELECTCOUNT(∗)
FROMstockp

rice
t
ableWHEREstockp

rice>10;SET@variableB
=
SELECTAVG(stoc
k
p
rice)FROMstock
p


rice
t
able;
−−Step2:

ConditionalbranchingusingvariablesIF@variable
A
>
50THEN−

−Iftherearemorethan50stockswithapricegreaterthan100
−−SQLcodeblockexecutedwhenconditionistrueSELECT′
Highnumberofexpensivestocks
′
ASconditionm

et,@variable
A
ASstock
c


ount;EL
SEIF@variable
B



>
150THEN
−−Iftheaveragestockpriceisgreaterthan150−
−SQLcodeblockexecutedwhenconditionistrueSELECT′
Averagestockpriceishigh
′
ASconditionm

et,@variable
B



ASaveragep

rice;ELSE
−−Defaultcasewhennoneofthepreviousconditionsaremet−
−SQLcodeblockexecutedwhenallconditionsarefalseSELECTraisee

rror(
′
N
ei
th
ercon
di
ti
onm
et
,checkthedat
a
′
);
EN
DIF;
EN
D
;
Prototyping Anonymous Block Code

Before diving into creating SQL code blocks within an Anonymous Block, it’s essential to prototype each individual
block to ensure they are functionally correct. Once you place the code inside an Anonymous Block, debugging
becomes a tedious process. You’ll constantly need to check **Queries -> Log** to sift through each query log
and find which one failed and what the error was. Since errors can cause a cascading effect, debugging becomes even
more challenging.

Keep in mind that the queries below will eventually need to be parameterized using variables, but since variables are
only supported in Anonymous Blocks, you won’t be able to use them directly here. The same applies to any
conditional branching code we’ve covered earlier. In these cases, you’ll need to manually simulate the logic by
assuming fixed values.

1. Let us retrieve the latest snapshot in the checkpoint table:

SELECT last_snapshot_id FROM checkpoint_table WHERE job_name = ‘Stock_Updates’ AND job_status =

‘SUCCESSFUL’ ORDER BY job_timestamp DESC LIMIT 1;

The result should be:

1. Let’s retrieve the snapshot from the input table that has not yet been processed. Our goal is to select the snapshot
ID that comes right after the one we processed last time. Execute the following SQL code blocks one at a time.
You can experiment with different values for **from_snapshot_id**, which is set to 0 in the example
below. For each value, such as 1, 2, or 3, it will return the next snapshot in the sequence. Notice that we are
creating a temporary table using the **TEMP** command to make the table easily accessible. Keep in mind that
this temporary table will only exist for the duration of the Anonymous Block, unlike regular temp tables which
persist for the duration of the session.

CREATE TABLE IF NOT EXISTS stock_meta AS select * from (SELECT history_meta(‘stock_price_table’));

SELECT snapshot_id FROM (SELECT snapshot_id FROM stock_meta WHERE snapshot_id > 0 --
from_snapshot_id ORDER BY ABS(snapshot_id - 0) ASC --from_snapshot_id LIMIT 1);

The result will be:

1. Execute the following function:

SELECT CURRENT_TIMESTAMP;

This is used to timestamp the output of the fact table we are creating and is also recorded in the checkpoint log table as
a proxy for the time the job was processed. The function will generate a string that must be cast to a TIMESTAMP
data type. Keep in mind that this timestamp serves only as a proxy since the actual job finishes processing later, after
the results are written and the cluster is shut down. We don’t have access to the exact timestamps of those internal
processes, making this proxy a reliable substitute.

1. Let ua prototype the queery that will get us the aggregations we are looking for:

SELECT SUM(stock_price) AS sum_stock_price, COUNT(*) AS record_count FROM stock_price_table

SNAPSHOT BETWEEN 1 AND 2

The answer will be the following:

Incremental Processing with Checkpoint Tables

Now that we have verified all the elements, we can now put it all together. Observe the use of the various variables and
the conditional branxhing logic. There are several INSERTs happening i.e. into the checkpoint and the output tables.

CREATE TABLE IF NOT EXISTS stock_meta AS select * from (SELECT

history_meta('stock_price_table'));
-- Get the next snapshot ID in sequence, after the last processed snapshot
SET @to_snapshot_id =
SELECT snapshot_id
FROM
(SELECT snapshot_id
FROM stock_meta
WHERE snapshot_id > @from_snapshot_id
ORDER BY ABS(snapshot_id - @from_snapshot_id) ASC
LIMIT 1);

-- If a new snapshot is available (i.e., @to_snapshot_id > 0), record

the current timestamp
IF @to_snapshot_id > @from_snapshot_id THEN
SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;
-- Insert the sum of stock prices and count of records for the next snapshot
into the incremental table
INSERT INTO Stock_Updates_Incremental
SELECT @to_snapshot_id AS snapshot_id,
SUM(stock_price) AS sum_stock_price,
COUNT(*) AS record_count,
cast(@last_updated_timestamp AS TIMESTAMP) AS
fact_table_timestamp
FROM stock_price_table SNAPSHOT BETWEEN @from_snapshot_id AND
@to_snapshot_id;

-- Log the successful job execution in the checkpoint table

INSERT INTO checkpoint_table
SELECT
'Stock_Updates' AS job_name,
'SUCCESSFUL' AS job_status,
@to_snapshot_id AS last_snapshot_id,
cast(@last_updated_timestamp AS TIMESTAMP) AS job_timestamp;
END IF;

END
</annotation></semantics></math>BEGIN
−−GetthelastprocessedsnapshotIDfromthecheckpointtable,orus
e
′
HEA
D
′
ifnoneexistsSET@froms

napshoti

d=
SELECTlasts

napshoti

dFROMcheckpointt

ableWHEREjobn

ame=′
Stock
U


pdates
′
ANDjo
b
s

tatus=
′
SUCCESSFUL

′
ORDERBYjo
b
t

imestampDESCLIMIT1;
CREATETABLEIFNOTEXISTSstock
m


etaASselect∗
from
(SE
LECThistorym

eta(′
stockp

rice
t
able
′
));
−−GetthenextsnapshotIDinsequence,
aft
erthelastprocessed
sna
ps
ho
tSET@t
o
s


napsho
t
i


d=
SELECTsnapshot
i
dFROM(SELECTsnapshot
i
dFROMstock
m


etaWHEREsnapshot
i
d>@froms

napshoti

dORDERBYABS(snapsho
t
i


d−
@froms

napshoti

d)ASCLIMIT1)
;−−
Ifanewsnapshotisavailable(i.e.,
@t
o
s


napsho
t
i


d>
0),
recordthecurrenttimestampIF@to


s


napshot


i


d
>@froms

napshoti

dTHENSET@last
u
pdated
t
imestamp=SELECTCURRENTT
IMESTAMP;−−
InsertthesumofstockpricesandcountofrecordsforthenextsnapshotintotheincrementaltableINSERTINTOStoc
k
U


pdates


I


ncrementalSELECT@to


s


napshot


i


dASsnapsho
t
i


d,SUM
(stoc
k
p
rice)ASsu
m
s


tockp

rice,COUNT
(∗)ASrecord
c
ount,
cast(@
lastu

pdatedt

imestampASTIMESTAMP)ASfac
t
t


able
t
imestampFROMstockp

rice
t
ableSNAPSHOTBETWEEN@from


s


napshot


i


dAND@
to
s
napshot
i
d;−
−LogthesuccessfuljobexecutioninthecheckpointtableINSERTINTOcheckpoint
t
ableSELECT′
Stoc
k
U



pdate
s
′
ASjobn

ame,
′
SUCCESSFUL′
ASjobs

tatus,
@tos

napshoti

dASlast
s
napshot
i
d,cast
(@last
u
pdated
t
imestampASTIMESTAMP)ASjobt

imestamp;
EN
DIF;
EN
D
;

Here is the explanation of the code blocks:

1. Resolve Fallback on Snapshot Failure: The query starts by setting the system to handle snapshot failures
gracefully.

2. Fetch the Last Processed Snapshot (**@from_snapshot_id**): The query fetches the last processed
snapshot ID from the checkpoint_table for the Stock_Updates job that has a status of
'SUCCESSFUL'. If no such snapshot exists, it defaults to 'HEAD'.

3. Fetch the Next Snapshot (**@to_snapshot_id**): The query then looks for the next available snapshot
ID greater than @from_snapshot_id by selecting the minimum snapshot ID (MIN(snapshot_id)). This
ensures that snapshots are processed in sequential order.

4. Error Handling for Missing Snapshots: If no new snapshot is available (@to_snapshot_id IS NULL), it
raises an error with the message 'No new snapshot available on this check'.

5. Record the Timestamp for Processing: If a new snapshot exists, the current timestamp is recorded in
@last_updated_timestamp.

6. Insert the Sum of Stock Prices and Count of Records into the Incremental Table: For the identified snapshot
(@to_snapshot_id), the query calculates the sum of stock_price values and count of records. It inserts
this along with the snapshot ID and timestamp into Stock_Updates_Incremental.

7. Log the Successful Execution: After processing the snapshot, the query logs the successful execution by
inserting a record into the checkpoint_table with the job name, status (SUCCESSFUL), snapshot ID, and
timestamp.

8. Exception Handling: If any errors occur during execution, the query raises a custom error message 'An
unexpected error occurred'. You may not find the errors useful in the Data Distiller Editor console. It
is highly recommended that you either look at the Query-Log where you will need to find the queries of interest
or better go down to the Scheduled Queries Tab

Each time you execute the script above, you will see that it will insert new rows into both the fact table and the
checkpoint table.

Go ahead and schedule this Anonymous Block following the steps in the tutorial here. Make sure you set the schedule
for hourly so that it can keep executing each hour and you can test thee branching logic i.e it should stop executing
after processing March.

Execute the following query to see the contents of the fact table **Stock_Updates_Incremental:**
SELECT * FROM Stock_Updates_Incremental
ORDER BY fact_table_timestamp;

The results should look like this:

Let us interrogate the checkpoint_table

We just saved our resources by doing incremental load. Let us write a query on this fact table to compute the average
stock price.

SELECT ROUND(SUM(sum_stock_price)/SUM(record_count), 2) AS AVG_STOCK_PRICE

FROM Stock_Updates_Incremental;

Prototype the retrieval of the from_snapshot_id variable.

If we assume that the from_snapshot_id=0, then we get the to_snapshot_id as 1.

Aggregates are returned for the snapshots

Insertion of new records stopped automatically after the month of March

Checkpoint table shows all the timestamps.

The end user still gets the result they are looking for.

https://data-distiller.all-stuff-data.com/unit-3-data-distiller-etl-extract-transform-load/etl-200-chaining-of-data-distiller-
jobs * * *

1. UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)

ETL 200: Chaining of Data Distiller Jobs

Unleash the power of seamless insights with Data Distiller’s chained queries—connect your data, step by step, to drive
better decisions

The goal of this case study is to process incremental processing on a dataset to create a new derived dataset.

Why Chain Data Distiller Jobs?

Chaining Data Distiller SQL jobs in marketing workflows can be extremely useful for managing sequential processes
where each step depends on the output of the previous one. Most high value Data Distiller use cases end up using
chaining of some form or the other.

Here are some examples of current use of Data Distiller for a wide ranging set of use cases:

First Job: A SQL job extracts and segments customers based on behavior (e.g., browsing history, purchase
frequency, or demographic data).

Second Job: Another job enriches these segments with external data (e.g., cost of living, product preferences, or
past purchase history).

Third Job: A job further enriches the segments by adding real-time engagement metrics (e.g., recent interactions
like clicks, views, or cart additions).
Fourth Job: The next job generates personalized content (e.g., product recommendations, targeted offers) based
on enriched segments.

Fifth Job: A final job structures the personalized datasets for campaign automation tools (e.g., email systems, ad
platforms).

New Feature Alert: Data Distiller can create SQL audiences from AEP Data Lake that can be published as External
Audiences in Real-Time Customer Profile for activation.

Adobe Journey Optimizer Performance Reporting

First Job: The first job collects raw engagement data (e.g., email opens, clicks, or social media interactions) from
various marketing channels.

Second Job: A second job calculates key metrics such as click-through rates (CTR), conversion rates, and ROI
for each campaign.

Third Job: The final job aggregates these metrics into daily, weekly, or monthly reports and sends the insights to
a BI tool or dashboard..

Customer Journey Touchpoint Mapping

First Job: A SQL job pulls data on customer interactions across touchpoints (e.g., website visits, email
engagement, and social media clicks).

Second Job: A second job sequences these interactions in chronological order to map each customer’s journey
over time.

Third Job: Another job enriches the data by associating interactions with specific campaigns, offers, or
promotions the customer encountered.

Fourth Job: A job groups interactions by channel (e.g., social media, email, website) to analyze the effectiveness
of each channel on customer engagement.

Fifth Job: This job generates insights about customer behavior patterns (e.g., when they tend to convert or drop
off) and flags high-value customers for retargeting.

Sixth Job: Another job calculates the time spent at each stage of the customer journey (e.g., from first interaction
to purchase) to identify bottlenecks or areas for improvement.

Seventh Job: The final job outputs a comprehensive customer journey report, which helps marketers fine-tune
messaging and timing across different channels for optimal engagement.

Lead Scoring Automation in Adobe B2B CDP and/or AJO B2B

First Job: A SQL job collects lead behavior data (e.g., content downloads, webinar attendance, or email
responses) from multiple sources.

Second Job: A second job cleans and standardizes the data to ensure consistent formatting and structure for
accurate scoring.

Third Job: The next job assigns scores to each lead based on predefined criteria (e.g., activity levels,
engagement frequency, or demographic fit).

Fourth Job: A job segments leads based on their scores into categories such as “hot leads,” “warm leads,” or
“cold leads,” facilitating targeted follow-ups.
Fifth Job: This job enriches the lead data with additional insights, such as firmographic data or lead readiness
indicators (e.g., industry, company size, or budget).

Sixth Job: Another job updates the CRM or marketing automation platform with the latest lead scores, triggering
personalized follow-up actions and workflows.

Seventh Job: The final job generates a lead scoring performance report, tracking metrics like conversion rates
and lead quality to refine and improve scoring criteria over time.

Product Recommendations in Adobe Commerce

First Job: A SQL job captures and processes customer interaction data, such as product views or add-to-cart
actions.

Second Job: The next job identifies relevant product recommendations based on this behavior using algorithms
or predefined business rules.

Third Job: The final job sends these product recommendations to an email marketing system or personalization
engine for delivery to the customer.

Real-Time Customer Data Platform Activation: Ad Spend Optimization

First Job: A SQL job pulls data from various advertising platforms (e.g., Google Ads, Facebook Ads) about
spend, impressions, and conversions for different campaigns.

Second Job: A second job standardizes and normalizes the data from different platforms to ensure consistency
across metrics (e.g., converting currencies, time zones, or impression formats).

Third Job: This job calculates key performance indicators (KPIs) such as cost per acquisition (CPA), return on
ad spend (ROAS), and conversion rate for each campaign.

Fourth Job: A job aggregates the KPIs by channel (e.g., Google Ads vs. Facebook Ads) to provide a
comprehensive view of performance at both the channel and campaign levels.

Fifth Job: Another job compares these KPIs across channels and campaigns, identifying top-performing
campaigns and those underperforming based on the defined thresholds (e.g., ROAS or CPA benchmarks).

Sixth Job: This job identifies campaigns with significant variations over time (e.g., sudden spikes in cost or drops
in conversion rates) and flags them for deeper analysis.

Seventh Job: A job suggests budget reallocation by reallocating spend from underperforming campaigns to high-
performing campaigns or channels based on the calculated KPIs.

Eighth Job: The next job forecasts future performance and ROI for the reallocated budget using predictive
analytics based on past campaign performance trends.

Ninth Job: This job sends the budget reallocation suggestions to the marketing platform or ad management tool
for implementation, ensuring real-time adjustments.

Tenth Job: The final job generates a performance report that tracks the effectiveness of the reallocation
decisions, highlighting any improvements in ROAS, CPA, and overall campaign performance.

Most ad spend reporting in the industry relies on custom-built solutions to collect data from various platforms.
FunnelIO is a prime example of a product that offers this capability out of the box, providing connectors that cover a
wide range of systems.
Standard Attribution Analysis

First Job: A SQL job collects data from various touchpoints (e.g., paid ads, email campaigns, social media)
where customers interact with the brand, including impressions, clicks, and conversions.

Second Job: A second job links these interactions to individual customer journeys, identifying which touchpoints
contributed to each conversion (e.g., first-click, last-click, or multi-touch).

Third Job: A job assigns a basic attribution model (e.g., first-click, last-click, linear) to measure the contribution
of each touchpoint towards the conversion.

Fourth Job: This job enriches the attribution model by incorporating customer demographic data and behavior to
better understand how different customer segments respond to various channels.

Fifth Job: A job calculates key metrics for each touchpoint and channel, such as conversion rate, time-to-
conversion, and cost per conversion, allowing for a detailed breakdown of performance.

Sixth Job: This job applies multi-touch attribution models (e.g., time decay, U-shaped, W-shaped) to give weight
to each interaction in the customer journey based on its influence on the final conversion.

Seventh Job: A job aggregates attribution results by channel, campaign, and customer segment to identify which
touchpoints are driving the most valuable conversions.

Eighth Job: This job compares attribution models (e.g., first-click vs. linear vs. time decay) to evaluate which
model gives the most accurate representation of customer behavior and conversion paths.

Ninth Job: A job suggests optimization strategies for future campaigns by identifying underperforming channels
and reallocating budget towards high-performing touchpoints based on the chosen attribution model.

Tenth Job: The final job generates an attribution performance report that tracks each channel’s contribution to
conversions over time, helping marketing teams optimize campaigns for better ROI.

Data Distiller includes built-in functions for first-touch and last-touch attribution. You can further customize these
(time decay, linear, U-shaped, W-shaped, non-linear, weighted) using Window functions to suit your specific needs.

First Job: A SQL job pulls historical data on marketing spend and performance across different channels (e.g.,
TV, radio, digital, print) including impressions, clicks, and conversions.

Second Job: A second job standardizes the data by normalizing spend, reach, and engagement metrics across
channels to create a unified dataset for analysis.

Third Job: A job calculates the contribution of each channel to overall sales or conversions using statistical
methods like regression analysis, which allows for the identification of relationships between media spend and
outcomes.

Fourth Job: This job enriches the model by incorporating external factors such as seasonality, economic
conditions, or competitive activity, to adjust for their impact on marketing effectiveness.

Fifth Job: A job applies time-series analysis to examine how media spend over time influences sales trends and
how different channels may have long-term or short-term effects.

Sixth Job: This job calculates diminishing returns for each channel, identifying the point where additional spend
yields less incremental benefit, helping to optimize budget allocation.

Seventh Job: A job assigns weight to each media channel based on its effectiveness, creating a model that can
forecast the likely outcomes of different budget scenarios (e.g., increasing TV ad spend vs. digital).
Eighth Job: This job runs simulations to test different media mix scenarios, forecasting outcomes such as
expected sales growth or ROI for various spend allocations across channels.

Ninth Job: A job suggests an optimized media mix, reallocating budgets to high-performing channels and
reducing spend on channels with lower returns, based on the model’s output.

Tenth Job: The final job generates a media mix performance report, showing how changes in media spending
influence sales or conversions, and provides recommendations for future marketing strategies based on the
analysis.

New Feature Alert: New Statistical Models such as regression analysis are available in Data Distiller.

Media Mix Modeling faces similar challenges to those encountered in collecting data from various campaign reporting
sources. First, the definitions and interpretations of metrics differ significantly across systems. Second, when
standardizing these metrics and dimensions, certain assumptions must inevitably be made. Lastly, the granularity of
data is often inconsistent or insufficient across these platforms.

Machine Learning Feature Engineering

First Job: A SQL job collects raw customer data (e.g., purchase history, website interactions, and demographics).

Second Job: Another job creates Recency, Frequency, Monetary (RFM) features based on customer transactions
to quantify customer engagement.

Third Job: A job computes average session duration and product views per session, transforming raw website
data into features that capture customer browsing behavior.

Fourth Job: This job generates time-based features, such as time since the last purchase and frequency of
interactions over the last 90 days.

Fifth Job: Another job enriches the feature set by calculating discount sensitivity—whether a customer purchases
more frequently when discounts are offered.

Sixth Job: The job then applies clustering algorithms (e.g., k-means) to group customers into segments like
“high-value” or “at-risk” based on their features.

Seventh Job: A job normalizes and scales the features to ensure they are ready for model training.

Eighth Job: The next job performs feature selection, identifying the most predictive features for churn modeling.

Ninth Job: A job updates the dataset with new interaction data, allowing the features to be incrementally updated
for real-time predictions.

Tenth Job: A final job exports the engineered feature set for training machine learning models, such as predicting
customer churn or recommending products.

Today there is no integration between the Destination Scheduler and Data Distiller Anonymous Block. For Dataset
Activation, read this tutorial.

Clean Room Data Collaboration through a Third-Party Identity Provider

First Job (Company A’s Environment): A SQL job within Data Distiller collects and anonymizes Company A’s
customer data (e.g., purchase history, demographic information) from internal systems, ensuring all PII
(Personally Identifiable Information) is removed using hashing or tokenization techniques.
Second Job (Company B’s Environment): A SQL job within Data Distiller collects and anonymizes
complementary data from Company B’s dataset (e.g., external browsing behavior or interests), ensuring all data
adheres to privacy standards by applying similar anonymization techniques.

Third Job: Each of Company A and Company B uploads their respective anonymized datasets through Data
Distiller’s dataset activation feature to the third-party identity provider (IDP), enabling secure matching and
analysis within the clean room environment.

Fourth Job__: The third-party IDP runs a Data Distiller job to match customer records from both datasets using
the anonymized identifiers (e.g., hashed email addresses), identifying shared customers between the two datasets.

Fifth Job__: A SQL job within the IDP’s clean room combines Data Distiller’s anonymized internal data (e.g.,
purchase history from Company A) with Company B’s anonymized data (e.g., browsing behavior) to create a
shared dataset of overlapping customers.

Sixth Job__: Another Data Distiller job enriches the shared dataset by adding third-party external data (e.g.,
demographic or geographic information) for additional insights.

Seventh Job__: A job runs privacy-preserving computations using methods like differential privacy__, where
noise is added to the data to protect individual identities. This ensures that insights on customer behaviors (e.g.,
purchase trends, engagement patterns) are generated without revealing personal information. The noise addition
process ensures that individual data points remain indistinguishable, even in aggregated results, ensuring
compliance with privacy regulations such as GDPR and CCPA.

Eighth Job__: The clean room generates aggregated marketing insights from the combined dataset, such as
cross-company customer behavior patterns and conversion rates.

Ninth Job__: Another Data Distiller job runs predictive analytics to identify high-value customer segments or
behaviors, helping both Company A and Company B optimize their marketing strategies.

Tenth Job__: A final Data Distiller job outputs anonymized, aggregated reports for both companies, providing
actionable insights (e.g., channel attribution, cross-platform behaviors) without compromising customer privacy.

There are a variety of cleanroom technologies available, including LiveRamp’s Safe Haven, Infosum, Snowflake Clean
Room, AWS Cleanrooms, ADH, and Merkle Merkury. If you’re working with one of these vendors, you can skip steps
4 through 10. However, if you’re a vendor planning to implement this as a custom solution using Data Distiller where
you control the IP of the algorithms and the reporting, the steps outlined above are the key ones to consider.

Whenever new data is materialized onto the AEP Data Lake—whether through ingestion, upload, or a Data Distiller
job—a new batch is created. If you examine the dataset, you’ll notice it has multiple batch IDs linked to it. However,
batches can often be too granular, requiring a higher level of abstraction. This is where the concept of a snapshot
comes in—a snapshot represents a collection of new batches grouped together and assigned a snapshot ID. The reason
multiple batches can end up in a single snapshot is that if the data volume is large and exceeds the internal maximum
threshold for a batch, it will be split into additional batches. Data Distiller can read and process these snapshots,
enabling incremental processing and making it a core capability for managing updates efficiently. But first, let us learn
how to create these snapshots efficiently.

Our goal is to simulate a fictional stock price for the first 3 months of next year.

You will need to access the Data Distiller Query Pro Mode Editor or use your own favorite editor:

Navigate to Queries->Overview->Create Query

Sequential Execution Challenges

Let us say that we generate a randomized stock price for the first 3 months of 2025 with the stock price beetween $30
and $60.

SELECT
date_add('2025-01-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

Do not execute the code below but observe the pattern for creating an empty dataset. We create an empty table by
creating a contradiction with the **WHERE** condition falsified.

CREATE TABLE stock_price_table AS

SELECT
CAST(NULL AS DATE) AS date,
CAST(NULL AS DECIMAL(5, 2)) AS stock_price
WHERE FALSE;

Do not execute the code below but observe the pattern for January 2025:

INSERT INTO stock_price_table

SELECT
date_add('2025-01-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

Do not execute the code below but observe the pattern for the month of February 2025:

INSERT INTO stock_price_table

(SELECT
date_add('2025-01-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date);

Do not execute the code below but observe the pattern for March 2025:

INSERT INTO stock_price_table

SELECT
date_add('2025-03-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq -- March has 31 days
ORDER BY date;

If you were to run each of the above queries individually, the process would be very time-consuming because both the
**CREATE TABLE** AS and **INSERT INTO** statements write data to the data lake. This triggers the batch
processing service in Data Distiller, which starts a cluster, runs the job, and then shuts the cluster down. This cycle of
spinning the cluster up and down for each query will cause unnecessary delays, as you’ll be waiting for both the
startup and shutdown phases with every execution. On the average, spin-up and spin-down of the cluster takes about 5
minutes each. Since we have 4 queries, this would take atleast 40 minutes.
An Anonymous Block in Data Distiller refers to a block of SQL code that is executed without being explicitly named
or stored in the database for future reuse. It typically includes procedural logic such as control-flow statements,
variable declarations, and exception handling, all enclosed within a **BEGIN...END** block. The great thing about
an anonymous block is that it runs all the SQL code within a single cluster session, eliminating the need to repeatedly
spin up and down multiple clusters. This helps save both time and compute resources.

Observe the syntax for **BEGIN** and **END**. There are two $ signs that are placed above **BEGIN** and
after **END.** Every block of SQL code has a semicolon to separate them out.

Let us dissect the above query:

1. **BEGIN ... END** Block: The BEGIN and END block groups a series of statements that need to be
executed as a single unit

2. EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR'

This block handles any errors that occur during the execution of the BEGIN ... END block:

**EXCEPTION** is used to define error-handling logic. Syntax errors are captured at compile time but the
**EXCEPTION** errors are to do with the data or the tables themselves.
**WHEN OTHER THEN** catches any error or exception that happens in the preceding statements.

**SET @ret = SELECT 'ERROR'** assigns the value 'ERROR' to the variable @ret, signaling
that an error occurred during the execution.

Keep in mind that any variables declared within an Anonymous Block exist only for the duration of that
block’s execution. However, the @ret variable in the example above is unique because it’s used in the
EXCEPTION handling clause, allowing it to persist beyond the session.

If an EXCEPTION in any of the chained queries is met, the query execution stops.

Do not attempt to use SELECT queries within a BEGIN...END block expecting interactive results to stream to your
editor. Although the code will execute, no results will be streamed and you will encounter errors. You can still declare
variables, use conditions, and handle exceptions, but these features are intended for use within the context of a Data
Distiller job, such as creating and deleting datasets, including temporary tables.

Remember, that Anonymous blocks are primarily used for procedural logic (e.g., variable assignments, loops, error
handling, DML operations) and do not support interactive result streaming.

The query below is expected to take about 20-30 minutes to complete, with around 10 minutes spent on spinning up
and down resources, and an additional 10-20 minutes writing the data to the data lake. Keep in mind that data
mastering might be delayed by other processes writing to the data lake.

Do not execute the query just yet, as you’ll end up waiting a long time for it to finish. Instead, you can comment out
the BEGIN END block and change TABLE to TEMP TABLE to bypass the batch processing engine and run the query
in ad hoc mode. TEMP TABLES are cached for the session. Once you’ve verified the results, you can then execute the
full query.

Ideally, you should schedule this query to run in the background, as your time is valuable, and it’s essential to use the
most efficient query techniques for deployment.

--Create an empty dataset via a contradiction

CREATE TABLE stock_price_table AS
SELECT
CAST(NULL AS DATE) AS date,
CAST(NULL AS DECIMAL(5, 2)) AS stock_price
WHERE FALSE;

-- Insert for January 2025

INSERT INTO stock_price_table
SELECT
date_add('2025-01-01', seq.i) AS date,
CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM
(SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

--Insert data for February 2025

--Insert data for March 2025

Let us verify the results of the query:

SELECT * FROM stock_price_table

ORDER BY date;

If you are using DBVisualizer, you have to use the backslash to make the code work:

--/

**

**CREATE TABLE table_1 AS SELECT * FROM TABLE_1;**

EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR';

**END**

**</annotation></semantics></math><span class="katex-html" aria-

hidden="true">
BEGIN∗∗‘‘
∗∗CREATETABLEtable1



ASSELECT∗FROMTABLE
1
;
∗∗‘‘∗

∗EXCEPTIONWHENOTHERTHENSET@ret=

SELECT′
ERROR′
;
∗
∗‘‘∗

∗END
∗∗‘‘
∗∗
;**

Show all the SNAPSHOTS in a Dataset

A snapshot ID is a checkpoint marker, represented as a Long-type number, applied to a data lake table each time new
data is written. The **SNAPSHOT** clause is used in conjunction with the table relation it is associated with.

Let us first try and see all the snapshots that are there in the table:

SELECT history_meta('stock_price_table')

The results will look like this in the Data Distiller Query Pro Editor. There are 5 snapshot IDs, the first one is just the
creation of an empty dataset. Each **INSERT INTO** led to a new snapshot: January data is in Snapshot
ID=2**,** February data is in Snapshot ID=3, and March data is in Snapshot ID=4.

Remember that **history_meta** will only give you the rolling 7 days worth of snapshot data. If you want to
retain the history, you will need to create a Data Distiller job to insert this data periodically into a new table.

1. snapshot_generation: This indicates the generation or version of the snapshot. Each time data is written or
updated, a new snapshot is created with an incremented generation number.

2. made_current_at: This column represents the timestamp of when the snapshot was made current, showing when
that particular snapshot was applied or written to the table.

3. snapshot_id: This is the unique identifier for each snapshot. It’s typically a Long-type number used to refer to
specific snapshots of the data.

4. parent_id: This field shows the parent snapshot ID, which means the snapshot from which the current snapshot
evolved. It reflects the relationship between snapshots where one might have been derived or evolved from
another.

5. is_current_ancestor: This is a Boolean column indicating whether this snapshot is an ancestor of the current
snapshot. If true, it means that this snapshot is part of the lineage leading up to the most recent snapshot.

6. is_current: This Boolean flag indicates whether this snapshot is the most current one. If true, it marks the latest
version of the table as of that snapshot.

7. output_record_count: This shows the number of records (rows) in the snapshot when it was created.

8. output_byte_size: This represents the size of the snapshot in bytes, indicating how much data was stored in that
snapshot.

Note that **snapshot_ids** will be monotonic i.e. always increasing but they will not be sequential (0, 1, 2, 3,
4) as they are generated and used by other datasets as well. They could well look like (0, 1, 2, 32, 43).
Keep in mind that summing the **output_byte_size** column provides a good approximation of the total
dataset size, though it doesn’t include metadata. The same approach applies to counting the total number of records in
the dataset. Additionally, you can compute the richness of the records for each snapshot by dividing the size of the
snapshot by the number of records in that snapshot

CREATE TEMP TABLE stock_meta_table AS SELECT history_meta('stock_price_table');

SELECT * FROM stock_meta_table;

It is recommended to create a **TEMP TABLE** instead of a permanent table, as materializing the dataset can take
several minutes. Keep in mind that the history_meta function only provides the last 7 days of snapshot data, which is
sufficient for most use cases like incremental processing. If you need to persist all snapshot information beyond this
period, you will need to set up a Data Distiller job to read new snapshots and regularly persist them to a table in the
data lake.

The number of records across all snapshots logged in the last 7 days is:

SELECT SUM(output_record_count) FROM stock_meta_table;

The approximate size of this dataset in GB based on the record sizes in the snapshots is:

SELECT SUM(output_data_size) / 1073741824 AS total_size_gb

FROM stock_meta_table;

Execute SNAPSHOT Clause-Based Queries

SELECT Data from a SNAPSHOT SINCE a Start SNAPSHOT ID

SELECT *
FROM stock_price_table
SNAPSHOT SINCE 2 -- Replace '2' with your desired start_snapshot_id
ORDER BY date;

This query retrieves data from the **snapshot**starting from the snapshot with ID 2, with all ddates in February
inclusive.

**SNAPSHOT** with a **AS OF** excludes the snapshot in its clause but includes all others before it.

SELECTData from AS OF Snapshot ID

SELECT *
FROM stock_price_table
SNAPSHOT AS OF 3 -- Replace '3' with your desired snapshot_id
ORDER BY date;

This query retrieves data as it existed at the time of snapshot ID 3.This will show the data for both January and
February, all dates inclusive.

**SNAPSHOT** with a **AS OF** excludes the snapshot in its clause but includes all others before it.

SELECT Data Between Two SNAPSHOT IDs

SELECT *
FROM stock_price_table
SNAPSHOT BETWEEN 2 AND 4 -- Replace '1' and '4' with your desired start and
end snapshot IDs
ORDER BY date;

This retrieves data changes that occurred between snapshot IDs 2 and 4. This will get you all the results for February
and March. The starting Snapshot ID=2 is excluded but all the other snapshot IDs 3 and 4 are included.

**SNAPSHOT** with a **BETWEEN** clause will always include the first snapshot but include the last one.

SELECT Data between the Most Recent SNAPSHOT (HEAD) and a specific SNAPSHOT

SELECT *
FROM stock_price_table
SNAPSHOT BETWEEN 'HEAD' AND 2 -- Replace '2' with your start_snapshot_id
ORDER BY date;

**HEAD** in the SNAPSHOT clause represents the earliest **SNAPSHOT** ID i.e. 0. This retrieves data between
the earliest snapshot (HEAD) which is the month of January and will exclude SNAPSHOT ID=2.

SELECT Data Between a Specific SNAPSHOT and the Earliest SNAPSHOT (TAIL)

SELECT *
FROM stock_price_table
SNAPSHOT BETWEEN 2 AND 'TAIL' -- Replace '3' with your end_snapshot_id
ORDER BY date;

This retrieves data between snapshot ID= 2 and the very last snapshot (TAIL i.e. 4) which will be excluded. You will
only see the months of February and March.

Trapping Errors via Exception Handling

In our sequential chaining of SQL queries within the Anonymous Block, there’s a significant flaw: what if a syntax
error causes a data insertion to fail, but the next block contains a DROP command? As it stands, the Anonymous Block
will continue executing each SQL block, regardless of whether the previous ones succeeded or failed. This is
problematic because a small error could trigger a domino effect, potentially causing further damage to the system. To
avoid this, we need a way to stop execution when an error occurs and trap the error for debugging purposes.

1. Let us first execute a query that has a syntax error ‘ASA’. You should see the error in an instant.
**EXCEPTION** handling did not kick in:

BEGIN DROPT ABLEIFEXISTST ABLEA ; CREATET ABLET ABLEA ASASELE

Remember that:

EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR'

This block handles any errors that occur during the execution of the BEGIN ... END block:

**EXCEPTION** is used to define error-handling logic. Syntax errors are captured at compile time but the
**EXCEPTION** errors are to do with the data or the tables themselves.

**WHEN OTHER THEN** catches any error or exception that happens in the preceding statements.
**SET @ret = SELECT 'ERROR'** assigns the value 'ERROR' to the variable @ret, signaling that an
error occurred during the execution.

Keep in mind that any variables declared within an Anonymous Block exist only for the duration of that block’s
execution. However, the @ret variable in the example above is unique because it’s used in the EXCEPTION
handling clause, allowing it to persist beyond the session.

If an EXCEPTION in any of the chained queries is met, the query execution stops.

8:41:13 PM > Query failed in 0.484 seconds. 8:41:13 PM > ErrorCode: 42601 queryId: 3690c93f-270e-4b72-
8605-94003b131cc3 Syntax error encountered. Reason: [line 2:26: mismatched input ‘ASA’ expecting {‘.’, ‘(’,
‘;’, ‘COMMENT’, ‘WITH’}]

1. Let us execute the query trying to select a column that does not exist:

BEGIN DROPT ABLEIFEXISTST ABLEA ; CREATET ABLET ABLEA ASSELECT

2. The job will start executing and even declare success because the outer Anonymous Block code executed
successfully but if you go into the Queries->Log, you will see after some search:

3. The problem with searching in Query->Log is that all of the queries inside the Anonymous Block have been
disaggregated and logged separately. If we want to see all of the queries and their status, we need to take a
different approach.

4. Navigate to Queries->Scheduled Queries and locate your failed query:

5. Click on the query and you should see the query run within the Anonymous Block listed in the left panel

6. You will see the status in the left panel per query. You will see the Overview that lists the entire query:

Scheduling of Anonymous Block

1. Copy paste the following query in the Data Distiller Query Pro Mode Editor. All that this query does is to drop
the table and recreate it.

BEGIN DROPT ABLEIFEXISTST ABLEA ; CREATET ABLET ABLEA ASSELECT

2. Name the template by giving it a name: Anonymous_test

3. Launch the template again from the Templates pane.

4. You should see the following:

5. Data Distiller Scheduler screen looks like the following:

Here are the parameters of the scheduler:

1. Frequency: Hourly, Daily, Weekly, Monthly, Yearly.

2. Every: When the schedule is supposed to execute. For example, if you choose weekly option, you can choose
which of the week you want this schedule to run.
3. Scheduled Start Time: Specified in UTC which can be extracted using the code:

SELECT from_unixtime(unix_timestamp()) AS utc_time;

4. Query Quarantine: Stops the schedule from wasting your resources if it fails for 10 times in a row.

5. Standard Alerts are available except for Query Run Delay where an alert is sent out if the time taken by the
query as it is running exceeds the Delay Time you have sent. So if a query is executing and it goes past the 150th
minute, an alert will be sent. The query will still continue to execute until it succeeds or fails.

If you want anything custom such as frequency like every 15 minutes, you can use the Data Distiller APIs

Last updated 5 months ago

Access the Data Distiller Query Pro Mode Editor

Query shows that data has been written to the dataset

The snapshot table giving us information about the sizes

You should get the same result as before.

The data from February and March are shown

AS OF means that al data that existed prior to creation of SNAPSHOT ID=3 i.e. March 1 are returned. Note that
March 1 is excluded.

January month data is excluded

Results will show the month of January.

Results will be shown for February and March.

Errors were caught due to EXCEPTION handling.

Locate your failed query.

Query runs correspond to the execution of the query as per schedule

All queries within Anonymous Block will be listed in the left panel.

Add Schedule option becomes visible.

Data Distiller Scheduler screen

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment/enrich-100-real-time-customer-profile-
overview * * *

1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 100: Real-Time Customer Profile Overview

Learn how Data Distiller can power the Real-time Customer Profile that offers a comprehensive, real-time view of
individual customers.
The Real-time Customer Profile in Adobe Experience Platform is a centralized and unified customer data platform
(CDP) that provides a 360-degree view of individual customers in real time. It collects and combines data from various
sources, both online and offline, to create a comprehensive and up-to-date profile for each customer.

Key features and capabilities of the Real-time Customer Profile include:

1. Data Integration: It connects and integrates data from multiple sources such as websites, mobile apps, CRM
systems, email marketing platforms, and offline channels. This data includes customer interactions, behaviors,
preferences, and transactional data.

2. Real-time Data: The profile is updated in real time, ensuring that marketers and other teams have access to the
latest customer information as soon as it becomes available.

3. 360-Degree Customer View: It creates a holistic view of each customer by stitching together data fragments
from different touchpoints. This view includes demographic information, purchase history, engagement history,
product interests, and more.

4. Segmentation: Users can segment customers based on various criteria, such as location, behavior, demographics,
and preferences. These segments can be used for targeted marketing campaigns and personalized experiences.

5. Personalization: Marketers can leverage the Real-time Customer Profile to deliver highly personalized and
relevant content and offers to customers across various channels, including websites, emails, and mobile apps.

6. Real-time Activation: It allows for real-time activation of customer data, enabling marketers to trigger
personalized experiences and campaigns instantly based on customer behavior or actions.

7. Machine Learning and AI: The platform often incorporates machine learning and artificial intelligence (AI)
capabilities to analyze customer data, predict behavior, and recommend actions to optimize marketing efforts.

8. Privacy and Compliance: Adobe Experience Platform places a strong emphasis on data privacy and compliance.
It provides tools to manage customer consent and data governance, ensuring that businesses adhere to regulatory
requirements.

9. Cross-Channel Integration: The Real-time Customer Profile seamlessly integrates with other Adobe Experience
Cloud solutions, enabling businesses to deliver consistent and coordinated customer experiences across channels.

In summary, the Real-time Customer Profile in Adobe Experience Platform empowers businesses to understand their
customers deeply, engage them with personalized experiences, and drive better marketing outcomes by harnessing
real-time data and insights. It plays a crucial role in enhancing customer engagement, loyalty, and overall brand
success.

Last updated 6 months ago

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment/enrich-101-behavior-based-
personalization-with-data-distiller-a-movie-genre-case-study * * *

1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 101: Behavior-Based Personalization with Data Distiller: A

Movie Genre Case Study
Here’s a basic tutorial that displays the essential components of filtering, shaping, and data manipulation with Data
Distiller.
Last updated 6 months ago

The story starts with a US company called GitFlix, a new startup, that has been able to identify its list of users and
their favorite movie genres. As a GitFlix marketer, your goal is to figure out the top genres that are popular by State
and for each such combination, create a list of emails to run a campaign against.

One of the key learnings I want you to take away from this tutorial is that more than any tool or any concept such as
segmentation or targeting, your understanding of data is key to unlocking value. Audiences are fluid because trends are
everchanging. How you track the world and its tastes is through data. How that data is collected, managed, curated,
and deployed responsibly is the ultimate act of providing great customer experience and service.

Download & Setup DBVisualizer by Follow the instructions here:

Download the following file locally to your machine.

You need to also ingest CSV Files into Adobe Experience Platform by following the instructions here:

1. Let us write the simplest query to understand what

select * from movie_data;sql

2. Let us count the number of records on the dataset. _id is a key that is unique and non-repeating that can be used
to count the number of records. You should get 1000 in the result.

select count(distinct id) from movie_data;

3. Since email is the primary identifier for the customers in the list, let us now find if the distinct values of the
emails match the record number.

select count(distinct email) from movie_data;

The result you should get is 976. This means a couple of things:

1. There are records that have emails as NULLs that need to be removed as they simply cannot be targeted. Note
that COUNT with DISTINCT clause will not count all the NULLs as one unique value. This can happen if there
were data quality issues upstream or the fact that such a record was created without requiring an email address at
some point in time. We do not really know the cause of that issue.

2. There are records that have the same email associated with them. This could happen if we allow our system to
register multiple users on the same email address. If that is so, we could simply aggregate the movie genre
information across these records i.e. give them all equal weight.

3. There is another way to extract the same information using a relatively new feature in Data Distiller:

DROP TABLE IF EXISTS movie_stats; ANALYZE TABLE movie_data COMPUTE STATISTICS as

movie_stats; SELECT * FROM movie_stats;

The results look like this:

Note that movie_stats is a TEMP table that is generated for the session per user. If you DROP this temp table in
DBVisualizer, you have to reconnect to fetch the metadata from Data Distiller that this table has indeed been dropped.
If you do not refresh, you will get an error that “movie_stats” exists. This limitation does not exist with the Data
Distiller UI.

Most of the mathematical statistics do not show up as the datatype is of string type. But take a look at the approximate
uniques. It gives you a sense of the cardinality of the various dimensions. The nullCount of24 for email shows that
there are 24 records that do not have this ID. As an exercise, I still do this manually writing SQL below but just be
aware that this approach also exists. And if you are wondering why I had to write two commands to get the statistics,
this is because Data Distiller conforms to the PostgresSQL syntax.

Note that PostgreSQL is compliant with ANSI SQL standards. It is compatible with ANSI-SQL2008 and supports
most of the major features of SQL:2016. However, the syntax accepted by PostgreSQL is slightly different from
commercial engines. SQL is a popular relational database language that was first standardized in 1986 by the American
National Standards Institute (ANSI). In 1987, the International Organization for Standardization (ISO) adopted SQL as
an international standard.

Warning: The statistics feature is not yet supported on Accelerated Store tables. It is supported only on datasets/tables
on Data Lake.

Count and Filter out NULL Identity Records

1. Let us count the number of records that have the email field as NULL

select count(COALESCE(email, ‘unknown’))-count(distinct email) AS number_null_values from movie_data;

COALESCE takes all the records that have email values as NULL and converts them into the string specified i.e.
“unknown”. COUNT on this coalesced field will count duplicate instances of non-null values in the system i.e. 1000
records. This number needs to be subtracted from the unique non-null values which will equal 24.

1. To filter out the records with email values as NULL, we have:

select * from movie_data WHERE email!= ″;

Identify if Duplicate Identity Records Exist

1. Let us count the number of records that have a non-NULL email field but have duplicate emails

select COUNT (DISTINCT id)-COUNT(DISTINCT email) AS Duplicate_Values from (select * from

movie_data WHERE email!= ″);

First, we filter the dataset of all the NULLs and then we run COUNT DISTINCT on the id and the email fields to see
if they are in line. The answer you should get here is 0 meaning that they are indeed unique.

Movie Genre Popularity by State

1. We first group by State and movie genres without splitting the movie genres apart

select State, movie_genres, COUNT(DISTINCT email) AS CUSTOMER_COUNT from movie_data WHERE

email!= ″ GROUP BY State, movie_genres ORDER BY CUSTOMER_COUNT DESC

The results should look like this:

1. We still have results such as Comedy|Drama that are counted separately from Comedy and Drama. We need to
be able to add customers that have these joint movie genres to the audiences by state and movie genre. For that, I
need to be able to use a regular expression function to turn the movie_genres field into an array and then use the
EXPLODE command to make a row per every genre value.

First, we will split at the pipe separator and then explode the strings:
SELECT State, email, explode(split(movie_genres, '\\|', -1)) AS movie_genres
from movie_data
WHERE email!= '';

The results look like this:

1. Remember, that we are giving equal credit to a customer for every genre that they are associated with. With that
assumption, let us do a count by state for all the genres and we should see that the numbers are accurate for state
and movie genre.

SELECT State, movie_genres, COUNT(email) as CUSTOMER_COUNT FROM (SELECT State, email,

explode(split(movie_genres, ‘\|’, -1)) AS movie_genres from movie_data WHERE email!= ″) GROUP BY State,
movie_genres ORDER BY CUSTOMER_COUNT DESC

The results look like this:

Email List for State by Movie Genre Targeting

1. Let us create an array of emails for each of these combinations:

SELECT State, movie_genres, COUNT(email) AS CUSTOMER_COUNT, array_agg(email) as email_list

FROM (SELECT State, email, explode(split(movie_genres, ‘\|’, -1)) AS movie_genres from movie_data
WHERE email!= ″) GROUP BY State, movie_genres ORDER BY CUSTOMER_COUNT DESC

The results look like this:

1. Since the campaigns have to be run by State and by movie genre, we need the resort to this by State column

SELECT State, movie_genres, COUNT(email) AS CUSTOMER_COUNT, array_agg(email) as email_list

FROM (SELECT State, email, explode(split(movie_genres, ‘\|’, -1)) AS movie_genres from movie_data
WHERE email!= ″) GROUP BY State, movie_genres ORDER BY State, CUSTOMER_COUNT DESC

Statistics computation on numerical columns.

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment/enrich-200-decile-based-audiences-with-
data-distiller * * *

1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 200: Decile-Based Audiences with Data Distiller

Bucketing is a technique used by marketers to split their audience along a dimension and use that to fine-tune the
targeting.

Why Enrich the Real-Time Customer Profile?

Let us take a step back and understand the building blocks of being able to personalize or even deliver a plain vanilla
experience:

1. Data gathering: You need a mechanism to collect the data about the customer from as many channels as
possible.
2. Identity resolution: You will need to resolve the identities across the channel data so that you can make sense of
the Profile.

3. Segmentation: Queries that group profiles based on various conditions.

4. Activation: Send the qualified profiles out as soon as possible with the appropriate metadata for personalization
whenever applicable.

The data that you gather will contain attributes, time-stamped behaviors, and pre-existing segment memberships
(possibly from another system). Raw behavioral data mostly constitutes up to 99% of all data that you will gather. If
you pump this into any database whether it be a warehouse or a NoSQL database, your segmentation queries will
overwhelm the system. If it does not overwhelm the system, be ready for a fat bill from the vendor.

To address this, we need a strategy to architect a trade-off: real-time computation vs. cost.

1. Real-Time Segmentation: The Real-Time Customer Profile store is a NoSQL database that is optimized for near
real-time segmentation on attributes, behaviors, and segment memberships. Real-time segmentation implies that
the conditions required for grouping the profiles are simple enough to be evaluated fast. Most of these conditions
are with short time frames involving counts of events that occurred and attributes. At the minimum, for the real-
time segmentation path to work, we need to make sure that those events are available within the database.

2. Batch Segmentation: For more complex queries, most real-time systems will compute these offline or in batch.
Batch segmentation happens in the Real-Time Customer Profile on a daily basis. The same applies to most
warehouse implementations as well. We could pre-compute the micro-conditions in the batch segmentation logic
as SQL-based attributes and just feed these attributes to the batch-processing segmentation engine. By doing so,
we have reduced the size of the data that we are pumping into the database thereby lowering our costs.

1. Batch Processing on Database: This technique is very common in the industry with vendors using terms
such as computed attributes, calculated traits, SQL traits, etc. However, most vendors require the
computation of these traits on the database itself thereby increasing the costs. Warehousing engines or even
NoSQL databases are just not built for batch processing scale on the behavioral data that you will encounter
in the domain of a CDP.

2. Batch Processing on Data Lake: Instead of using the compute resources of the database (warehouse or
otherwise) which are expensive for complex queries, we are using the compute resources of the Data
Distiller’s batch processing engine on a data lake to reduce the cost by an order of magnitude. We can use
our savings to compute newer kinds of attributes that can further give us more options for segmentation. As
we are developing these newer attributes, we can work closely with the data science team to design profile
features as well.

We will be using Data Distiller to generate yearly purchase values for profiles in the Real-Time Customer Profile. We
will use that information to segment this population into ten buckets and then bring that information back into the
Profile for segmentation and targeting. Also, by creating such computational attributes or SQL traits, you are
compressing the pattern of behavior into a derived characteristic of the individual thus reducing the need to store all of
the behavioral data in Real-Time Customer Profile. The complex computation encapsulates the essence of the behavior
which is also easy for a marketer to grasp and use.

You need to have Adobe Real-Time CDP set up and operating so that you can execute and access the example. The
example relies on data generated by the Real-Time Customer Profile.

You will also need to make sure you have completed this section before proceeding further. At the very least, you
should be familiar with how Profile Attribute snapshot datasets work.

Generate a Randomized Yearly Purchases Dataset

1. First, we will extract the email and CRM identities from all of the identity maps. We will be using this as the key
for our random dataset:

SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] AS crmid from

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

You will get a result that looks like the following:

If you do not have access to a profile snapshot dataset, you can use the dummy data in the CSV file below as a
substitute for the table above:

Results of the above quary available as a table for the rest of the exercise.

Your queries will change and look simpler. You just need to replace the code fragment that we did above with:

SELECT email, crmid FROM identity_data

If you want a tutorial on how to ingest CSV data, please consult this example:

1. Let us generate the randomized yearly purchase values

SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars FROM (SELECT

identitymap.email.id[0] AS email, identitymap.crmid.id[0] AS crmid from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903);

The results will be:

Let us carry out some basic cleaning operations to remove null identities (email and crmid) in the dataset.

SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS

YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='')
WHERE crmid !='';

The results will be:

Create the Decile Buckets

We need to use the NTILE window function that lets you split the yearly purchases attribute after sorting them into 10
equal-sized buckets and adding this bucket number as a new attribute. I cvan change this to any number of buckets I
want.

SELECT *, NTILE(10)
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS
YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='')
WHERE crmid !='');

Note that we did not use the partition dimension clause which is all but a grouping dimension to split/partition the
dataset and apply the NTILE logic on each of the partitions. In our case, we have a single dataset and no grouping
dimensions such as location. If used such a partitioning dimension such as location, then the decile computation would
be done for each partition.

The results are:

Let us verify that the decile bucket logic is working as designed. Let us first find the total number of records:

SELECT DISTINCT COUNT(*) FROM

(SELECT *, NTILE(10)
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS
YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='')
WHERE crmid !='')
);

The result will be 6000 records. Let me count the number of records per decile bucket and also find the minimum and
maximum values for the yearly purchase data for each of the buckets.

SELECT decile_bucket, COUNT(decile_bucket),

MIN(YearlyPurchase_Dollars),MAX(YearlyPurchase_Dollars) FROM
(SELECT *, NTILE(10)
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS
YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='')
WHERE crmid !='')
)
GROUP BY decile_bucket ORDER BY decile_bucket ASC ;

Here are the results of that query:

If I could figure out a way to ingest this attribute data into a Real-Time Customer Profile, the minimum and maximum
values of the thresholds give me enough flexibility to define an audience that stacks up to a maximum size of 6000
members. If I use the yearly purchase conditions from 4.0 to 6820.0, I should get 5x600=3000 members. So, by using
this decile technique, I now have full control of the reach while monitoring for the focus targeting via the yearly
purchase dimension. Therefore, with a single attribute dimension, the focus and reach of your campaign are inversely
proportional. Also, note that the decile buckets are labeled as numbers. It pays to sit down with marketing and define
more intuitive names for these buckets that they all rally around.

Enrich Real-Time Customer Profile with Derived Fields

A Data Distiller Derived Attribute is a field in a dataset that is not directly observed or collected but is created or
computed from existing data. The derived attribute values are typically generated through transformations,
calculations, or by combining multiple existing field values to offer new insights or improve model performance. In
some cases, derived attributes are simple pass-throughs of existing fields, where no transformation or calculation is
needed. These fields retain the original values but are reorganized to fit specific analytical or modeling purposes.

The dataset we want to create is an attribute dataset for Real-Time Customer Profiles. To make this happen, we will
need to create a custom schema of Individual Profile Schema type and add the following custom fields as shown
below. Plus, we will also need to at least specify a primary identity and mark the schema for Profile. Marking the
schema in this specific way notifies the database of the layout of the data. Alternatively, we can create a schema that
mimics the schema on the fly (called ad hoc schema in Data Distiller) that gives you the flexibility to define these
schemas in SQL code within the SQL editor.

The action of marking a schema for a Real-Time Customer Profile cannot be undone. What this means is that if you
are not careful about how you go about creating schemas and adding them, you will end up with a lot of ’deadwood”
schemas that will clutter up the Union view. With this risk in mind, we should use the UI or API to create the definitive
schemas, populate datasets and then mark them for Profile. Creating ad hoc schemas are useful for quick prototyping
or creating intermediate datasets but remember, that with great power comes great responsibility. In any situation,
where you creating a final set of datasets for an app within the Adobe ecosystem or elsewhere, pay attention to your
schema design. At the very least, have them defined well.

There is more flexibility with datasets as they can be marked and unmarked for Profile. Marking a dataset for Profile
means that the database monitors for new data from that point onwards. If you delete an attribute dataset, Real-Time
Customer Profile will delete the attributes. The same is true with event data as well. The same is true if TTL or dataset
expiration is applied to these Profille-marked datasets. These actions have different consequences for the Identity Store
- deletion of datasets results in the cleaning of the identity graph on a daily basis. TTL on Profile-marked datasets does
not propagate to Identity Graph.

Observe the data types of the various fields -the Yearly_Purchases_Dollars field is of integer type.

Please check the guardrails for Real-Time Customer Profile based on the entitlement you have. There are
recommendations provided for the number of attribute (20) and event (20) datasets that can be attached to it. Also,
there is a limit on the number of batches of data that can be ingested per day (90) into the Profile database as well.
These constraints can be addressed by using a pipeline architecture that consolidates datasets and running them on the
same schedule to create fewer batches of data.

There are two ways for me to create a dataset and mark it for Real-Time Customer Profile:

1. Create a brand new dataset with every single update of the yearly purchases data: If our use case was to
accommodate rolling 365-day purchases with more weightage to the recent purchases, then we have no choice
but to create a new table with every run i.e. daily. In this case, you would DROP the table every day and automate
the addition of this data to the profile.

2. Insert and append into an existing dataset with every run for new updates of the yearly purchase data. If we
want to retire the old or updated data, it will require some new data techniques (timestamping and snapshotting)
that we will not cover in this example.

In both cases, as long as the attribute dataset has been marked for Profile, the Real-Time Customer Profile will keep
monitoring for new batches of attribute data from the data lake. Marking a dataset for Profile means that we have to do
this manually in the dataset UI. If we drop the dataset or delete it, we would need to do this manual step every single
time. This leaves up with a 3-part strategy:

1. Create a one-time empty dataset or do this on a periodic basis so that we can manage dataset size.

2. Mark the empty dataset for Profile.

3. Append new attribute data into this dataset. New attribute data for the same profile will be overwritten even
though multiple records of the same data now exist in the data lake. As a reminder, Adobe Experience Platform
today supports append-only semantics. Update operations are not yet supported.

Create an Empty Dataset for Real-Time Customer Profile

We need to create the empty dataset first because the Profile store only monitors new batches of data after we mark the
dataset. If we inserted data into the dataset and then marked it for Profile store, the batches of data will not be ingested
into Profile.

Warning: The reason why we are not creating a dataset from the UI by going into Workflows->Create dataset from
Schema is because of the limitation in Adobe Experience Platform that these datasets cannot be dropped (using DROP
TABLE) in Data Distiller.

Here is the code for creating an empty dataset:

DROP TABLE IF EXISTS decile_attribute_dataset_example;

CREATE TABLE decile_attribute_dataset_example WITH
(schema='Derived_Attributes_Deciles', Label='PROFILE') AS
(SELECT struct(email AS email,
crmid AS crmid,
YearlyPurchase_Dollars AS Yearly_Purchase_Dollars,
decile_bucket decile_bucket) AS _pfreportingonprod
FROM (
SELECT *, NTILE(10)
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, CAST(round(10000*abs(randn(1))) AS INT) AS
YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='') WHERE crmid !=''))
WHERE FALSE
);

This code will execute successfully and you will see:

Let us analyze the code:

1. DROP TABLE: Since we are creating a brand new dataset for every single run i.e. daily or weekly, we should
delete the previous dataset and create a new one. Note the limitation that DROP TABLE will not work f you
create an empty dataset by using Workflows->Create dataset from schema. You should only use Data Distiller
to create the empty dataset if you want to be able to drop it.

2. struct: The structure maps the input fields such as YearlyPurchase_Dollars from the select query below to the
schema field Yearly_Purchase_Dollars. You could create any hierarchy of objects by using this mapping. For
example, we could have also created a custom schema that had two objects in it such as a purchases object and a
identity_fields object. In that case, the code would have been:

........ SELECT STRUCT( STRUCT( STRUCT( YearlyPurchase_Dollars Yearly_Purchase_Dollars ) AS

purchases email as email, crmid as crmid) AS identity_fields ) AS _pfreportingonprod, .......

3. It is imperative during the prototyping stage that you double-check that the dataset was created

SELECT * FROM decile_attribute_dataset;

4. schema=‘Derived_Attributes_Deciles’ specifies that the data layout must conform to our created XDM schema.

5. CAST(round(10000*abs(randn(1))) AS INT): We added this in the core code to match the integer data type of
Yearly_Purchase_Dollars in the schema.

6. WHERE 1=2: Note the code in the last line where we are creating a contradiction to create the empty dataset.
Make sure you execute the following command:
SELECT * FROM decile_attribute_dataset_example;

Mark the Empty Dataset for Real-Time Customer Profile

Go into the Dataset UI and mark the dataset for Profile:

Append Data to an Existing Dataset for Real-Time Customer Profile

Let us now insert data into the empty table:

INSERT INTO decile_attribute_dataset_example

(SELECT struct(email AS email,
crmid AS crmid,
YearlyPurchase_Dollars AS Yearly_Purchase_Dollars,
decile_bucket decile_bucket) AS _pfreportingonprod
FROM (
SELECT *, NTILE(10)
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, CAST(round(10000*abs(randn(1))) AS INT) AS
YearlyPurchase_Dollars
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0]
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
WHERE email !='') WHERE crmid !='')));

Verify that Data is Loaded into Real-Time Customer Profile

Let us retrieve the first row of the dataset we created for Profile Store:

SELECT to_json(_pfreportingonprod) FROM decile_attribute_dataset LIMIT 1;

The results are:

The primary identity for this attribute record is [email protected]. The decile bucket is 1 and
Yearly_Purchase_Dollars is 1.

If we interrogate the Profile store by doing the following:

The results are the following:

Now that you uploaded the entire dataset into Real-Time Customer Profile, which dimension can you use to split this
dataset into multiple audiences based on the decile bucketing?

Appendix: Reconciling Identities Linked to the Same Profile in Multiple Purchase Datasets

The above modeling assumes that each row is unique and that a yearly purchase can be assigned to each of them. This
will be the case when you have an email address or the CRM ID acting as a source of truth for reconciled data from
the backend systems. Or you have used the Identity Lookup table as mentioned in the appendix section below to
reconcile the identities of the same customer across multiple datasets.

If that data is not reconciled, then you would need to reconcile that data. The more fragmented this information, the
worse it gets for you and your company. Imagine an e-commerce system that tracks online transactions with an email
address as the primary identifier and a CRM system that centralizes all of the transactions including those in the e-
commerce system and those which are offline. You need to be careful to ensure you are not counting a purchase
transaction twice. From a sales reporting standpoint, this would get even worse.

If you are reconciling purchase data from multiple datasets that have different identities, then you have to generate the
identity lookup table from the Profile snapshot attribute data for any merge policy. As long as the merge policy
selected has identity stitching enabled, the identity graph will be the same for all such merge policies as the Real-Time
Customer Profile has a single identity graph in the system.

You will need to do multiple joins across these datasets with the identity lookup table while grouping the results by the
unique profile ID that was generated. In fact, you can use custom logic to prioritize which values you want to ingest
from the datasets as the source of truth. Please read the documentation on the creation of the identity lookup table:

Results of randomized yearly purchases data in a single dataset

Cleaned up dataset with yearly purchases in dollars.

Decile buckets autogenerated using the NTILE function

The count of records in each of the buckets is evenly split between the 10 buckets.

Custom schema for the decile bucket data.

Result of a successful CREATE TABLE command with the parameters

Empty dataset. The dataset UI will show batches processed but none weere written.

Mark the empty dataset for Profile.

Interrogating the Real-Time Customer Profile.

The same data is now present within the Profile Store.

Prerequiste for this section.

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment/enrich-300-recency-frequency-monetary-
rfm-modeling-for-personalization-with-data-distiller * * *

1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for

Personalization with Data Distiller
Learn how to leverage RFM modeling to enhance real-time customer personalization and drive targeted marketing
strategies.

Understanding customer behavior is crucial for optimizing marketing strategies, and a variety of models exist to help
businesses do just that. One of the most well-known is RFM (Recency, Frequency, Monetary), which segments
customers based on their purchasing patterns, but it’s just the beginning. Other models, such as Customer Lifetime
Value (CLV), and Propensity Models provide deeper insights into customer value, loyalty, and engagement. These
models, along with tools like Customer Satisfaction (CSAT) and Behavioral Segmentation, allow businesses to
tailor marketing strategies, whether in B2C or B2B contexts. By leveraging these analytical frameworks, companies
can focus on the most relevant customer groups, improve personalization, and drive sustainable growth through data-
driven decision-making.
Marketers use these models to gain deeper insights into customer behavior, segment audiences effectively, and
optimize marketing strategies. These models help in several key areas:

1. Personalization: Models like RFM allow marketers to target the right customers with tailored messages based on
their purchase history, engagement, and value to the business.

2. Resource Allocation: By identifying high-value customers, marketers can prioritize resources and efforts on the
most profitable segments or those needing retention strategies.

3. Improved Customer Experience: Models like RFE (a variation of RFM) help marketers understand how
engaged customers are and how likely they are to recommend the brand, guiding improvements in customer
experience.

4. Data-Driven Decision Making: These models turn complex data into actionable insights, enabling marketers to
make informed decisions, such as which segments to focus on, which campaigns to run, and how to optimize
customer journeys.

5. Maximizing ROI: By using models to focus on the most promising customer groups, marketers can enhance the
efficiency of their campaigns, leading to better returns on marketing investments.

RFM, shorthand for Recency (R), Frequency(F), and Monetary(M), represents a data-driven approach to customer
segmentation and analysis. This methodology delves into three pivotal dimensions of customer behavior: the recency
of purchase, the frequency of engagement, and the monetary value spent. Through the quantification of these
parameters, businesses attain valuable insights into distinct customer segments, empowering the formulation of
customized marketing strategies that effectively cater to individual customer needs.

RFE (Recency, Frequency, Engagement) is similar to RFM but emphasizes how recently and frequently a customer
engages with the brand or product, without focusing on monetary value. It is commonly used in subscription or
engagement-driven models where customer interaction is a key metric. The main factors it measures are user activity,
interactions, and time spent with the brand.

The key aspects of RFM Modeling

Business Understanding of RFM Model

The RFM model classifies customers based on their transactional behaviors, utilizing three key parameters:

Recency gauges the time elapsed since a customer’s last purchase, providing insights into engagement levels and
future transaction potential.

Frequency assesses the frequency of customer interactions, serving as an indicator of loyalty and sustained
engagement.

Monetary value measures the total spending of customers, emphasizing their value to the business.

The combination of these factors enables businesses to assign numerical scores to each customer, typically on a scale
from 1 to 4, where lower scores signify more favorable outcomes in our specific use case. For instance, a customer
scoring 1 in all categories is deemed the “best,” showcasing recent activity, high engagement, and substantial
spending.

Derived from research in direct mail marketing, RFM analysis aligns with the Pareto Principle, suggesting that 80% of
sales emanate from 20% of customers. Employing the RFM model allows businesses to adeptly segment their
customer base, predict future purchasing behaviors, and tailor marketing initiatives to optimize engagement and
profitability.
While RFM is often associated with B2C marketing due to its focus on customer behavior and purchasing patterns,
RFM can also be highly valuable in B2B (Business-to-Business) contexts.

In B2B, RFM can be adapted to measure the activity of business clients based on things like:

Recency: How recently a client engaged with your company, whether through a purchase, inquiry, or other forms
of communication.

Frequency: How often a client engages with your business, attends meetings, or makes purchases.

Monetary: The financial value of the client’s transactions or deals over time.

For example, B2B use cases can use RFM to segment clients based on their purchasing behavior or engagement levels,
helping to inform account management, upsell opportunities, and personalized marketing strategies. The core
principles of RFM are flexible enough to apply to both B2B and B2C environments.

RFM proves invaluable for comprehending customer dynamics and refining marketing strategies, with key advantages
including:

1. `Enhanced Revenue through Precision Targeting

1. Tailoring messages and offers to specific customer segments optimize revenue by boosting response rates,
retention, satisfaction, and Customer Lifetime Value (CLTV).

2. Effectively predicts future customer behavior by leveraging recency, frequency, and monetary metrics.

3. Allows precise messaging alignment, optimizing recommendations for frequent high-spenders and fostering
loyalty among smaller spenders.

2. Objective Customer Segmentation and Decision Support

1. Provides an objective, numerical depiction of customers, simplifying segmentation without necessitating

advanced expertise or software.

2. Assigns rankings on a scale, with lower rankings indicating a higher likelihood of future transactions.

3. Facilitates easy interpretation of intuitive outputs, supporting decision-making and strategy formulation.

3. Insights into Revenue Sources and Customer Dynamics

1. Offers insights into revenue sources, underscoring the significance of repeat customers and guiding efforts
to enhance customer satisfaction and retention**.**

2. Emphasizes the need for balancing customer engagement, ensuring top customers are not over-solicited
while nurturing lower-ranking customers through targeted marketing efforts.

Like any other approach, RFM also has limitations:

1. Simplicity and Generalization: RFM provides a straightforward framework but may oversimplify customer
behavior, assuming uniformity within segments based on recency, frequency, and monetary values

2. Equal Weighting of Factors: The model assigns equal importance to recency, frequency, and monetary values,
potentially misrepresenting customer value as one factor might be more critical than another in certain cases.

3. Limitations in Contextual Understanding: RFM lacks consideration for context, failing to account for product-
specific characteristics or nuances in customer preferences, resulting in potential misinterpretations of purchasing
behaviors.

RFM and Real-Time Personalization

RFM (Recency, Frequency, Monetary) segments can be dynamically integrated into real-time personalization
strategies by leveraging customer behaviors to tailor interactions instantly. As customer data is updated in real time,
businesses can adjust their personalization efforts based on the latest RFM scores. For example, a customer who
recently made a high-value purchase might see personalized product recommendations or loyalty rewards immediately
upon their next visit, while a less engaged customer could receive a targeted offer or incentive to re-engage. This real-
time adaptation ensures that customers receive highly relevant and timely content, enhancing their overall experience
and increasing the likelihood of conversions.

Once these attributes or base segments are created in Real-Time Customer Profile, they become available for
personalization at the Edge (e.g., Adobe Target, Offer Decisioning) and for Streaming Activation through platforms
like Adobe Journey Optimizer and Streaming Destinations.

Case Study: Luma Entering a New Market

Luma has recently opened a new website in a new country selling only 7 products. The price is shown below.

4. Aspire Fitness Jacket: $80

5. Push It Messenger Bag’ $45

Users explore the website to browse various products and have the option to log in with their email address at any
time. As they navigate, they can add items to their cart, proceed to checkout, place an order, and receive a web order
confirmation. Some users may also choose to call the toll-free number to cancel their order. Additionally, users often
manage their cookies, frequently clearing them. A portion of these users participate in the loyalty program. To add an
extra layer of privacy, all identifying information has been anonymized using Data Distiller.

As the Marketing Manager at “The Luma Store,” your aim is to target customers based on their past behavior using
RFM segmentation. This involves ranking customers by their recency, frequency, and monetary value scores on a scale
of 1 to 4. The RFM model assigns each customer a score for these three factors, with 1 being the highest and 4 the
lowest. Your goal is to construct an effective marketing strategy by creating customer segments. You have been
assigned some requirements:

1. A customer can only belong to one of the 6 segments. This is not a hard requirement in practice, but the
marketing department wants to tailor a consistent message to their customer by ensuring that they belong to a
single segment.

2. Customers should be bucketed into the following 6 segments in the following priority order:

1. Core - Your Best Customers

1. Highly ranked in every category, these customers respond well to loyalty programs

2. They transact frequently, spend generously, and exhibit brand loyalty.

3. On a scale to 1 to 4, these would rank the highest among all the dimensions i.e. Recency=1,
Frequency=1 and Monetary=1.

2. Loyal - Your Loyal Customers

1. Customers with top scores for frequency, indicating frequent transactions

2. Although they may not be the highest spenders, they exhibit consistent loyalty

3. On a scale to 1 to 4, these would rank the highest along the Frequency dimension i.e. Frequency=1 for
all values of Recency and Monetary.

3. Whales - Your Highest-Paying Customers

1. Customers with top marks for monetary value, signifying high spending.

2. On a scale to 1 to 4, these would rank the highest along the Monetary dimension i.e. Monetary=1 for
all values of Recency and Frequency.

4. Promising - Your Faithful Customers

1. Customers who transact frequently but spend less compared to other segments.

2. In this case, we will assume that they are frequent i.e. Frequency is (1,2,3) and spend not so much i.e.
Monetization is (2,3,4).

5. Rookies - Your Newest Customers

1. Newest customers who have recently transacted but have low frequency scores.

2. In this case, we will assume that they are very recent i.e. Recency is 1 with lowest frequency i.e.
Frequency is 4.

6. Slipping - Once Loyal, Now Gone

1. Formerly loyal customers who have become inactive or less frequent.

2. Presents an opportunity for retention efforts, such as discount pricing and exclusive offers, to win them
back.

3. In our case, we will assume Recency is (2,3,4) and Frequency is lowest equal to 4.

While these requirements might seem like a simple assignment in this tutorial, this is exactly the type of analysis and
requirements generation your marketing team should be doing.

First, you’ll need to establish an RFM scale and determine the level of granularity for each dimension—how many
categories will be used for Recency, Frequency, and Monetary value.

Next, you’ll define how customers are categorized into these segments. I

n our example, the criteria are structured to ensure that customer segments don’t overlap. This was done deliberately to
prevent conflicts in personalization strategies.

Additionally, pay attention to the taxonomy—the naming of segments plays a key role in aligning your team around
these well-recognized foundational segments. Clear and consistent segment names help foster a shared understanding
and focus, ensuring that everyone is on the same page when strategizing and executing marketing efforts.

Dear Marketer: You Should Not Worry About SQL

As a marketer, you’re not expected to be writing or understanding SQL all day. The whole purpose of RFM (Recency,
Frequency, Monetary) analysis is to have these attributes prepared so you can use them for audience analysis,
activation, and personalization. Typically, data engineers, architects, or your marketing ops team will handle the
technical work, while you’ll focus on consuming and applying the results. That’s even more reason to be kind to your
data teams!

But if you’re curious about SQL, don’t worry—it’s not as hard as it seems. SQL operates on similar principles to
working with Excel. The main limitation of Excel is that it struggles with large, complex datasets and can’t handle
high volumes of events. That’s why tools like Data Distiller exist, designed to process trillions of records in one go.

Keep in mind that all the RFM attributes created in Data Distiller are automatically added to the Real-Time Customer
Profile. Once they’re in there, they become available for audience creation and activation across social media and paid
media channels. They’re also ready to use as audiences in Adobe Journey Optimizer. And here’s the real advantage:
these attributes are available for edge personalization through Adobe Target or even Offer Decisioning.

Also, RFM attributes are calculated for each individual customer. You can also add this data as a lookup table in
Customer Journey Analytics, allowing you to analyze every journey within the context of RFM attributes.

Lastly, the same RFM attributes can be used to enrich the B2B Real-Time Customer Profile, which enables account
segmentation and personalization of buying groups in Adobe Journey Optimizer’s B2B edition. Essentially, this means
that the entire Adobe DX (Digital Experience) portfolio can be activated using these attributes. Whether it’s for precise
account-based marketing, personalized experiences, or optimizing journeys for B2B audiences, these RFM attributes
play a crucial role in driving effective personalization and engagement across Adobe’s ecosystem.

So, the big question you should be asking your data team isn’t how to build the RFM attributes, but rather how to gain
access to them. Specifically, you should ask what data they are calculated on, how frequently they are updated, and
how fresh the data is. Understanding these factors will help ensure that your audience analysis, segmentation, and
personalization strategies are based on up-to-date and relevant insights.

But just in case, you want to know how SQL works. Look below.

High-Level Overview of Steps to Follow in Data Distiller

Here are the steps we will follow:

1. We will start by exploring the web transaction data to gain insights into essential fields such as customer ID,
timestamps, and order totals.

2. Once the data is fully understood, we will calculate RFM metrics for each customer: Monetary (M), representing
the total amount spent; Frequency (F), counting the number of purchases; and Recency (R), measuring the days
since the most recent purchase. Each RFM dimension will be divided into quartiles, resulting in 64 distinct
segments in this three-dimensional space.

3. We’ll then visualize the distribution of these segments using dashboards to ensure accuracy.

4. Once verified, we will automate the process of updating the Real-Time Customer Profile or Customer Journey
Analytics. This segmentation will enable the creation of audience profiles based on marketing requirements,
enhancing the Real-Time Customer Profile with RFM attributes for more personalized marketing and
engagement strategies.

Before you Start: Prerequisites

If you are unfamiliar with certain concepts in Adobe Experience Platform, it is recommended that you review the
tutorial provided below:

Load Data for Luma Case Study

1. The data has been generated in CSV format to capture the essence of the use case. In practice, you would
typically source this data from Adobe Analytics, Adobe Commerce, or Adobe Web/Mobile SDK. The key
takeaway is that you’ll need to apply the techniques outlined in this tutorial to extract the relevant events and
fields into a canonical CSV format using Data Distiller. The main goal is to work with only the necessary fields
and keep the data as flat as possible, while maintaining practicality.

2. Download the above data locally.

Load the CSV Data into Adobe Experience Platform

1. Name the dataset as luma_web_dataset and follow the steps outlined here:

2. Since we are loading the CSV file directly, there is no need to create an XDM schema (whether it’s record, event,
or other B2B styles). Instead, we will be working with an Ad Hoc schema. While Data Distiller can work with
any schema, when we prepare the final dataset for hydration into the Real-Time Customer Profile, we will use a
Record XDM schema.

Data verification and exploration involve executing **SELECT** to inspect, validate, and analyze data to ensure that
it has been accurately translated during the ingestion process. This process helps identify any discrepancies,
inconsistencies, or missing information in the data.

The Most Basic Exploration Query

Let us access the Data Distiller Query Pro Mode Editor and execute the following query:

1. Navigate to Queries->Create Query

2. Paste and execute the following query:

SELECT * FROM luma_web_dataset

Observe the following in the results:

1. The products column are the list of products associated with the event type.

2. The first 9 records from the top of the result set actually maps out a typical customer journey that started with
some browsing and an eventual purchase.

3. Observe how a Purchase ID gets attached at the order step as purchase_id

4. If you scroll further down, you will see some of the customers have a loyalty ID associated with them.

5. The list of products is provided as a comma-separated list. While this isn’t relevant for the RFM tutorial, if we
were conducting a product affinity analysis, flattening this data would be a key step.

Cleaning the Data: Focus on Orders Only But Exclude All Cancelled Orders

Remember our RFM model only focuses on the recency, frequency and monetary value of all purchases made. We are
not so concerned about engagement (page views) or the checkout process. Also, we must exclude all orders that were
cancelled as well as they do not contribute to a valid calculation - we would need to deal with cancellations differently.

1. First we will create a Data Distiller View. Copy and execute the following SQL in the Data Distiller Query Pro
Mode Editor:
CREATE OR REPLACE VIEW orders_cancelled AS SELECT purchase_id FROM luma_web_dataset WHERE
event_type IN (‘order’, ‘cancellation’) AND purchase_id IS NOT NULL GROUP BY purchase_id HAVING
COUNT(DISTINCT event_type) = 2;

Remember, we are selecting all the non-null purchase IDs that had a cancellation associated with them and aggregating
them with a GROUP BY. The purchase IDs that we get as a result set needs to be excluded from our dataset.

VIEWs behave like virtual datasets and so naming them helps in reusing them throughout the code.

1. Then we will select the purchase IDs that are not in the view and retain them

SELECT * FROM luma_web_dataset WHERE purchase_id NOT IN (SELECT purchase_id FROM

orders_cancelled) OR purchase_id IS NULL;

As you type mulltiple queries into the Data Distiller Query Pro Mode Editor, make sure you highlight and execute the
query of interest:

1. Let us now exclude all events that are not orders:

SELECT * FROM luma_web_dataset WHERE event_type = ‘order’ AND purchase_id NOT IN (SELECT
purchase_id FROM orders_cancelled);

2. You should now have the result set on which we will create the RFM model.

3. At this point in time, it is a good idea to name the query as a template RFM_{YourName}. Just click the arrow
button at the bottom right to create a Data Distiller Template. You can also click the menu icon at the top left
corner to make more space for the editor.

If you leave the Data Distiller Editor inactive for more than 30 minutes, you’ll encounter a notification that the
database connection has been lost when you try to use it again. This happens because the system requires you to
refresh the page to re-establish the connection. To avoid losing any work, be sure to save your template before
refreshing the page. Remember to execute all the SQL code that has temp tables as those are only persisted for the
session.

If you want to delete a view then use the following syntax: DROP VIEW IF EXISTS order_data; But
remember that **VIEW**s have dependencies - if there is any view being used witin other views, then you will need
to drop those views first. For this, you will need to manually examine the code or follow the hints from the error
message itself i.e. it will list the depdent views.

To start the development of an RFM model, the first step is to calculate three scores for each customer: Recency,
Frequency, and Monetary value. These scores are derived from raw data collected through customer interactions and
past purchase transactions. Just as a recap:

Recency reflects the time elapsed since the customer’s last purchase, considering their entire history with us.

Frequency denotes the total number of purchases made by the customer over their entire history.

Monetary represents the overall amount of money spent by the customer across all transactions during their
entire tenure with us.

Calculate RFM Score for Each Unique User ID

Let’s delve into how we can leverage the raw data to compute these essential scores.

Extract the Fields with Field Filtering

1. We are augmenting the query developed in the previous section by choosing email address as our userid as every
order requires a email login. We also use the **TO_DATE** row level function in Data Distiller to convert the
timestamp date. The **total_revenue** currently reflects the price for each individual transaction. Later,
we will aggregate this value by summing it up for each email ID.

SELECT email AS userid, purchase_id AS purchaseid, price_total AS total_revenue, TO_DATE(timestamp) AS

purchase_date FROM luma_web_dataset WHERE event_type = ‘order’ AND purchase_id NOT IN (SELECT
purchase_id FROM orders_cancelled) AND email IS NOT NULL;

2. The results should look like this:

1. Next, we will create a TEMP TABLE (temporrary table) to cache the results of the previous query for the
duration of our session. Unlike VIEWS, which execute the underlying query each time they are called, TEMP
TABLEs store the data in memory, similar to how tables are persisted in the AEP Data Lake. Utilizing TEMP
TABLEs and VIEWs enhances the modularity and readability of your code.

Remember that TEMP Tables (a feature of Data Distiller) uses the Ad Hoc Query Engine and hence does not use up
the Batch Query Engine. This means all of the above data exploration can happen without using the Batch Query
Engine as long as the query is within reason i.e. does not timeout within 10 minutes. If you have a very large dataset,
you should explore the **ANALYZE TABLE** command to create dataset samples. The only problem with **TEMP
TABLES** is that they cannot be used as part of materializing the data in the data lake which makes them well suited
ror data exploration tasks only.

1. Copy paste the following command to create a TEMP TABLE

CREATE TEMP TABLE order_data AS SELECT email AS userid, purchase_id AS purchaseid, price_total AS
total_revenue, TO_DATE(timestamp) AS purchase_date FROM luma_web_dataset WHERE event_type =
‘order’ AND purchase_id NOT IN (SELECT purchase_id FROM orders_cancelled) AND email IS NOT NULL;
SELECT * FROM order_data;

5. The result will be the following:

1. Since we will be materializing the results later, we will be using **VIEW**s instead of **TEMP TABLE**s

CREATE OR REPLACE VIEW order_data AS SELECT email AS userid, purchase_id AS purchaseid,

price_total AS total_revenue, TO_DATE(timestamp) AS purchase_date FROM luma_web_dataset WHERE
event_type = ‘order’ AND purchase_id NOT IN (SELECT purchase_id FROM orders_cancelled) AND email IS
NOT NULL; SELECT * FROM order_data;

Aggregate the Transactions to Generate the RFM Values

1. Copy paste the following query and execute

SELECT userid, DATEDIFF(CURRENT_DATE, MAX(purchase_date)) AS days_since_last_purchase,

COUNT(purchaseid) AS orders, SUM(total_revenue) AS total_revenue FROM order_data GROUP BY userid;

2. The results will be

1. DATEDIFF(CURRENT_DATE, MAX(purchase_date)) AS days_since_last_purchase

calculates the number of days between two dates.

2. Create a VIEW to simplify the code

CREATE OR REPLACE VIEW RFM_Values AS SELECT userid, DATEDIFF(CURRENT_DATE,

MAX(purchase_date)) AS days_since_last_purchase, COUNT(purchaseid) AS orders, SUM(total_revenue) AS
total_revenue FROM order_data GROUP BY userid; SELECT * FROM RFM_Values;

Generate RFM Multi-Dimensional Cube

We 4 slots for each dimension and we need to arrange all the values from the slots in 4 bins from highest to lowest.

1. Copy paste and execute the following SQL code:

SELECT userid, days_since_last_purchase, orders, total_revenue, 5-NTILE(4) OVER (ORDER BY

days_since_last_purchase DESC) AS recency, NTILE(4) OVER (ORDER BY orders DESC) AS frequency,
NTILE(4) OVER (ORDER BY total_revenue DESC) AS monetization FROM RFM_Values;

2. The **NTILE** window function is a way to divide data into equal-sized groups, or “buckets.”. In our query, it
helps categorize customers into 4 equal groups (quartiles) based on their recency, frequency, and monetization
values:

Frequency: Customers are ranked based on how many purchases they’ve made i.e**. orders.** The
ones with the most orders are placed in group 1, and those with the fewest orders are in group 4.

Monetization: This column ranks customers by how much total revenue they’ve generated
**total_revenue**. The highest spenders are placed in group 1, and the lowest spenders are in group
4.

Recency: The query ranks all customers based on how long it’s been since their last purchase
(**days_since_last_purchase**). It divides them into 4 groups, where the customers who
purchased most recently are in group 1, and the ones who haven’t purchased for the longest time are in
group 4.

3. The results should look like this:

4. Let us make sure we create the VIEW for this as well:

CREATE OR REPLACE VIEW RFM_Scores AS SELECT userid, days_since_last_purchase, orders,

total_revenue, 5-NTILE(4) OVER (ORDER BY days_since_last_purchase DESC) AS recency, NTILE(4) OVER
(ORDER BY orders DESC) AS frequency, NTILE(4) OVER (ORDER BY total_revenue DESC) AS
monetization FROM RFM_Values;

5. Since we have the RFM scores, we can slot them into different segments as per the requirements listed in the
case study section

SELECT userid, days_since_last_purchase, orders, total_revenue, recency, frequency, monetization, CASE when
Recency=1 and Frequency=1 and Monetization =1
then ‘1. Core - Your Best Customers’ when Recency in(1,2,3,4) and Frequency=1 and Monetization in (1,2,3,4)
then ‘2. Loyal - Your Most Loyal Customers’ when Recency in(1,2,3,4) and Frequency in (1,2,3,4) and
Monetization =1 then ‘3. Whales - Your Highest Paying Customers’ when Recency in(1,2,3,4) and Frequency in
(1,2,3) and Monetization in(2,3,4) then ‘4. Promising - Faithful customers’ when Recency=1 and Frequency=4
and Monetization in (1,2,3,4) then ‘5. Rookies - Your Newest Customers’ when Recency in (2,3,4) and
Frequency=4 and Monetization in (1,2,3,4)
then ‘6. Slipping - Once Loyal, Now Gone’
end RFM_Model FROM RFM_Scores;

6. Observe the use of **CASE** statements with logical conditions that can be used to set the value of the
**RFM_Model** variable
7. The results are shown below:

4. Create a VIEW to save the RFM segments, scores and values:

CREATE OR REPLACE VIEW RFM_MODEL_SEGMENT AS

SELECT userid,
days_since_last_purchase,
orders,
total_revenue,
recency,
frequency,
monetization,
CASE
when Recency=1 and Frequency=1 and Monetization =1
then '1. Core - Your Best Customers'
when Recency in(1,2,3,4) and Frequency=1 and Monetization in (1,2,3,4)
then '2. Loyal - Your Most Loyal Customers'
when Recency in(1,2,3,4) and Frequency in (1,2,3,4) and Monetization =1
then '3. Whales - Your Highest Paying Customers'
when Recency in(1,2,3,4) and Frequency in (1,2,3) and Monetization
in(2,3,4)
then '4. Promising - Faithful customers'
when Recency=1 and Frequency=4 and Monetization in (1,2,3,4)
then '5. Rookies - Your Newest Customers'
when Recency in (2,3,4) and Frequency=4 and Monetization in (1,2,3,4)
then '6. Slipping - Once Loyal, Now Gone'
end RFM_Model
FROM RFM_Scores;
SELECT * FROM RFM_MODEL_SEGMENT;

Analysis of RFM Model with Dashboards

An important task at this point is to start visualizing the slices of the RFM cube so that we can get a sense of what the
distribution of customers looks like.

RFM Insights Data Model Creation

1. First, you need to complete the following prerequisite:

It is recommended that you also read through this as well:

1. Let us create a data model so that the Dashboards can recognize the data and allow us to build charts. Copy paste
and execute the following piece of code

CREATE DATABASE lumainsights WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch); CREATE

SCHEMA lumainsights.lumakpimodel; ALTER MODEL lumainsights.lumakpimodel RENAME TO luma_dash;

2. Let us make sure we understand the above code

1. CREATE DATABASE lumainsights: This creates a new database named

**lumainsights** that will store and organize the data for insights.

2. **WITH (TYPE=QSACCEL)**: The TYPE=QSACCEL indicates that the database is optimized for
query acceleration. This is used to improve the speed of dashboard queries, which is crucial for dashboards
and analytics use cases where performance is key.
3. **ACCOUNT=acp_query_batch**: This specifies the Data Distiller account used for batch query
processing. If you do not have the Data Distiller license, this account will not exist.

4. WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch) specify that the database should be

created in the Accelerated Store specifically and not on the AEP Data Lake. AEP Dashboards can only work
on datasets in the Accelerated Store.

5. CREATE SCHEMA lumainsights.lumakpimodel: This creates a schema named

lumakpimodel under the lumainsights database. A schema is a logical container for organizing
database objects like tables and views.

6. lumainsights.lummakpimodel is the data model and using the ALTER MODEL

command, it is changed to **luma_dash** for easy readability in dashboards.

Hydrate RFM Insights Model

1. We need to first create an empty table. Observe the **WHERE** condition where a contradiction results in no
rows being returned and hence an empty table ius created.

CREATE TABLE IF NOT EXISTS lumainsights.fact_rfm_model AS SELECT * FROM

RFM_MODEL_SEGMENT WHERE FALSE;

2. Insert the RFM_MODEL_SEGMENT data into this table:

INSERT INTO lumainsights.fact_rfm_model FROM SELECT * FROM RFM_MODEL_SEGMENT

3. Let us retrieve the results of the query. Observe that we just use the name of the table because this table name is
unique across the data lake and the accelerated store. If you fully qualify the table name with the dot notation i.e.
**lumainsights.lumakpimodel.fact_rfm_model**, you will get the same result.

SELECT * FROM fact_rfm_model;

4. The results of the query will be the same as the **VIEW** on the data lake:

Create a Dashboard using Data Distiller Query Pro Mode

We will be using SQL to build charts for our dashboard:

1. Navigate to the AEP left sidebar and click on Dashboards->Create Dashboard

2. Name the dashboard as RFM_Dashboard. Click on Query Pro Mode. This will open up the Data Distiller Editor
within the context of Dashboard workflows. Click on Enter SQL.

Note that this feature of using SQL to author charts in Query Pro Mode is only available in Data Distiller.

1. In the Data Distiller Editor that opens, please make sure you choose **luma_dash** as the data model from
the dropdown and execute the following query:

SELECT * FROM fact_rfm_model

2. The results will look like this. Click Select.

3. Choose Marks->Table. Then click on the + and add Header 1. Add Column and keep adding all the attributes.
Name the table as RFM by User. You should get a preview that looks like this with 5 columns (instead of all the
attributes shown). This is expected as the View More feature in the table will show all the columns and all the
rows.
4. Cllick on Save and Close. Resize the table widget so that it covers the width of the dashboard**.** Then click
Save. After saving**,** click Cancel to exit the Edit mode**.**

5. Click on the ellipsis to click on View More

6. You will get all the records that you can scroll through or even paginate through the various pages. Click on
Download CSV on the top right corner to download upto 500 rows of data per page. If you page yourself to the
next page, you can download that data as well.

7. As an exercise, create bar charts titled Users by RFM Segment. Click Edit->Add Widget->Enter SQL. Make
sure that **luma_dash** is chosen as the data model from the dropdown.

Use the following code:

SELECT RFM_MODEL, COUNT(userid) AS user_count

FROM fact_rfm_model
GROUP BY RFM_MODEL
ORDER BY RFM_MODEL ASC

1. The bar chart can be built like this. This is pretty easy to do and you should try this on your own.

2. If you click the Export button on the top right corner of the dashboard, you will have the option to print or save
the dashboard as a PDF. This is how your dashboard as a PDF should look like:

These dashboards are highly beneficial because the Data Distiller Scheduling feature allows us to automatically
generate fresh fact tables as soon as new data is available. For the end marketer, this means they can simply view the
dashboards without needing to write any code or perform manual data analysis.

Hydrating the Real-Time Customer Profile

We are now ready to hydrate the Real-Time Customer Profile. First, we will create a new dataset on the data lake and
then mark it for Profile.

You can also read up more about the theory behind this here:

Creating Derived Dataset to Store RFM Attributes

1. Create the empty dataset first. We will need a primary identity as this dataset will be ingested into the Profile
Store that needs a partition key.

CREATE TABLE IF NOT EXISTS adls_rfm_profile ( userId text PRIMARY IDENTITY NAMESPACE
‘Email’, days_since_last_purchase integer, orders integer, total_revenue decimal(18, 2), recency integer,
frequency integer, monetization integer, rfm_model text ) WITH (LABEL = ‘PROFILE’);

2. Make sure that you have Email available as an identity namespace. You can check this here:

1. Once the dataset is created, you should be able to go to Datasets->Browse->adls_rfm_profile and see that the
dataset is empty.

2. You will also see that it creates a proper XDM Individual Profile Schema with custom fieldgroups if you
browse to Schemas->Browse->adls_rfm_profile. You need to copy the tenant name which is
**_pfreportingonprod** (in my case) at the very top of the schema.

3. Here is some explanation on what is happening with the code

1. **userId text**: Defines a column named userId of data type text. This column will store the
user identifiers. The datatype is string.

2. **PRIMARY IDENTITY NAMESPACE 'Email'**: This specifies that userId is the primary
identity for the records in this table and belongs to the identity namespace 'Email'.

3. Primary Identity: In Adobe Experience Platform, the primary identity is the unique identifier used to
merge customer data across different datasets for the Real-Time Customer Profile.

4. Identity Namespace ‘Email’: Indicates that the values in userId are email addresses and belong to the
predefined identity namespace for emails. This helps in unifying profiles based on email addresses.

5. **days_since_last_purchase integer** Stores the number of days since the user’s last
purchase and the datatype is a whole number. The same applies to **orders integer, recency
integer, frequency integer,** and **monetization integer**

6. **total revenue decimal(18, 2)** has precision: up to 18 digits in total.and a scale: 2 digits
after the decimal point.

7. **rfm_model text:** Holds additional information about the RFM segment applied to the user. The
data type is string.

8. The clause **WITH (LABEL = 'PROFILE')** indicates that the table is marked as a Profile dataset
in Adobe Experience Platform (AEP). Datasets labeled with **'PROFILE'** are enabled for Real-Time
Customer Profile, meaning that data ingested into these datasets contributes to building unified customer
profiles. Additionally, while the Identity Graph/Store processes all records, it will skip reading them if no
additional identities (beyond the primary identity) are present. The Identity Graph is designed to identify
and associate two or more identities within each attribute or event record, and without such associations, no
further action is taken on these records.

Insert Data into the Newly Created Derived Dataset

1. We will now insert the data from RFM_MODEL_SEGMENT View into the **adls_rfm_profile** that
has been marked for Real-Time Customer Profile.

INSERT INTO adls_rfm_profile SELECT Struct(userId, days_since_last_purchase, orders, total_revenue,

recency, frequency, monetization, rfm_model) _pfreportingonprod FROM RFM_MODEL_SEGMENT

This code takes some time to run because it operates in Batch Mode, which involves spinning up a cluster to execute
the query. The process includes reading data from the data lake into the cluster, performing the necessary processing,
and then writing the results back to the data lake. The cluster spin-up and shutdown process can take several minutes,
contributing to the overall execution time. This is typical for batch processing workloads where resources are
provisioned dynamically for each job.

1. Observe that the order of the fields in the **SELECT** query of the **INSERT** statement mirrors exactly
one-to-one with the order of the fields in **RFM_MODEL_SEGMENT**. This ensures that the values from
**RFM_MODEL_SEGMENT** are inserted correctly into the corresponding fields in the target structure or table.
Maintaining this strict alignment is crucial to avoid mismatches between the source and target fields during data
insertion.

2. The keyword Struct is used because _pfreportingonprod is treated as an object or

structured data type that encapsulates multiple fields. By using **Struct**, you are grouping the data for the
fields (such as **userId**, **days_since_last_purchase**, **orders**, etc.) into a single
object, which allows for these fields to be handled together as a unit. This is useful when you need to insert or
manage multiple fields as a single entity within an object, such as **_pfreportingonprod**.
Do not worry about having added data to this dataset for Profile. You can simply delete the dataset or use the DROP
table command. Deleting the dataset will remove all corresponding data from the Real-Time Customer Profile,
including the Identity Store. This means any graph links or identity associations created from the dataset will also be
deleted. It is the fastest and most efficient way to remove data from the Real-Time Customer Profile and ensure that no
related data remains in the Identity Graph.

1. Once the dataset has data you should be able to go to Datasets->Browse->adls_rfm_profile and see that the
dataset has data. It should have 2000 rows of data.

Access the RFM Derived Attributes For Audience Creation

1. To see if the data has been loaded into Profile, navigate to Customer->Profile->Browse. Choose the Identity
Namespace as Email and put in the value of [email protected]

2. Navigate to Customer->Audiences->Create Audience->Build Rule

3. Click on Attributes_>XDM Individual Profile

4. Click on the folder that has the same name as the tenant namespace **Pfreportingonprod.** Custom
attributes created in Data Distiller can be found in this folder.

5. You can easily drag and drop the **Rfm_Model** attribute to begin building an audience. Keep in mind that
these attributes can be utilized for Edge, Streaming, and Batch Audiences.

Even though the Profile has been populated, the Rule Builder may not display the attributes. To resolve this, click the
settings icon on the Fields sidebar to the left and select the option labeled “Show all XDM Fields.”

Note: Hydrating Customer Journey Analytics

RFM data can be used as a lookup table in Adobe’s Customer Journey Analytics (CJA) to enhance the analysis of
customer behavior. To do this, you would first upload the RFM dataset as a lookup table into CJA. This dataset
typically includes key metrics such as Recency (how recently a customer made a purchase), Frequency (how often
they purchase), and Monetary (how much they spend). The lookup table should include a common identifier, such as
email or customer ID, which will be used to connect the RFM data to other journey datasets in CJA.

Once uploaded, you would configure the lookup relationship by mapping the RFM attributes (e.g., Recency,
Frequency, and Monetary scores) to the corresponding customer profile data in CJA. This enables the RFM scores to
enrich the event-level journey data, allowing for more granular and targeted analysis. For example, you could analyze
how customers with high-frequency scores interact with different touchpoints in their journey, or track conversion
rates for high-value customers across different campaigns.

By integrating RFM data as a lookup, you unlock the ability to create segments based on behavioral insights and
incorporate them into dashboards, reports, and personalized marketing efforts. Additionally, RFM-enriched data can be
utilized in real-time to power dynamic journey flows, enabling personalized experiences based on past behaviors. This
method ensures you can continually refine and enhance customer experiences across all channels by leveraging both
historical RFM data and real-time journey events in Customer Journey Analytics.

Demystifying SQL in terms of Excel spreadsheet analysis.

Our strategy for the tutorial

Data Distiller Query Pro Mode Editor

Use the highlight and execute feature to execute pieces of code as you sequence them.
Restricting the analysis to only orders dataset

Click the menu item on top left to minimize the AEP sidebar, then expand the templates side bar by using the arrow
icon at bottom right corner. Then name your tamplate and save it.

All order transactions have the data we need for our analysis

SELECT * and other queries operate on the temporary table as if it were materialized in the data lake, but it’s actually
cached in memory.”

Data Distiller Data Model explained.

The same query editor can be used to query the tables in the Accelerated Store.

Data Distiller Query Pro Mode for Chart Authoring

Choose luma_dash as the data model

Table chart on the dashboard

ViewMoe and ViewSQL options in the table chart

Raw data exploration in a Excel like interface.

ViewSQL feature gives you the SQL behind the chart,

Bar chart showing the audience size for RFM segments

Dashboard PDF available for printing and sharing.

Email exists as an identity namespace.

adls_rfm_profile is an empty dataset.

A proper XDM schema is created with the same name as the dataset.

Data has been inserted into the dataset marked for Real-Time Customer Profile.

Profiles now exist for the data that has been loaded.

RFM attributes have hydrated the Real-Time Customer Profile

Choose the rule builder to accees the rules for building audiences.

Navigating to the attributes

Custom attributes created in Data Distiller can be found in Pfreportingonprod

Audience authoring using RFM attributes in Rule Builder.

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution/idr-100-identity-graph-overview * * *

1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 100: Identity Graph Overview
In Adobe’s Real-Time Customer Profile, an identity graph is a core component that maps various identifiers associated
with individual customers across multiple devices, touchpoints, and interactions.

The Identity Graph in Adobe’s Real-Time Customer Profile is a foundational element that enables you to connect and
consolidate data from various touchpoints and interactions with a customer. It’s essentially a dynamic network of
customer identities, such as email addresses, mobile numbers, social media profiles, and more. This graph helps create
a unified, 360-degree view of each customer by associating all the identifiers and attributes associated with that
individual.

Creating an identity graph involves several steps and technologies to consolidate and map various identifiers
associated with individuals or entities across different touchpoints and devices. Here’s a general overview of how an
identity graph is typically created:

1. Data Collection: The process begins with collecting data from various sources, such as websites, mobile apps,
social media, CRM systems, and more. This data includes identifiers like email addresses, phone numbers, device
IDs, and cookies.

2. Identity Resolution: Identity resolution algorithms are employed to link or match different identifiers that
belong to the same individual or entity. These algorithms consider factors like data accuracy, timestamps, and
probabilistic matching to create identity links.

3. Graph Database: The identity graph is stored in a specialized database known as a graph database. Graph
databases are well-suited for representing and querying interconnected data, making them ideal for identity graph
management.

4. Creating Identity Profiles: As identities are resolved and linked, individual or entity profiles are created within
the graph database. These profiles consolidate all known identifiers and associated attributes for each entity.

5. Updating in Real-Time: The identity graph should be updated in real time as new data becomes available. This
ensures that the graph reflects the latest interactions and identifiers associated with individuals or entities.

Last updated 6 months ago

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment/enrich-400-net-promoter-scores-nps-for-
enhanced-customer-satisfaction-with-data-distiller * * *

1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer

Satisfaction with Data Distiller
Unlock the power of NPS to measure and improve customer loyalty and satisfaction

Last updated 4 months ago

Here’s the structure of the dataset. It has 1000 responses to a NPS survey which has been hydrateed with enriched with
RFM (Recency, Frequency, Monetary) and RFE (Recency, Frequency, Engagment) style attributes.

customer_id``:Unique identifier for the customer.

nps_score``:The raw NPS score (0-10 scale).

**promoter_flag**``:A binary flag indicating if the customer is a promoter (1 for NPS scores of 9-10).

**passive_flag**``:A binary flag indicating if the customer is passive (1 for NPS scores of 7-8).

**detractor_flag**``:A binary flag indicating if the customer is a detractor (1 for NPS scores of 0-6).

**purchase_frequency**``:The number of purchases the customer has made in the last 12 months.

avg_order_value``:The average amount spent by the customer per order.

total_spent``:The total amount spent by the customer.

customer_support_interactions``:The number of times the customer interacted with support.

marketing_emails_clicked``:Number of marketing emails clicked by the customer.

account_age_in_days``:The number of days since the customer created their account.

**churn_flag**``:A binary flag for whether the customer churned or not (0 for not churned, 1 for
churned).

Tip: No matter the structure of your data, as long as you transform it into the flat, canonical schema via Data Distiller,
you can apply all of the queries provided below. Alternatively, you can template the queries to suit your specific needs.

Net Promoter Score (NPS) is a metric used by organizations to measure customer loyalty and satisfaction. It’s derived
from a single survey question:

“On a scale of 0 to 10, how likely are you to recommend our product or service to a friend or colleague?”

Based on their response, customers are categorized into three groups:

Promoters (9-10): Enthusiastic, loyal customers who are likely to recommend your product or service.

Passives (7-8): Satisfied but unenthusiastic customers who are vulnerable to competitive offerings.

Detractors (0-6): Unhappy customers who could damage your brand through negative word-of-mouth.

The percentage of Promoters (%Promoters) refers to the proportion of customers classified as Promoters out of the
total number of respondents, which includes Promoters, Passives, and Detractors. The same applies to the
%Detractors. This results in a score ranging from -100 to +100, where:

Positive NPS indicates that more customers are promoters than detractors.

Negative NPS signals that more customers are detractors, a warning sign of poor customer satisfaction.

In traditional NPS calculations, Passives are excluded from the final score, with only Promoters and Detractors
contributing to the outcome. Passives have no direct influence on the NPS result except in the inclusion in the total for
computing the Promotor and Detractor percentages. A simpler way to view the formula is that Promoters are assigned
+1 point, Passives receive 0 points, and Detractors are assigned -1 point

NPS Use Cases in Adobe Experience Platform

Segment Customers Based on NPS Categories

By categorizing customers as Promoters, Passives, or Detractors, businesses can create enriched customer segments in
Adobe Experience Platform (AEP). Each NPS group reflects different customer sentiments and behaviors, which can
then drive personalized marketing or support strategies:

Promoters: Can be targeted with loyalty programs, exclusive offers, or referral incentives to amplify their
positive impact.

Passives: Can be nudged toward becoming Promoters with tailored offers or incentives to increase their
engagement and satisfaction.

Detractors: Require attention with special customer service offers, surveys for deeper feedback, or even product
improvements to mitigate negative sentiment.

Predictive Models for Churn and Retention

NPS can be used as a key indicator in churn prediction models. Customers categorized as Detractors may be more
likely to churn, while Promoters are often more loyal.

Detractors can trigger workflows for retention efforts, such as sending out discounts or personalized support.

Promoters might trigger marketing campaigns focused on advocacy, encouraging them to leave reviews or
promote the brand on social media.

Personalized Engagements and Cross-Channel Journeys

You can tailor personalized marketing engagements based on a customer’s NPS score across multiple touchpoints.

Promoters: Can receive real-time in-app rewards, loyalty program invitations, or be nudged toward higher-tier
memberships.

Detractors: Might receive customer service interactions or problem-resolution emails right after a low NPS score
is recorded.

Using Adobe Journey Optimizer, NPS data can also trigger different customer journeys, ensuring that each customer
gets the right message or experience based on their satisfaction levels.

Real-Time Feedback Loops with Data Distiller Derived Attributes

The Real-Time Customer Profile can be updated with each interaction or survey response. By integrating NPS surveys
into Data Distiller Derived Attributes, you can ensure that customer sentiment data is always fresh and up-to-date.
This allows:

Immediate action: When a detractor gives a poor NPS score, this can trigger a workflow for the customer
service team to reach out.

Continual monitoring: As customer satisfaction improves, so does their NPS, and these updates can be fed back
into customer profiles for more refined future engagements.

Data Distiller Audience Enrichment with Behavioral Data

In AEP, NPS data can be combined with other behavioral, transactional, or demographic data to build a fuller customer
profile. For example, a Detractor who also has high interaction rates with support may indicate deeper customer
service issues. On the other hand, a Promoter who purchases frequently could be offered a loyalty tier upgrade to
deepen brand engagement.
Sample Size Considerations for NPS Surveys

In traditional NPS calculations, although we collect responses from three categories—Promoters, Passives, and
Detractors—the NPS score itself simplifies the calculation to a binomial structure. This is because the NPS formula
only considers Promoters and Detractors, while Passives are excluded from the final calculation (they have a weight
of zero). Essentially, the multinomial distribution (with three categories) is approximated as a binomial distribution
by treating the survey responses as either Promoters (success) or Detractors (failure), while ignoring Passives.
However, note that Passives are still included in the overall sample size, which impacts the precision of the
calculation and the confidence intervals.

The binomial distribution describes the probability of achieving a certain number of successes (e.g., Promoters in
your survey) in a fixed number of independent trials (e.g., survey responses), where each trial has only two possible
outcomes (e.g., Promoter or not Promoter). In this context:

Success corresponds to a customer being a Promoter (NPS score of 9 or 10),

Failure corresponds to a customer being a Detractor (NPS score of 0 to 6).

The traditional NPS calculation, therefore, simplifies the multinomial survey into a binomial process, focusing on the
difference between the proportions of Promoters and Detractors.

To ensure that your NPS surveys are reliable and represent your customer base, you need a statistically significant
sample size. The key factors affecting this include:

1. Confidence Level: Typically set at 95%.

2. Margin of Error: Often chosen as ±5%.

3. Customer Base Size: The larger your base, the more responses you need to ensure accuracy. For large bases,
around 400-500 responses are generally sufficient.

4. Segment Diversity: If your customer base includes diverse segments (e.g., regional or demographic groups), it
may be necessary to oversample to ensure all groups are represented.

In a large sample size situation, the binomial distribution which describes the probability of a given number of
successes in a fixed number of independent trials, can be approximated by a normal distribution, thanks to the Central
Limit Theorem. The confidence interval **E** for a proportion **p** is given by:

Where:

**
p** is the sample proportion,

Z is the Z-score associated with the desired confidence level,

This formula ensures that the sample size is large enough to estimate the population proportion with a specified margin
of error and confidence level.

Rearranging this to solve for the required sample size **n** gives the formula:

_n_ is the required sample size

Z is the Z-score from 1.96 for 95%

p is the estimated proportion of promoters typically 0.5 if unknown

**E** is the margin of error of 0.05 for ±5%

Using this formula, we can calculate that approximately 384 responses would be required for a 95% confidence level
and a ±5% margin of error.

Tip: The calculation of 384 responses applies to any survey where you’re trying to estimate a proportion (such as
customer satisfaction, NPS, or any binary outcome like “yes/no” responses).

Practical Considerations for a Smaller Customer Base

For a smaller customer base, you can use the finite population correction (FPC) to adjust the sample size:

Where the sample size has been adjusted from the **n** we computed above and **N** is the number of customers
in your database). For a population of 1,000 customers, the adjusted sample size using the finite population correction
is approximately 278 responses. This would still provide a 95% confidence level with a ±5% margin of error, but
requires fewer responses than the unadjusted sample size due to the smaller population

Large-scale surveys can be expensive, so consider how many responses are feasible while still achieving statistically
reliable results.

Traditional NPS Calculation in Data Distiller

The NPS formula is:

Let us now compute the NPS for the sample:

WITH nps_categories AS (
SELECT
CASE
WHEN nps_score >= 9 THEN 'Promoter'
WHEN nps_score BETWEEN 7 AND 8 THEN 'Passive'
ELSE 'Detractor'
END AS nps_category
FROM nps_survey_dataset
)
SELECT
(COUNT(CASE WHEN nps_category = 'Promoter' THEN 1 END) * 100.0 / COUNT(*))
-
(COUNT(CASE WHEN nps_category = 'Detractor' THEN 1 END) * 100.0 / COUNT(*))
AS nps_score
FROM nps_categories;

The result will be:

Generalize to the Population with Binomial Distribution

Let us now generalize this to the entire population. When we generalize the NPS from a sample to the entire
population, you are estimating the NPS for the population based on the sample. However, because you are using only a
sample of the population, you need to account for uncertainty. This is where the confidence interval comes into play.

Calculate the Proportions:

Pp: Proportion of Promoters in the sample is #Promoters/n

Pd: Proportion of Detractors in the sample #Detractors/n

Calculate the Standard Error (SE): The standard error (SE) is a measure of how much uncertainty there is
in your estimate of a value—in this case, the difference between the proportion of Promoters and Detractors
in your NPS calculation. It helps quantify how much your sample results might vary if you were to take different
samples from the same population.The formula for SE takes into account both

How much variability there is in the Promoter percentage as shown by the first term below **Pp(1-
Pp)**

How much variability there is in the Detractor percentage as shown by the second term below

**Pd(1-Pd)**

and divides each by the sample size n to reflect that larger samples tend to produce more stable (less variable)
estimates. Then, it adds them together and takes the square root.

Choose Confidence Level: For a 95% confidence level, the Z-score is 1.96. For other confidence levels, use the
corresponding Z-score (e.g., 1.64 for 90% confidence).

Calculate the Confidence Interval: The confidence interval for NPS is:

Where:

NPS is your sample NPS score.

Z is the Z-score for your chosen confidence level.

SE is the standard error.

Copy and execute the following piece of SQL code:

WITH nps_calculation AS (
SELECT
COUNT(*) AS total_responses,
SUM(CASE WHEN nps_score >= 9 THEN 1 ELSE 0 END) AS promoters,
SUM(CASE WHEN nps_score BETWEEN 0 AND 6 THEN 1 ELSE 0 END) AS
detractors
FROM nps_survey_dataset
),
proportions AS (
SELECT
total_responses,
promoters,
detractors,
CAST(promoters AS FLOAT) / total_responses AS proportion_promoters,
CAST(detractors AS FLOAT) / total_responses AS proportion_detractors
FROM nps_calculation
),
standard_error_calculation AS (
SELECT
total_responses,
proportion_promoters,
proportion_detractors,
(SQRT(
(proportion_promoters * (1 - proportion_promoters) /
total_responses) +
(proportion_detractors * (1 - proportion_detractors) /
total_responses)
)) AS standard_error
FROM proportions
)
SELECT
(proportion_promoters * 100 - proportion_detractors * 100) AS nps,
standard_error,
-- Z-score for 95% confidence level is 1.96
(proportion_promoters * 100 - proportion_detractors * 100) - 1.96 *
(standard_error * 100) AS lower_bound_ci,
(proportion_promoters * 100 - proportion_detractors * 100) + 1.96 *
(standard_error * 100) AS upper_bound_ci
FROM standard_error_calculation;

The result will be:

This shows that the NPS score for the population will range from -49 to -41 with 95% confidence. The 95%
confidence interval means that if we were to repeat this survey multiple times, in 95 out of 100 cases, the true NPS
score for the entire population would fall somewhere within this range. In other words, we’re pretty certain that the
population’s NPS score is somewhere between -49 and -41, but it could vary slightly if we surveyed everyone.

However, this does not mean that there won’t be outliers or individual scores that fall outside this range. The
confidence interval reflects the overall population’s average NPS score, not individual customer responses. It’s
possible to have a few extreme responses (either very positive or very negative) that are not captured by this interval,
but these outliers won’t significantly shift the average NPS score for the entire population.

The term**Pp(1-Pp)** is used to represent the variance in a binomial distribution, which is the distribution for
yes/no outcomes (e.g., Promoter or not Promoter). This term quantifies how much variability there is in the proportion
of Promoters in your sample.

High variability: If there’s a lot of variation between Promoters and non-Promoters in your sample (e.g., the
proportions are more balanced, like 50% Promoters and 50% non-Promoters), the value of**Pp(1-Pp)** will
be larger.

Low variability: If almost everyone in your sample is either a Promoter or not a Promoter (e.g., 90% Promoters
and only 10% non-Promoters), there’s less variability, and the value of**Pp(1-Pp)**will be smaller.

Generalize to the Population with Normal Distribution Approximation

The normal distribution approximation to the binomial distribution works well when:

1. The sample size is large enough - a common rule of thumb is that a sample size of 30 or more is often considered
enough for the normal distribution, but this is under ideal conditions (symmetrically distributed data). For NPS,
where the data can be skewed, you often need larger samples.

2. The probability of success is not too close to 0 or 1.

Specifically, the approximation is typically considered valid when both:

n × Pp ≥ 5. and. n × (1−Pp) ≥ 5

where:
n is the number of trials (or sample size).

Pp is the probability of success (e.g., the proportion of Promoters in NPS).

1-Pp is the probability of failure (e.g., the proportion of non-Promoters).

WITH promoter_calculation AS ( SELECT COUNT(*) AS total_responses, SUM(CASE WHEN nps_score >= 9

THEN 1 ELSE 0 END) AS promoter_count FROM nps_survey_dataset ), proportion_calculation AS ( SELECT
total_responses, promoter_count, CAST(promoter_count AS FLOAT) / total_responses AS p,
CAST(total_responses - promoter_count AS FLOAT) / total_responses AS non_promoter_p FROM
promoter_calculation ) SELECT total_responses, p, non_promoter_p, total_responses * p AS n_times_p,
total_responses * (1 - p) AS n_times_1_minus_p, CASE WHEN total_responses * p >= 5 AND total_responses *
(1 - p) >= 5 THEN ‘Conditions Met’ ELSE ‘Conditions Not Met’ END AS condition_check FROM
proportion_calculation;

The results show that the condition is met:

If the NPS distribution is normally distributed, the variance or standard deviation of the NPS scores treated as a
continuous variable can be used to calculate the confidence interval. Instead of calculating the proportions of
Promoters and Detractors, you would use the sample mean and sample variance of the NPS scores directly.

1. Calculate the Mean (NPS): The mean NPS score from your sample:

Where **X** represents the individual NPS scores and **n** is the total number of responses.

1. Calculate the Standard Error (SE) using the variance:

2. Calculate the Confidence Interval (CI):

Where:

**Z** is the Z-score for the desired confidence level (e.g., 1.96 for 95% confidence).

We will now write a SQL query for the above situation where the NPS scores are weighted as follows:

Promoters (9–10): 1 point,

Passives (7–8): 0 points,

Detractors (0–6): -1 point,

we will apply the normal approximation to calculate the NPS and the confidence interval for the given population.

WITH nps_transformation AS (
SELECT
-- Assign weights based on NPS score
CASE
WHEN nps_score >= 9 THEN 1 -- Promoters (9-10)
WHEN nps_score BETWEEN 7 AND 8 THEN 0 -- Passives (7-8)
WHEN nps_score <= 6 THEN -1 -- Detractors (0-6)
END AS transformed_nps_score
FROM nps_survey_dataset
),
nps_statistics AS (
SELECT
-- Calculate the mean of the transformed NPS scores
AVG(transformed_nps_score) AS mean_transformed_nps,
-- Calculate the variance of the transformed NPS scores
VARIANCE(transformed_nps_score) AS variance_transformed_nps,
-- Count total responses
COUNT(*) AS total_responses
FROM nps_transformation
),
standard_error_calculation AS (
SELECT
mean_transformed_nps,
-- Standard error = sqrt(variance / n)
SQRT(variance_transformed_nps / total_responses) AS standard_error
FROM nps_statistics
)
SELECT
-- Transformed NPS mean, scaled to match traditional binomial NPS (-100 to
100)
mean_transformed_nps * 100 AS transformed_nps,
standard_error,
-- Calculate the 95% confidence interval (Z = 1.96)
(mean_transformed_nps * 100) - 1.96 * (standard_error * 100) AS
lower_bound_ci,
(mean_transformed_nps * 100) + 1.96 * (standard_error * 100) AS
upper_bound_ci
FROM standard_error_calculation;

You can see that the results are pretty close to our calculations using binomial distribution:

What are NPS Ranges Across Industries?

Here is a snapshot of what the NPS scores mean:

More Detractors than Promoters. Room for serious concern

Unhappy, likely to leave or speak negatively

Equal number of Promoters and Detractors

Neutral, could go either way

More Promoters than Detractors. Good, but needs improvement

Generally satisfied, but potential vulnerabilities

Excellent. Strong customer loyalty

Very satisfied, likely to recommend

Outstanding. Almost all Promoters, very few Detractors

Extremely satisfied, strong brand advocacy

Here is what the NPS scores typically mean and you can get a lot of such data on the web:

Financial Services/Banking
Television/Internet Providers

Here is summary of NPS scores across industries

NPS > 50 is considered excellent across most industries.

NPS between 30 and 50 is good, indicating satisfied customers with potential areas for improvement.

NPS below 30 signals that there’s significant room for improvement, and a negative NPS indicates customer
dissatisfaction.

Tip: Compare your NPS to industry averages to get a clearer picture of how you’re performing relative to
competitors.

Customer Expectations: In industries like telecommunications or utilities, customers generally have lower
expectations for service and satisfaction, which results in lower average NPS scores. In contrast, tech and retail
sectors often have higher customer expectations, and companies must work harder to earn high NPS scores.

Competition and Product Nature: Some industries, such as e-commerce or SaaS, can easily provide a high-quality,
personalized customer experience, leading to higher NPS scores. In contrast, industries like insurance or telecom,
which are often seen as commoditized or have more rigid service structures, tend to see lower NPS scores.

Customer Interaction Complexity: Companies in industries that have complex customer interactions, like
healthcare or financial services, often have lower NPS scores, since these industries deal with more intricate services
that are harder to standardize in terms of customer experience.

A weighted NPS is used in situations where an organization wants to emphasize certain customer segments or give
different levels of importance to customer feedback. The traditional NPS equally balances Promoters and Detractors,
while ignoring Passives, but some business contexts might justify a weighted approach. Some companies do adopt
custom variations of NPS for their internal metrics, especially in B2B, enterprise-level organizations, or premium
service sectors, where certain customers are significantly more valuable than others. These variations often remain
proprietary, tailored to the company’s business model and customer engagement strategy.

Here are some ways in which weighted NPS could be used:

1. When Certain Groups Are More Critical to the Business:

Promoters could be given a higher weight if the business wants to strongly emphasize the importance of
customer advocacy and referrals. For example, in industries where word-of-mouth marketing is crucial,
the impact of Promoters could be magnified.

Detractors could be downweighted if their negative feedback is less concerning for certain business models
(e.g., highly niche markets where negative feedback from outliers is less relevant).

2. When Passives Play a Significant Role:

Passives typically do not impact NPS, but in certain industries, satisfied but unenthusiastic customers
might still provide value (e.g., they are long-term customers who continue to purchase but don’t actively
promote). A weighted NPS could include Passives to account for their steady contribution to revenue.

3. Customizing NPS for Specific Business Goals:

Companies might want to assign different weights to customer segments based on profitability, brand
loyalty, or customer lifetime value (CLV). For instance, a high-value segment of Promoters could be
weighted more heavily to reflect their overall business impact.
A weighted NPS could be used to focus more on customer satisfaction in high-margin products or
premium services where Passives may still contribute significantly to profit.

4. B2B vs. B2C Contexts:

In B2B (business-to-business) environments, where relationships with clients tend to be deeper and longer-
lasting, a weighted NPS might be useful. For example, Passives (clients who continue using the service
without actively recommending it) might be more valuable in a B2B context than in B2C (business-to-
consumer), where immediate action from Promoters or Detractors is more critical.

5. Long-Term Strategy vs. Short-Term Tactics:

In some cases, companies may want to emphasize long-term relationships with customers over short-term
sales. A weighted NPS could assign more points to Passives or Promoters who may not actively advocate
but continue to make purchases, supporting a long-term retention strategy.

6. Customized NPS in Specific Sectors:

Some industries might use a weighted NPS to tailor the metric to the realities of their customer dynamics:

Healthcare: The stakes are high, and dissatisfied customers (Detractors) could have outsized impacts,
so Promoters might be weighted higher to emphasize positive patient experiences.

Luxury Brands: Here, Promoters are especially valuable, so their feedback might be assigned more
weight.

Let’s assume the following weights:

The adjusted NPS formula would then become:

Where:

Pp is the proportion of Promoters,

Ppassive is the proportion of Passives,

**Pd
** is the proportion of Detractors.

The general formula for the standard error of a weighted sum of proportions is:

Where:

**n** is the total number of survey responses and the other parameters aree as defined above.

Let us assume the scenario where Promoters get +2 points, Passives get +1 point, and Detractors get -3 points:

WITH nps_transformation AS (
SELECT
-- Assign weights based on NPS score
CASE
WHEN nps_score >= 9 THEN 2 -- Promoters (9-10 get +2 points)
WHEN nps_score BETWEEN 7 AND 8 THEN 1 -- Passives (7-8 get +1
point)
WHEN nps_score <= 6 THEN -3 -- Detractors (0-6 get -3 points)
END AS transformed_nps_score
FROM nps_survey_dataset
),
nps_statistics AS (
SELECT
-- Calculate the mean of the transformed NPS scores
AVG(transformed_nps_score) AS mean_transformed_nps,
-- Calculate the variance of the transformed NPS scores
VARIANCE(transformed_nps_score) AS variance_transformed_nps,
-- Count total responses
COUNT(*) AS total_responses
FROM nps_transformation
),
standard_error_calculation AS (
SELECT
mean_transformed_nps,
-- Standard error = sqrt(variance / n)
SQRT(variance_transformed_nps / total_responses) AS standard_error
FROM nps_statistics
)
SELECT
-- Transformed NPS mean, scaled to match traditional binomial NPS (-100 to
100)
mean_transformed_nps * 100 AS transformed_nps,
standard_error,
-- Calculate the 95% confidence interval (Z = 1.96)
(mean_transformed_nps * 100) - 1.96 * (standard_error * 100) AS
lower_bound_ci,
(mean_transformed_nps * 100) + 1.96 * (standard_error * 100) AS
upper_bound_ci
FROM standard_error_calculation;

The results will be:

Transformed NPS: -136

This indicates that, after applying the custom weights (+2 for Promoters, +1 for Passives, and -3 for
Detractors), the overall NPS score is significantly negative, reflecting a majority of Detractors compared to
Promoters.

Standard Error: 0.0692

This represents the uncertainty or variability in the transformed NPS score. A relatively small standard error
suggests that the data is not very spread out and the NPS score is stable within the dataset.

95% Confidence Interval:

This confidence interval shows that the true value of the transformed NPS is likely to fall between -149.57 and
-122.43 with 95% confidence.

We are going to do a pairwise computation of the correlation between traditional NPS scores and the various attributes
we have:

Calculate the Pearson correlation between the transformed NPS score and each attribute using the formula:

This gives the pairwise correlation for each attribute against the transformed NPS score. The SQL code will be:
WITH nps_transformation AS (
SELECT
-- Assign numerical values to NPS score categories
CASE
WHEN nps_score >= 9 THEN 1 -- Promoters (9-10 get +1)
WHEN nps_score BETWEEN 7 AND 8 THEN 0 -- Passives (7-8 get 0)
WHEN nps_score <= 6 THEN -1 -- Detractors (0-6 get -1)
END AS transformed_nps_score,
purchase_frequency,
avg_order_value,
total_spent,
customer_support_interactions,
marketing_emails_clicked
FROM nps_survey_dataset
),
correlation_calculation AS (
SELECT
-- Calculate correlation for each attribute using Pearson's formula
(SUM((transformed_nps_score - (SELECT AVG(transformed_nps_score) FROM
nps_transformation)) * (purchase_frequency - (SELECT AVG(purchase_frequency)
FROM nps_transformation))) /
(SQRT(SUM(POWER(transformed_nps_score - (SELECT
AVG(transformed_nps_score) FROM nps_transformation), 2))) *
SQRT(SUM(POWER(purchase_frequency - (SELECT AVG(purchase_frequency)
FROM nps_transformation), 2))))) AS corr_purchase_frequency,

(SUM((transformed_nps_score - (SELECT AVG(transformed_nps_score) FROM

nps_transformation)) * (avg_order_value - (SELECT AVG(avg_order_value) FROM
nps_transformation))) /
(SQRT(SUM(POWER(transformed_nps_score - (SELECT
AVG(transformed_nps_score) FROM nps_transformation), 2))) *
SQRT(SUM(POWER(avg_order_value - (SELECT AVG(avg_order_value) FROM
nps_transformation), 2))))) AS corr_avg_order_value,

(SUM((transformed_nps_score - (SELECT AVG(transformed_nps_score) FROM

nps_transformation)) * (total_spent - (SELECT AVG(total_spent) FROM
nps_transformation))) /
(SQRT(SUM(POWER(transformed_nps_score - (SELECT
AVG(transformed_nps_score) FROM nps_transformation), 2))) *
SQRT(SUM(POWER(total_spent - (SELECT AVG(total_spent) FROM
nps_transformation), 2))))) AS corr_total_spent,

(SUM((transformed_nps_score - (SELECT AVG(transformed_nps_score) FROM

nps_transformation)) * (customer_support_interactions - (SELECT
AVG(customer_support_interactions) FROM nps_transformation))) /
(SQRT(SUM(POWER(transformed_nps_score - (SELECT
AVG(transformed_nps_score) FROM nps_transformation), 2))) *
SQRT(SUM(POWER(customer_support_interactions - (SELECT
AVG(customer_support_interactions) FROM nps_transformation), 2))))) AS
corr_customer_support_interactions,

(SUM((transformed_nps_score - (SELECT AVG(transformed_nps_score) FROM

nps_transformation)) * (marketing_emails_clicked - (SELECT
AVG(marketing_emails_clicked) FROM nps_transformation))) /
(SQRT(SUM(POWER(transformed_nps_score - (SELECT
AVG(transformed_nps_score) FROM nps_transformation), 2))) *
SQRT(SUM(POWER(marketing_emails_clicked - (SELECT
AVG(marketing_emails_clicked) FROM nps_transformation), 2))))) AS
corr_marketing_emails_clicked

FROM nps_transformation
)
SELECT
corr_purchase_frequency,
corr_avg_order_value,
corr_total_spent,
corr_customer_support_interactions,
corr_marketing_emails_clicked
FROM correlation_calculation;

The result will be:

The above results are quite depressing:

corr_purchase_frequency: 0.00248 shows a very weak positive correlation between the NPS score and the
purchase frequency. This means that customers with higher purchase frequency are very slightly more likely to
be Promoters, but the relationship is nearly negligible.

corr_avg_order_value: 0.02237 indicates a weak positive correlation between the average order value and
the NPS score. Customers who spend more on average are marginally more likely to be Promoters, but again, the
relationship is very weak.

corr_total_spent: 0.0105 suggests that the correlation between total spending and the NPS score is also very
weakly positive. This suggests that customers who spend more overall are only slightly more likely to be
Promoters.

corr_customer_support_interactions: 0.02748 shows a slightly stronger positive correlation between

customer support interactions and the NPS score compared to other attributes, but it is still very weak.
Customers who have more interactions with customer support are slightly more likely to give a higher NPS score.

corr_marketing_emails_clicked: -0.0183 indicates a weak negative correlation between the number of

marketing emails clicked and the NPS score. Customers who clicked on more marketing emails are marginally
less likely to be Promoters or are more likely to be Detractors. However, the relationship is still quite weak.

These weak correlations imply that none of the customer attributes included in the analysis are strong predictors of
whether a customer will be a Promoter, Passive, or Detractor.

You may want to explore other customer attributes or use more sophisticated techniques (like feature importance in
machine learning models) to better understand what drives NPS.

One objective is to determine whether we can use the available customer attributes to predict the NPS score for
customers who did not respond to the survey. However, based on the weak correlations observed in the analysis,
building a linear regression model (to generate a continuous NPS value) or a classification model (to categorize
customers into NPS groups using linear methods) with the current set of attributes would likely result in a model with
low predictive power. Here’s why:

The weak correlations suggest that these attributes do not explain much of the variance in the NPS score, indicating
that the relationships between the variables and the NPS score are not strong enough for accurate prediction. To
improve the model’s performance, it may be necessary to create new features or transform existing ones that better
capture the relationships between the data and the NPS. For example:
Combining total spent and average order value to create a new feature representing customer value.

Investigating interaction effects between variables, such as combining purchase frequency and customer
support interactions to discover hidden patterns.

Furthermore, the weak correlations indicate that the relationships between the attributes and NPS might not be linear.
Therefore, using non-linear models such as decision trees, random forests, or gradient boosting machines (GBM)
could be more effective. These models are capable of capturing more complex interactions between variables, which
could lead to better predictions.

Additionally, the current set of attributes may not provide sufficient information to predict NPS accurately. You may
need to incorporate additional or more informative customer attributes that are better predictors of customer
satisfaction and loyalty. Attributes such as customer satisfaction scores, net purchase frequency over time, or social
media interactions might provide deeper insights into a customer’s likelihood of being a Promoter, Passive, or
Detractor.

In summary, improving the feature set, exploring non-linear models, and incorporating additional relevant attributes
may significantly enhance the model’s ability to predict NPS for customers who did not respond to the survey.

NPS=%Promoters−%Detractors\text{NPS} = \% \text{Promoters} - \% \text{Detractors} NPS=%Promoters−

%Detractors

p±Z×p×(1−p)np \pm Z \times \sqrt{\frac{p \times (1 - p)}{n}} p±Z×np×(1−p)

n=Z2×p×(1−p)E2n = \frac{Z^2 \times p \times (1 - p)}{E^2} n=E2Z2×p×(1−p)

nadjusted=n1+n−1Nn_{\text{adjusted}} = \frac{n}{1 + \frac{n - 1}{N}} nadjusted=1+Nn−1n

NPS=%Promoters−%Detractors\text{NPS} = \% \text{Promoters} - \% \text{Detractors} NPS=%Promoters−

%Detractors

SE=Pp(1−Pp)n+Pd(1−Pd)nSE = \sqrt{\frac{P_p (1 - P_p)}{n} + \frac{P_d (1 - P_d)}{n}} SE=nPp(1−Pp)+nPd(1−Pd)

NPS±Z×SE\text{NPS} \pm Z \times SE NPS±Z×SE

Xˉ=∑Xn\bar{X} = \frac{\sum X}{n} Xˉ=n∑X

σ2=1n∑(X−Xˉ)2\sigma^2 = \frac{1}{n} \sum (X - \bar{X})^2 σ2=n1∑(X−Xˉ)2

SE=σnSE = \frac{\sigma}{\sqrt{n}} SE=nσ

CI=Xˉ±Z×SE\text{CI} = \bar{X} \pm Z \times SE CI=Xˉ±Z×SE

Adjusted NPS=Wp×Pp+Wpassive×Ppassive−Wd×Pd\text{Adjusted NPS} = W_p \times P_p + W_{\text{passive}}

\times P_{\text{passive}} - W_d \times P_d Adjusted NPS=Wp×Pp+Wpassive×Ppassive−Wd×Pd

SE=Wp2×Pp(1−Pp)n+Wpassive2×Ppassive(1−Ppassive)n+Wd2×Pd(1−Pd)nSE = \sqrt{\frac{W_p^2 \times P_p (1 -

P_p)}{n} + \frac{W_{\text{passive}}^2 \times P_{\text{passive}} (1 - P_{\text{passive}})}{n} + \frac{W_d^2
\times P_d (1 - P_d)}{n}} SE=nWp2×Pp(1−Pp)+nWpassive2×Ppassive(1−Ppassive)+nWd2×Pd(1−Pd)

Correlation(X,Y)=∑(X−Xˉ)(Y−Yˉ)∑(X−Xˉ)2×∑(Y−Yˉ)2\text{Correlation}(X, Y) = \frac{\sum (X - \bar{X})(Y -

\bar{Y})}{\sqrt{\sum (X - \bar{X})^2} \times \sqrt{\sum (Y - \bar{Y})^2}} Correlation(X,Y)=∑(X−Xˉ)2
×∑(Y−Yˉ)2∑(X−Xˉ)(Y−Yˉ)

Net Promoter Score is -45.

Conditions are met for normal distribution approximation

Results are very similar with normal distribution approximation.

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution/idr-200-extracting-identity-graph-from-
profile-attribute-snapshot-data-with-data-distiller * * *

1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data
with Data Distiller
An identity lookup table is a database table used to store identities associated with various identity namespaces in the
Real-Time Customer Profile.

Last updated 6 months ago

You need to get familiar with Profile attribute snapshot dataset explorations. Please complete or browse the following
sections before proceeding:

The profile attribute snapshot dataset is a daily export from the Real-Time Customer Profile. This dataset also contains
the identities for the profile data. The identities are encapsulated in a map data structure with identity types (called
namespaces in AEP parlance) and identity values. There can be multiple identity types (email, cookie, cross-device
such as CRM, etc.) and there can be multiple values within each. You are tasked with transforming this data in the map
structure to a relational structure that can function as a lookup table. This lookup table will serve as the foundation of
all analytics you will do on Real-Time Customer Profile data and beyond. Whether it is Customer 360, SQL Traits, or
ML workflows, this lookup table will be invaluable for those use cases.

Exploring the Map Structure using Data Distiller: The Identity Map

Before you extract the identity information, you need to understand the map structure. Use the to_json feature in Data
Distiller to get in place information about the schema structure without having to go to the Schemas UI screen. This is
an extremely important feature with deeply nested data structures and maps.

SELECT to_json(identityMap) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

If you execute the following, you will see the following:

What you see above is the identity map structure. The map has the identity namespace (email, crmid) as the
unique key to index the values. Note that this is an array of values. If you need to extract the identities, you will need
to use the following code:

SELECT identityMap.email.id FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The results look like the following:

Typically we would explode this array

SELECT explode(identityMap.email.id) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;
identityMap.email.id gives you the array of email identity values and not just an email identity value and
these arrays are great for explode functions.

But there is a problem - plain vanilla “explode” will get rid of rows that do not have any identity values. That is a
problem because the absence of an identity is a signal and is possibly associated with other identities that the person
may have. Let us fix the issue by using explode_outer from Data Distiller:

SELECT explode_outer(identityMap.email.id) AS email FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

Extract Identities from the Map Structure to Create an Identity Lookup Table

But just doing an explode_outer on a single identity namespace is of no use. We have destroyed the association
between identities within that same namespace. Let us generate a synthetic UUID for each profile row to keep track of
our identities as we explode them. If we do this right, we will generate a table of UUIDs and related identities as a
columnar table that gives us leverage to use it for a variety of operations.

If we take the first element of the email namespace and concatenate that with the first element of the crmid
namespace, we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have
that identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged
it into a single row.

Let us now generate the unique UUIDs along with the schema-friendly identity map structure

SELECT concat(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,to_json(identityMap) from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

The results will look like:

But concat string function does a poor job of concatenating NULL values. We have to remedy that with
COALESCE:

SELECT COALESCE(identityMap.email.id[0],identityMap.crmid.id[0]) as
unique_id,to_json(identityMap) from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

This works and you will get:

If there are two identity namespaces, then the explode_outer operator works on each, one at a time. Make sure
you remove the to_json as we do not need it anymore:

SELECT unique_id, explode_outer(identityMap.email.id) AS email,

identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,identityMap from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)

You will get:

We need to explode this one more time for crmid and we should get:

SELECT unique_id, email, explode_outer(crmid) as crmid FROM (

SELECT unique_id, explode_outer(identityMap.email.id) AS email,
identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,identityMap from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)

The results would be:

Using this table, I can do some very interesting analysis. For example, we can look at the histogram of email IDs in
the system:

SELECT bucket, count(unique_id) FROM(

SELECT unique_id, count(CASE WHEN email IS NULL THEN 0 ELSE email END) as
bucket FROM(
SELECT unique_id, email, explode_outer(crmid) as crmid FROM (
SELECT unique_id, explode_outer(identityMap.email.id) AS email,
identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,identityMap from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
))
GROUP BY unique_id
ORDER BY bucket ASC)
GROUP BY bucket

The answer looks like this:

Why are we seeing 976 as a bucket and 0 counts for unique_ids? The sum of 8225 and 976 adds up to the number of
profiles. Why did we even get this result in the first place? Is there something wrong with the way the SQL is written
or is there something more profound happening in the identity graph?

Still cannot figure it out? What if the email and _crmid i_dentities were non-existent? You have null rows in the table
that you should clean before you do any analytics. For the identity lookup use case which is the goal of this section, it
would not matter,

Exploding the array helps you extract the identities.

explode_outer retains the nulls for the identity namespaces

Generate the unique ID without llosing identity associations.

COALESCE function retrieves the first non-zero value in a list and is a good choice for a unique ID for the profile in
our example.

email identities have been separated into separate rows without breaking profile association.

Identity lookup table in relational form

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution/idr-300-understanding-and-mitigating-
profile-collapse-in-identity-resolution-with-data-distiller * * *

1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 300: Understanding and Mitigating Profile Collapse in Identity
Resolution with Data Distiller
Mastering profile cleanup transforms data chaos into clarity, enabling accurate, unified real-time customer profiles with
15+ algorithms.

Last updated 4 months ago

For this tutorial, you will need to ingest the following dataset:

using the steps outlined in:

Understanding Profile Collapse

Adobe Experience Platform Hub

Here is a recap of the fundamental concepts of Adobe Experience Platform. Data enters the Adobe Experience
Platform through edge, streaming, and batch methods. Regardless of the ingestion mode, all data must find its place in
a dataset within the Platform Data Lake. The ingestible data falls into three categories: attributes, events, and lookups.
The Real-Time Customer Profile operates with two concurrent stores – a Profile Store and an Identity Store. The
Profile Store takes in and partitions data based on the storage key, which is the primary identity. Meanwhile, the
Identity Store continuously seeks associations in among identities, including the primary one, within the ingested
record, utilizing this information to construct the identity graph. These two stores, accessing historical data in the
Platform Data Lake, open avenues for advanced modeling, such as RFM, among other techniques.

Deterministic and probabilistic identity resolution are two methods used to match and merge customer identities across
data sources, each with its own trade-offs.

Deterministic identity resolution relies on exact matches of unique identifiers, such as email addresses, phone
numbers, or customer IDs. This approach is highly accurate when the data is consistent, ensuring a precise link
between records. However, its limitation lies in the need for identical identifiers across systems, which can reduce
match rates when variations in data exist. Adobe Experience Platform uses deterministic identity resolution.

On the other hand, probabilistic identity resolution uses statistical algorithms and data attributes (e.g., name,
location, browsing behavior) to calculate the likelihood that different records belong to the same individual. While this
approach increases the chances of finding connections even when identifiers differ, it introduces some uncertainty, as
matches are based on probability rather than certainty.

Profile collapse in Adobe Experience Platform occurs when data from different individuals is mistakenly merged into a
single customer profile. This typically happens due to identity management issues or data inconsistencies during the
process of data ingestion and identity resolution.

Data can enter Adobe Experience Platform (AEP) via edge, streaming, or batch ingestion methods. Regardless of how
it arrives, all data is ultimately stored within the Platform Data Lake and categorized as attributes, events, or lookups.
To control how the identity algorithm processes data, you need to prepare the data before it enters the Identity Store.

There are two scenarios for data handling:

1. Batch Ingestion: In this case, data should be processed by Data Distiller and then ingested into both stores. This
approach works seamlessly, as Data Distiller can handle these jobs in either full or incremental processing modes.

2. Real-Time Ingestion (Edge or Streaming): In this scenario, processing cannot occur immediately because you
need to act on data in real time. However, all real-time data is stored in the Data Lake, allowing for periodic
identity operations when system downtime is possible, preferably during low-traffic periods. To perform these
operations, follow these steps:

Disable the existing dataflow for Dataset A, where real-time data is currently being ingested.

Unmark Dataset A for Profile.

Apply the identity algorithm operations to Dataset A, creating a new Dataset B and marking it for Profile.
Ensure Dataset B is empty and marked for Profile before hydrating it.

Set up a new dataflow from the same source to Dataset B. This setup will ensure that both historical and
new streaming data are ingested seamlessly.

Shared Devices or Browsers

Example: When multiple users share the same device or browser, the same Experience Cloud ID (ECID) is used
for all interactions.

How It Causes Profile Collapse: Since the ECID remains constant across different users, AEP may mistakenly
associate all interactions with a single profile, merging data from multiple individuals. For example, if two
customers log in using the same device, their distinct CRM IDs may both get linked to the same ECID, leading to
a merged profile in the Profile Store.

Identity Fragmentation (Multiple Identities for the Same Individual)

Example: A single user may appear under different identifiers across data sources or channels, such as different
CRM IDs for B2B and B2C interactions, different emails, or multiple login credentials.

How It Causes Profile Collapse: When AEP tries to link these identities, it may merge data from different
identifiers into a single profile. If the linking is done inconsistently or incorrectly, it can lead to a collapsed profile
where data from different roles or personas of the same individual gets combined inappropriately.

Example: Poor data quality can lead to duplicate records where the same person is listed multiple times with
slight variations in the data.

How It Causes Profile Collapse: If these duplicate records are not accurately distinguished, AEP’s identity
resolution may treat them as separate initially but later merge them into one profile, resulting in collapsed data..

Cross-Device or Cross-Channel Tracking

Example: Users interact across multiple devices or channels, generating different identifiers (e.g., cookies,
mobile IDs, ECIDs).

How It Causes Profile Collapse: If identity resolution does not accurately map these identifiers back to the same
person, data from different users who share similar device or channel characteristics may be incorrectly merged,
causing profile collapse.

Shared Network Environments (e.g., Public Wi-Fi)

Example: Multiple users on the same network (e.g., public Wi-Fi) can appear to share similar network-based
identifiers, such as IP addresses.

How It Causes Profile Collapse: If network-based identifiers are used as part of the identity resolution, this can
result in incorrect associations between different users on the same network, leading to merged profiles.
Using IP addresses as identifiers in profile resolution can be problematic because they do not uniquely represent
individual users. Here are several reasons why IP addresses can lead to inaccurate identity resolution and profile
collapse:

1. Shared Networks: Multiple users often share the same IP address when they are on the same network, such as
public Wi-Fi in a coffee shop, office network, or a household router. In these scenarios, an IP-based identifier
may incorrectly group different users as one, resulting in profile collapse.

2. Dynamic IP Addresses: Many internet service providers (ISPs) assign dynamic IP addresses, meaning a user’s
IP can change over time. For instance, when a user reconnects to the internet or moves between networks, they
might receive a new IP address. This variability can lead to fragmented profiles for the same user or incorrect
matches with other users.

3. Proxy Servers and VPNs: Users who access the internet via a proxy server or a virtual private network (VPN)
can share the same IP address across multiple devices and locations. This introduces further ambiguity, as users
from entirely different networks can appear to have identical IP addresses, complicating identity resolution.

4. Geographic Misinterpretations: IP addresses can sometimes be misleading in terms of location, especially with
mobile carriers or large ISPs that route traffic through different hubs. Users may appear to be connecting from
one location even if they are in a completely different one, which can skew profile data and misrepresent user
behavior.

5. IP Address Rotation and NAT: Many large networks, especially in enterprise or cellular environments, use
Network Address Translation (NAT), allowing multiple devices to share a single public IP address. In these cases,
hundreds or thousands of users may appear to be connecting from the same IP, making it an unreliable identifier
for individual profiles.

Third-Party Data with Inconsistent Identity Keys

Example: Integrating third-party data with different identity keys (e.g., hashed emails, mobile IDs) can
complicate the identity resolution process.

How It Causes Profile Collapse: If the third-party identity keys are not consistently linked to primary identifiers
in AEP, data from different individuals may be mistakenly merged.

Keeping prospect data separate from existing customer data in Adobe Experience Platform is crucial for several
reasons, primarily around data accuracy, compliance, and targeted engagement. Here’s a breakdown of why this
separation is important:

1. Different Data Quality and Attributes: Prospect data often lacks the depth and reliability of existing customer
data, as it may come from third-party sources, form fills, or inferred behavior. Mixing it with verified customer
data can dilute the accuracy of customer profiles and lead to mistaken assumptions or merges.

2. Targeted Engagement Strategies: Prospects and existing customers are at different stages in the customer
journey and require distinct engagement strategies. Keeping them separate allows businesses to personalize
messaging based on the user’s status—whether they are a potential customer needing nurturing or an existing
customer who may benefit from cross-selling or loyalty programs.

3. Compliance and Privacy Requirements: Privacy regulations often impose stricter handling requirements for
prospect data, as it may include inferred interests or demographics without direct consent. By isolating prospect
data, you can manage it according to specific data handling policies, reducing the risk of inadvertently using
unconsented data in customer-specific actions or analyses.

4. Avoiding Profile Collapse and Data Pollution: If prospect data is integrated with customer data prematurely, it
increases the likelihood of profile collapse, where profiles may merge incorrectly due to weak or partial
identifiers. This mixing can lead to inaccurate profiling, mistaken identity resolution, and ineffective targeting.
Keeping the data separate helps maintain clean, verified customer records.

5. Flexible and Scalable Identity Resolution: By separating prospects, organizations can apply tailored identity
resolution processes, especially if the prospect data has a different set of identifiers or is less frequently updated.
This approach ensures that identity resolution can be scaled and adjusted for prospects without impacting
customer data accuracy.

Different Data Sources with Varying Update Frequencies

Example: Data sources may update at different intervals, such as real-time data streams versus weekly CRM
updates.

How It Causes Profile Collapse: If identities are linked or unlinked based on out-of-sync information, the
Identity Store may merge profiles incorrectly.

Incorrectly Implemented Identity Resolution Logic in Ingested Data

Example: Errors in the data ingestion process can send incorrect identity mapping information to AEP.

How It Causes Profile Collapse: These mistakes can lead to improper associations between identifiers, causing
unrelated profiles to be merged.

Why Profile Collapse Matters

Profile collapse impacts the accuracy and effectiveness of customer data by:

Incorrect Personalization: Merging unrelated data results in irrelevant or misleading personalization, causing
customers to receive content or offers that don’t apply to them.

Data Quality and Analysis Issues: Aggregated metrics may reflect combined behaviors of different individuals,
leading to skewed analysis and inaccurate insights.

Privacy Risks: Mixing data from multiple users can unintentionally expose personal information to the wrong
individual.

To minimize the risk of profile collapse, consider:

Refining Identity Matching Rules: Use more precise matching criteria and multiple attributes to accurately
resolve identities.

Improving Data Quality: Address duplicate records and inconsistencies at the source before ingestion.

Configuring Identity Graphs Carefully: Ensure identity resolution rules and graph configurations account for
different identity sources and their unique characteristics.

Regular Monitoring and Auditing: Implement checks to detect and rectify potential profile collapses.

In the context of profile collapse, the trade-off between deterministic and probabilistic identity resolution revolves
around balancing accuracy and coverage.

Deterministic methods offer high accuracy by linking profiles only when there are exact matches on unique
identifiers, reducing the risk of mistakenly merging different individuals but potentially leaving some profiles
fragmented if identifiers are inconsistent or missing.
On the other hand, probabilistic methods expand coverage by using statistical algorithms to match records based on
similarities in behaviors, patterns, or non-unique identifiers. While this approach can unify more profiles even when
exact matches are unavailable, it also increases the likelihood of false positives, leading to profile collapse where data
from different individuals is mistakenly combined.

To mitigate this, typically a hybrid approach can be used, starting with deterministic matches and then leveraging
probabilistic methods to fill in gaps, aiming to balance the need for comprehensive identity resolution with the risk of
inaccurate profile merging.

Querying the Data to Identify Collapsed Profiles

First, we will identify the profiles that are collapsed, based on the presence of different CRM_IDs for the same ECID.

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count

FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This query identifies the ECID values associated with multiple CRM_IDs, indicating potential collapsed profiles.

Mitigating Profile Collapse Strategies

When cleaning datasets to resolve collapsed profiles caused by multiple CRM_IDs linked to the same ECID, there are
various rule-based strategies that can be employed. The choice of strategy depends on business requirements and the
characteristics of the data. Here are some possible rule-based strategies to resolve such profile collapses:

Keeping the Latest Record

Description: This strategy focuses on retaining the most recent record for each ECID by using the
Login_Timestamp to identify the latest activity.

Use Case: This approach is valuable when the latest activity is considered the most accurate or relevant representation
of a user profile. By keeping only the most recent data, you ensure that the profile reflects the most current user
information, excluding outdated or redundant records that may no longer be valid.

Implementation Example: Execute the following code blocks sequentially to implement this method, making sure to
run each step individually to maintain data integrity:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Latest Record Algorithm

CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset a
WHERE Login_Timestamp = (
SELECT MAX(Login_Timestamp)
FROM example_dataset b
WHERE a.ECID = b.ECID
);

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Result is not shown as it should return no rows as shown below:

In the algorithms below, we won’t display the screen with empty rows, but if the algorithms function correctly, a
screen with no rows returned as shown above is the expected outcome.

Keeping the Earliest Record

Description: This strategy retains only the earliest record for each ECID, determined by the
Login_Timestamp.

Use Case: This approach is useful when the initial identification or first interaction with a profile is considered
the most reliable, ensuring that any subsequent, potentially inconsistent updates do not affect the profile’s original
data.

Implementation Example: Execute each code block one at a time to avoid conflicts and ensure data integrity:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Earliest Record Algorithm CREATE TEMP TABLE cleaned_dataset AS SELECT * FROM example_dataset a
WHERE Login_Timestamp = ( SELECT MIN(Login_Timestamp) FROM example_dataset b WHERE a.ECID =
b.ECID );

--Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;

Prioritizing a Specific CRM_ID Type

Description: This strategy retains records based on a preferred CRM_ID type, giving priority to a specific type
(e.g., always keeping B2C over B2B).

Use Case: This approach is helpful when certain customer relationship types are more important for analysis or
operations. For example, retail customers (B2C) may be prioritized over business customers (B2B) to focus on
individual consumer behavior in retail environments.

Implementation Example: Execute each code block sequentially to ensure the correct data is retained without
conflicts:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Prioritizing a Specific Identity Type CREATE TABLE cleaned_dataset AS SELECT * FROM example_dataset
WHERE CRM_ID LIKE ‘B2C_%’;

--Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;

Resolving Based on a Scoring System

Description: Apply a scoring system to rank profiles by various attributes (such as login recency, interaction
count, or device type), then retain the highest-scoring record for each ECID.

Use Case: Ideal when determining the “best” record requires multiple criteria, offering a more refined method for
resolving collapsed profiles.
Implementation Example: Execute each code block in sequence to ensure accurate and conflict-free data
processing:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, CRM_ID, Device_Type, Login_Timestamp

FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Score DESC) AS rn
FROM ( SELECT *, -- Scoring logic: more recent logins and ‘Desktop’ device type score higher CASE WHEN
Device_Type = ‘Desktop’ THEN 10 ELSE 5 END + DATEDIFF(day, ’2024-10-01′, Login_Timestamp) AS
Score FROM example_dataset ) scored_data ) ranked_data WHERE rn = 1;

--Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;

Merging Attributes from Collapsed Records

Description: Instead of discarding any records, merge the attributes from all records associated with the same
ECID. This could involve creating lists of values, aggregating metrics, or applying other transformation rules.

Use Case: Suitable when all available information is valuable and should be preserved, such as combining
multiple phone numbers or email addresses associated with a profile.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT ECID,
STRING_AGG(CRM_ID, ', ') AS CRM_ID,
MAX(Login_Timestamp) AS Last_Login,
COUNT(DISTINCT Device_Type) AS Unique_Device_Count
FROM example_dataset
GROUP BY ECID;

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Keeping a Combination of the Latest Record by Type

Description: If an ECID is associated with multiple CRM_ID types, keep the latest entry for each type.

Use Case: Useful in scenarios where having both B2B and B2C relationships is important for the same user, and
the latest activity for each type is relevant.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY
CASE
WHEN SUBSTRING(CRM_ID, 1, 3) = 'B2C' THEN 1
ELSE 2
END,
Login_Timestamp DESC) AS rn
FROM example_dataset
) filtered_data
WHERE rn = 1;

-- Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Removing Profiles with Ambiguous Identity Mapping

Description: If an ECID is linked to different CRM_ID types that cannot be resolved through any of the above
rules, these profiles can be flagged or removed for manual review.

Use Case: Suitable for cases where automation cannot confidently resolve the conflicts, requiring human
oversight.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Step 1: Identify ambiguous profiles (those with multiple CRM_IDs per

ECID)
CREATE TEMP TABLE ambiguous_profiles AS
SELECT ECID
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

-- Step 2: Create the cleaned dataset by excluding ambiguous profiles

CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset
WHERE ECID NOT IN (SELECT ECID FROM ambiguous_profiles);

-- Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Utilizing External Reference Data for Resolution

Description: Reference an external dataset, such as a master customer list, to resolve which CRM_ID should be
considered valid.

Use Case: This approach is useful when a trusted external source can guide the resolution, especially when there
are well-maintained external records.

Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT a.*
FROM example_dataset a
JOIN master_customer_list b ON a.CRM_ID = b.Valid_CRM_ID;

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Using Confidence Scores for Identity Resolution

Description: Assign confidence scores to each CRM_ID association based on various factors, such as the number
of interactions, consistency of login information, or verification level. Higher confidence scores indicate stronger
associations.

Use Case: Useful when multiple identifiers are present, and some are more reliable than others. This strategy
helps prioritize the most trusted associations.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Confidence_Score
DESC) AS rn
FROM (
SELECT *,
-- Example confidence scoring based on frequency and
consistency
COUNT(*) *
CASE
WHEN Device_Type = 'Desktop' THEN 1.5 ELSE 1 END AS
Confidence_Score
FROM example_dataset
GROUP BY ECID, CRM_ID, Device_Type, Login_Timestamp
) scored_data
) ranked_data
WHERE rn = 1;

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This query creates a cleaned dataset by selecting the highest-confidence record for each unique **ECID**,
effectively reducing profile overlap and potential data duplication. It starts by calculating a
**Confidence_Score** for each record in the **example_dataset** based on the frequency of
occurrences and device type, with desktop interactions receiving a higher weight. Next, it ranks each record within
each **ECID** group according to this confidence score, assigning the top rank to the highest-scoring entry. Finally,
the query filters out only the top-ranked record for each **ECID**, ensuring that each profile is represented by a
single, most reliable entry. This approach helps to minimize profile collapse by excluding lower-confidence records
and prioritizing the most relevant data for each unique customer.

Description: Retain records based on a time-based rule, such as keeping data within a specific time window (e.g.,
last six months). Older data can be archived or discarded.

Use Case: Suitable when the most recent data is more relevant than historical data, or when you want to reduce
data volume while keeping the latest information.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp
DESC) AS rn
FROM example_dataset
WHERE Login_Timestamp >= DATEADD(month, -12, CURRENT_DATE)
) ranked_data
WHERE rn = 1;

-- Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Combining Data Using Aggregation Rules

Description: Aggregate certain fields for profiles that share the same ECID, combining data using aggregation
functions (e.g., summing transaction counts, taking the latest email address).

Use Case: Useful for combining behavioral data while still allowing for unique identifiers to coexist within a
merged profile.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE aggregated_dataset AS

SELECT ECID,
MAX(Login_Timestamp) AS Last_Login,
COUNT(*) AS Interaction_Count,
STRING_AGG(DISTINCT Device_Type, ', ') AS Device_Types
FROM example_dataset
GROUP BY ECID;

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Applying Domain-Specific Rules

Description: Use business-specific rules to decide which records to keep. For example, prioritize records from
specific CRM systems (e.g., always keep records from a particular system if multiple sources are integrated).

Use Case: Effective in organizations with a well-defined hierarchy of data sources, where some data sources are
known to be more reliable than others.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT *
FROM example_dataset
WHERE CRM_ID IN (
SELECT CRM_ID FROM trusted_sources --make sure you have a trusted
source table
);

--Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Creating Composite Identifiers for Resolution

Description: Combine multiple fields (e.g., ECID + Device_Type) to create a composite identifier for
deduplication. This strategy adds additional granularity to the process.

Use Case: Helpful in distinguishing between profiles that share the same ECID but differ in other aspects like
device type or location.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS

SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp,
ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp
DESC) AS final_rn
FROM (
-- First deduplication by ECID and Device_Type
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp,
ROW_NUMBER() OVER (PARTITION BY ECID, Device_Type ORDER BY
Login_Timestamp DESC) AS device_rn
FROM example_dataset
) deduped_by_device
WHERE device_rn = 1
) final_deduped
WHERE final_rn = 1;

-- Test to see if there is profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

The above query aims to clean up a dataset to prevent profile collapse by ensuring that each **ECID** (a unique
customer identifier) is represented by only one record, even if there are multiple entries across different devices or
CRM IDs. This is achieved through a two-step deduplication process. In the first step, we use a subquery to partition
the data by both **ECID** and **Device_Type**, ordering by **Login_Timestamp** in descending order
to keep only the most recent record for each unique combination of **ECID** and **Device_Type**. This
intermediate result ensures that for each **ECID**, only the latest activity per device type is retained. In the second
step, an outer query applies another **ROW_NUMBER()** partitioned solely by ECID, ordering again by
**Login_Timestamp** in descending order. By selecting only the top-ranked record (**final_rn = 1**),
the query retains only the latest entry per **ECID** across all device types. This two-layered approach effectively
removes duplicate entries and prevents profile collapse by consolidating each **ECID** into a single, most recent
record, providing a clean and unified dataset.

Clustering Techniques for Grouping Similar Profiles

Description: Use clustering algorithms (e.g., k-means, hierarchical clustering) to group similar profiles and
resolve which ones are likely to represent the same individual.

Use Case: Useful when there are subtle differences in data attributes across profiles, and statistical methods can
help identify groups that should be merged.

Implementation Example: This approach would be executed inside Data Distiller Advanced Statistics and
Machine Learning features.

To address profile collapse, our clustering should identify profiles that are likely to belong to the same individual
based on shared or similar identifiers. Here’s an approach focused on preventing profile collapse:

1. Device ID Consistency:

Count the number of unique Device_Type values associated with each ECID and CRM_ID.

Profiles with a wide variety of device types may indicate multiple users on shared devices, which could
contribute to profile collapse.

**device_type_variety**: Counts the distinct Device_Types used by each ECID. High variety
here may indicate a shared device or cross-identity usage.

2. Shared CRM Identifiers:

Calculate the frequency of each CRM_ID per ECID. High frequencies might indicate cases where multiple
profiles with the same CRM_ID are collapsed into one.

A high count of distinct CRM_IDs per ECID suggests possible identity collisions (e.g., a B2B and a B2C
CRM_ID on the same ECID).

**crm_id_count**: Counts distinct CRM_IDs per ECID, helping to detect cases where multiple
identities are merged into one, indicating a potential profile collapse.
3. Login Recency:

While not as direct, recency can still help detect anomalies where profile collapse might have caused
inactive or mismatched data.

**login_recency**: While not a direct indicator, it can provide context for activity levels, which may
be useful if certain collapsed profiles appear inactive compared to expected activity.

By clustering based on these features, you’re more likely to identify clusters where ECIDs are artificially merged due
to overlapping CRM_IDs or device types. Profiles within the same cluster that have multiple CRM_IDs could be
flagged for further review or disambiguation, effectively isolating cases where profile collapse is likely occurring.

-- Upper threshold on the number of clusters

SELECT COUNT(DISTINCT ECID) FROM example_dataset;

-- Create the K-Means Model

CREATE OR REPLACE MODEL profile_collapse_clustering
TRANSFORM(vector_assembler(array(device_type_variety, crm_id_count,
login_recency)) AS features)
OPTIONS (
MODEL_TYPE = 'KMEANS',
NUM_CLUSTERS = 1000, -- Adjust the number of clusters based on
dataset size
MAX_ITER = 20 -- Set maximum iterations
)
AS
SELECT
device_type_variety,
crm_id_count,
login_recency
FROM
(SELECT
ECID,
COUNT(DISTINCT Device_Type) AS device_type_variety,
COUNT(DISTINCT CRM_ID) AS crm_id_count,
DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency
FROM example_dataset
GROUP BY ECID);

-- Create a table to store the clusters and predictions for detecting profile
collapse
CREATE TABLE IF NOT EXISTS ecid_cluster AS
SELECT *
FROM MODEL_PREDICT(profile_collapse_clustering, 1, (
SELECT
ECID,
COUNT(DISTINCT Device_Type) AS device_type_variety, -- Number of
unique devices per ECID
COUNT(DISTINCT CRM_ID) AS crm_id_count, -- Number of
unique CRM_IDs per ECID
DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency --
Recency of last login
FROM example_dataset
GROUP BY ECID
));
-- Read the values from ecid_cluster
SELECT * FROM ecid_cluster;

Machine Learning-Based Resolution

Description: Train a machine learning model (e.g., logistic regression, decision trees) to predict whether two
profiles should be merged based on features like similarity of CRM_IDs, login patterns, or device types.

Use Case: Ideal for complex datasets where simple rules don’t capture the nuances, and historical labeled data is
available to train a model.

Implementation Example: Use a machine learning framework to train the model, then apply the model
predictions in SQL to clean up the dataset.

Assume that:

We have historical data with labeled pairs of ECIDs and their features (CRM_ID similarity, device
type overlap, login frequency similarity, etc.).

We use logistic regression for binary classification to predict “merge” or “do not merge” for each profile
pair.

Let us first generate the feature engineering dataset:

-- Step 1: Calculate login counts per ECID

CREATE TEMP TABLE profile_logins AS
SELECT
ECID,
CRM_ID,
Device_Type,
Login_Timestamp,
COUNT(*) OVER (PARTITION BY ECID) AS login_count
FROM example_dataset;

-- Step 2: Generate labeled pairs by self-joining on ECID pairs

CREATE TEMP TABLE labeled_pairs AS
SELECT
p1.ECID AS ECID1,
p2.ECID AS ECID2,

-- Feature: CRM_ID Similarity (1 if similar, 0 if not)

CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS crm_id_similarity,

-- Feature: Device Type Overlap (1 if similar, 0 if not)

CASE WHEN p1.Device_Type = p2.Device_Type THEN 1 ELSE 0 END AS
device_type_overlap,

-- Feature: Login Frequency Similarity (absolute difference)

ABS(p1.login_count - p2.login_count) AS login_frequency_diff,

-- Example Merge Label (1 if CRM_IDs match, 0 otherwise)

CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS merge_label

FROM profile_logins p1
JOIN profile_logins p2
ON p1.ECID < p2.ECID -- Ensure unique pairs and avoid self-pairing
ORDER BY p1.ECID, p2.ECID;

-- Step 3: Check the result

SELECT * FROM labeled_pairs LIMIT 10;

Train the logistics regression model:

CREATE OR REPLACE MODEL profile_merge_model

TRANSFORM(vector_assembler(array(crm_id_similarity, device_type_overlap,
login_frequency_diff)) AS features)
OPTIONS (
MODEL_TYPE = 'LOGISTIC_REG',
LABEL = 'merge_label' -- Logistic Regression for binary
classification
) AS
SELECT
crm_id_similarity,
device_type_overlap,
login_frequency_diff,
merge_label
FROM labeled_pairs;

Manual Review for High-Risk Collapses

Description: Identify high-risk profile collapses (e.g., large discrepancies in profile attributes) and flag them for
manual review.

Use Case: When automated processes cannot resolve all cases with high confidence, manual intervention may be
required for certain profiles.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Step 1: Test for Profile Collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

-- Step 2: If profile collapse is detected, create a table of affected

ECIDs
CREATE TEMP TABLE collapsed_profiles AS
SELECT ECID
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 2;

-- Step 3: Compile records needing manual review based on identified ECIDs

CREATE TEMP TABLE manual_review AS
SELECT *
FROM example_dataset
WHERE ECID IN (SELECT ECID FROM collapsed_profiles);

Confidence-Based Merging of Multiple Strategies

Description: Combine multiple strategies by assigning confidence levels to each and merging profiles based on
the highest combined confidence.

Use Case: Ideal when no single rule can address all cases effectively, allowing a multi-strategy approach to
improve resolution accuracy.

Implementation Example: Create multiple confidence scores based on different rules, then aggregate these to
determine the final outcome.

Fuzzy Matching Techniques for Inexact Data

Description: Apply fuzzy matching on attributes like CRM_ID or name fields to identify records that are close
matches but not exact. This can help clean up cases where variations in identifiers exist.

Use Case: Helpful when there are data entry errors or slight differences in CRM_ID formats that contribute to
profile collapses.

Implementation Example: Use Data Distiller preprocess the data with fuzzy matching techniques and then
apply cleaned values in SQL. There is a detailed tutorial for this located here.

Audit-Based Reconciliation

Description: Implement auditing rules where profiles with certain characteristics (e.g., frequent logins from
different devices) are automatically flagged for reconciliation checks.

Use Case: Ensures ongoing monitoring of data integrity by routinely checking for signs of potential profile
collapses.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE flagged_for_audit AS

SELECT ECID, COUNT(DISTINCT Device_Type) AS Distinct_Device_Count
FROM example_dataset
GROUP BY ECID
HAVING Distinct_Device_Count > 3;

Choosing the Right Strategy

The appropriate strategy will depend on factors such as:

Data Quality: How consistent and accurate is the CRM_ID data?

Business Rules: Are there clear rules for prioritizing certain records over others?

Data Usage: Will all the attributes be needed, or can some be safely discarded?

Combining multiple strategies may also be necessary, such as merging attributes for certain profiles while prioritizing
the latest records for others. Testing different approaches on sample datasets can help in choosing the optimal strategy
for your specific scenario

To ensure the cleanup worked, always run a check to see if any ECID still has multiple CRM_IDs:
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

If this query returns no results, the cleanup was successful.

Finalizing the Derived Dataset

If the cleaned dataset meets your criteria, you can replace the original dataset with the cleaned version or store it as a
new dataset for further processing.

There is prrofile collapse across the board.

k-means model is created.

Results of the ML prediction for clustering.

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution/idr-301-using-levenshtein-distance-for-
fuzzy-matching-in-identity-resolution-with-data-distiller * * *

1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity

Resolution with Data Distiller
Learn how to apply fuzzy matching with Data Distiller to improve accuracy in identity resolution and profile
management.

Last updated 4 months ago

For this tutorial, you will need to ingest the following dataset:

using the steps outlined in:

We will also be using DBVisualizer to extract large volumes of data directly onto our machine from the Data Distiller
backend:

Fuzzy matching is a technique used to identify and link records that are similar but not exact matches, such as
CRM_IDs or name fields with slight variations. By applying fuzzy matching, you can address discrepancies arising
from data entry errors or inconsistent formatting, which are common causes of profile collapse in customer databases.
When records are mistakenly treated as separate due to small variations, fuzzy matching can help consolidate them,
leading to more accurate identity resolution.

Fuzzy matching is particularly valuable when dealing with data entry errors or subtle differences in identifier formats
that cause profile fragmentation or collapse. For instance, CRM_IDs that differ by one or two characters may represent
the same individual, but these small discrepancies prevent a system from recognizing them as such. By applying fuzzy
matching, you can detect and merge these near-duplicates, improving the quality and continuity of the customer
profiles.

The implementation involves preprocessing the data with fuzzy matching techniques using Data Distiller. This initial
processing phase standardizes and identifies close matches within CRM_IDs or name fields. The cleaned and
standardized values are then applied in SQL to create a consolidated dataset, resulting in more accurate profile records
and reducing instances of incorrect merges or data fragmentation. This multi-step approach combines fuzzy matching
capabilities with SQL’s power to efficiently handle large datasets, enhancing overall data accuracy and reliability.

The Levenshtein distance is a way of measuring how different two words (or strings) are from each other. Imagine
you have two words, and you want to see how many small changes it would take to turn one word into the other.

Each change could be:

1. Inserting a new letter (e.g., turning “cat” into “cart” by adding “r”).

2. Deleting a letter (e.g., turning “cart” into “cat” by removing “r”).

3. Replacing one letter with another (e.g., turning “cat” into “bat” by changing “c” to “b”).

The Levenshtein distance counts the minimum number of these small changes needed to make the two words
exactly the same. So, if two words are very similar, like “cat” and “bat,” the distance is small (only one change). If
they’re quite different, like “cat” and “elephant,” the distance is much larger because you’d need many changes.

In essence, Levenshtein distance gives us a simple way to measure how “close” or “far” two words are from each other
based on the number of changes required. It’s often used in spell-checkers or in finding similar records in databases to
help match up entries that might be slightly different due to typos or inconsistencies.

Identify Fuzzy Matches in CRM_ID using LEVENSHTEIN distance

In the SQL example below, fuzzy matching is applied to identify records with CRM_IDs that are nearly identical, even
if they differ slightly due to minor data entry errors or formatting inconsistencies. The query leverages the
Levenshtein distance function to calculate the edit distance between pairs of **CRM_ID**s, which measures the
minimum number of single-character edits (insertions, deletions, or substitutions) needed to make one **CRM_ID**
identical to another. By setting a threshold of 2, the query identifies CRM_ID pairs that have minor variations—
indicating that they may refer to the same entity but were inconsistently recorded. This approach is particularly useful
in cases where exact matches fail to capture all duplicates due to slight discrepancies. By storing these potential
matches in a temporary table, **fuzzy_matches**, the process allows for a detailed review or automated cleanup
to merge or consolidate profiles, ultimately improving the accuracy and integrity of the dataset.

Select Best Match Based on the Highest Similarity Score

In the query below, the goal is to identify the best match for each **ECID** based on the highest similarity score
between **CRM_ID**s, using our fuzzy matching algorithm. The query operates on the principle of minimizing the
Levenshtein distance (called edit distance) between **CRM_ID** pairs within each **ECID** group. By finding
the smallest possible **crm_id_similarity_score**, we capture the closest match—meaning the pair with
the least number of character edits needed to make the **CRM_ID**s identical.

The subquery (SELECT MIN(crm_id_similarity_score)...) determines this closest match by

selecting the smallest **crm_id_similarity_score** for each **ECID1**, representing the record with the
highest similarity. The primary query then filters **fuzzy_matches** to include only those pairs whose similarity
score is equal to this minimum value, effectively creating **best_matches**. This temporary table stores each
**ECID** and its closest matching record, allowing for precise consolidation based on the closest possible
**CRM_ID** values. By focusing on the minimum edit distance, the query ensures that only the best match is
selected for each **ECID**, thus refining identity resolution and reducing the chance of incorrect profile merges.

-- Step 2: Select the best match for each ECID pair based on the highest
similarity score
CREATE TEMP TABLE best_matches AS
SELECT ECID1, ECID2, CRM_ID1, CRM_ID2
FROM fuzzy_matches
WHERE crm_id_similarity_score = (
SELECT MIN(crm_id_similarity_score)
FROM fuzzy_matches f
WHERE f.ECID1 = fuzzy_matches.ECID1
);

SELECT * FROM best_matches;

To create the cleaned dataset, use the following:

-- Step 3: Create a new cleaned dataset with preferred CRM_ID for each ECID
CREATE TEMP TABLE cleaned_dataset AS
SELECT
a.ECID,
COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID, -- Use preferred CRM_ID from
best_matches if available
a.Device_Type,
a.Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

SELECT * FROM cleaned_dataset;

In the query above, the goal is to create a cleaned dataset where each **ECID** is associated with a preferred
**CRM_ID**, selected based on the closest match identified in the **best_matches** table.

The query works on the principle of data standardization and preference selection—it uses fuzzy matching results
to replace potentially inconsistent or duplicate **CRM_ID**s with the most representative one for each **ECID**.
Here’s how it achieves this:

1. COALESCE Selection: The query applies the **COALESCE(b.CRM_ID1, a.CRM_ID)** function, which
takes the **CRM_ID** from **best_matches** (if available) as the preferred identifier. **COALESCE**
ensures that if there is no match in **best_matches**, the original **CRM_ID** from
**example_dataset** (**a.CRM_ID**) is retained. This means that for each **ECID**, the system
first looks for a refined CRM_ID and defaults to the original one if no match exists.

2. LEFT JOIN: By performing a LEFT JOIN between example_dataset (a) and

**best_matches** (**b**) on **ECID**, the query ensures that all records in
**example_dataset** are preserved. Only records with a corresponding **ECID** in
**best_matches** will have the **CRM_ID** replaced, making the cleaned dataset comprehensive while
preserving unmatched entries.

3. Resulting Cleaned Dataset: The **cleaned_dataset** now contains records where each **ECID** is
linked to the best possible **CRM_ID**, improving data consistency by standardizing identifiers based on the
closest match.

The results are:

To remove duplicate records in the **cleaned_dataset** based on **ECID** while retaining only one record
per **ECID**, you can use a **ROW_NUMBER()** function to rank the records within each **ECID** group,
then select only the top-ranked record. This method ensures that duplicates are filtered out, leaving only one preferred
**CRM_ID** for each **ECID**.
-- Step 2a: Create a dataset with preferred CRM_IDs and rank duplicates for
removal
CREATE TEMP TABLE ranked_cleaned_dataset AS
SELECT
a.ECID,
COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID, -- Use preferred CRM_ID from
best_matches if available
a.Device_Type,
a.Login_Timestamp,
ROW_NUMBER() OVER (PARTITION BY a.ECID ORDER BY a.Login_Timestamp DESC) AS
rn -- Rank records within each ECID by Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

-- Step 2b: Select only the top-ranked record for each ECID to remove
duplicates
CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM ranked_cleaned_dataset
WHERE rn = 1;

Verify the Cleaned Dataset

You can preview the results especially for this small dataset:

The reduced row count in the result set (9458 instead of the original 10,000) indicates that some records were filtered
out as duplicates when we applied the **ROW_NUMBER()** function. This reduction occurred because multiple
records with the same **ECID** were consolidated, keeping only the top-ranked (most recent) record for each
**ECID**. In other words, many ECIDs in the original **example_dataset** had multiple records with
different **CRM_ID**s or duplicate entries based on **Login_Timestamp**. When applying
**ROW_NUMBER()** and keeping only **rn = 1**, we retained only one unique record per **ECID** based
on the most recent **Login_Timestamp**. Therefore, if there were originally 10,000 rows but many duplicate
**ECID**s, filtering down to the top-ranked **rn = 1** record for each unique **ECID** results in only 9458
unique ECIDs in **cleaned_dataset**.

-- Step 3: Verify the result by checking for profile collapse

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This shows that there are no duplicates:

Results of the fuzzy match using the Levenshtein distance.

Highest similarity scoring matches.

Cleaned dataset shows duplicate records.

If you scroll down the resultset, you will only see 9458 rows.

There are no duplicates in the 9458 identity matches.

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution/idr-302-algorithmic-approaches-to-b2b-
contacts-unifying-and-standardizing-across-sales-orgs * * *
1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and

Standardizing Across Sales Orgs
Learn algorithmic techniques for merging, deduplicating, and enriching B2B contact data to create unified, accurate
profiles using Data Distiller

Last updated 3 months ago

Download the following file:

Ingest the file by following this tutorial:

Also we will be using:

The case study is about a financial company that is facing significant challenges with contact records across various
systems, particularly multiple instances of Salesforce, Adobe CDP, and Marketo. The challenges arise from data
fragmentation, lack of standardization, duplication, and governance issues, impacting both marketing and sales teams.
Here’s an in-depth look at each challenge and how it affects operations, followed by potential solutions for addressing
them using SQL on a contact list dataset:

1. Fragmented Data Across Multiple Salesforce Instances: Contacts are stored in multiple Salesforce instances.
Some users only have access to a subset of instances, resulting in incomplete data visibility. This limits the ability
of users (especially in sales and customer service) to see the full history or context of a contact, impacting
customer interactions and leading to missed opportunities.

2. Duplicate Contacts: Duplicate contacts exist within and across multiple instances, creating conflicting data
points. Duplicates lead to confusion as different versions of the same contact may contain conflicting details,
such as job titles, roles, and companies, which affect marketing and sales targeting.

3. Data Footprint in Adobe CDP: Only a small subset of users has access to the digital footprint data captured in
Adobe CDP. Sales and customer-facing teams lack critical insights derived from digital interactions, reducing
their ability to personalize engagements effectively.

4. Multiple Email Addresses Per User: Adobe CDP has identified multiple email addresses for some users,
merging them into a single user record, but this data is not synchronized back to sales systems. Inconsistent email
addresses create a fragmented view in sales systems, which may lead to duplicated outreach or incomplete
activity history.

5. Disconnected Marketing Technology Stacks: Multiple marketing stacks (e.g., associated with different
instances) are not integrated, preventing a cohesive enterprise-wide campaign view. Marketing messages may
become redundant or disconnected as customers are targeted by individual product campaigns instead of a unified
brand campaign.

6. Lack of Governance for Contact Creation: Users can create new contacts in core Salesforce, as long as the
email is different, even if the contact exists in another instance.This results in scattered records for the same
contact across systems, making it challenging to maintain an accurate, centralized customer profile.

7. Multiple Roles for Contacts: Contacts may hold multiple roles across organizations, which impacts the type of
messaging they should receive. Inconsistent role-based communication strategies can cause confusion, as the
same contact might be messaged for multiple roles within different campaigns, leading to mixed messaging.
8. Tracking Contact Promotions and Role Changes: There’s no enterprise-wide tracking of contacts’ role
changes or promotions. When a contact transitions to a new role (e.g., promotion or job change), users may
inadvertently lose the contact’s activity history, limiting the personalization potential for future interactions.

9. Difficulty with Attribution in Marketing: Contacts spread across multiple Salesforce instances make it hard to
track and attribute marketing activities accurately. Inaccurate attribution data impacts budget allocation decisions,
making it challenging for marketing teams to understand the effectiveness of their campaigns or optimize
spending.

Dataset Strategy for Supporting Sales Team-Specific Rules and Marketing-Level Cohesion

When working with diverse sales teams, each with its unique business rules and priorities, a robust dataset strategy
must balance the need for individualized algorithms with the overarching goal of enabling marketing to look across all
sales organizations cohesively. This strategy ensures that the data remains harmonized at the schema level but allows
for flexibility in processing and prioritizing information to suit both localized needs and enterprise-wide insights.
Here’s how such a strategy can be designed:

Custom Algorithms for Each Sales Organization

Each sales team operates with specific business rules and requirements. To support this, we implement custom
algorithms for processing the harmonized dataset for each sales organization. These algorithms allow for:

Tailoring how data is aggregated, deduplicated, and prioritized based on the sales team’s operational focus.

Generating attributes unique to that sales organization, such as region-specific metrics or custom scoring models
for leads.

Aligning with local sales strategies while ensuring the outputs conform to a standard schema for cross-
organizational interoperability.

For example, Sales Org A might prioritize customer engagement metrics, while Sales Org B focuses on product
affinity scores. These differences are captured in their respective datasets.

Harmonization with a Single Schema

Although the datasets for each sales organization are processed with different algorithms, the outputs adhere to a
common schema. This standardized schema ensures that attributes across datasets are aligned and comparable. For
instance:

Attributes like email, first_name, and purchase_history remain consistent across all datasets.

Unique business rules are applied at the processing level but do not compromise the schema’s integrity.

This harmonization enables the Profile Store to ingest and manage all datasets seamlessly while retaining each sales
organization’s specific details.

Ingesting Datasets into the Profile Store

The datasets for each sales organization are ingested into the Profile Store, which serves as the central repository for
customer data. Each dataset is preserved as an independent layer within the Profile Store, ensuring that:

Marketing and sales teams can reference the datasets individually or collectively.

Data lineage is maintained, enabling traceability of attributes back to their originating sales organization.
Dynamic Data Selection Using Merge Policies

The Profile Store enables dynamic selection of datasets through merge policies, which define how datasets are
prioritized and combined:

Merge Policies by Sales Organization: Individual sales teams can specify merge policies that prioritize their
dataset when creating audiences or running segmentation.

Cross-Organization Merge Policies: Marketing can define merge policies that aggregate and harmonize datasets
using Data Distiller across sales organizations, providing a unified view of customer data.

For example, one merge policy might prioritize the most recent dataset from a specific sales org, while another
aggregates the highest-value attributes across all sales orgs.

Flexible Segmentation and Personalization Contexts

Merge policies allow for dynamic person audience creation at a granular level. This ensures:

Sales-Specific Views: Sales teams can work within their own datasets, creating audiences and personalization
rules that align with their business priorities.

Marketing-Level Insights: Marketing can look across all sales organizations by selecting merge policies that
harmonize datasets, enabling segmentation and personalization at the enterprise level.

By switching merge policies, teams can seamlessly transition between localized and global perspectives without
altering the underlying datasets.

Profile Snapshots for Merge Policies

To ensure operational flexibility and avoid conflicts between teams, Profile Snapshots are created for each merge
policy. These snapshots capture:

The state of the unified profile dataset under a specific merge policy at a given time.

A static, reproducible dataset for segmentation, analysis, or experimentation without disrupting live datasets.

For example, marketing can generate a snapshot using a cross-organization merge policy for a campaign, while a sales
org uses a snapshot of its own dataset for a regional initiative.

A Unified Yet Flexible Dataset Strategy

This strategy allows each sales organization to operate within its unique business rules while maintaining a
harmonized data foundation. The use of custom algorithms ensures localized relevance, while the standardized schema
and Profile Store enable enterprise-wide cohesion. Merge policies and Profile Snapshots provide the flexibility needed
for segmentation and personalization at both the sales and marketing levels. This approach empowers the organization
to balance tailored sales strategies with holistic marketing insights, ensuring consistent data integrity, adaptability, and
alignment across the business.

Harmonization of data is critical in the above case study because fragmented and inconsistent datasets hinder the
ability to gain a unified view of their customers, leading to inefficiencies, missed opportunities, and poor decision-
making. When data is spread across multiple systems, such as separate Salesforce instances, discrepancies like
duplicate records, conflicting contact details, and incomplete histories arise. This fragmentation not only causes
confusion among teams but also impacts the accuracy of customer insights, making it challenging to deliver
personalized experiences or coordinate cohesive campaigns. By harmonizing data into a single schema, organizations
can resolve inconsistencies, preserve valuable historical context, and ensure that every team—from marketing to sales
and customer service—has access to a complete and accurate profile for each customer. This unified approach enables
better customer engagement, more targeted marketing strategies, and improved operational efficiency, ensuring data
integrity and business relevance at every level.

The diagram illustrates a simple example where fragmented data from multiple Salesforce instances is integrated into a
unified view by harmonizing it into a single schema, referred to as the Profile Person Dataset. Each Salesforce
instance, such as Salesforce Instance 1 and Salesforce Instance 2, operates independently and often contains
overlapping or duplicate records with inconsistent attributes like contact details, roles, or activity history. This
fragmentation creates challenges for teams such as marketing, sales, and customer service, which require complete and
accurate data to deliver personalized customer experiences.

The first step in the process involves exporting data from each Salesforce instance into corresponding datasets, labeled
Dataset 1 and Dataset 2 in the diagram. These datasets reflect the unique structures and attributes of their respective
source systems. However, the extracted datasets are often misaligned in terms of format, schema, and content,
necessitating a transformation process to create a consistent view.

The transformation and harmonization occur in the Process stage, which is powered by Data Distiller’s ETL
(Extract, Transform, Load) capabilities. During this stage, the datasets are extracted, deduplicated, and standardized
into a unified schema. Advanced techniques such as fuzzy matching (e.g., using Levenshtein distance) are employed to
group similar records and resolve duplicates. Business rules are applied to prioritize data, with recent and strategically
significant records given higher weight. This ensures that the most relevant and accurate information is retained for
each contact. The harmonized data is then loaded into a unified dataset.

The final output of this process is the Profile Person Dataset, represented by the large red box in the diagram. This
dataset consolidates all contact records into a single, comprehensive schema, combining historical context, role
changes, and marketing relevance. By harmonizing the data, organizations eliminate silos, ensure data integrity, and
provide a complete 360-degree view of each contact. This unified dataset enables teams to engage in more effective
marketing campaigns, optimize sales strategies, and improve customer service interactions.

The harmonized dataset schema looks like the following that has been supplied as the CSV file for our example:

contact_id: Unique identifier for each contact record.

instance_id: Identifies the Salesforce instance (e.g., Salesforce_A, Salesforce_B, Salesforce_C)

from which the contact originates, suggesting that the same contact may appear across multiple instances.

first_name and last_name: Contact’s names, useful for identifying potential duplicates along with the email
field.

email: Assumed to be the primary identifier for deduplication, as it is likely unique for each contact in real-world
scenarios.

role: Job role or position of the contact; contacts may have multiple roles across different instances.

company: The company associated with each contact; may vary if the contact changes jobs.

marketing_stack: Represents the marketing technology stack; certain stacks may be prioritized for marketing
purposes.

product: Products associated with the contact, potentially reflecting products they interact with or oversee.

data_source: Source system providing the contact data (e.g., Salesforce, Adobe CDP, Marketo).

last_updated: The last date the contact record was updated, useful for identifying the most recent information.
contact_status: Indicates whether the contact is Active or Inactive.

The data looks like the following after harmonization:

Assumptions for Deduplication and Prioritization

Email as Primary Identifier: email will be treated as the primary key for deduplication. If multiple records
share the same email, they represent the same contact, albeit with potential differences in roles, stacks, or
companies.

Recent Updates Preferred: The most recent information is more likely to be accurate, so records with a more
recent last_updated date will be prioritized.

Stack Priority:

Suppose we assign weights to stacks: Stack_A > Stack_B > Stack_C > Stack_D. This weighting
could reflect strategic priorities in marketing or sales, such as favoring contacts associated with certain
technologies or markets.

Role Relevance: The role field is important for targeting, especially if the contact holds a decision-making
role. When a contact has multiple roles across records, we can aggregate these roles in a historical sequence.

Active vs. Inactive Status: Records marked as Active are prioritized for deduplication and analysis, while
Inactive contacts can be retained for reference or historical purposes but not for active engagement.

Let us first execute a basic exploratory query:

SELECT * FROM contact_list;

The above shows some patterns as we would expect:

Potential Duplicates: Since contacts can exist across multiple instances (instance_id) and may have similar
first_name, last_name, or email, there is a possibility of duplicate entries that need to be merged.

Multiple Roles and History: Contacts may have different roles or companies over time, which necessitates
tracking role and company history for a complete profile.

Prioritization Opportunity: With the marketing_stack, last_updated, and contact_status

fields, prioritization can be applied to keep the most relevant and recent records for each contact.

Identify and Remove Duplicates Based on Prioritization and Ranking with Email as Deduplication Dimension

The prioritization and ranking process for contact deduplication leverages a composite scoring method to ensure that
the most relevant and recent information is retained for each unique email. This method assigns a priority_score
to each record by combining two key factors: stack weight and recency. Marketing stacks are assigned weights based
on importance, with higher values for preferred stacks (e.g., **Stack_A** > **Stack_B** > others), indicating
their strategic relevance. Additionally, records are prioritized by **last_updated** date, where more recent
records receive a higher score. By calculating a **priority_score** as a function of these weights, each
contact’s records are ranked within their email group. The record with the highest score is selected as the primary entry
for each email, thus retaining the most critical and timely data while removing redundant or lower-priority entries.
This process ensures that data integrity and business relevance are maximized across the contact list.

-- Prioritize and deduplicate contacts based on stack weight and recency

WITH scored_contacts AS (
SELECT *,
-- Assign weights based on marketing_stack
CASE
WHEN marketing_stack = 'Stack_A' THEN 4
WHEN marketing_stack = 'Stack_B' THEN 3
WHEN marketing_stack = 'Stack_C' THEN 2
WHEN marketing_stack = 'Stack_D' THEN 1
ELSE 0 -- Default weight for unknown stacks
END AS stack_weight,

-- Calculate recency weight (inverted so that more recent dates are

higher)
DATEDIFF(CURRENT_DATE, last_updated) AS recency_weight,

-- Calculate the priority score: combine stack weight and recency

(CASE
WHEN marketing_stack = 'Stack_A' THEN 4
WHEN marketing_stack = 'Stack_B' THEN 3
WHEN marketing_stack = 'Stack_C' THEN 2
WHEN marketing_stack = 'Stack_D' THEN 1
ELSE 0
END * 10) - DATEDIFF(CURRENT_DATE, last_updated) AS priority_score

FROM contact_list
),
ranked_contacts AS (
-- Rank records by priority score within each email group
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC)
AS rank
FROM scored_contacts
)
-- Select the top-ranked record for each email
SELECT *
FROM ranked_contacts
WHERE rank = 1;

Contacts may have multiple roles in different companies or contexts. We can aggregate the roles associated with each
contact’s email and then apply the deduplication algorithm from the previous section.

-- Step 1: Aggregate roles for each email address

WITH role_aggregated_contacts AS (
SELECT email,
ARRAY_AGG(role) AS role_history -- Aggregate roles in an array for
each email
FROM contact_list
GROUP BY email
),

-- Step 2: Join aggregated roles back to original contacts and calculate

priority score
scored_contacts AS (
SELECT c.*,
ra.role_history,
-- Assign weights based on marketing_stack
CASE
WHEN c.marketing_stack = 'Stack_A' THEN 4
WHEN c.marketing_stack = 'Stack_B' THEN 3
WHEN c.marketing_stack = 'Stack_C' THEN 2
WHEN c.marketing_stack = 'Stack_D' THEN 1
ELSE 0 -- Default weight for unknown stacks
END AS stack_weight,

-- Calculate recency weight (inverted so that more recent dates are

higher)
DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,

-- Calculate the priority score: combine stack weight and recency

(CASE
WHEN c.marketing_stack = 'Stack_A' THEN 4
WHEN c.marketing_stack = 'Stack_B' THEN 3
WHEN c.marketing_stack = 'Stack_C' THEN 2
WHEN c.marketing_stack = 'Stack_D' THEN 1
ELSE 0
END * 10) - DATEDIFF(CURRENT_DATE, c.last_updated) AS priority_score
FROM contact_list AS c
JOIN role_aggregated_contacts AS ra
ON c.email = ra.email
),

-- Step 3: Rank and select top-ranked record for each email

ranked_contacts AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC)
AS rank
FROM scored_contacts
)

-- Select the top-ranked record for each email

SELECT *
FROM ranked_contacts
WHERE rank = 1;

The **ARRAY_AGG** function consolidates all roles associated with each unique email into a single array
(**role_history**) before deduplication, ensuring that no role information is lost. This aggregated array
captures a comprehensive role history, allowing us to keep all roles linked to a contact, even if they appeared in
different records. By using **ARRAY_AGG** early in the process, we retain the full context of each contact’s various
roles in one place, enabling deeper insights and use cases like historical tracking or personalization, while streamlining
the dataset to one deduplicated record per email with all relevant roles intact.

Prioritize Active Contacts

To prioritize active contact records while retaining role history and deprioritizing inactive records, we can adjust the
algorithm by modifying the **priority_score** calculation to give preference to active records. This ensures
that active records are ranked higher, while inactive ones are only retained if there are no active records available for
the same email.

-- Step 1: Aggregate roles for each email address

WITH role_aggregated_contacts AS (
SELECT email,
ARRAY_AGG(role) AS role_history -- Aggregate roles in an array for
each email
FROM contact_list
GROUP BY email
),

-- Step 2: Join aggregated roles back to original contacts and calculate

priority score with active status prioritized
scored_contacts AS (
SELECT c.*,
ra.role_history,
-- Assign weights based on marketing_stack
CASE
WHEN c.marketing_stack = 'Stack_A' THEN 4
WHEN c.marketing_stack = 'Stack_B' THEN 3
WHEN c.marketing_stack = 'Stack_C' THEN 2
WHEN c.marketing_stack = 'Stack_D' THEN 1
ELSE 0 -- Default weight for unknown stacks
END AS stack_weight,

-- Calculate recency weight (inverted so that more recent dates are

higher)
DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,

-- Calculate the priority score with additional weight for active

status
((CASE
WHEN c.marketing_stack = 'Stack_A' THEN 4
WHEN c.marketing_stack = 'Stack_B' THEN 3
WHEN c.marketing_stack = 'Stack_C' THEN 2
WHEN c.marketing_stack = 'Stack_D' THEN 1
ELSE 0
END * 10) - DATEDIFF(CURRENT_DATE, c.last_updated)) +
CASE
WHEN c.contact_status = 'Active' THEN 100 -- Add a high weight
for active contacts
ELSE 0
END AS priority_score
FROM contact_list AS c
JOIN role_aggregated_contacts AS ra
ON c.email = ra.email
),

-- Step 3: Rank and select top-ranked record for each email

ranked_contacts AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC)
AS rank
FROM scored_contacts
)

-- Select the top-ranked record for each email

SELECT *
FROM ranked_contacts
WHERE rank = 1;

Handle Contact Promotion and Track History

To track contact promotions and role changes before deduplication, we can create a history array that captures both the
role and the timestamp (**last_updated**) for each change. This allows us to maintain a historical view of each
contact’s roles and companies over time. We’ll use **ARRAY_AGG** to create this array with a combination of
**role**, **company**, and last_updated for each unique email.

-- Identify contact promotions with role history in chronological order

WITH ordered_contacts AS (
SELECT email, first_name, last_name, company, role, last_updated
FROM contact_list
ORDER BY email, last_updated ASC
)
SELECT email, first_name, last_name,
company,
MIN(last_updated) AS first_seen,
MAX(last_updated) AS last_seen,
ARRAY_AGG(role) AS role_history -- Collect roles in already sorted
order
FROM ordered_contacts
GROUP BY email, first_name, last_name, company;

Fuzzy Match on email, First Name and Last Name

There is a detailed tutorial on the this method here.

The query below first performs fuzzy matching using the Levenshtein distance on **first_name**,
**last_name**, and **email** to group similar contact records. It then aggregates **role** and
**company** history, preserving a timeline of roles and company affiliations. Priority is assigned based on active
status and marketing stack, with the most recent and strategically significant stacks weighted higher. The query
captures the most recent **first_name** and **last_name** by joining back on the latest
**last_updated** date, ensuring that the deduplicated contact retains up-to-date personal information. Finally, it
ranks each grouped entry by priority score and selects only the top-ranked record for each unique email, providing a
single, comprehensive, and prioritized profile for each contact. This solution consolidates contact records efficiently
while retaining valuable historical context and ensuring data integrity for accurate customer engagement.

-- Step 1: Identify similar records based on fuzzy matching for first_name,

last_name, and email
WITH fuzzy_matched_contacts AS (
SELECT a.contact_id AS id_a, b.contact_id AS id_b,
a.email AS email_a, b.email AS email_b,
a.first_name, a.last_name,
-- Calculate Levenshtein distances for fuzzy matching
LEVENSHTEIN(a.first_name, b.first_name) AS first_name_distance,
LEVENSHTEIN(a.last_name, b.last_name) AS last_name_distance,
LEVENSHTEIN(a.email, b.email) AS email_distance
FROM contact_list a
JOIN contact_list b
ON a.contact_id < b.contact_id -- Avoid self-joins and duplicates
WHERE LEVENSHTEIN(a.first_name, b.first_name) < 3 -- Example threshold
AND LEVENSHTEIN(a.last_name, b.last_name) < 3
AND LEVENSHTEIN(a.email, b.email) < 3
),

-- Step 2: Aggregate roles, company history, and calculate priority score

aggregated_contacts AS (
SELECT a.email AS email,
ARRAY_AGG(DISTINCT a.role ORDER BY a.last_updated) AS role_history,
-- Aggregate unique roles over time
ARRAY_AGG(STRUCT(a.company, a.last_updated) ORDER BY a.last_updated)
AS company_history, -- Track company changes over time

-- Prioritize active status, assign stack weight, and calculate

recency
MAX(CASE WHEN a.contact_status = 'Active' THEN 1 ELSE 0 END) AS
is_active,
MAX(
CASE
WHEN a.marketing_stack = 'Stack_A' THEN 4
WHEN a.marketing_stack = 'Stack_B' THEN 3
WHEN a.marketing_stack = 'Stack_C' THEN 2
WHEN a.marketing_stack = 'Stack_D' THEN 1
ELSE 0
END * 10 - DATEDIFF(CURRENT_DATE, a.last_updated)
) AS priority_score,

-- Use the latest `last_updated` to identify the most recent names

MAX(a.last_updated) AS latest_last_updated
FROM contact_list a
JOIN fuzzy_matched_contacts fmc
ON a.contact_id = fmc.id_a OR a.contact_id = fmc.id_b
GROUP BY a.email
),

-- Step 3: Join the original contact list again to get the most recent names
based on `latest_last_updated`
final_contacts AS (
SELECT ac.email,
ac.role_history,
ac.company_history,
ac.is_active,
ac.priority_score,
c.first_name AS final_first_name,
c.last_name AS final_last_name
FROM aggregated_contacts ac
JOIN contact_list c ON ac.email = c.email AND ac.latest_last_updated =
c.last_updated
),

-- Step 4: Select the highest-priority merged record for each email group
ranked_merged_contacts AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC)
AS rank
FROM final_contacts
)

-- Final selection of deduplicated, merged records

SELECT
email,
final_first_name AS first_name,
final_last_name AS last_name,
role_history,
company_history,
is_active,
priority_score
FROM ranked_merged_contacts
WHERE rank = 1;

Data Harmonization in Data Distiller

Excel view of the CSV file above is the harmonized dataset.

The results of the exploratory query.

Deduplication based on prioritization.

Active records are prioritized.

Promotion and company history

Merged contacts with fuzzy match.

https://data-distiller.all-stuff-data.com/unit-6-data-distiller-audiences/dda-100-audiences-overview * * *

1. Unit 6: DATA DISTILLER AUDIENCES

DDA 100: Audiences Overview

Segmentation matters because it enables businesses to understand and cater to the diverse needs and preferences of
their customer base, leading to more effective marketing and product strategies.

Segmentation is a marketing and data analysis technique that involves dividing a larger target audience or customer
base into smaller, more homogenous groups or segments based on specific criteria. The goal of segmentation is to
better understand and cater to the distinct needs, preferences, and behaviors of each segment. Here’s a concise
summary:

1. Audience Division: Segmentation involves dividing a broader audience into smaller, more manageable groups or
segments. These segments can be based on various factors, including demographics, psychographics, behavior, or
geographic location.

2. Personalization: By identifying and understanding the unique characteristics of each segment, businesses can
tailor their marketing strategies, products, and services to better meet the specific needs and interests of each
group.

3. Improved Targeting: Segmentation helps companies target their marketing efforts more effectively. Instead of
using a one-size-fits-all approach, they can focus resources on the segments most likely to respond positively to
their offerings.

4. Enhanced Customer Experience: Segmentation contributes to a more personalized and relevant customer
experience. It allows businesses to deliver content, promotions, and messages that resonate with each segment,
increasing customer satisfaction.

5. Data-Driven Decision-Making: Segmentation relies on data analysis to identify meaningful patterns and
groupings. This data-driven approach helps businesses make informed decisions, allocate resources efficiently,
and measure the effectiveness of their marketing campaigns.
6. Market Expansion: Segmentation can also uncover opportunities in underserved or overlooked segments of the
market. By targeting these segments, companies can potentially expand their customer base.

Last updated 6 months ago

https://data-distiller.all-stuff-data.com/unit-6-data-distiller-audiences/dda-300-audience-overlaps-with-data-distiller * *
*

1. Unit 6: DATA DISTILLER AUDIENCES

DDA 300: Audience Overlaps with Data Distiller

Learn how to leverage snapshot of profile attributes, identities and segment memberships to build exotic queries such
as 3 or 4 segment overlaps

Last updated 6 months ago

Every day, a snapshot of the Profile attributes for every merge policy is exported to the data lake. These system
datasets are hidden by default but are accessible by toggling these datasets in the data catalog.

These datasets contain information about the profile attributes, the identity map, and the segment membership as
reflected at the time of creating the snapshot. The examples below show how you can explore this dataset to
understand identity composition and even create exotic segment overlaps.

To first access, the Profile Snapshot datasets, navigate to Datasets and click on the filter to show system datasets.
Turning this filter on will show system datasets along with others. You need to search for “profile attribute” to filter
down the list. Profile attribute snapshot datasets are exported for every merge policy in the system. In the example
below, there are 3 merge policies and hence there are 3 datasets.

These datasets will typically look like the following unless their name has been changed:
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

Even if the name has been changed, you should be able to query the dataset:

SELECT * FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The columns you will see in the result will look like the following. If you see identityMap and segmentMembership
fields, then you are on the right track.

There are some other interesting datasets that are generated on a daily basis by the system. Explore some of these on
your own.

select * from br_segment_destination;

This will give you the list of destinations that have been mapped to a segment. Note that each row contains one such
mapping and that a single destination can have multiple segments and vice versa. Also, note that segmentnamespace
refers to the source of the segment i.e. whether it was created in AEP or elsewhere.

You may also see a dataset that contains the segment information. Search for a dataset that has segment definition in
the schema type and you should be able to locate this dataset

select * from profile_segment_definition_6591ba8f_1422_499d_822a_543b2f7613a3

The results of the query look like the following:

And finally the destinations ID to account name mapping is available in this system dataset:

select * from dim_destination;

Let us explore another dataset that gives us information about what identities are available for mapping to a
destination.

select * from br_namespace_destination

order by destinationID

The result is:

The result shows that for a given target identity (namespace) available in the destination, there are multiple source
identity options, many of which are not used (isMapped = false). These identities were sourced by the profiles that
were part of the audience which in turn was powered by the identity graph.

Tip: Note that if an identity is missing at the source which is mapped to a destination field, that identity is dropped
from being sent. There are other identities of the same profile that may have non-zero values and they will get sent to
the destination. Activation to a destination is essentially a process of identity disaggregation.

Warning: The Profile attribute export dataset does not contain profiles that have more than 50 identities. In order to
avoid losing these profiles, you need to contact Adobe to configure the identity graph so that it retains the most recent
identities belonging to a cookie namespace.

Just for fun, let us see what fields are being used in our destinations in my test environment:

select * from br_namespace_destination

where isMapped='true'
order by destinationID

The results are:

Find Merge Policies in the Snapshot

The following tables are created from the Profile Attribute Snapshot in the reporting star schema for the Real-Time
Customer Profiles. We will write queries against these tables to get answers that will be

SELECT * FROM adwh_dim_merge_policies;

Note that there is no table name to merge policy mapping in this star schema. You need to run the first query to
determine the merge policy and identify the dataset_id. Use that to search the dataset name in Univeral Search at the
top and not local dataset search.

You can get the dataset_id to the dataset name by typing the following and browsing down the list of results:

Another approach would be to copy and paste that into the search bar at the top in the AEP UI.

The dataset name will reveal itself. Click on the dataset to get the table name which you will need as the table name
for all your queries.

You should be able to start exploring the profile snapshot datasets immediately.

SELECT * FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

SELECT person.name.lastName FROM
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

Retrieve Segment Information

SELECT * FROM adwh_dim_segments;

Count Profiles by Merge Policy

Merge policy is an “MDM-like” feature that lets you prioritize attribute records whenever there is a conflict i.e. either
use one dataset over the other or use the timestamp to resolve that conflict. Mostly, you will see that timestamp
precedence is used as the source of truth is typically a CRM system that is evolving over time, and new attribute
records with updates get added to the Real-Time Customer Profile. The new records need to take precedence.

SELECT COUNT(identityMap) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

If you navigate to the Profiles->Overview page, for the same merge policy, you will see the same count. Note that we
had chosen Default Time-based merge policy in this example in our environment.

Use EXPLODE to Separate Identities in Separate Rows

SELECT Segmentmembership, explode(identitymap) from

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

This splitting into rows destroys Identity map association.

SELECT to_json(identityMap) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

If you execute the following, you will see the following:

What you see above is the identity map structure. The map has the identity namespace (email, crmid) as the unique key
to index the values. Note that this is an array of values. If you need to extract the identities, you will need to use the
following code:

SELECT identityMap.email.id FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The results look like the following:

Typically we would explode this array

SELECT explode(identityMap.email.id) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

If we take the first element of the email namespace and concatenate that with the first element of the crmid namespace,
we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have that
identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged it
into a single row.

Let us now generate the unique UUIDs along with the schema-friendly identity map structure

SELECT concat(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,to_json(identityMap) from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

The results will look like:

But concat string function does a poor job at concatenating NULL values. We have to remedy that with COALESCE:

SELECT COALESCE(identityMap.email.id[0],identityMap.crmid.id[0]) as
unique_id,to_json(identityMap) from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

This works and you will get:

If there are two identity namespaces, then the explode_outer operator works on each, one at a time. Make sure you
remove the to_json as we do not need it anymore:

SELECT unique_id, explode_outer(identityMap.email.id) AS email,

identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,identityMap from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)

You will get:

We need to explode this one more time for crmid and we should get:

SELECT unique_id, email, explode_outer(crmid) as crmid FROM (

The results would be:

Using this table, I can do some very interesting analysis. For example, we can look at the histogram of email IDs in the
system:
SELECT bucket, count(unique_id) FROM(
SELECT unique_id, count(CASE WHEN email IS NULL THEN 0 ELSE email END) as
bucket FROM(
SELECT unique_id, email, explode_outer(crmid) as crmid FROM (
SELECT unique_id, explode_outer(identityMap.email.id) AS email,
identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS
unique_id,identityMap from
profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
))
GROUP BY unique_id
ORDER BY bucket ASC)
GROUP BY bucket

The answer looks like this:

Extract Identities Without Breaking Segment Membership Association

In our example, we will be simplifying the use case since we do not have multiple identities. We will be just using the
first element to create a simple map.

SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] AS crmid from

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

SELECT Segmentmembership, identitymap.email.id[0] AS email,

identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

Expand Segment Membership

SELECT explode(Segmentmembership), email, crmid

FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)sq

Create Segment Membership, Email, and CRMID Triples

SELECT key AS segment_id, email, crmid FROM (

SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
);

Resolve Segment ID to Segment Name

The segment_id needs to be resolved to a segment name. Note that I have to name the subqueries in order to reference
them in the INNER JOIN clause. I am also creating a unique UID by concatenating the ID values

SELECT segment_name, email, crmid, concat(email, crmid) AS UID FROM

(
SELECT key AS segment_id, email, crmid FROM (
SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
) ) AS A
INNER JOIN adwh_dim_segments AS B
ON A.segment_id=B.segment;

Write Query Result to Data Lake

Materialize this table to the data lake. Use the DROP TABLE clause to wipe out an existing table

DROP TABLE IF EXISTS segment_data;

CREATE TABLE segment_data AS (SELECT segment_name, email, crmid, concat(email,
crmid) AS UID FROM
(
SELECT key AS segment_id, email, crmid FROM (
SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
) ) AS A
INNER JOIN adwh_dim_segments AS B
ON A.segment_id=B.segment);

Explore what is contained in the dataset. We should see 13K rows

SELECT * FROM segment_data;

Identify records that have NULL crmid values

SELECT * FROM segment_data

WHERE crmid IS NULL;

Audience Size Calculation

Here we will count the number of people vs what I see in the UI. There are some segments that are zero and that do not
agree with UI. What do you think happened?

SELECT segment_name, count(UID)

FROM segment_data
GROUP BY segment_name

Do not materialize the dataset. Instead, let us fix this with COALESCE

SELECT segment_name, count(DISTINCT UID) FROM (SELECT segment_name, email,

crmid, concat(COALESCE(email, ''), COALESCE(crmid, ''))AS UID FROM
(
SELECT key AS segment_id, email, crmid FROM (
SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
) ) AS A
INNER JOIN adwh_dim_segments AS B
ON A.segment_id=B.segment)
GROUP BY segment_name;

We still do not materialize because we need to check the count. It should be 8,225 in our case.

SELECT count(DISTINCT UID) FROM (SELECT segment_name, email, crmid,

concat(COALESCE(email, ''), COALESCE(crmid, '')) AS UID FROM
(
SELECT key AS segment_id, email, crmid FROM (
SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
) ) AS A
INNER JOIN adwh_dim_segments AS B
ON A.segment_id=B.segment);

The above query result does not agree with what we see below. Where is the difference?

Exploding segmentMembership eliminated unsegmented profiles.

SELECT COUNT(identityMap) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

SELECT COUNT(identityMap) FROM

profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c WHERE Segmentmembership IS
NULL903

If you add 6500 (profiles attached to at least one segment)+1725 (profiles not attached to any segment), you get 8225
which is our count.
Generate Segment Name to Identities Table

DROP TABLE IF EXISTS segment_data;

CREATE TABLE segment_data AS (
SELECT segment_name, email, crmid, concat(COALESCE(email, ''), COALESCE(crmid,
'')) AS UID FROM
(
SELECT key AS segment_id, email, crmid FROM (
SELECT explode(value), email, crmid FROM (
SELECT explode(Segmentmembership), email, crmid
FROM
(SELECT Segmentmembership, identitymap.email.id[0] AS email,
identitymap.crmid.id[0] AS crmid
from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
) ) AS A
INNER JOIN adwh_dim_segments AS B
ON A.segment_id=B.segment);

Let us create some analytical queries

SELECT segment_name, count(DISTINCT UID), COUNT(DISTINCT email) AS count_email,

COUNT(DISTINCT crmid) AS count_crmid
FROM segment_data
GROUP BY segment_name;

Double-check:

SELECT * FROM segment_data;

SELECT COUNT(DISTINCT UID) FROM segment_data;

Generate Emails Associated with a Segment Name

SELECT segment_name, array_agg(email) as email_list, array_agg(crmid) as

crm_list
FROM segment_data
GROUP BY segment_name;

Generate Segments Associated with an Email Address

SELECT email, array_agg(segment_name)

FROM segment_data
GROUP BY email

SELECT * FROM (SELECT UID, array_agg(segment_name) AS seg_array FROM

segment_data GROUP BY UID) WHERE array_contains(seg_array, 'United States') AND
array_contains(seg_array, 'Saurabh New') AND array_contains(seg_array, 'Winter
wear')

SELECT * FROM (SELECT UID, array_agg(segment_name) AS seg_array FROM

segment_data GROUP BY UID)
WHERE array_contains(seg_array, 'United States') AND array_contains(seg_array,
'Saurabh New') AND array_contains(seg_array, 'Winter wear')
AND array_contains(seg_array, 'Male Gender Test Segment')

Even before we typed this query, we should have been able to predict the kinds of answers we would have gotten. The
lowest-sized audience will give you the upper limit on the overlap. Can you figure out which of these segments is that?

However, if we execute the query, we see the not-so-surprising result.

Scrolling to right, you will see identityMap field and then the segmentMembership.

Destination to segment napping

Segment ID to segment mapping raw table.

Destination ID to Destination Account Name mapping

Source identity namespaces that could map to the destination identity namespace.

Deestination identity fields being used in my environment.

Copy the dataset_id from the query result

Dataset name is available for dataset id typed into Univeersal search

Profiles Overview page has merge policy filter that gives you the same count.

Exploding the array helps you extract the identities.

explode_outer retains the nulls for the identity namespaces

Generate the unique ID without llosing identity associations.

COALESCE function retrieves the first non-zero value in a list and is a good choice for a unique ID for the profile in
our example.

email identities have been separated into separate rows without breaking profile association.

Identity lookup table in relational form

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence/bi-100-data-distiller-business-
intelligence-a-complete-feature-overview * * *

1. Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE

BI 100: Data Distiller Business Intelligence: A Complete Feature Overview

Unlock insights with Data Distiller dashboards featuring advanced queries, customizable filters, drillthroughs, built-in
SQL, and accelerated querying, all integrated seamlessly with BI tools.

Last updated 3 months ago

Data Distiller enable you to create powerful, enterprise-grade dashboards that rival the best-in-class dashboards found
in traditional business intelligence (BI) tools. You don’t need to rely on expensive BI vendors or invest extra resources
for users to build and consume insights effectively. By leveraging SQL as the foundation for creating dashboards, you
can unlock the full potential of your data without the need for additional software or tools.
Data Distiller dashboards designed with two key audiences in mind: the data team and the business users. The data
team will focus on building foundational charts, ensuring that they include the necessary filters and dimensions to
provide flexibility. This allows business users to perform deep, drill-down analyses and gain actionable insights
directly from the dashboards. This approach streamlines the workflow, enabling teams to quickly create interactive and
insightful dashboards without having to leave the Data Distiller environment.

Data Distiller Dashboards UI and backend have undergone a major revamp.

Data Distiller Accelerated Store

Instead of writing queries directly against the AEP Data Lake, you can use the accelerated store, which features a high-
performance engine designed for faster dashboard queries. This engine significantly enhances the speed and efficiency
of data retrieval, ensuring that dashboards load quickly without compromising data accuracy. Best of all, this
capability is included as part of the Data Distiller license, making it a seamless, cost-effective solution for boosting
query performance in your dashboards

CRUD Support in Accelerated Store

**MERGE INTO** and **UPDATE/DELETE** syntax enables you to update/delete records between your source
table and the target table in the Accelerated Store that allows you to do low volume surgical deletes.

REST API Support in Accelerated Store

You can leverage REST APIs to post queries from any application and retrieve the results as JSON. This allows for
seamless integration of Data Distiller’s querying capabilities into your own applications, providing flexible and
automated access to data insights for reporting, analysis, or further processing.

Data Distiller Data Models

A Data Distiller Data Model, much like a reporting star schema**,** organizes customer and campaign data for
efficient insights. The fact table can store key metrics like campaign impressions, clicks, or purchases, while
dimension tables provide additional context, such as customer demographics, product categories, or marketing
channels. This structure enables marketers to quickly analyze performance across various dimensions, track campaign
effectiveness, and identify trends, offering a clear view of customer behavior and campaign ROI. The star schema of
the data model enhances query performance, making it ideal for marketing dashboards and ad hoc reporting.

A normalized relational data model is optimized for representing data in a compact and efficient way by minimizing
redundancy and organizing information across multiple related tables. Each piece of data is stored only once, reducing
storage needs and avoiding duplication. This design allows for updates, like changes in a dimension (e.g., customer
data), to be made in one place without reprocessing the entire dataset. As a result, updates are more efficient, saving
processing time and ensuring consistency across related data in different tables.

Data Distiller Data Views

Data Distiller Data Views addresses the usability challenges of normalized data models by enabling data engineers to
create flat views tailored for marketing users. These flat views present the underlying data in a simplified format,
making it easy for users to build dashboards or perform ad hoc analysis without dealing with the complexity of the
normalized model. This approach ensures that marketing teams can work with familiar, user-friendly data structures,
streamlining their analysis while maintaining the benefits of the underlying model’s efficiency and flexibility.
A Data Distiller Dashboard typically includes several key visualizations: Big Numbers (KPIs) to highlight critical
metrics like total web traffic, Line Charts to track trends over time (e.g., traffic or sales), Bar Charts to compare
categories (like products or regions), Donut Charts to show proportions of a total, and Tables to present raw, detailed
data for deeper analysis. These elements together provide a comprehensive view of performance and insights, enabling
quick decision-making and further investigation

You can easily download the entire dashboard as a single-page PDF, making it convenient to share with stakeholders.
This feature is especially useful for presenting key insights in a visually organized format, allowing stakeholders to
review and understand the data without needing access to the dashboard itself. It ensures consistency in reporting and
facilitates clear communication, whether for meetings, presentations, or email sharing.

In a Data Distiller Dashboard, each chart can be expanded to view the underlying data, displayed in a table format
with pagination, making it easy to explore like an Excel file. You can also download the data as a CSV for further
analysis. The ViewSQL feature reveals the SQL query used to generate the chart, allowing users to reuse the query in
other charts or add it to their private LLM for advanced personalization and modeling. This flexibility enhances both
data exploration and customization.

Data Distiller Drilthroughs

Data Distiller Drillthroughs are a feature that allows users to explore data more deeply by clicking on a chart within
a dashboard and being directed to another detailed report or dashboard. The purpose of a drillthrough is to provide
context and further insights without overwhelming the primary dashboard with excessive details.

For example, if a user clicks on a marketing leads figure in a regional dashboard, a drillthrough might show detailed
transaction records or performance metrics specific to that region, helping users investigate trends or anomalies
efficiently.

One of the main challenges with using drag-and-drop interfaces to create charts is the lack of flexibility when it comes
to defining custom metrics. While these interfaces are convenient, they often fall short in handling more complex
calculations. On the other hand, SQL provides unmatched flexibility in metric definition. For instance, if you want to
visualize a trailing 30-day average using a rolling window for each date, achieving this in a typical drag-and-drop
dashboard interface would be nearly impossible. You would likely need to recompute the entire metric at the ETL layer
instead.

Our goal of building this feature was to unleash the flexibility of SQL at the chart authoring layer.

Data Distiller Global Filters

We’ve adopted an innovative approach to filter design by allowing filters to be created at the dashboard level while
giving you full control over how they’re applied within individual charts. This flexibility enables sophisticated filter
logic, where both local filters and the chart’s context work together to determine how the filter impacts the data
displayed. This advanced filter design ensures greater customization and precision, allowing you to tailor the behavior
of each chart based on specific business needs.

Data Distiller Date Filters

An advanced date picker-style filter, offering both date range selection and preset options, can be applied to charts
within the dashboard. This feature allows users to quickly customize date-based filters, enhancing flexibility and
precision in data analysis.

Pushdown Filters in Drillthroughs

When a drillthrough is applied on a dashboard, the global filter can also be applied to the child elements, provided they
are connected to the same filter. This ensures that the global filter will influence all related child charts and reports
within the dashboard, even when navigating deeper into the data through drillthrough actions. This setup maintains
consistency across visualizations and enhances the interactivity of the dashboard.

BI Connectivity to Accelerated Store

You can integrate your preferred BI tool with the data models stored in the Accelerated Store, enabling seamless access
to high-performance, optimized data. This allows users to leverage the power of the Accelerated Store’s query engine
while continuing to work within familiar BI environments for dashboard creation, reporting, and analysis. The
flexibility of this integration ensures that businesses can take advantage of both their existing BI tools and the
advanced capabilities of the Accelerated Store without sacrificing speed or efficiency.

Audience analysis including overlaps and identity composition

Conceptual mock of a Data Distiller Data Model

BI Dashboards built with Data Distiller

https://data-distiller.all-stuff-data.com/unit-6-data-distiller-audiences/dda-200-build-data-distiller-audiences-on-data-
lake-using-sql * * *

1. Unit 6: DATA DISTILLER AUDIENCES

DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
Unleash the full potential of your data with Data Distiller—where advanced audience creation meets real-time
insights, scalability, and unmatched personalization.

The Adobe Experience Platform (AEP) Data Lake is a comprehensive data hub that brings together datasets from a
wide array of Adobe applications, including Adobe Analytics, Adobe Campaign, Adobe Audience Manager, Marketo,
and Adobe Commerce. When you combine this wealth of data with that from Adobe Real-Time Customer Data
Platform, Adobe Journey Optimizer, Customer Journey Analytics, Adobe Mix Modeler, Marketo Measure, and Adobe
GenStudio, you have the entire Adobe ecosystem at your fingertips—empowering you to personalize at scale like
never before. These diverse datasets provide a rich foundation for creating highly targeted and dynamic segments,
helping you better understand and engage your customers.

Beyond raw customer data, each of these systems generates system datasets that contain critical information about
personalization and customer interactions. Whether it’s journey insights from Adobe Journey Optimizer or behavioral
data from Adobe Analytics, the AEP Data Lake allows you to access all of this in one place. This unified data source
enables you to build Data Distiller Audiences, allowing you to craft audiences based on highly granular customer
behaviors and interactions across all channels. With all this data readily available, you are not just building audiences
—you are orchestrating personalized, omnichannel experiences that meet customers wherever they are, powered by
the full strength of Adobe’s ecosystem.

Benefits of Data Distiller Audiences

Data Distiller stands as the go-to solution for modern audience authoring, offering unmatched flexibility, scalability,
and analytical power. When it comes to building personalized, data-driven marketing strategies, Data Distiller brings
numerous benefits that make it superior to other audience creation tools.
At the core of Data Distiller’s advantage is its ability to leverage SQL, a universally recognized language for database
marketing. SQL’s expressiveness provides flexibility and control, allowing you to create detailed audience segments
with precision. Unlike restrictive point-and-click tools, SQL gives you the freedom to write complex queries that target
specific behaviors, characteristics, or actions. Whether you’re segmenting based on purchase behavior, demographics,
or engagement history, SQL enables you to fine-tune audience definitions, making it easier to engage the right
customers at the right time.

True Behavior-Based Audiences

Modern audience engines often struggle to process large volumes of raw event data efficiently. Data Distiller
eliminates this bottleneck. With its ability to seamlessly handle massive amounts of event data and deeply nested
structures, Data Distiller goes beyond the capabilities of most campaign tools. This allows you to create highly
nuanced audiences based on complex behaviors that would otherwise be difficult to define. Data Distiller also provides
powerful data transformation capabilities, making it possible to extract, manipulate, and process data in ways that
other systems cannot. Custom joins across datasets let you combine diverse sources of identity information, resulting
in a level of audience precision and customization that is unparalleled.

One of the most impressive aspects of Data Distiller is its ability to operate at scale. Whether you’re working with
millions or billions of records, Data Distiller can handle these datasets effortlessly. This scalability enables the creation
of high-value audiences that can drive sophisticated, personalized marketing campaigns without sacrificing
performance or speed. Large-scale personalization initiatives, which once seemed daunting, can now be executed
efficiently, giving brands the ability to target and engage vast customer bases with ease.

Real-Time or Batch Personalization

You can seamlessly activate audiences in real-time or batch across Adobe’s Real-Time Customer Data Platform
destinations and Adobe Journey Optimizer. This flexibility allows businesses to shift between batch-based audience
creation for long-term insights and real-time activation for immediate, contextually relevant customer engagement. n
addition to real-time activations, Data Distiller Audiences can also be enriched with personalization attributes and
activated to batch or file-based destinations.

Advanced Audience Orchestration

Data Distiller is more than just an audience segmentation tool; it offers a flexible audience orchestration system. With
the ability to create conditional branching and mix-and-match criteria, you can build complex audience structures
based on multiple conditions. This allows for the orchestration of highly intricate customer experiences, taking into
account diverse data points like behavior, identity, and engagement history. By combining these conditions, you can
seamlessly manage customer journeys across touchpoints, delivering truly personalized experiences.

Leverage Audiences from Real-Time Customer Profile

Another key advantage of Data Distiller is its ability to integrate with Profile Snapshot datasets. You can mix and
match audience memberships from profiles natively authored in Real-Time Customer Profile, along with identity
compositions. This enhances audience precision by merging identity data with real-time audience insights, resulting in
more accurate segmentation and better personalization. This flexibility allows for the seamless combination of
historical data with real-time insights, unlocking richer and more actionable audience segments.

Cross-Platform Integration with External Audiences

Data Distiller isn’t limited to the audiences on AEP Data Lake alone. You can also integrate it with external audiences
created by Adobe Audience Manager and Federated Audience Composition to extend your segmentation strategies.
This cross-audience capability allows you to blend SQL-based audiences with those created in external platforms,
offering even greater flexibility in personalization and reach. By leveraging multiple segmentation strategies, you can
ensure that your audiences are fully optimized for different campaigns and marketing goals.
Advanced Analytics and Insights with Data Distiller

Data Distiller does more than just audience segmentation; it opens up a world of advanced analytics. With its ability to
generate deep insights, statistical models, and hypercubes, Data Distiller provides marketers with the tools they need
for advanced audience analysis. You can process vast amounts of data from multiple sources to uncover key trends,
build multidimensional views of customer segments, and extract actionable insights. Whether you need to explore
patterns in customer behavior, run advanced statistical models, or dive deep into the data, Data Distiller is equipped to
fuel your marketing strategies with powerful insights.

Superior Privacy and Governance

One of the standout advantages of Data Distiller Audiences built on the AEP Data Lake is the enhanced governance
and privacy controls. Since the AEP Data Lake is natively integrated with Adobe Experience Platform’s Trust
Framework, it benefits from robust data governance, ensuring that every dataset adheres to the strictest compliance,
security, and privacy standards. This level of control is critical when dealing with customer data, especially in an
environment where regulations like GDPR, CCPA, and other data protection laws are paramount.

You will need this to upload the test data:

You may encounter situations such as retrieving other external audience information may require you to work with
complex data structures like maps and also with Profile Snapshot datasets. You should read this tutorial to familiarize
yourself with these:

We will be using the following data to create segments:

You should also familiarize yourself with some statistics as well as we take a different approach to understanding the
data in this tutorial:

Retail Case Study: Optimizing Email Marketing Campaigns with Audience Segmentation and A/B Testing

A retail brand is running a series of email marketing campaigns for its Spring Sale, Holiday Offers, and New
Arrivals. The marketing team wants to:

1. Decide on Audience Strategy: They need to determine a segmentation strategy: Should they prioritize loyalty
models to guide their campaigns, or shift focus back to engagement? Segment the audience based on engagement
or loyalty by creating targeted messaging. For example, craft specific messages for those who have opened
emails but haven’t clicked (warm leads) versus those who haven’t engaged at all (cold leads).

2. [Not in this tutorial] A/B Test Subject Lines: Compare two email subject lines for the same campaign to see
which one drives more engagement (open and click rates).

3. Monitor and Reduce Email Bounces: Track and reduce email bounces by analyzing hard and soft bounces to
refine the email list and improve targeting.

4. Personalized Marketing: Use engagement metrics (like open and click counts) to create personalized follow-up
campaigns, offering exclusive deals or reminders based on customer interaction behavior.

In our scenario, we need to send audiences based on campaign performance within the Adobe ecosystem to a third-
party activation system. While this approach is not ideal, it is a necessity driven by our architecture, a practice that is
common in many organizations. The objective is to leverage system datasets in the Adobe Experience Platform to
create new audiences that can be sent to other systems in batch, without significantly increasing the data volume in the
Real-Time Customer Profile.
In this case study, the data is simulated to reflect the performance metrics typically seen in campaign tools like Adobe
Journey Optimizer or Adobe Campaign. The test data has been simplified into a canonical form, allowing us to focus
more on audience design and activation rather than data transformation. For a comprehensive guide on how to utilize
similar data from Adobe Journey Optimizer, refer to the tutorial available here.

Expected Outcomes

Higher Engagement: By tracking open and click rates, the marketing team can focus on the most effective
content, leading to higher engagement and ultimately increased sales.

Improved Targeting: Customer segmentation based on interaction helps in tailoring future messages, leading to
better personalization and increased likelihood of conversion.

[Not in this tutorial] Optimized Content: A/B testing results will provide insights into what content or subject
lines resonate most with the audience, enabling the brand to optimize its messaging.

Reduced Bounce Rates: Understanding bounce types (hard or soft) will allow the marketing team to clean up
the email list, ensuring better deliverability and engagement metrics.

There are separate tutorials that will cover topics on advanced audience orchestration (for optimized content) and
activation:

Follow the steps outlined in the tutorial to ingest the CSV file listed above.

High Level Summary of Data Distiller Audience Commands

**CREATE AUDIENCE**

CREATE AUDIENCE highly_engaged_audience

WITH (primary_identity = email, identity_namespace = Email)
AS (
SELECT
customer_id,
email,
campaign_name,
open_count,
click_count,
CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
FROM email_campaign_dataset_20241001_050033_012
WHERE open_count > 0 AND click_count > 0
);

**INSERT INTO**

INSERT INTO highly_engaged_audience

(SELECT
customer_id,
email,
campaign_name,
open_count + 1 AS open_count,
click_count + 2 AS click_count,
CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
FROM email_campaign_dataset_20241001_050033_012
WHERE open_count > 0 AND click_count > 0)
**DROP AUDIENCE**

DROP AUDIENCE highly_engaged_audience;

Exploratory Analysis to Define Audience Strategy

Before you run the queries, just double check that you have the right dataset. Navigate to Datasets->Browse to locate
the dataset. Click on it and retrieve the table name. This is covered in the tutorial here. Be aware that it may append
new characters to the table name as shown in the query below.

SELECT
campaign_name,
COUNT(*) AS total_emails_sent,
SUM(open_count) AS total_opens,
SUM(click_count) AS total_clicks,
ROUND(SUM(open_count) / COUNT(*), 2) AS open_rate,
ROUND(SUM(click_count) / SUM(open_count), 2) AS click_through_rate
FROM email_campaign_dataset_20241001_050033_012
GROUP BY campaign_name
ORDER BY open_rate DESC;

The result will be:

As an analyst, it’s essential to document or summarize the insights derived from the data. In the example above, you’re
interpreting raw data to draw meaningful conclusions. As you move through this section, notice how advanced
statistical analysis can make it easier to describe the data and support your observations. Even if you’re not deeply
involved in statistics, having that analysis to back up your findings is always valuable.

Flash Discounts has the highest open rate (5.17), meaning that the email subject or content was particularly
engaging. However, the click-through rate is on the lower end compared to others. This suggests that while many
customers opened the email, fewer were motivated to click, possibly indicating that the content or call-to-action
inside the email could be optimized.

Holiday Offers performs well overall, with both a strong open rate (5.08) and a fairly high click-through rate
(0.49). This suggests that the campaign is well-targeted, and the messaging inside the email effectively
encourages customer engagement. This campaign is performing relatively well and might serve as a benchmark
for future campaigns.

New Arrivals has a similar open rate to Holiday Offers, at 5.01, and matches its click-through rate at 0.49. This
suggests that the customers are equally interested in this content, but again, while opens are strong, there’s still
room for improving the conversion rate from clicks to action.

Spring Sale has the lowest open rate (4.98) among the four campaigns, though the difference is small. However,
it compensates with the highest click-through rate (0.50), suggesting that while fewer people opened the email,
those who did were highly engaged and more likely to click. This indicates that the content was persuasive for
the subset of users who opened the email, but perhaps the subject line could be improved to increase opens.

Best Performing Email Subject Lines

Identify which email subject lines drive the highest engagement.

SELECT
email_subject,
campaign_name,
COUNT(*) AS total_emails_sent,
SUM(open_count) AS total_opens,
SUM(click_count) AS total_clicks,
ROUND(SUM(open_count) / COUNT(*), 2) AS open_rate,
ROUND(SUM(click_count) / SUM(open_count), 2) AS click_through_rate
FROM email_campaign_dataset_20241001_050033_012
GROUP BY email_subject, campaign_name
ORDER BY click_through_rate DESC;

This query allows you to compare how different email subject lines perform across campaigns in terms of engagement.

The results will look like this:

Most Popular Subject Lines by Campaign

You’ll need to use a window function to identify the top result within each campaign, and then select the highest-
ranking element from each partition.

SELECT
campaign_name,
email_subject,
total_opens
FROM (
SELECT
campaign_name,
email_subject,
SUM(open_count) AS total_opens,
ROW_NUMBER() OVER (PARTITION BY campaign_name ORDER BY SUM(open_count)
DESC) AS rank
FROM email_campaign_dataset_20241001_050033_012
GROUP BY campaign_name, email_subject
) AS ranked_subjects
WHERE rank = 1;

The result looks like:

The most popular subject lines across the campaigns indicate that customers respond strongly to messages that
emphasize urgency and discounts. For the Flash Discounts campaign, the top subject line, “Special Offer Just for
You!”, suggests that personalized and exclusive offers resonate well with the audience, driving 3,394 opens. Similarly,
Holiday Offers and New Arrivals both found success with the subject line “Don’t Miss Out on Big Discounts,” which
highlights that customers are motivated by the prospect of large savings. The Spring Sale campaign’s most successful
subject line, “Limited Time Deal!”, reinforces the effectiveness of urgency in encouraging engagement. Overall, these
subject lines show that messaging focused on exclusive deals and time-sensitive offers performs well across various
campaigns.

Data Distiller Statistics: Basic Analysis

Descriptive statistics give you a summary of the central tendency and dispersion of your key metrics (e.g., open count,
click count, purchase value, etc.).

SELECT
ROUND(AVG(open_count), 2) AS avg_open_count,
ROUND(STDDEV(open_count), 2) AS stddev_open_count,

ROUND(AVG(click_count), 2) AS avg_click_count,
ROUND(STDDEV(click_count), 2) AS stddev_click_count,

ROUND(AVG(avg_purchase_value), 2) AS avg_purchase_value,
ROUND(STDDEV(avg_purchase_value), 2) AS stddev_purchase_value,

ROUND(AVG(purchase_frequency), 2) AS avg_purchase_frequency,
ROUND(STDDEV(purchase_frequency), 2) AS stddev_purchase_frequency,

ROUND(MIN(open_count), 2) AS min_open_count,
ROUND(MAX(open_count), 2) AS max_open_count,

ROUND(MIN(click_count), 2) AS min_click_count,
ROUND(MAX(click_count), 2) AS max_click_count
FROM email_campaign_dataset_20241001_050033_012;

The results show the following:

The results show that, on average, customers open emails about 5 times and click on links around 2.46 times. Purchase
behavior shows an average purchase value of $254.78, with customers making approximately 25 purchases on
average. However, there’s noticeable variability, especially in purchase value (standard deviation of $140.88) and
purchase frequency (standard deviation of 14.58), indicating that customer spending and purchase habits differ widely.
Some customers don’t engage with emails at all (0 opens or clicks), while others engage frequently, with a maximum
of 9 opens and clicks.

Customer Purchase and Engagement Correlation

Identify whether customer loyalty and purchase behavior correlate with email engagement (opens and clicks).

SELECT
customer_loyalty_score,
AVG(open_count) AS avg_open_count,
AVG(click_count) AS avg_click_count,
AVG(avg_purchase_value) AS avg_purchase_value,
AVG(purchase_frequency) AS avg_purchase_frequency
FROM email_campaign_dataset_20241001_050033_012
GROUP BY customer_loyalty_score
ORDER BY customer_loyalty_score DESC;

This query helps identify trends between customer engagement in email campaigns and their purchase behavior,
providing insights into which customer segments are most valuable.

Let us now compute the correlation between loyalty score and the other metrics:

SELECT
'Open Count' AS metric,
CORR(customer_loyalty_score, open_count) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL

SELECT
'Click Count' AS metric,
CORR(customer_loyalty_score, click_count) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL
SELECT
'Purchase Value' AS metric,
CORR(customer_loyalty_score, avg_purchase_value) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL

SELECT
'Purchase Frequency' AS metric,
CORR(customer_loyalty_score, purchase_frequency) AS correlation
FROM email_campaign_dataset_20241001_050033_012;

The results indicate that there is no significant correlation between customer loyalty score and key engagement
metrics such as open count, click count, purchase value, and purchase frequency. The Pearson correlation coefficients
for all these variables are close to zero, suggesting that higher loyalty scores do not predict increased email
engagement or purchase behavior. Whether customers have a high or low loyalty score does not appear to impact how
often they open or click emails, or how much and how frequently they make purchases. This means that in this dataset,
loyalty score is not a strong indicator of customer actions, and other factors may need to be explored to better target
and engage customers.

The **CORR()** function in SQL calculates the Pearson correlation coefficient between two numeric columns. It
measures the linear relationship between two variables, producing a value between -1 and 1:

1: A perfect positive correlation, meaning as one variable increases, the other also increases.

0: No correlation, meaning there’s no linear relationship between the two variables.

-1: A perfect negative correlation, meaning as one variable increases, the other decreases.

Data Distiller Statistics: Advanced Analysis

SELECT
-- Median calculations
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY open_count) AS
median_open_count,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY click_count) AS
median_click_count,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_purchase_value) AS
median_purchase_value,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY purchase_frequency) AS
median_purchase_frequency,

-- Mode calculations for open_count

(SELECT open_count
FROM email_campaign_dataset_20241001_050033_012
GROUP BY open_count
ORDER BY COUNT(*) DESC
LIMIT 1) AS mode_open_count,

-- Mode calculations for click_count

(SELECT click_count
FROM email_campaign_dataset_20241001_050033_012
GROUP BY click_count
ORDER BY COUNT(*) DESC
LIMIT 1) AS mode_click_count,
-- Mode calculations for avg_purchase_value
(SELECT avg_purchase_value
FROM email_campaign_dataset_20241001_050033_012
GROUP BY avg_purchase_value
ORDER BY COUNT(*) DESC
LIMIT 1) AS mode_purchase_value,

-- Mode calculations for purchase_frequency

(SELECT purchase_frequency
FROM email_campaign_dataset_20241001_050033_012
GROUP BY purchase_frequency
ORDER BY COUNT(*) DESC
LIMIT 1) AS mode_purchase_frequency,

-- Kurtosis calculations
KURTOSIS(open_count) AS kurtosis_open_count,
KURTOSIS(click_count) AS kurtosis_click_count,
KURTOSIS(avg_purchase_value) AS kurtosis_purchase_value,
KURTOSIS(purchase_frequency) AS kurtosis_purchase_frequency,

-- Skewness calculations
SKEWNESS(open_count) AS skewness_open_count,
SKEWNESS(click_count) AS skewness_click_count,
SKEWNESS(avg_purchase_value) AS skewness_purchase_value,
SKEWNESS(purchase_frequency) AS skewness_purchase_frequency

FROM email_campaign_dataset_20241001_050033_012;

Observe the functions used to compute the median and mode in Data Distiller

Median represents the middle value in a dataset and helps to provide a clear picture of the typical values, unaffected
by extreme outliers. In this case, the median open count is 5, indicating that half of the customers opened emails 5
times or fewer, while the other half opened emails more than 5 times. Similarly, the median click count is 4,
suggesting that customers typically clicked on email links 4 times or fewer. For purchase behavior, the median average
purchase value is 275.0, meaning that half of the customers made average purchases of $275 or less, while the other
half made purchases of more than $275. Finally, the median purchase frequency is 25, indicating that half of the
customers made 25 or fewer purchases, reflecting a balanced distribution across the dataset.

Mode reflects the most frequently occurring value within a dataset, providing insights into the most common
behaviors. In this analysis, the mode for the open count is 5, meaning that most customers opened emails 5 times,
making this the most frequent behavior. Interestingly, the mode for click count is 0, suggesting that the most
common behavior was for customers to open emails without clicking on any links. The mode for average purchase
value is 275.0, which indicates that $275 was the most frequent purchase amount among customers. The most frequent
purchase frequency is 46, showing that many customers made 46 purchases, highlighting a common engagement level
within the dataset.

The reason the median and mode seem so similar is likely because, in our dataset, the values for median and mode
happen to be close in magnitude for some metrics, such as open count (median 5, mode 5) and average purchase value
(median 275, mode 275). However, while the values are close, median and mode describe different characteristics of
the dataset:

Median represents the middle point of the data distribution, meaning half the values are below and half are
above.
Mode represents the most frequently occurring value in the dataset, which shows the most common behavior.

Skewness measures the asymmetry of a data distribution, indicating whether the values tend to cluster more on one
side of the mean. In this dataset, the skewness for open count is -0.53, showing a slight negative skew, meaning more
customers have higher open counts, but the skewness isn’t extreme. For click count, the skewness is 0.96, which
indicates a positive skew, where most customers have lower click counts, but a few customers click significantly
more often. The skewness for average purchase value is very close to zero (0.03), suggesting a nearly symmetrical
distribution, with purchase values evenly spread around the mean. Similarly, the purchase frequency has a skewness
of -0.009, also indicating a nearly symmetrical distribution, meaning customers’ purchase frequencies are balanced on
both sides of the average, with no significant bias toward higher or lower frequencies

Kurtosis is a measure that describes how much of the data is concentrated in the tails of a distribution, indicating how
extreme the values are. In this dataset, the negative kurtosis values for open count (-4.252), purchase value (-4.193),
and purchase frequency (-4.217) suggest that the distributions are flatter than a normal bell-shaped curve, meaning
there are fewer extreme behaviors, such as very high or very low values, and most customers’ behaviors are closer to
the average. The click count, with a kurtosis of 0.073, is close to a normal distribution, indicating a more typical
spread of values with a balanced number of extremes. Overall, the negative kurtosis values reflect that most
customers behave moderately, and extreme behaviors, such as very high engagement or large purchase amounts,
are less common.

Approach on Audience Strategy

Based on the above analysis, options for us to explore are:

Rather than relying on loyalty score, segment customers based on their past engagement metrics (e.g., high
openers vs. low openers, frequent purchasers vs. occasional buyers). This is the approach we will take in this
tutorial.

[Not in this tutorial] Perform A/B testing to experiment with subject lines across various customer segments.
The most popular subject lines across the campaigns indicate that customers respond strongly to messages that
emphasize urgency and discounts.

Focus on personalization by using customer purchase history, browsing behavior, or other real-time data.
Personalized offers and messaging can improve engagement and conversions more effectively than broad
targeting based on loyalty score. An example of this is doing RFM or FRE modeling as shown in this tutorial
here.

[Not in this tutorial] Consider if timing affects customer engagement. For example, sending campaigns based
on previous purchase dates or behavioral signals (e.g., abandoned carts) may improve response rates more than
relying on static scores like loyalty.

Identify Highly Engaged Customers

Create a list of customers who have opened and clicked on emails, marking them as highly engaged.

SELECT
customer_id,
email,
campaign_name,
open_count,
click_count,
CASE
WHEN open_count > 0 AND click_count > 0 THEN 'Highly Engaged'
WHEN open_count > 0 AND click_count = 0 THEN 'Opened, No Click'
ELSE 'No Engagement'
END AS engagement_level
FROM email_campaign_dataset_20241001_050033_012
ORDER BY engagement_level;

This query segments customers into “Highly Engaged,” “Opened, No Click,” and “No Engagement.”

It is recommended to perform audience estimation and debugging before creating Data Distiller Audiences.

Let us execute the following audience estimation query:

SELECT
engagement_level,
COUNT(DISTINCT customer_id) AS audience_size
FROM (
SELECT
customer_id,
email,
campaign_name,
open_count,
click_count,
CASE
WHEN open_count > 0 AND click_count > 0 THEN 'Highly Engaged'
WHEN open_count > 0 AND click_count = 0 THEN 'Opened, No Click'
ELSE 'No Engagement'
END AS engagement_level
FROM email_campaign_dataset_20241001_050033_012
) AS engagement_data
GROUP BY engagement_level
ORDER BY audience_size DESC;

Create a Data Distiller Audience

Before executing the query below, keep in mind that this will use the Batch Compute Engine in Data Distiller. It will
create a new dataset and update an existing one marked for Real-Time Customer Profile. Batch jobs typically take
about ~5 minutes for the cluster to spin up and down, so expect the query to take around ~10 minutes to complete.

It’s recommended to use SQL queries with limits while prototyping your audience queries to avoid timeouts. Once
your audience queries are finalized, you can use the Data Distiller Orchestration Anonymous Block feature to wrap all
audience creation steps into a single statement, significantly saving time. The **CREATE AUDIENCE** command
works similarly to **CREATE TABLE**, but it can also write segment membership and enriched attributes to
external audiences in Real-Time Customer Profile.

We will now create the highly engaged customer audience which will include customers who have both opened and
clicked emails.

CREATE AUDIENCE highly_engaged_audience

Let’s take a closer look at the columns we are selecting and their purpose:

**customer_id**: This is likely a CRM-style identifier, essential for activating systems across the
enterprise. It is associated with additional information like the customer’s first and last name, which can be pulled
from a central system.

**email:**This serves as the primary identity in the Adobe Experience Platform, used for personalization and
identity stitching. This will also be reused in the target system for email campaigns.

**campaign_name:**Represents the name of the campaign running in Adobe Journey Optimizer. This can
provide descriptive feedback to the target activation system.

**open_count/click_count:**These are key engagement metrics that can be leveraged for further
segmentation in the downstream email campaign system.

**event_timestamp:**This field is considered a best practice because it helps track when the audience
record was materialized across different parts of the platform. As you progress through this tutorial, you’ll
understand the importance of this choice.

Observe the syntax above:

WITH (primary_identity=email, identity_namespace=Email)

There are quotes in the column name (email) that we are using for primary identity and the identity namespace
(Email).

Keep in mind that the primary identity is used by Real-Time Customer Profile to partition and store data without
affecting the identity graph. Data Distiller Audiences do not impact the identity graph, allowing you to create ad hoc
audiences without worrying about altering the behavior of existing audiences or personalization created in Real-Time
Customer Profile.

Additionally, the derived attributes we’ve added, such as **customer_id**, **campaign name**,
open**_count**, and **click_count**, are not available for segmentation within Real-Time Customer
Profile. These attributes are used exclusively for personalization or activation in the supported destinations which are
batch or file-based destinations.

Remember, Data Distiller audiences are evaluated by Profile only at the time of creation. Since the associated derived
attributes for these imported audiences are non-durable and not stored in the Profile store, the audience will only be
updated if it is manually refreshed.

Data Distiller Audiences (much like any other external audience) only support flat, relational tables for audience
creation, meaning nested data structures like arrays, maps, or other types of nesting are not allowed—the audience
must remain flat. If you’re working with a nested dataset, you can extract the necessary fields to create a flat table by
using the **SELECT** statement with dot notation, which accesses specific fields within nested structures.

Additionally, when activating a Data Distiller Audience, regardless of the export format (CSV, JSON, Parquet), the
data will always maintain a flat structure.

If your destination requires custom audience nested schemas, you’ll need to take the dataset activation route. For use
cases that involve identity graphs or profile data from Real-Time Customer Profile, the best approach is to combine
Profile snapshot data with these datasets to create metadata-rich datasets tailored for specific destination needs.
What is an Identity Namespace?

Note that the identity namespace refers to the type of identity, allowing you to differentiate between, for example, an
identity based on email and one based on a cookie. This distinction helps you define which types of identities you’re
willing to accept and provides insights into the omnichannel nature of your customers’ identities.

To retrieve the Identity Namespace for email, navigate to Identities->Browse and search for email. The Type is the
identity namespace that we need.

If you navigate to Datasets->Browse->highly_engaged_audience, you will see that a dataset is created for this
audience. Also, observed that it is not enabled for Profile (the radio button for Profile is disabled).

Real-Time Customer Profile and Data Distiller Audiences

During the import process for a Data Distiller audience, you need to specify which column corresponds to the Primary
Identity, such as an email address, ECID, or a custom identity specific to your organization. The data associated with
this Primary Identity is the only information linked directly to the profile. If no existing profiles match the data in the
Primary Identity column, a new profile is created. However, this new profile remains isolated, with no associated
attributes or experience events.

The remaining data in the Data Distiller audience is considered payload attributes. These attributes can be used for
personalization and enrichment during activation but are not directly attached to the profile itself. Instead, they are
stored in the data lake.

While a Data Distiller audience can be referenced when building new audiences with the Segment Builder, individual
profile attributes within the audience cannot be utilized independently.

Data Distiller Audience Datasets

When you create a Data Distiller audience, a dataset is created and appears in the dataset inventory. The dataset’s name
will match the name of the Data Distiller audience you created.

Let us execute the query:

SELECT * FROM highly_engaged_audience;

The structure of the dataset mirrors the **SELECT** query. As you add records to this dataset, you will see them
appending. The results are:

The purpose of this dataset is to provide derived attributes (**customer_id**, **campaign name**,
open**_count, click_count** and **event_timestamp)** that will be used for personalization or
activation in a destination. During activation, the destination retrieves the necessary values from this dataset.

If this dataset wasn’t marked for Profile, how did the data get into the Profile?

The secret is the Audience Portal Dataset for UPS Ingestion dataset with the name as
**audience_portal_dataset_for_ups_ingestion.**

If we preview the dataset by:

SELECT * FROM audience_portal_dataset_for_ups_ingestion;

The results will be the following:

Each record from our audience is ingested into Real-Time Customer Profile and is a row within this dataset. Note that
it does not contain the derived attributes (**customer_id**, **campaign name**, open**_count**, and
**click_count)** in the audience.

Execute the following query with the to_json command

SELECT to_json(segmentMembership) FROM

audience_portal_dataset_for_ups_ingestion;

**segmentMembership**

The segment ID (DDA -> [``943aa6a9-d43a-4399-8d6a-782eeb4524f4) uniquely

identifiees this as a Data Distiller Audience (DDA) with audience ID as **943aa6a9-d43a-4399-8d6a-
782eeb4524f4** for the **highly_engaged_audience** that we created. If you are using other
external audiences, they will have a different acronym such as AAM for Adobe Audience Manager.

The status “realized” indicates that the profile has qualified for the audience, and the timestamp reflects the most
recent qualification time. If the profile qualifies again on the following day, a new record will be added with an
updated timestamp.

**lastQualificationTime** tells you when that profile qualified for the segment. If you were to publish
the same profile back into the audience again, it will create a new entry in this dataset where the
**lastQualificationTime** would be updated.

**identityMap**

This is the primary identity listing for the profile in that audience.

Note that there are two filters you will need to apply to extract the members of an external audience from the
**audience_portal_dataset_for_ups_ingestion** dataset: **audience ID** and
**realized** timestamp. To work with how to extract this data from maps, you can read up the tutorial here.

Exploring Audience Portal in Adobe Real-Time Customer Data Platform

Let us now navigate to the audiences screen:

If you are struggling to find Data Distiller audiences, you can click on the Filter icon and then choose the Data
Distiller option:

Let us go ahead and click on the audience:

Also, observe the audience size which directly corresponds to 6.47K records.

Keep in mind that Data Distiller audiences, like any other external audience, will follow the default merge policy,
which can affect entitlements. It willl increase the number of profiles you are consuming for profiles that came in from
the datqset that do not exist in Profile. From a Profile Storage standpoint, If you are not using these audiences, you
should delete them via the **DROP** command.

If you scroll down to the sample profiles at the bottom and select one, you will see:

If we click on the attributes, we won’t find any of the derived attributes that were present in the
**highly_engaged_audience** dataset in the AEP Data Lake.

If we click on Audience membership:

Observe the details on Audience ID and the Last Modified timestamp (local time) which is 12:15 AM PST equivalent
to 7:15PM UTC. You will see that it is close to the UTC timestamp (7:22pm) to the records in the Audience Portal
Dataset for UPS Ingestion dataset. This means that the ingestion into Real-Time Customer Profile started at 7:15PM
UTC and the realized status was marked around 7 minutes after. If we check the time when the
**highly_engaged_audience** dataset was created and updated around the same time.

Data Distiller Audiences are published to the Real-Time Customer Profile in batches and become available quickly
once the Data Distiller job is complete. Typical latency can vary from minutes to about an hour depending on the data
ingestion load on the Real-Timee Customer Profile system. The low latency is ideal for ad hoc batch segmentation,
especially since the current support for batch segmentation in Real-Time Customer Profile has a 24-hour processing
window.

When managing audiences in Data Distiller, you can delete or **DROP** an audience as long as it isn’t actively
referenced or utilized in another audience, destination, or Adobe Journey Optimizer. Ensuring the audience isn’t linked
elsewhere prevents any unintended disruptions to workflows or segmentations that might rely on that audience. If an
audience is still in use, it should first be unlinked or replaced in those areas before deletion can proceed.

Adding or Updating a Data Distiller Audience

Before we begin, it’s important to note that all data added to datasets in the AEP Data Lake is append-only. Any new
data, whether it’s a fresh row or an update, will be added as new rows to the dataset.

A day passes, and let’s assume that every customer in the highly engaged audience has both opened and clicked once
more. To simulate this scenario, we will insert the same batch of individuals who re-qualified today but with a different
timestamp.

INSERT INTO highly_engaged_audience

Keep in mind that there’s no need to specify a keyword like **AUDIENCE** or indicate which column is the primary
identity. The assumption is that you’ll follow the defined schema. If you don’t, it could result in incorrect data being
assigned to the primary identity field.

If we execute this query, we will get:

SELECT COUNT(*) FROM highly_engaged_audience

You will see a doubling of the number of records:

Let us access the duplicate records:

SELECT * FROM highly_engaged_audience

ORDER BY customer_id;

Observe CUST00003 customer_id value- it has two records:

If you look up the audience by going to Audiences->Browse->highly_engaged_audience, you will see something
like this:
Don’t wait for the segment ingestion status to update. Instead, click on the sample profiles, and you’ll see that the
update may have already been applied.

The**INSERT INTO** Data Distiller audience workflow will be used when this audience is being used as part of
another audience, destination or Adobe Journey Optimizer since you cannot drop or delete the audience that have these
downstream dependencies.

When inserting the same profile through multiple INSERT INTO statements, such as on a scheduled basis across
different days, there are two key considerations. First, the **lastQualificationTime** for that profile will be
continuously updated, and the profile will be activated again if it’s linked to a destination. Second, if you’re using
personalization attributes in a file-based destination, all attributes associated with that same profile over all time
will be exported, rather than just the most recent version. This is the typical behavior for external audiences in Adobe
Experience Platform. In such cases, you will need to use the **event_timestamp** attribute that we used in the
audience to choose the most recent value of the attribute within the destination system.

Building an Edge Audience with a Data Distiller Audience as the Foundation

In this section, we will use the highly_engaged_audience as the foundation for real-time personalization at the edge.
When members of this audience visit the website and authenticate via email, we can identify them through the identity
graph as part of this audience, enabling us to deliver personalized experiences instantly using tools like Adobe Target
or Offer Decisioning. However, it’s important to note that members of this audience who do not have an associated
cookie ID on the edge will not be eligible for personalization.

Additionally, every edge evaluation triggers both a streaming evaluation on the Hub and a batch evaluation, which
serves as a reconciliation process to ensure consistency in segment definitions for the next day. This means that edge
audiences are not only available in real-time for activation on the edge but are also synced as streaming audiences for
activation on the Hub. The Hub projects identity graph and segment memberships of profiles it has encountered on the
edge, maintaining alignment between the edge and central systems for seamless personalization experiences across
channels.

1. We need to make sure that our Default Merge Policy is edge-enabled before we can create an edge segment.
Navigate to Profiles->Merge Policies and select the default merge policy. Remember that Data Distiller
audiences are associated with default merge policy only.

2. Turn the slider to the right to turn on the Active on Edge Merge Policy.

3. Continue navigating through the next few screens until you reach the final one. Click Finish.

4. Navigate to Audiences->Create Audience

5. Choose Build Rule. Data Distiller Audiences are not supported in Compose Audience option.

Compose Audience accepts CSV files (up to 1 GB) as external audiences, similar to the Upload CSV workflow
mentioned earlier. You can also use third-party tools like DBVisualizer to run your Data Distiller SQL queries,
download the entire audience as a CSV, and then manually upload it as a CSV file.

1. The Audience Builder pane consists of an Attributes pane and an Events pane. Navigate to Fields→ Experience
Platform → highly_engaged_audience, and then drag and drop it onto the Attributes pane.

2. Click on the Convert to Rules icon, and you’ll notice that it is grayed out. Typically, it would have given you
access to all the rules including those on the Profile attributes. This indicates that Data Distiller audience
attributes cannot be used for rule creation in the Rule Builder. This is expected, as mentioned earlier, because the
audience does not add attributes to the Profile’s attribute set. In fact, when we inspected the Profile earlier, we
observed that these attributes were not present.
If you need to use attributes, you’ll need to utilize Data Distiller Derived Attributes, which will populate the Real-
Time Customer Profile with the necessary attributes in a separate dataset. You can read about it in detail here.

1. To create a real-time segment, such as an edge or streaming segment, we need an event trigger. Navigate to
Fields → Events, scroll down to the Any Event option, and drag it into the Events pane in the Rule Builder.
Next, adjust the time-based clause to In the last 2 hours by selecting the appropriate settings from the dropdown
menus.

2. Name the audience as Highly Engaged Audience - Edge and choose the Evaluation Method as Edge.

3. Click the side button for Evaluation Eligibility Popup.

Real-time audiences (edge or streaming) should rely on precomputed attributes and simple conditions on events within
short timespans. If the timespan exceeds 24 hours, the segment will no longer qualify as an edge audience, and
anything beyond 7 days makes it impossible to evaluate as a streaming segment. While batch evaluation is the easiest
method, batch segments cannot be triggered by events and are only evaluated every 24 hours. This limitation makes
real-time engagement challenging. By leveraging Data Distiller-derived attributes or Data Distiller Audiences, you
gain far more flexibility in creating real-time audiences. Remember, Adobe Journey segments can be either streaming
or batch, depending on the requirements.

1. Close the popup and Click Publish.

Dropping or Deleting an Audience

Dropping an audience is similar to the DROP TABLEsyntax:

DROP AUDIENCE highly_engaged_audience;

A Data Distiller Audience cannot be deleted if it is being used in another audience, a destination, or Adobe Journey
Optimizer, similar to audiences defined within the Real-Time Customer Profile. In such cases, you will receive a
message similar to what appears when attempting the same action from the UI:

11:47:06 AM > Query failed in 4.332 seconds.

11:47:06 AM > ErrorCode: 58000 Internal System Error [Failed to delete audience
due to: {"requestId":"456aa723-8b42-4fe1-94ec-16131b8ccc0a","errors":{"403":
[{"code":"UPAPI-113432-403","message":"The audience cannot be deleted, as it
has dependents in PES."}]},"type":"https://ns.adobe.com/aep/errors/UPAPI-
113432-403","title":"The audience cannot be deleted, as it has dependents in
PES.","status":403}]

In our case, we have created an edge segment in the previous section that uses this. You will need to go to Audiences-
>Browse and delete Highly Engaged Audience - Edge. Additionally, if you are using this audience in any destination,
you will need to remove it from that destination.

You can remove an audience from a Destination by navigating to the Destination Account, Choosing Audience
Selecting Activation Data and Remove Audience as shown in the figure below:

Automatic Audience Drop in Real-Time Customer Profile & Override Option

There is 30 day TTL is on segment membership. Any segment membership which has lastQualificationTime beyond
30 days, membership will be cleaned.

Data Distiller Audiences in Profile Snapshot Datasets

Every day, a snapshot of Profile attributes for each merge policy is exported to the AEP Data Lake. These system
datasets are hidden by default but can be accessed by toggling them on in the data catalog.

These datasets contain information about profile attributes, the identity map, and segment membership as they existed
at the time of the snapshot. The examples below show how to explore this data to better understand identity
composition and even create advanced segment overlaps. Data Distiller Audiences are also included in these
snapshots, providing the opportunity to build even more complex audiences that span multiple datasets and can be
evaluated in batch within the Real-Time Customer Profile.

Data Distiller Audiences are always associated with the default merge policy.

First, you will need to find the default merge policy and find the profile snapshot dataset corresponding to that merge
policy. More details are here.

1. Execute the following query:

SELECT * FROM adwh_dim_merge_policies;

The result will be:

1. Copy paste the dataset in the Search bar.

2. Click the dataset and navigate to the right panel to copy the table name.

3. Execute the following query:

SELECT * FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The results will be the following. Scroll to the right and you will see the DDA segment memberships:

Notice once again that the Data Distiller Audience-derived attributes are absent, as expected. However, they are
present in the Data Distiller Audience dataset, which Destinations will use to populate its enriched attributes. You can
also check the Audience Overview page:

Appendix: Audiences on Data Lake vs. Composable Audiences

For any Customer Data Platform (CDP), the choice between access to raw data versus pre-defined audiences depends
on the use cases, flexibility, and control requirements of the CDP. Each approach has its advantages:

Data Lake Approach: Access to Raw Data

Flexibility and Granularity: Raw data provides full granularity and allows more flexibility for creating custom
segments, running complex analytics, and deriving unique insights.

Advanced Personalization: With raw data, you can build personalized customer profiles and define complex
audience rules tailored to specific marketing strategies.

Historical Data Analysis: Access to raw data enables richer historical analysis, trend forecasting, and a deeper
understanding of customer behavior across touchpoints.

Dynamic Audience Creation: Teams can build audiences dynamically, experiment, and adjust targeting based on
evolving marketing goals.

Cost and Complexity: Processing raw data often requires more storage, processing power, and expertise to
manage data pipelines, making it potentially more resource-intensive.
Composable Approach: Access to Audiences Only

Simplicity and Speed: Using pre-defined audiences can streamline activation by reducing the need for heavy
data processing. Teams can focus on deploying campaigns quickly.

Less Overhead: Relying on audience data reduces the burden on storage, processing, and potentially
compliance, as only specific attributes of segmented audiences are used.

Efficiency in Activation: Direct access to audience segments supports rapid campaign activation, especially if
audiences are already aligned with common use cases.

Limitations in Customization: Pre-defined audiences limit flexibility. You’re restricted to existing segment
definitions, which may not cover all desired targeting needs.

Dependence on CDP’s Audience Quality: The effectiveness depends on the granularity and accuracy of the pre-
built audiences provided to the CDP.

Use Cases for Raw Data: Ideal when the CDP use cases involve complex segmentation, personalized content,
AI/ML model training, or require insights beyond standard marketing segments.

Use Cases for Audience Access: Best for rapid deployment, simpler activation use cases, or environments where
high customization is unnecessary.

Most CDP implementations use a hybrid model: access to both raw data and pre-defined audiences, enabling the
flexibility to leverage existing audience data for straightforward campaigns and access raw data for more sophisticated
needs. This balance is often most effective, supporting both quick activations and deep, data-driven personalization.

Last updated 3 months ago

Campaign metrics analysis.

Thee results of the descriptive statistics query.

Engagement and purchase exploration

Data Distiller statistical functions give new results.

Successful creation of the audience has a user experience similar to CREATE TABLE.

Identity namespace is available in the Identities section in AEP UI.

Dataset is created for the Data Distiller audience.

Structure of the dataset mimics the SELECT query.

The dataset used by Real-Time Customer Profile to ingest the Data Distiller audience records.

Results of the query against the Audience Portal dataset.

Peek into audience membership.

Data Distiller audience is visible in the audience portal.

Data Distiller filter to get to the audiences.

Data Distiller audience in Real-Time Customer Profile shown always with default merge policy.

Sample profile in the Data Distiller audience

Attributes pane for a sample profile.

Audience membership pane.

You can see that the count of records has increased.

New records are added to the dataset as they come in.

Segment ingestion status may be misleading.

Sample profile shows that the audience has updated.

Choose the default merge policy.

Turn on Active on Edge Merge Policy

Click Finish to make the changes.

Use thee Build Rule option.

Drag and drop the Data Distiller Audience onto the rule builder canvas.

Data Distiller audience gives you the members but not their attributes for segmentation rules within Rule Builder.

Drag an event to the canvas.

Choose the Edge as the Evaluation Method.

You can select any option you prefer, but the system will evaluate whether you truly qualify.

Audience dropped via SQL.

Retrieve the dataset ID for the dataset associated with default merge policy.

The dataset name will pop up.

Retrieve the table name for the Profile Snapshot dataset.

Data Distiller Audiences are part of the Profile Snapshot Dataset associated with Default Merge Policy.

Data Distiller Audiences are visible in the Audience Overview page.

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence/bi-200-create-your-first-data-model-
in-the-data-distiller-warehouse-for-dashboarding * * *

1. Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE

BI 200: Create Your First Data Model in the Data Distiller Warehouse for
Dashboarding
Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to
store and manage data.
Last updated 4 months ago

You should read the Architecture Overview to understand the various query engines supported by Data Distiller:

What is the Accelerated Store?

In September 2023, Data Distiller added a new SQL compute engine (called Query Accelerated Engine or the Data
Distiller Warehousing Engine) specifically geared toward business intelligence, dashboarding, and reporting use cases.
The storage and compute layer sits parallel to the data lake and its associated SQL engines (Ad Hoc Query and Batch).
As a Data Distiller user, you will see all of the tables i.e. AEP datasets in a single catalog view.

A portion of the Accelerated Store is used by AEP to power the out-of-the-box dashboards seen in Adobe Real-Time
CDP (RTCDP) Overviews pages of Profile, Audiences, and Destinations. You will also see similar dashboards show up
on Adobe Journey Optimizer (AJO) pages for reporting on journeys and campaigns. The tables used in creating the
star schemas (or datagroups as Data Distiller calls them) are available for SQL querying with those apps. There is no
customization available at the data layer for these models.

In this example, we will visit the use case of fully building our own star schema as we would have done in our
enterprise warehouse except that we will be doing it within AEP.

There are a few things you need to be aware of so that you can implement a robust design for your end users:

1. Data Storage: The Query Accelerated layer has a limitation of 1 TB of data that you can keep in the star
schemas. This excludes all of the star schemas that were created for out-of-the-box dashboards for RTCDP or
AJO. If you copy raw data into this layer, be aware of this size limitation.

2. Query Concurrency: There can be a maximum of 4 concurrent queries in the system. These queries include
read, write, and update operations. If you do long-running ETL jobs that are computing and writing data into the
accelerated layer, it can impact the BI dashboard performance. This is not uncommon in such systems and you
should make sure you have designed this well.

3. User Concurrency: User concurrency is unlimited, unlike the Ad Hoc Query engine. Make sure you create a
Technical User Account for each of the dashboards you publish.

4. Caching: The Query Accelerated layer has a caching layer that caches the queries across all users so you should
see a performance boost for the same dashboard used across multiple users. The cold-start problem where the
first user in the system may have the longest loading time is something to be aware of.

5. Flat Tables: You cannot create nested data structures within the Query Accelerated Layer. The whole idea is to
keep the data in fact and dimension tables much like what is seen in data warehousing practice.

6. Built-in BI Dashboards: Data Distiller also integrates with (User-Defined) Dashboards which you can see in the
AEP UI. All the star schemas you build along with the ones that are shipped by AEP are accessible here. If your
use case demands that access to the insights be accessible within AEP or you have resource constraints to get a BI
license, then this is the tool for you. It is not a fully-fledged BI dashboarding capability but has enough to get you
the visualization you are looking for.

7. More Capacity: If you are looking for more capacity (storage & query concurrency) than what ships today, then
you need to contact an Adobe rep.

Our scenario is pretty straightforward as we will be showing the mechanics of creating a fact table with aggregated
snapshot data of purchases by country code. There is a lookup table that has the country ID mapped to the country
code. This lookup table will function as a dimension table. The goal is to create a star schema in the Accelerated Store.
The files that you will need are the following

Follow the steps for importing smaller-sized CSV files (1GB or less) as datasets outlined here:

Create a Database and a Schema

Tables contain the data. Think of a database as a logical grouping of tables and schemas as the next-level logical
grouping underneath it. The database grouping is useful when you want to keep the data separated for various
reporting use cases. From a BI tool perspective, this means that you should not be able to answer questions only
specific to this database grouping.

Note: Irrespective of the database and schema groupings that you create, all of these tables are accessible just like the
datasets in the data lake for querying and joining. The logical grouping makes the most sense when you access these
databases within a BI tool like Tableau or PowerBI.

1. Navigate to the AEP navigation bar. Click on Queries->Create Query

2. Make sure you choose the prod: all database in the database dropdown

3. Let us first create a database called testexample

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

Let us take a look at the definition in the SQL metadata commands:

1. **TYPE=QSACCEL:** This specifies that the database and the schemas are custom-built and reside in the
customizable layer of the Accelerated Store

2. **ACCOUNT=acp_query_batch:** This specifies that you have the Data Distiller license in order to create
the tables.

Note: The customization feature works only if you have the Data Distiller license.

Best Practice: As you start thinking of putting the data model in production, consider prototyping on the AEP Data
Lake and use SQL as a means to make sure that all the queries work including the ones in the charts. All of the steps in
this tutorial apply. The only difference is that instead of using CREATE DATABASE testexample WITH
(TYPE=QSACCEL, ACCOUNT=acp_query_batch)

you will use CREATE DATABASE testexample

This will create the database, the schema and the data model on the data lake. When going to production, drop all the
databases, schemas and tables

We will create two logical partitions (just because we want to) to separate this database into fact and lookup schemas

CREATE SCHEMA testexample.lookups;

CREATE SCHEMA testexample.facttables;

Each combination of database and schema i.e. testexample.lookups and

**testexample.facttables** are separate data models.

1. Let us now rename this data models into a friendly name for access within dashboards:

ALTER MODEL testexample.facttables RENAME TO test_purchase;

ALTER MODEL testexample.facttables RENAME TO test_country;

Also, note that if you made a mistake in creating a database or schemas, you can do the following:

First, you need to access the schemas in a database testexample:

For each schema, you should delete them:

DROP SCHEMA IF EXISTS testexample.lookups CASCADE;

DROP SCHEMA IF EXISTS testexample.facttables CASCADE;

After all the schemas are dropped, you can drop the database:

DROP DATABASE IF EXISTS testexample;

The keyword CASCADE is a helpful keyword because it not just deletes the schemas but also deletes the underlying
tables.

Tables actually carry the data we will need for our dashboarding/reporting use cases. Unlike the data lake tables, you
will need to predefine the data types for these tables.

1. Let us define the lookups table first:

CREATE TABLE testexample.lookups.country_lookup AS SELECT cast(null as string) country_code, cast(null

as string) country_name WHERE false;

2. Let us define the fact tables first:

CREATE TABLE testexample.facttables.crm_table AS SELECT cast(null as int) purchase, cast(null as string)

country_code WHERE false;

If you ever need to surgically drop a table, you can use the following:

DROP TABLE IF EXISTS testexample.lookups.country_lookup;

DROP TABLE IF EXISTS testexample.facttables.crm_table;

Tip: Postgres SQL treats names of objects (database, schemas, and tables) in a case-insensitive manner. If you use
uppercase letters in naming, it will treat it as if its in lowercase.

Build Relationships Between Fact and Dimension Tables

This relationship building is required for the BI tool to make sense of this star schema. The User-Defined Dashboards
also use these associations in “joining” these lookups on the fly which is helpful when building visualizations.

1. The code is specified in the following way:

ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES

country_lookup(country_code) NOT enforced;

Warning: If you use the full path specification such as testexample.facttables.crm_table and
testexample.lookups.country_lookup in the ALTER TABLE command, you will get errors.

The key points to note are the following:

1. crm_table is the core fact table that has a key relationship on country_code
2. REFERENCES means that crm_table is joining in country_lookup on the country_code key.

3. NOT enforced means that the JOIN can be a 1-to-many on the country_lookup table. If there are two values for
a country_code to country_name, we will do the JOIN on both. It is your responsibility to make sure that the
country_lookup does not have duplicates for a single country_code value.

Tip: In DBVisualizer, if you use metadata commands like ALTER TABLE, executing a subsequent SELECT
command will not return any results. You will need to disconnect and reconnect to the Data Distiller database.
Metadata commands make changes to the Data Distiller database and these changes are not pushed out into the client
i.e. DBVisualizer. The client has to make the calls to get thesee updates.

You could execute all of this code; query by query bu highlighting the query and then using the Run Selected Query
feature in the Data Distiller Query Pro Mode Editor:

Cheat code:

-- Create the Database in Accelerated Store

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

-- Create the Schemas on the Database

CREATE schema testexample.lookups;
CREATE schema testexample.facttables;

-- Create the Tabbles

CREATE TABLE testexample.facttables.crm_table AS
SELECT
cast(null as int) purchase,
cast(null as string) country_code
WHERE false;

CREATE TABLE testexample.lookups.country_lookup AS

SELECT
cast(null as string) country_code,
cast(null as string) country_name
WHERE false;

-- Define a key relationship between the country code and the lookup
ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES
country_lookup(country_code) NOT enforced;

-- Rename the data models to make them user friendly names in Query Pro Mode
ALTER MODEL testexample.facttables RENAME TO test_purchase;
ALTER MODEL testexample.lookups RENAME TO test_country;

-- Hydrate the Tables

INSERT INTO testexample.facttables.crm_table (country_code, purchase)
(SELECT country, purchase_amount FROM purchases_dataset_crm_data);

INSERT INTO testexample.lookups.country_lookup(country_code, country_name)

(select country_code, country from(
select count(*), country_code, country from country_codes
group by country_code, country));

Always ensure you are in the correct database when using the Data Distiller Query Pro Mode Editor. If you are
querying tables from the AEP Data Lake or creating tables in the Accelerated Store, make sure you are in the
**prod;all** database. When querying tables in the Accelerated Store, you should be in the specific database that
corresponds to the data model containing the tables you’re querying.

Federation Access, Reads and Writes Across AEP Data Lake and Accelerated Store

Data Distiller enables you to read and write datasets across both the AEP Data Lake and the Accelerated Store. From
the query engine’s perspective, when using the Data Distiller Query Pro Mode Editor (or a third party client), you
can seamlessly join tables from both stores and persist the results across them. To ensure the query engine writes to the
Accelerated Store rather than the AEP Data Lake, you need to qualify the data model using specific parameters, such
as **WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch)**. This syntax signals to the system that the
target is the Accelerated Store.

The Data Distiller Query Pro Mode Editor used for chart creation cannot access tables in the AEP Data Lake due to
a 60-second timeout limit for chart queries. If a query takes longer to run because of the latency in reading from the
data lake, it will time out. This is one of the reasons why the Accelerated Store was built—to mitigate these latencies
caused by the data infrastructure layer. While data lakes are excellent for storing large volumes of data, running
exploration queries that can take minutes, and executing batch jobs to create new datasets, they are not optimized for
dashboarding. In contrast, the Accelerated Store’s data layer is designed to efficiently serve queries for dashboards.

However, in the main Data Distiller Query Pro Mode Editor within the AEP UI (i.e., Queries → Create Query), queries
can be executed across both the AEP Data Lake and the Accelerated Store in a federated manner. These queries can
read from and write to both stores without the 60-second timeout restriction, although they may time out after 10
minutes. This flexibility allows for longer-running queries that can return results in minutes, avoiding the limitations
present in chart creation.

1. Let us hydrate the tables

INSERT INTO testexample.facttables.crm_table (country_code, purchase) (SELECT country, purchase_amount

FROM purchases_dataset_crm_data)

INSERT INTO testexample.lookups.country_lookup(country_code, country_name) (SELECT country_code,

country FROM( SELECT count(*), country_code, country FROM country_codes GROUP BY country_code,
country))

Observe that the data is being read from the AEP Data Lake but it is being inserted into the Accelerated Store. The
Data Distiller Batch Query Engine is read from one store (AEP Data Lake) and writing in a federated fashion into the

Note the following:

1. The order of the specification of the columns (country_code, purchase) in the crm_table destination table needs
to match the same order of the columns (country, purchase_amount) in the source table
purchases_dataset_crm_data.

2. The order of the specification of the columns (country_code, country_name) in the country_lookup destination
table needs to match the same order of the columns (country_code, country) in the source table country_codes.

3. Let us test some queries. Since these tables are in the Accelerated Store, we should now choose the database as
**testexample**

SELECT * FROM testexample.facttables.crm_table;

The results are the following:

1. Also execute the following query as well:

SELECT * FROM testexample.lookups.country_lookup;

The results are the following:

Note that all the tables that get created here need not have a unique name spanning both the Data Lake and the
Accelerated Store. Hence, you need to specify the full path in the **CREATE TABLE** metadata command.

Federation Across Data Models in Accelerated Store

A Note on Federation Across Data Models

Since data models are logical partitions, all tables that are part of data models within the Accelerated Store are
accessible in Data Distiller Query Pro Mode for mixing and matching. These tables can also be used for dashboarding
purposes.

The full path of a table **i.e.database_name. schema_name.table_name** has to be unique across the
AEP Data Lake and Accelerated Store. The system will not let you create duplicate tables underneath the same data
model (database_name. schema_name)

Let us do a join query between the fact and the dimension tables. Our goal is to combine records from two tables
whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT a.purchase,a.country_code,b.country_name FROM crm_table a

INNER JOIN country_lookup b ON a.country_code = b.country_code

The results will be:

Let us run the aggregation query. Our goal again is to combine records from two tables whenever there are matching
values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT sum(purchase) AS purchase, country_name FROM

testexample.facttables.crm_table
INNER JOIN testexample.lookups.country_lookup ON
testexample.facttables.crm_table.country_code =
testexample.lookups.country_lookup.country_code
GROUP BY country_name;

Tip: In DBVisualizer, note that you need to ALIAS or AS construct to name the column. For example, sum(purchase)
should be aliased as purchase. The limitation does not exist in the Data Distiller SQL Query Editor.

The results will be:

Choose prod: all when creating data models in the Accelerated Store.

Highlight and execute a query

Querying the testexample.facttables.crm_table in the Accelerated Store

Querying the testexample.lookups.country_lookup in the Accelerated Store

Results of a join operation in the Acceelerated Store

Results of an aggregation query on the Accelerated Store

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence/bi-300-dashboard-authoring-with-
data-distiller-query-pro-mode * * *

1. Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE

BI 300: Dashboard Authoring with Data Distiller Query Pro Mode

This tutorial goes through the steps of building a dashboard using SQL Chart Authoring, Drillthroughs and Global
Filters.

In this tutorial, we will explore how to create powerful, enterprise-grade dashboards that rival the best-in-class
dashboards found in traditional business intelligence (BI) tools. The goal is to demonstrate that you don’t need to rely
on expensive BI vendors or invest extra resources for users to build and consume insights effectively. By leveraging
SQL as the foundation for creating these charts, we can unlock the full potential of your data without the need for
additional software or tools.

The tutorial is designed with two key audiences in mind: the data team and the business users. The data team will
focus on building foundational charts, ensuring that they include the necessary filters and dimensions to provide
flexibility. This allows business users to perform deep, drill-down analyses and gain actionable insights directly from
the dashboards. This approach streamlines the workflow, enabling teams to quickly create interactive and insightful
dashboards without having to leave the Data Distiller environment.

By the end of the tutorial, you’ll have the skills to create sophisticated dashboards that empower business users to
make data-driven decisions—without the complexity and cost typically associated with external BI platforms.

Why SQL for Chart Authoring?

Data Distiller natively supports SQL which continues to be a powerful tool for marketing analysts because it allows
them to access and manipulate large datasets directly, without relying on pre-built reports or static dashboards. With
SQL, marketing analysts can:

1. Create Custom Metrics: What we have seen is that marketing data often requires complex calculations, such as
lifetime value, churn rates, or multi-touch attribution. Almost with every customer we have worked with, we have
seen that they use SQL to define these metrics precisely, tailor them to specific campaigns, and adjust them as
needed.

2. Data Exploration: SQL allows marketing analysts to dig into granular data, performing detailed segmentation,
cohort analysis, and cross-channel performance reviews, giving a more complete picture of customer behavior.

3. Faster Decision-Making: By querying data in Adobe Experience Platform Data Lake directly, marketing
analysts can quickly respond to changing trends and real-time campaign performance, reducing the dependency
on other teams or fixed reporting schedules. So, this is a good skill to have on your team.

4. Flexibility: SQL provides the flexibility to create dynamic, customizable reports that can be adapted for different
stakeholders, from top-level summaries for executives to detailed insights for campaign managers. You are not
restricted by how you build the data layer or the authoring layer.

Overall, SQL empowers marketing analysts to go beyond basic analytics, driving more data-driven decisions and
optimizing marketing strategies with precision. And with LLM-powered chat assistants, learning SQL or debugging
SQL is no longer a challenge!

Pro Tip: While that creative companion (yes, your trusty chat assistant) is here to help, make sure you fully understand
what it suggests before using it. And remember, never send any of your company’s sensitive data or queries to the bot!
Instead, get savvy by crafting examples using dummy data. Stay sharp and secure!

Metrics are the lifeblood of a marketing team’s campaigns, and no two companies are alike. Neither are their metrics.

Follow the steps to create your first table in the accelerated store here:

1. Once you have completed the Prerequisites, make sure you have executed the following queries in sequence:

-- Create the Database in Accelerated Store CREATE DATABASE testexample WITH (TYPE=QSACCEL,
ACCOUNT=acp_query_batch);

-- Create the Schemas on the Database CREATE schema testexample.lookups; CREATE schema
testexample.facttables;

-- Create the Tabbles CREATE TABLE testexample.facttables.crm_table AS SELECT cast(null as int) purchase,
cast(null as string) country_code WHERE false;

CREATE TABLE testexample.lookups.country_lookup AS SELECT cast(null as string) country_code, cast(null

as string) country_name WHERE false; -- Define a key relationship between the country code and the lookup
ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES
country_lookup(country_code) NOT enforced;

-- Rename the data models to make them user friendly names in Query Pro Mode ALTER MODEL
testexample.facttables RENAME TO test_purchase; ALTER MODEL testexample.lookups RENAME TO
test_country;

-- Hydrate the Tables INSERT INTO testexample.facttables.crm_table (country_code, purchase) (SELECT

country, purchase_amount FROM purchases_dataset_crm_data);

INSERT INTO testexample.lookups.country_lookup(country_code, country_name) (select country_code, country

from( select count(*), country_code, country from country_codes group by country_code, country));

-- Data Exploration Queries SELECT * FROM crm_table; SELECT * FROM country_lookup;

2. Make sure that the Database dropdown is set to **prod:all** in the top right corner and save it as a Query
Template

Data Distiller Charts Primer

1. Navigate to Dashboards->Create Dashboard->Add Widget to explore what charts do.

2. Here are the key concepts to keep in mind when building a chart in Data Distiller Dashboards:

1. There are 5 key chart types available under the Marks section that cover most of the visualization needs for
business users:

1. Line Chart (aka Trend Chart): Ideal for visualizing data trends over time, helping users track
changes, growth, or patterns.
2. Donut Chart (aka Pie Chart): Useful for showing parts of a whole, making it easy to visualize
percentage breakdowns or categorical distributions.

3. Bar Chart: Effective for comparing quantities across different categories, making it versatile for a
variety of data comparisons.

4. Table: Allows users to display raw data in a structured format, perfect for detailed reports or numeric
summaries.

5. Big Number Chart (aka KPI Chart): Highlights a single key metric, providing an at-a-glance view
of important figures like revenue, user growth, or performance metrics.

These chart types cover a wide range of use cases, from tracking trends and showing proportions to
comparing values and summarizing key performance indicators (KPIs).

2. Most charts feature an X and Y axis where you can assign attributes from the Attributes panel. These
attributes are typically categorized into three types:

Metrics (123): These represent numerical values, such as sales or revenue, and are often used for
calculations or aggregations.

Dimensions (ABC): These are categorical fields, such as names, categories, or locations, that
represent groupings or categories of data.

Dates (calendar icon): These are the date fields that you have defined as columns in your tables in the
Accelerated Store.

The X and Y axes allow you to plot data using these attributes, with metrics usually plotted on the Y-axis
(vertical) and dimensions/dates on the X-axis (horizontal), enabling you to visualize patterns, trends, or
distributions across different categories.

3. Both the X and Y axes in Data Distiller have built-in aggregation functions like MIN, MAX, SUM, and
AVG to easily summarize data for visualization. However, when using Data Distiller Query Pro Mode
with SQL authoring, these built-in functions become less necessary because you would typically handle
the aggregation directly in your SQL query. This gives you greater control over how data is processed and
presented, allowing you to fine-tune your calculations and ensure the desired outcome without relying on
the built-in functions provided by the visualization tool.

4. There is a third visualization dimension called Color that can enhance the richness of the chart. Adding an
attribute to the Color dimension in a bar chart will create a stacked bar graph, where the bars are
segmented or “sliced” by that attribute. This allows you to break down the values within each bar into
subcategories, providing more detailed insights within the same chart. The use of color helps to visually
distinguish between different segments, making it easier to interpret the data and see how different groups
or categories contribute to the total value.

3. On the right-hand side, you’ll find the chart properties panel, where you can customize various aspects of the
chart. This includes:

Naming the Chart: You can assign a title or label to the chart for easy identification.

Legend Placement: You can control the position of the chart’s legend (e.g., top, bottom, left, or right) to
improve readability.

Naming Visualization Dimensions: You can rename the labels for the X and Y axes to better represent the
data being displayed.
These customization options help tailor the chart to your specific needs, ensuring clarity and improving the
overall presentation of the data.

4. Hierarchical Design: Each dashboard supports a maximum of 10 charts to ensure performance and encourages
users to design their dashboards in a more hierarchical manner. If there are charts that provide detailed
information, they should be placed in separate dashboards that can be accessed through Data Distiller
Drillthroughs. This approach promotes clarity and efficient organization by allowing users to navigate from
high-level summaries to more detailed views without overcrowding the main dashboard.

5. All charts have the following when deployed on a dashboard:

1. ViewMore: Shows you the tabular data with pagination along with the chart. You can download the results
as a CSV file.

2. ViewSQL: Shows the underlying SQL behind the chart. You can copy the SQL or execute it in the Data
Distiller Query Pro Mode Editor.

3. Drillthrough: If a dashboard is attached as a Drillthrough dashboard for a chart, it becomes accessible for
users. They can drill down into the linked dashboard directly from the chart, allowing for deeper exploration
of the data.

6. Export as PDF: All dashboards also include an Export to PDF feature, which generates a single-page PDF
version of the dashboard for manual email distribution or archival purposes.

7. Chart Summary: The table above provides a helpful guide to understanding different types of visualizations
available in dashboards, along with when and how to use them. Here’s a more detailed description for each type
of chart to make it easier to understand:

1. Big Number (KPI Chart): This chart is used to display a single, important number. It’s ideal for showing
key performance indicators (KPIs), such as total sales or total website traffic. It gives you a high-level, easy-
to-read snapshot of a key metric.

1. X and Y Axes: Since this chart only shows one big number, there’s no need for X or Y axes. The
metric is placed on Value dimension.

1. Color: No color is needed because it’s just one number.

2. Example: Think of this as showing the total web traffic for your site in big, bold numbers.

2. Line Chart (Trend Chart): This chart is great for showing trends over time. For example, if you want to
see how web traffic changes week by week, a line chart will plot those changes over time.

1. X Axis: The X axis represents time, like days, months, or years. This allows you to see how things
change over a period.

1. Y Axis: The Y axis shows a metric, like the number of visitors or revenue, so you can track how
it changes over time.

2. Color: You can add color to represent different categories (e.g., showing different lines for
various traffic sources like social media, email campaigns, etc.).

3. Example: A line chart could show web traffic trends for the top 5 marketing channels over the
last year.

3. Bar Chart: Bar charts are used for comparing categories side by side. For example, you might want to
compare the web traffic from different countries or the sales of different products.
1. X Axis: The X axis represents categories, such as different countries or product names.

1. Y Axis: The Y axis shows the value of a metric, such as the number of visitors or total sales.

2. Color: Adding a color dimension allows you to break down each bar further. For example, you
could split the traffic from each country into different sources, like social media, direct traffic,
and search engines.

3. Example: A bar chart could show how much traffic each country is generating, with colors
dividing the traffic sources for each country.

4. Donut Chart (Pie Chart): Donut or pie charts are used for showing parts of a whole. They allow you to
visualize how much each segment contributes to the total.

1. X and Y Axes: There are no axes on a donut chart because the chart itself shows proportions. The
metric is placed on Value dimension.

1. Color: Each segment of the donut is colored differently to represent a category, making it easy to
see how much each category contributes.

2. Example: A donut chart could show the percentage of traffic that comes from different marketing
channels, with each channel represented by a different color.

5. Table: Tables display detailed data in a grid format, allowing you to see raw numbers in a structured way.
This is great when you need to present multiple data points for comparison.

1. X and Y Axes: Tables don’t use traditional X and Y axes. Instead, they have headers at the top, which
label each column (e.g., date, traffic source, number of visitors).

1. Color: Tables don’t use color for individual cells

2. Special Note: A table will only display the 5 first columns when placed in a dashboard. It will
display upto 20 columns when viewed in the View More mode. Each page in ViewMore mode
will have 500 rows. There is Download CSV option that lets you download the 500 rows on that
page with all the underlying dimensions that were there in the table (no restrictions on the 5 or 20
dimensions that are being visualized)

3. Example: A table could display detailed web traffic numbers for each channel and date, allowing
users to drill down into the data.

Big Number (KPI): A simple, large number showing an important metric, like total web traffic.

Line Chart (Trend): Tracks changes over time, showing trends in metrics like traffic or sales.

Bar Chart: Compares different categories (e.g., countries, products) using bars.

Donut Chart (Pie): Shows how different categories contribute to a total, using a donut-shaped chart.

Table: Presents raw, detailed data in a grid for in-depth analysis.

Build Bar Chart: Country By Revenue

1. In the AEP UI, navigate to Dashboards->Create Dashboard

2. Click Create Dashboard. Name the dashboard as QueryProModeDemo. Choose Query Pro Mode. Click
Save.

3. Minimize the left navigation bar and click on Add Widget

4. In the Query Pro Mode Editor, choose the Data Model titled test_purchase and execute the following query:

SELECT * FROM testexample.facttables.crm_table

5. Click Select to select the results of this query on which we will build the chart**.** On the next screen, add the
attributes **purchase** to the X and **country_code** to Y axis. Choose the bar chart and do the
labeling as shown in the diagram below.

1. Name of the chart: Country by Revenue

2. X Axis Label: Revenue (US Dollars)

6. Save and Close the chart. Now resize the chart on the dashboard by dragging the bottom right corner.

7. Resize the chart on the dashboard. Then click Save. Then Click Cancel to exit the Edit Mode.

8. Click on Edit and then Add Widget

Build a Table: Federated Queries Across Data Models

1. Let us now execute a query across the two data models. We have looks up in **test_country** data model
and the raw purchase data in **test_purchase.**

The Data Distiller Query Pro Mode Editor you are using for chart creation cannot access tables in the AEP Data
Lake due to a 60-second timeout limit for chart queries. If a query takes longer to run because of the latency in reading
from the data lake, it will time out. This is one of the reasons why the Accelerated Store was built—to mitigate these
latencies caused by the data infrastructure layer. While data lakes are excellent for storing large volumes of data,
running exploration queries that can take minutes, and executing batch jobs to create new datasets, they are not
optimized for dashboarding. In contrast, the Accelerated Store’s data layer is designed to efficiently serve queries for
dashboards.

However, in the main Data Distiller Query Pro Mode Editor within the AEP UI (i.e., Queries → Create Query),
queries can be executed across both the AEP Data Lake and the Accelerated Store in a federated manner. These queries
can read from and write to both stores without the 60-second timeout restriction, although they may time out after 10
minutes. This flexibility allows for longer-running queries that can return results in minutes, avoiding the limitations
present in chart creation.

1. Copy paste and execute the following query with the data model chosen as **test_country**

SELECT sum(purchase) AS purchase, country_name FROM testexample.facttables.crm_table INNER JOIN

testexample.lookups.country_lookup ON testexample.facttables.crm_table.country_code =
testexample.lookups.country_lookup.country_code GROUP BY country_name;

It is required that you list the full path like the way it was done above for chart authoring i.e.
**testexample.lookups.country_lookup** and
**testexample.facttables.crm_table.country_code**

1. The editor should look like this:

2. Click Select to select the results of this query on which we will build the chart**. Choose Table** under Marks
3. Click on **purchase** attribute and add it to Header 1. Rename the Header name as Revenue and choose
Sortable that allows you to sort the rows by that table in the View More mode.

4. Click Add Header to create Header 2 and add the next attribute **country_name** to Header 2.

5. Name the Header 2 to Country.

6. Change the order of thee columns by using the Up Arrow. Make minor adjustments to Revenue Label- Change it
to Revenue (US Dollars). Also, rename the table**.**

7. Resize the chart using the bottom right arrow:

8. Click on Save. Then Click Cancel.

9. Click on the View More mode in table by clicking on the ellipsis. You can sort the columns and even resize the
columns. There is a Download CSV option that lets you download 500 rows per page. You cannot download the
whole table as a CSV.

A table will only display the first 5 columns when placed in a dashboard. It will display upto 20 columns when viewed
in the View More mode. Each page in ViewMore mode will have 500 rows. There is Download CSV option that lets
you download the 500 rows on that page with all the underlying dimensions that were there in the table (no restrictions
on the 5 or 20 dimensions that are being visualized)

1. As an exercise, go through the same steps as outlined before to create a Big Number Chart using the SQLs show
below.

Be careful to choose the **test_purchase** data model in the dropdown each time you build a chart below
otherwise you will be getting errors.

1. Total Revenue (US Dollars)

SELECT SUM(purchase) AS Total_Revenue FROM testexample.facttables.crm_table;

SELECT COUNT(DISTINCT country_code) AS Number_of_Countries FROM

testexample.facttables.crm_table;

SELECT COUNT(purchase) AS Transaction_Volume FROM testexample.facttables.crm_table;

2. You should be able to drag and resize the Big Number charts. Your final dashboard should look like this:

Improve Bar Chart Readability

1. Our bar chart automatically handles certain aspects of the visualization, but it doesn’t sort the data and will
truncate the results - which is all the more reason to use a table as we did. As a dashboard designer, it’s essential
to monitor both the data and the visualization closely to ensure that what is being displayed accurately reflects the
underlying data.

2. Our chart has bars that are not sorted. This creates issues for our business users who would like to see things
sorted.

3. Click Edit the Dashboard. Click Edit the bar chart. And make sure you click on the pencil icon to access the
Data Distiller Query Pro Mode Editor

4. Make the smallest changes to the SQL. Let us do the aggregation ahead of time and do the sorting in the result set
that is feeding the bar chart itself.
SELECT SUM(purchase) as purchase, country_code from testexample.facttables.crm_table GROUP by
country_code ORDER BY purchase DESC

5. You should see the following preview of the chart which makes it hard to parse the bars as they are so close:

6. So we now have to do a tradeoff for readability. We have the detailed numbers in the table chart that we created
that can show the long tail. Wee are now going to truncate the results to show just the right amount of the long
tail.

SELECT SUM(purchase) as purchase, country_code from testexample.facttables.crm_table GROUP by

country_code ORDER BY purchase DESC LIMIT 20

7. Now the chart with the renaming Top 20 Countries by Revenue will look like the following:

7. Just make sure that you have Aggregation chosen as None on the X and the Y Axis:

Improve Dashboard Readability and Navigation

1. Let us rearrange the dashboard with charts following a pattern - summaries at top via Big Number charts, then
more detail through Bar Charts and finally the detailed data via Tables.

2. The dashboard should look like this:

3. We need to make some more adjustments to make this dashboard appealing and interesting

1. We need to create a new summary donut chart that compares the percentage contribution of the Top 5
Countries By Revenue against the rest.

2. The detailed table that we have now should be made available as a Data Distiller Drill Through on the bar
chart so that a user who wants to truly dig into the data can see the raw data.

3. Notice how I adjust my charting requirements and seamlessly adapt to the new ones without needing to
perform ETL or overhaul the entire project. Additionally, the metric mentioned in (1) is somewhat arbitrary.

Revenue Contribution Donut Chart

1. Top 5 Countries Contribution vs. Rest: First, let us go back to the main Data Distiller Query Pro Mode Editor
and prototype the query there:

SELECT (top_5_purchase / total_revenue) * 100 AS top_5_percentage, (rest_purchase / total_revenue) * 100 AS

rest_percentage FROM ( SELECT SUM(CASE WHEN rank <= 5 THEN purchase ELSE 0 END) AS
top_5_purchase, SUM(CASE WHEN rank > 5 THEN purchase ELSE 0 END) AS rest_purchase FROM (
SELECT country, SUM(purchase_amount) AS purchase, RANK() OVER (ORDER BY SUM(purchase_amount)
DESC) AS rank FROM purchases_dataset_crm_data GROUP BY country ) AS ranked_countries ) AS
purchases_data CROSS JOIN ( SELECT SUM(purchase_amount) AS total_revenue FROM
purchases_dataset_crm_data ) AS total_data

2. You should be in this editor and prototyping your results against the raw dataset in the AEP Data Lake. Go to
AEP’s Left Navigation bar and navigate to Queries->Create Query. You can see that the top 5 countries
contribute around 44% of total revenue.

3. Important: If you look at the structure of the query, there is a window function (**RANK**), several **SUM**
aggregations and even a **CROSS JOIN** happening at the same time. When I execute this query, it takes
about 20 seconds. Remember that the Accelerated Store can process 4 queries at a time. If I start overloading the
system with a heavy duty query like this one, it can cause the timeout of one or many charts. Let us go ahead and
create a table in the Accelerated Store using the similar SQL code which should be easy. Open up the Data
Distiller main editor and add the following queries and execute them in sequence (highlight the query and run
selected query button next to the run button)

4. Create the table underneath the fact tables as this is also a fact. Make sure that you choose the
**Decimal(10,2)** as the data type:

CREATE TABLE testexample.facttables.contribution_table AS SELECT cast(null as Decimal(10,2))

top_5_countries, cast(null as Decimal(10,2)) rest_countries WHERE false;

**DECIMAL(10,2)** refers to a fixed-point or exact numeric data type with a precision of 10 digits and a scale
of 2 digits. Here’s what that means:

Precision (10): This is the total number of significant digits that can be stored. In this case, the total number of
digits is 10.

Scale (2): This indicates the number of digits to the right of the decimal point. In this case, 2 digits are reserved
for the decimal portion.

1. Insert the data from the AEP Data lake query into this table:

INSERT INTO testexample.facttables.contribution_table (top_5_countries, rest_countries) (SELECT

(top_5_purchase / total_revenue) * 100 AS top_5_percentage, (rest_purchase / total_revenue) * 100 AS
rest_percentage FROM ( SELECT SUM(CASE WHEN rank <= 5 THEN purchase ELSE 0 END) AS
top_5_purchase, SUM(CASE WHEN rank > 5 THEN purchase ELSE 0 END) AS rest_purchase FROM (
SELECT country, SUM(purchase_amount) AS purchase, RANK() OVER (ORDER BY SUM(purchase_amount)
DESC) AS rank FROM purchases_dataset_crm_data GROUP BY country ) AS ranked_countries ) AS
purchases_data CROSS JOIN ( SELECT SUM(purchase_amount) AS total_revenue FROM
purchases_dataset_crm_data ) AS total_data);

2. Verify that the table has been hydrated:

SELECT * FROM testexample.facttables.contribution_table

The results should be returned very fast:

Notice that the dimensions are displayed as column names, which isn’t ideal for creating charts. This is a common
issue you may face. In the next section, we’ll work on resolving this. After all, Query Pro Mode allows us to quickly
reshape data on the fly. However, keep in mind that this may come with a slight performance impact.

1. Now go into the Data Distiller Query Pro Mode in Dashboards and make sure you choose
**test_purchase** as the data model and execute the following query:

SELECT ‘Top 5’ AS category, top_5_countries AS value FROM testexample.facttables.contribution_table

UNION ALL SELECT ‘Rest 5’ AS category, rest_countries AS value FROM
testexample.facttables.contribution_table;

2. The donut chart authoring should look like this:

3. The dashboard after you have saved it should look like this:

Enable Data Distiller Drillthroughs

1. Rearrange the dashboard so that the bar and the donut charts are side by side. Save the dashboard.
2. Open up a new AEP tab and navigate to Dashboards. Create a new dashboard called Detailed Revenue by
Country . In the Query Pro Mode, type the following query. You should be able to retrieve the query from
ViewSQL in the previous QueryProModeDemo dashboard

Here is the code to copy and use. Make sure that the data model chosen is **test_purchase** in the data model
dropdown:

SELECT sum(purchase) AS purchase, country_name FROM

1. Configure the table as you did before (this should be fast)

2. Rearrange the dashboard and save it

3. Go back to the QueryProMode dashboard and edit the bar chart titled Top 20 Countries by Revenue. Enable
Drillthrough and choose the dashboard Detailed Revenue by Country from the alphabetical list

4. Click on the ellipsis for the Top 20 Countries by Revenue bar chart.

5. You should see the drill-down dashboard (which is pretty cool). If you click on the parent dashboard, you should
be able to go back up. You can create as many drillthroughs you like and can figure which direct you want to go.

6. Go back to the QueryProModeDemo dashboard and delete the table there. At this point, you have a high-level
summary dashboard with drilldowns.

A new request from the business users is that they want to apply filters to three specific Big Number chart widgets
shown below, while leaving the others unaffected.

1. In the QueryProModeDemo dashboard, click on Add Filter->Global Filter

2. Conceptually, a global filter applies a user-selected value to a chart query, assuming the relevant column exists in
the underlying table. For instance, if we want to filter the queries by country name, we would need to translate
the **country_name** to **country_code** and apply it to the three chart queries that use the
**testexample.facttables.crm_table** table. However, displaying **country_code** on the
dashboard is not user-friendly. To improve the user experience, we can allow users to select a country name, and
internally, this selection would be mapped to the corresponding country_code in the
**testexample.facttables.crm_table** table.

It turns out that the answer to our problems is in the **testexample.lookups.country_lookup** that has
the mapping.

SELECT concat('''', country_code, '''') AS Id,

country_name AS value
FROM testexample.lookups.country_lookup

Observe the format of the table that is being created - it has an **id** column which is the column that will be
passed as strings internally to the **testexample.facttables.crm_table table**and country name
values in the **value** column that the business user will select on the dashboard

1. This is how the results should look like:

2. Click Next. Name the filter as **Country_Code** and do not choose any default values:
3. Click Select. You should now see the Global Filter icon appear in the dashboards.

4. Click on the Global Filter icon to open a pop-up. Select any values, and you’ll notice that none of the charts
update. This is because the global filter hasn’t been linked to the charts yet. We still need to specify which charts
should have the global filter applied. Additionally, for the filter to work, the relevant column must exist in the
table that the chart’s query is based on.

5. Let us open the Big Number chart for Total Revenue (US Dollars). Open the editor and add the following code:

SELECT SUM(purchase) AS Total_Revenue FROM testexample.facttables.crm_table WHERE country_code IN

($country_code_filter);

We have just added $country_code_filter as a runtime parameter.

1. Press the play button and you should see a screen that looks like this. The query should fail and you should
navigate to Query Parameters

2. Choose the value of ‘AF’ to test if the filter works:

3. The result should look like:

4. Copy the following filter code:

WHERE country_code IN ($country_code_filter)

Tip: If there are multiple parameters, they can be combined in any filter condition within the chart. The global filter
simply passes the column values into the chart, acting as runtime parameters for the query. The WHERE clause can then
use these values in any combination that SQL allows. As a result, each global filter can be interpreted uniquely
depending on the chart’s context.

1. Click Select. Then enable the Global Filter and bind the Country_Code Global Filter ( which was a collection
of **country_code id** and **country_name** **value** pairs) to the
**country_code_filter** parameter in the chart query. Thus, values defined in Country_Code Global
Filter can now be chosen by the business user. When they are chosen, the values are sent as
**country_code_filter** to the chart query**.**

2. Click on Save and Close. Do the same operation on all the other charts. Make sure you add the following to their
chart query and remove any semicolons:

WHERE country_code IN ($country_code_filter)

3. Now apply the Global Filter with the value of Japan

4. You should get the following dashboard with the filtered charts at the top:

5. If you want to change the Global Filter icon and click the Global Filter Edit icon

Recap: We started by creating an **id**-based table for the Global Filter. The **values** in this table
represent the options users can select from the filter dropdown. The filter will only apply to tables that contain the
corresponding **id** values in one of their columns. For the charts, the query needs to be parameterized by adding
a filter condition on that column (using the **IN** clause for multiple values). This parameter must be linked to the
Global Filter’s **id**in the chart configuration. When a user selects a **value** from the Global Filter, the
corresponding **id** is passed to all the charts in the dashboard. Only charts that can accept that **id** as a
parameter will apply the filter condition to the query, refining the results accordingly.

Pushdown Filter in Drillthroughs

1. The children of a dashboard, on which a Drillthrough is applied, can have the Global Filter applied to them as
long as they are bound to the same filter.“This means that the Global Filter will affect all child elements in the
dashboard if they are properly connected to the filter, even when Drillthrough actions are involved.

2. We want to be able to apply the Global Filter in Total Revenue (US Dollars) chart below to be pushed down
(propagated down) to the dashboard below:

3. Let us examine the query behind Detailed Revenue by Country dashboard:

SELECT
sum(purchase) AS purchase, country_name
FROM
testexample.facttables.crm_table
INNER JOIN
testexample.lookups.country_lookup
ON testexample.facttables.crm_table.country_code =
testexample.lookups.country_lookup.country_code
GROUP BY
country_name

4. Observe that **country_code** is already present in the tables being joined except that we have not added it.
Let us add it and add our **$country_code_filter** parameter:

SELECT
SUM(purchase) AS purchase, country_name
FROM
testexample.facttables.crm_table
INNER JOIN
testexample.lookups.country_lookup
ON testexample.facttables.crm_table.country_code =
testexample.lookups.country_lookup.country_code
WHERE
testexample.facttables.crm_table.country_code IN ($country_code_filter)
GROUP BY
country_name;

1. You need to create a brand new filter, as there is currently no option to select from a library of Global Filters.
Make sure to give it the exact same name as the one previously used in the Global Filters section

2. You should be able to see that the filters are propagated down:

3. The Date Filteris very similar to the Global Filter except that the table of filter values is predefined. It is in the
**DATE** format i.e **YYYY-MM-DD**. There are also presets defined as well.

4. The Date Filter will use the following format in SQL fragment:

WHERE registration_date BETWEEN CAST(‘$START_DATE’ AS DATE) AND CAST(‘$END_DATE’ AS

DATE)

What the above means is that the query is filtering results to include only rows where a specific date column falls
within the range of two dates, **$START_DATE** and **$END_DATE**. Here’s a breakdown:

**BETWEEN**: This keyword is used to check if a value falls within a specified range, inclusive of both the
start and end values.
**CAST('$START_DATE' AS DATE)**: Converts the value of the variable **$START_DATE** into a
**DATE** data type.

CAST(‘$END_DATE’ AS DATE): Similarly, this converts the value of the **$END_DATE** variable into a
**DATE** data type.

Thus, the SQL fragment is checking if a particular date field in the query lies within the inclusive range of
**$START_DATE** and **$END_DATE** once they are converted to the **DATE** type. Like Global Filters,
we have two parameters -**$START_DATE** and **$END_DATE**that we will bind to the Date Filter.

For example, if $START_DATE is '2021-01-01' and $END_DATE is '2021-12-31',

the query would return rows where the date column is between January 1, 2021, and December 31, 2021, inclusive.

1. Let us first create a table that has the names of customers who registered as a loyalty member:.Make sure you
execute each of the SQL queries by selecting the query in the editor and using the Run Selected Query option:

-- Explore the Registration Table-- SELECT first_name, last_name, registration_date FROM

purchases_dataset_crm_data;

-- Create a New Table in the Accelerated Store -- CREATE TABLE testexample.facttables.registration_table AS

SELECT cast(null as string) first_name, cast(null as string) last_name, cast(null as DATE) registration_date
WHERE false;

--Insert data into this new table -- INSERT INTO testexample.facttables.registration_table(first_name, last_name,
registration_date) (SELECT first_name, last_name, TO_DATE(registration_date, ‘MM/dd/yyyy’) AS
DATE_KEY FROM purchases_dataset_crm_data);

You should see the following:

1. Create a table with the following SQL in the QueryProMode dashboard:

SELECT * FROM testexample.facttables.registration_table ORDER BY registration_date DESC

Observe that if you choose the **ORDER BY** and keep flipping the **DESC/ASC**, you can find out the date
range: 2020-01-01 and 2021-12-30.

1. Resize and rearrange the dashboard to show the table:

2. Click on Add Filter and choose Date Filter:

3. Configure the Date Filter to choose Dec 31, 2021 as the default date as we have data for 2 years i.e. 2020 and
2021

4. The dashboard should look like this and note that the date has the format that is just the reverse of the DATE
format.

5. If you open up the Registration Table chart, you will see two new parameters:

6. Add this SQL fragment for the Date Filter. Input the parameters for **@START_DATE** as 1/1/2020 and
**$END_DATE** as 12/31/2021.

SELECT * FROM testexample.facttables.registration_table WHERE registration_date BETWEEN

CAST(‘$START_DATE’ AS DATE) AND CAST(‘$END_DATE’ AS DATE) ORDER BY registration_date
ASC;

7. Bind the Date Filters as shown in the figure below:

8. The Date Filter should be. You can just type the dates if you need to:

9. Results of the Date Filter query are shown below:

In conclusion, this tutorial shows that you can create powerful, interactive dashboards using SQL without relying on
expensive BI tools. By empowering both data teams and business users to collaborate within the Data Distiller
environment, you can streamline the process of building insightful, data-driven dashboards. This approach eliminates
the need for external BI platforms, reducing complexity and cost while still delivering high-quality, actionable insights
for decision-making.

Last updated 5 months ago

Make sure that you choose prod: all in the database dropdown.

Anatomy of a Data Distiller Chart

Chart summary - when to use what and why

Navigate to dashboards inventory page

Chart added to the dashboard

Click Cancel to exit the Edit Mode

Final result after joining the two tables.

Choose Table as the Option

Choose the columns in the table.

Choosing the next column.

Make adjustments to the table you created.

Adjust the length and width of the table to make it look visually appealing.

View More gives you more columns and data to view than the table in the dashboard.

Data Distiller Query Pro Mode access is through the pencil icon

The long tail that we typically see in data.

Top 20 countries by revenue

Aggregations should be marked as none

Dashboard readibility has improved

Go to Left Navigation bar and navigate to Queries->Create Query

Chart authoring for the SQL result set

The dashboard should look like this.

Summary metrics are side by side

ViewSQL gives you the Query Pro Mode SQL that you can reuse across charts.
Query copy pasted with the test_purchase as the data model

The detailed dashboard that has the most information

Enabling Drillthroughs for drill-down dashboards

Accessing the Drillthrough hyperlink

Parent dashboard is available in the top left corner. Clicking on it will take you one level up.

Global filters should apply to these widgets only.

Generating a global filter on testexample.lookups.country_lookup table.

Name the filter but do not choose default values.

Global filter icon appears in the dashboard.

Global Filter dropdown in the dashboard

Charts are unaffected because the global filter has not been attached to the 3 charts.

Query parameters are not set

Choosing a parameter for testing the query to see if it will give the correct result wheen thee Global Filter is applied.

Results returned as part of passing a parameter to a query.

Enable the Global Filter for this chart and bind it to the country_code_filter parameter

Apply the global filter with a value of Japan

The three charts at the top that are attached to the Global Filter update.

You can modify the Global Filter you have created.

Total Revenue (US Dollars) has a Drillthrough dashboard

The Global Filter is not applied to the underlying dashboard underneath.

Filters are propagated down

Create the registration table

SQL query for the table authoring

Registration table is added to the dashboard.

Choose the options in Date Filter

Choose Dec 31, 2021 as the default date.

Date Filter added to the dashboards.

$START_DATE and $END_DATE are the two parameters

Choose the date parameters

Parameters in the SQL query are availlable for binding

Date Filter keeps changing the results based on the filter values.

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence/bi-500-optimizing-omnichannel-
marketing-spend-using-marginal-return-analysis * * *

In this tutorial, we explore how to optimize marketing spend across various channels by leveraging Data Distiller to
analyze and visualize marketing effectiveness. The dataset includes about a million records of marketing activities
spanning multiple channels—paid search, social media, email marketing, and display ads—across different dates,
promotional activities, and economic conditions. Each record captures key data points such as marketing spend,
revenue generated, promotional activity status, and economic condition for a given day. The assumption is that you
have acquired this data from the various channels and have harmonized the data using Data Distiller.

Download the sample data is:

using the following tutorial:

The primary objective of this case study is to help a fictional company, “RetailHub Inc.,” determine the marginal
returns of each marketing channel. The analysis will enable RetailHub to:

1. Identify which marketing channels yield the highest returns relative to investment.

2. Detect the break-even point where additional marketing investment no longer yields profitable returns.

3. Understand how external factors like promotional activities and economic conditions influence channel
effectiveness.

Note that the data has been standardized and harmonized using Data Distiller. Even without loading the data into AEP,
you can open the CSV file in Excel to see the standardization:

Data harmonization is an assumed step.

Data harmonization is a foundational use case in Data Distiller, with the primary goal of bringing omni-channel data
into a consistent format within the system. Harmonization is a crucial preprocessing step for analyzing datasets that
may come from various sources and formats. For the dataset you provided, harmonization ensures that all data entries
—across different marketing channels, economic conditions, and promotional activities—are standardized, consistent,
and ready for effective analysis in our use case:

1. Date Standardization:

Ensuring that dates are in a consistent format (e.g., **YYYY-MM-DD** or **MM/DD/YYYY**) allows
for accurate time-based filtering and sorting.

Any time components, if not relevant, might be normalized to midnight or removed for simplicity, focusing
on daily summaries.

2. Consistent Channel Naming: Marketing channels like “paid_search,” “social_media,” “display_ads,” and
“email_marketing” appear with consistent naming. Harmonizing ensures that channels are not duplicated with
different spellings or formats (e.g., “Paid Search” vs. “paid_search”).

3. Spend and Revenue Harmonization:

The spend and revenue columns need to be in a consistent currency and format (e.g., USD, with two
decimal places).
Handling missing or zero values to avoid skewed marginal return calculations. For example, filling missing
spend or revenue values with a logical default or calculating them based on available data if appropriate.

4. Categorization of Promotional Activity:

The **promo_activity** column uses binary values (0 and 1) to indicate whether a promotion was
active or not. Harmonizing categorical data in this way enables grouping and filtering without ambiguity.

This could involve mapping different promotional statuses (like “active,” “on hold,” etc.) to a binary format,
ensuring simplicity in downstream analysis.

5. Economic Condition Standardization:

The **economic_condition** column uses consistent categories like “Good,” “Average,” and
“Poor” to represent economic conditions. Harmonizing these categories ensures that they are standardized
and can be used effectively in groupings.

If the data sources used different terms for economic conditions, harmonization would involve mapping
these variations to standard categories for consistency.

6. Dealing with Outliers:

Identifying and handling outliers in spend and revenue ensures that extreme values don’t distort marginal
return calculations.

Harmonization may involve detecting unusual spikes or drops in spend/revenue and determining whether
they’re genuine or need adjustment or removal.

7. Null Handling:

Ensuring there are no null values in critical columns like **spend**, **revenue**, and
**channel** to prevent errors in calculations.

Null values might be imputed or flagged for exclusion in the analysis if they cannot be filled logically.

8. Data Type Harmonization: Ensuring that numeric columns like spend and revenue are in a numeric data
type (float or integer), while categorical fields like **channel** and **economic_condition** are in
string format.

Remember that the revenue in our dataset is attributed revenue by channel: Each row’s revenue value is tied to the
channel listed in the **channel** column, indicating that it represents the outcome (revenue) generated as a result
of marketing efforts on that specific channel. For example, if the **channel** is “social_media,” then the revenue
reflects only the income generated through social media activities on that date.

1. Calculating Marginal Return per Channel: Using SQL window functions, calculate marginal returns to
compare incremental revenue with incremental spend on each channel. This helps identify which marketing
channels provide the best returns over time.

2. Identifying the Break-even Point: Define the break-even point where the marginal return becomes less than 1
(indicating diminishing returns). Highlight this point in the data to guide marketing budget allocation decisions.

3. Dynamic Visualization by Date Range: Learn how to adjust date ranges to dynamically view changes in
marginal return over specific timeframes. This allows RetailHub to analyze the effectiveness of marketing spend
during promotional periods or under varying economic conditions.
4. Analyzing External Influences: Examine how factors like promotions and economic conditions impact marginal
returns across channels, allowing for more informed, context-aware budgeting decisions.

Calculate Marginal Returns per Channel

A key question to ask is: What is the additional revenue generated from every additional dollar spent? This
analysis helps answer that by calculating the marginal return, which represents the incremental revenue generated for
each incremental spend. Analyzing marginal returns across channels and time allows companies to optimize their
marketing budgets by identifying the most effective channels and understanding when additional spend starts yielding
diminishing returns.

For example, if a channel’s marginal return drops below a certain threshold (e.g., less than 1, meaning incremental
revenue is less than incremental spend), it may indicate that additional spending on that channel is no longer cost-
effective.

Calculate the Break-Even Point for Each Channel

We are now going to calculate the break-even point for marketing spend on each channel for each date. The
break-even point represents the moment when additional spend on a given marketing channel no longer yields
proportional increases in revenue—essentially, when the marginal return on investment drops below 1. The reason
why we are computing this on a daily basis is:

Evaluate Daily Channel Efficiency: By calculating the break-even point on a daily basis, we can assess how
efficiently each marketing channel is performing each day. Marketing effectiveness can fluctuate due to various
factors, such as promotions, seasonal changes, or economic conditions. Calculating this daily allows us to capture
these variations in real time.

Identify Diminishing Returns: The marginal return calculation (revenue generated by the incremental spend)
tells us if each additional dollar spent on a channel still brings in more than a dollar in revenue. When marginal
return falls below 1 on a specific date, it indicates diminishing returns, meaning the channel’s current spend
level may no longer be cost-effective.
Optimize Budget Allocation: Knowing the daily break-even point helps make dynamic decisions on where to
allocate marketing budgets. For instance, if a specific channel’s marginal return is consistently below 1, it might
signal that budgets could be more effective if reallocated to other channels with higher marginal returns.

Context-Aware Decision-Making: Since this dataset includes external factors (like promo_activity and
economic_condition), we can interpret the break-even point in the context of these factors. For example, if
marginal returns are higher during promotional periods, it may make sense to increase spend during promotions
and decrease it when promotions are inactive.

The break-even point is determined by calculating the marginal return for each channel on each date, using the
following approach:

If marginal return < 1, the additional spend on that date and channel yielded less revenue than the spend itself,
indicating that the break-even point has been reached or exceeded.

To identify the break-even point where incremental return is less than incremental spend, we can calculate the
marginal return and then filter for instances where this value drops below 1. This approach helps pinpoint the point
where additional investment in a marketing channel does not yield a proportionate return, marking the point of
diminishing returns. Here’s a more detailed breakdown of this process:

WITH marginal_return AS (
SELECT
date,
channel,
spend,
revenue,
LAG(revenue) OVER (PARTITION BY channel ORDER BY date) AS prev_revenue,
LAG(spend) OVER (PARTITION BY channel ORDER BY date) AS prev_spend
FROM marketing_data
),
calculated_margins AS (
SELECT
date,
channel,
spend,
revenue,
prev_revenue,
prev_spend,
COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend, 0), 0)
AS marginal_return
FROM marginal_return
WHERE prev_spend IS NOT NULL
)
SELECT
date,
channel,
spend,
revenue,
marginal_return,
CASE WHEN marginal_return < 1 THEN 'Break-even' ELSE 'Above break-even' END
AS status
FROM calculated_margins;

Overall Breakeven Analysis Across Channels

Conducting a break-even analysis across the entire time period for all channels provides essential insights for strategic
budgeting and resource allocation. By calculating the average marginal return over the full dataset, we gain a high-
level view of each channel’s effectiveness, revealing whether additional spending generally yields proportional
revenue. This analysis is valuable for several reasons:

Long-term Channel Effectiveness: Averages over time help identify channels that consistently underperform
(marginal return below 1), indicating that these channels may not justify continued investment.

High-level Budgeting and Simplified Decision-making: For executives and budget owners, a channel-level
view simplifies decisions, making it easier to identify where to increase or decrease budgets. This approach
avoids overreacting to short-term fluctuations and focuses on sustainable channel performance.

Baseline for Future Comparisons: Establishing a long-term average serves as a baseline to assess the impact of
future strategic adjustments, allowing teams to track if new tactics improve returns over time.

Cross-channel Optimization: With insights on which channels consistently deliver value, marketing resources
can be allocated to synergistic, high-performing channels, enhancing overall ROI and aligning spend with
business goals.

WITH marginal_return AS ( SELECT date, channel, spend, revenue, LAG(revenue) OVER (PARTITION BY
channel ORDER BY date) AS prev_revenue, LAG(spend) OVER (PARTITION BY channel ORDER BY date)
AS prev_spend FROM marketing_data ), calculated_margins AS ( SELECT channel, COALESCE((revenue -
prev_revenue) / NULLIF(spend - prev_spend, 0), 0) AS marginal_return FROM marginal_return WHERE
prev_spend IS NOT NULL ) SELECT channel, AVG(marginal_return) AS avg_marginal_return, CASE WHEN
AVG(marginal_return) < 1 THEN ‘Break-even’ ELSE ‘Above break-even’ END AS status FROM
calculated_margins GROUP BY channel ORDER BY channel;

This yields

The analysis in the screenshot shows the average marginal return for each marketing channel (display_ads,
email_marketing, paid_search, and social_media) over the entire time period. Here’s what the results indicate:

1. Above Break-even: All channels have an average marginal return greater than 1, which means that, on
average, each channel generates more revenue than the amount spent. In other words, for every dollar invested,
each channel yields more than a dollar in revenue. The status column confirms that each channel is “Above
break-even,” implying that, over the entire time period, none of the channels is losing money on average.

2. Comparing Channel Effectiveness: Among the four channels: Email marketing has the highest average
marginal return (approximately 1.76), suggesting it’s the most effective channel in terms of return on investment.
Social media follows, with an average marginal return of about 1.62. Display ads and paid search are slightly
lower but still above 1, with average marginal returns of approximately 1.27 and 1.49, respectively. This suggests
that, although all channels are profitable, email marketing and social media may be the most efficient in terms
of generating revenue relative to spend.

There are some implications of the above findings:

Budget Allocation: Since email marketing and social media offer the highest returns, it may be wise to prioritize
spending on these channels. Conversely, while display ads and paid search are still profitable, they might be
considered for optimized or reduced spending if budget reallocation is an option.

Channel Strategy: This analysis provides a baseline understanding of long-term performance. Channels with the
highest average marginal returns could be candidates for further investment, while those with lower returns could
be evaluated for strategy adjustments to improve their effectiveness.

Dynamic Visualization Using Date Filtering

It’s often helpful to examine performance metrics (like marginal return) over particular date ranges to understand how
certain periods affect marketing channel effectiveness. For example:

During holiday seasons, promotional periods, or economic shifts, specific channels might perform better or worse.

SELECT
date,
channel,
spend,
revenue,
COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend, 0), 0) AS
marginal_return,
CASE WHEN COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend,
0), 0) < 1 THEN 'Break-even' ELSE 'Above break-even' END AS status
FROM (
SELECT
date,
channel,
spend,
revenue,
LAG(revenue) OVER (PARTITION BY channel ORDER BY date) AS prev_revenue,
LAG(spend) OVER (PARTITION BY channel ORDER BY date) AS prev_spend
FROM marketing_data
WHERE date BETWEEN '2024-12-01' AND '2025-01-15' -- Holiday season date
range
) AS seasonal_analysis
WHERE prev_spend IS NOT NULL
ORDER BY channel, date;

Analyzing a date range around specific campaigns or product launches helps reveal the immediate impact of these
events on returns.

SELECT
date,
channel,
spend,
revenue,
COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend, 0), 0) AS
marginal_return,
CASE WHEN COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend,
0), 0) < 1 THEN 'Break-even' ELSE 'Above break-even' END AS status
FROM (
SELECT
date,
channel,
spend,
revenue,
LAG(revenue) OVER (PARTITION BY channel ORDER BY date) AS prev_revenue,
LAG(spend) OVER (PARTITION BY channel ORDER BY date) AS prev_spend
FROM marketing_data
WHERE date BETWEEN '2024-04-01' AND '2024-04-15' -- Campaign launch date
range
) AS event_analysis
WHERE prev_spend IS NOT NULL
ORDER BY channel, date;
Analyzing External Influences on Marginal Returns

External factors, like active promotions or varying economic conditions, often play a critical role in marketing
effectiveness. For example:

Promotional Impact: Promotional periods may lead to a higher return on marketing spend, justifying increased
spending during these times.

Economic Sensitivity: Different channels may perform better or worse depending on the economic conditions,
helping organizations make strategic adjustments in spend allocation during economic downturns or booms.

By grouping marginal returns by promo_activity and economic_condition, analysts can

discover which conditions are most favorable for each channel, enabling them to plan marketing budgets with these
factors in mind.

WITH marginal_return AS (
SELECT
date,
channel,
spend,
revenue,
promo_activity,
economic_condition,
LAG(revenue) OVER (PARTITION BY channel ORDER BY date) AS prev_revenue,
LAG(spend) OVER (PARTITION BY channel ORDER BY date) AS prev_spend
FROM marketing_data
),
calculated_margins AS (
SELECT
date,
channel,
spend,
revenue,
promo_activity,
economic_condition,
COALESCE((revenue - prev_revenue) / NULLIF(spend - prev_spend, 0), 0)
AS marginal_return
FROM marginal_return
WHERE prev_spend IS NOT NULL
)
SELECT
channel,
promo_activity,
economic_condition,
AVG(marginal_return) AS avg_marginal_return,
COUNT(*) AS observations
FROM calculated_margins
GROUP BY
channel,
promo_activity,
economic_condition
ORDER BY
channel,
promo_activity DESC,
economic_condition;
1. Display Ads:

When promotions are active (**promo_activity = 1**), **display_ads** performs best under
“Good” economic conditions, with an average marginal return of 2.1446, indicating that each additional
dollar spent on display ads generates more than double in revenue.

Under “Poor” economic conditions, even with promotions, **display_ads** still show a strong
marginal return (1.7376), suggesting resilience in lower economic conditions.

When there is no promotion (promo_activity = 0), display_ads marginal returns

drop significantly under “Average” conditions, showing a negative marginal return (-0.0211). This means
that additional spend actually correlates with a decline in revenue, suggesting that display ads may be
inefficient without promotional support in average economic times.

2. Email Marketing:

Email marketing shows high marginal returns when promotions are active across all economic conditions,
with values ranging from 1.6464 under “Average” conditions to 2.2325 under “Poor” conditions.

This channel appears to be especially effective during promotions, showing consistently high marginal
returns, even in challenging economic conditions.

Without promotions, email marketing still maintains decent marginal returns across all economic
conditions, with values around 1.5 or higher. This stability suggests that email marketing is a robust channel
that performs well with or without promotions.

3. Paid Search:

Paid search shows strong performance under “Good” economic conditions, both with promotions (average
marginal return of 2.1201) and without (average marginal return of 2.0134).

During “Poor” economic conditions with no promotions, paid search’s marginal return drops significantly,
even turning negative (-0.1821), suggesting that paid search might be ineffective in low economic periods
without promotional support.

Overall, paid search performs best in “Good” economic times but shows vulnerability in weaker economies
if promotions are not running.

4. Social Media:

Social media shows varied marginal returns and is the only channel with fewer records (observations).
Under “Average” economic conditions with promotions active, the average marginal return is 0.8451,
suggesting that this channel is not as efficient as others during promotions.

There is insufficient data in the displayed portion to analyze performance under “Good” or “Poor” economic
conditions, indicating that further data collection might be needed to fully evaluate social media’s
performance.

Effectiveness of Promotions by Channel:

Display ads and email marketing show strong positive responses to promotions across all economic
conditions, suggesting they should be prioritized for promotional campaigns. Display ads are highly
effective in “Good” conditions, while email marketing is consistent even in poor economies.

Paid search performs well during “Good” economic conditions with or without promotions, but its
performance drops significantly without promotions in weaker economies. It may be more cost-effective to
support paid search with promotions during challenging economic periods.

Economic Sensitivity:

Display ads and paid search appear sensitive to economic conditions, performing significantly better in
“Good” economic times. Without promotions, they may even produce negative returns in average or poor
economies, indicating the importance of aligning spend on these channels with favorable economic periods.

Email marketing is the least sensitive to economic conditions, maintaining positive returns across all
conditions. This robustness suggests that it could be a reliable channel for steady investment regardless of
economic shifts.

Channel-Specific Budget Allocation:

Given the marginal returns observed, email marketing should receive a stable or even increased allocation
due to its resilience and effectiveness across conditions.

Display ads should be prioritized for promotional periods and particularly leveraged when economic
conditions are favorable.

Paid search should be monitored closely and potentially reduced during economic downturns, especially if
promotions are not planned, as it shows poor performance under these conditions.

Social media might require further evaluation or data collection due to its limited observations, but it
appears less effective during promotional periods compared to other channels.

This analysis highlights that promotions and economic conditions significantly impact the effectiveness of each
marketing channel. Email marketing is a strong performer in both good and poor economies, making it a reliable
investment, while display ads and paid search show high returns in favorable conditions or during promotions. Social
media may require further data for a full assessment but appears less effective during promotions in average economic
conditions.

Channel Segmentation Based on Performance Metrics

The goal of clustering for channel segmentation based on performance is to group similar marketing channels or time
periods according to key performance metrics, such as marginal return, spend, and revenue. By identifying patterns
and grouping channels with similar characteristics, we gain insights into high-performing and low-performing
segments, which help guide strategic budget decisions.

To achieve this, we first prepare the data by ensuring that each record includes essential metrics—marginal return,
spend, and revenue—along with additional contextual features, like promotional activity or economic conditions if
available. It’s important to normalize or scale these metrics so that each one contributes equally to the clustering
process.

Once we define the features—such as [avg_spend, avg_revenue, avg_marginal_return]—we

combine them into a single vector for each channel or time period. Running the chosen clustering algorithm then
assigns each data point (channel or date) a cluster label, grouping similar records together. Finally, we analyze and
interpret each cluster to understand its performance characteristics. High-performing clusters might demonstrate high
returns relative to spend, making them strong candidates for increased budget allocation. Conversely, low-performing
clusters may indicate poor returns, suggesting areas where budget cuts or optimizations are needed.

CREATE TABLE IF NOT EXISTS calculated_margins AS

-- Step 0: Split Data into Training (80%) and Testing (20%)

-- Assuming calculated_margins is the table with data including marginal

returns per channel
CREATE TABLE channel_data_split AS
SELECT *,
CASE
WHEN RAND() < 0.8 THEN 'train' ELSE 'test' END AS data_split
FROM calculated_margins;

-- Step 1: Create the K-Means Model on the Training Data Only

CREATE OR REPLACE MODEL channel_performance_clustering
TRANSFORM(vector_assembler(array(avg_spend, avg_revenue, avg_marginal_return))
AS features)
OPTIONS (
MODEL_TYPE = 'KMEANS',
NUM_CLUSTERS = 4, -- Adjust the number of clusters based on
analysis needs
MAX_ITER = 20 -- Set maximum iterations for convergence
)
AS
SELECT
channel,
AVG(spend) AS avg_spend,
AVG(revenue) AS avg_revenue,
AVG(marginal_return) AS avg_marginal_return
FROM
channel_data_split
WHERE data_split = 'train'
GROUP BY
channel;

-- Step 2: Store Clustering Results for Training Data

-- Create a table to store the clusters and predictions for each channel based
on performance
-- Create a table to store predictions for the test data, which will be used
for evaluating the model
--CREATE TABLE IF NOT EXISTS channel_cluster_train AS
SELECT *
FROM MODEL_PREDICT(channel_performance_clustering, 1,
SELECT
channel,
AVG(spend) AS avg_spend,
AVG(revenue) AS avg_revenue,
AVG(marginal_return) AS avg_marginal_return
FROM channel_data_split
WHERE data_split = 'train'
GROUP BY channel
);

-- Step 3: Store Clustering Results for Testing Data

-- Create a table to store predictions for the test data, which will be used
for evaluating the model
--CREATE TABLE IF NOT EXISTS channel_cluster_test AS
SELECT *
FROM MODEL_PREDICT(channel_performance_clustering, 1,
SELECT
channel,
AVG(spend) AS avg_spend,
AVG(revenue) AS avg_revenue,
AVG(marginal_return) AS avg_marginal_return
FROM channel_data_split
WHERE data_split = 'test'
GROUP BY channel
);

The results are for training data:

The results for the test data are:

The clustering results from the training and test data provide insights into how different marketing channels are
segmented based on their performance metrics (average spend, average revenue, and average marginal return). Let’s
interpret the clustering assignments and what they suggest about each channel’s performance characteristics.

Training Data Clustering Results

1. Cluster Assignments:

Paid Search: Assigned to cluster 1 with an average spend of approximately 5003.84, average revenue of
8757.87, and an average marginal return of 1.43.

Display Ads: Assigned to cluster 2 with an average spend of around 5001.57, average revenue of 8753.31,
and an average marginal return of 0.96.

Social Media: Assigned to cluster 0 with an average spend of 4998.77, average revenue of 8749.15, and an
average marginal return of 1.67.

Email Marketing: Assigned to cluster 3 with an average spend of 5003.85, average revenue of 8755.74,
and an average marginal return of 1.80.
2. Cluster Interpretations:

Cluster 1 (e.g., Paid Search): Channels in this cluster have moderate marginal returns and generate a
reasonable balance between spend and revenue. They are effective but may not maximize revenue as
efficiently as higher-return clusters.

Cluster 2 (e.g., Display Ads): This cluster has the lowest marginal return (0.96), indicating that these
channels might not be cost-effective, as they generate less than a dollar in revenue for each dollar spent.
Channels in this cluster may need optimization or reduced spending.

Cluster 0 (e.g., Social Media): Channels in this cluster show good performance, with a high marginal return
(1.67), indicating that these channels are more effective at converting spend into revenue.

Cluster 3 (e.g., Email Marketing): This cluster has the highest marginal return (1.80), suggesting that
email marketing is the most efficient channel in terms of revenue generation. Channels in this cluster should
be prioritized for budget allocation due to their high ROI.

Test Data Clustering Results

1. Cluster Assignments:

Paid Search: Assigned to cluster 1 with an average spend of 5009.28, average revenue of 8773.25, and an
average marginal return of 1.39.

Display Ads: Assigned to cluster 0 with an average spend of 4999.25, average revenue of 8745.20, and an
average marginal return of 2.35.

Social Media: Assigned to cluster 1 with an average spend of 5014.96, average revenue of 8781.21, and an
average marginal return of 1.44.

Email Marketing: Assigned to cluster 1 with an average spend of 5012.91, average revenue of 8775.92,
and an average marginal return of 1.59.

2. Cluster Interpretations:

In the test data, Display Ads is uniquely assigned to cluster 0 with a significantly higher marginal return
(2.35). This suggests that display ads might perform better under certain conditions or periods, achieving a
much higher return than observed in the training data. This might be due to factors like specific promotional
activity or seasonal influences during the test period.

Cluster 1 for the test data includes Paid Search, Social Media, and Email Marketing, all with moderate-
to-high marginal returns ranging from 1.39 to 1.59. This indicates that, for this test period, these
channels perform efficiently and consistently, converting each dollar spent into more than a dollar in
revenue.

Performance Consistency Across Periods: Email Marketing and Social Media generally remain in high-
performing clusters across both training and test periods, suggesting stable effectiveness. These channels should
be considered for higher budget allocations.

Cluster Variability for Display Ads: Display Ads shows significant variability, shifting from a low-performing
cluster (2 in training) to a high-performing cluster (0 in test). This might indicate that Display Ads’ effectiveness
is highly influenced by external factors (e.g., timing, promotional campaigns). Further investigation could help
pinpoint specific conditions that make this channel more effective.

Budget Allocation Strategies

Channels in high-performing clusters with high marginal returns (such as Email Marketing and Social
Media in the training period, and Display Ads in the test period) should be prioritized for budget increases.

Channels in lower-performing clusters with marginal returns below 1 (such as Display Ads in the training
period) should either undergo optimization or receive reduced budgets unless specific conditions that
improve their performance can be identified.

Optimal Budget Optimization Across Channels: A Simple Approach

Our goal now is to create an optimization model to maximize overall marginal return across all channels within a
given budget.

To create an optimization model for budget allocation across marketing channels, we’ll design a model that maximizes
the overall marginal return within a set budget constraint. Here, we’ll use linear programming techniques (since
marginal return can typically be approximated as a linear function of budget) to distribute the budget across channels
in an optimal way. This model will suggest how much budget to allocate to each channel to maximize total return,
subject to budget limits and any specified minimum or maximum spend requirements.

Currently, Data Distiller does not support linear programming models natively. However, budget allocation across
omni-channel campaigns can still be achieved using combination techniques, which are effective in this context.

To set up a budget optimization, we’ll generate combinations of budget allocations across channels within a defined
range and filter out combinations that meet the total budget constraint and other conditions. This approach allows us to
approximate an optimal allocation across four channels (email_marketing, paid_search, social_media,
and display_ads) to maximize returns based on each channel’s marginal return.

Total Budget Constraint: The combined spend across all channels cannot exceed a specified total budget.

Minimum and Maximum Spend per Channel: Each channel has a predefined minimum and maximum budget
allocation.

Marginal Return for Each Channel: Each channel has an expected marginal return, which allows us to
calculate the total return for each combination.

The objective function that we have here is:

WITH email_marketing_values AS (
SELECT explode(sequence(1000, 5000, 500)) AS email_marketing
),
paid_search_values AS (
SELECT explode(sequence(1500, 6000, 500)) AS paid_search
),
social_media_values AS (
SELECT explode(sequence(1200, 4500, 500)) AS social_media
),
display_ads_values AS (
SELECT explode(sequence(800, 3000, 500)) AS display_ads
),
PossibleAllocations AS (
SELECT
em.email_marketing,
ps.paid_search,
sm.social_media,
da.display_ads,
(em.email_marketing * 1.8 + ps.paid_search * 1.4 + sm.social_media *
1.5 + da.display_ads * 1.0) AS total_return,
(em.email_marketing + ps.paid_search + sm.social_media +
da.display_ads) AS total_spend
FROM
email_marketing_values AS em
CROSS JOIN
paid_search_values AS ps
CROSS JOIN
social_media_values AS sm
CROSS JOIN
display_ads_values AS da
)
SELECT
email_marketing,
paid_search,
social_media,
display_ads,
total_spend,
total_return
FROM
PossibleAllocations
WHERE
total_spend <= 15000 -- Total budget constraint
ORDER BY
total_return DESC
LIMIT 100; -- Select the allocation with the highest return

Remember that although we get lots of combinations that satisfy the budget constraint (and better), you need to choose
the highest return i.e the top result which can be got by changing **LIMIT 100** to **LIMIT 1**.

The results are the following:

The additional rows display the top 100 budget allocations sorted by total return in descending order, with returns
ranging from 22,500 to 23,100. This range suggests there are multiple budget allocation combinations that yield
returns close to the maximum. In some cases, slightly adjusting the budget between channels—such as increasing
allocation for **paid_search** while reducing spend on **social_media** or **display_ads**—still
meets the budget constraint and achieves a high return, though not as high as the optimal allocation. Notably,
**email_marketing** and **paid_search** consistently receive high allocations (around 4,500 to 5,000)
across the top results, indicating their higher marginal returns and prioritization in the budget. In contrast,
**social_media** and **display_ads** exhibit more variability in their allocations, suggesting these
channels are adjusted as needed to meet the total budget constraint while maximizing total return.

There are multiple combinations and we can do a tradeoff here. However, if we seek the optimized solution, we get:

The optimal allocation with the highest return shows the budget distribution across the four channels—
**email_marketing**, **paid_search**, **social_media**, and **display_ads**—that
maximizes the total return within the specified budget constraint. In this allocation, **email_marketing** and
**paid_search** each receive a budget of 5,000, **social_media** receives 4,200, and
**display_ads** is allocated 800. The total spend is exactly 15,000, which meets the budget constraint, and the
total return achieved is 23,100, representing the maximum achievable return within these constraints.

Prioritization of Channels:

Email Marketing and Paid Search consistently receive high allocations across the top results, implying
that they likely have higher marginal returns relative to other channels. It would be beneficial to prioritize
these channels when allocating future budgets.

Social Media and Display Ads should be treated as flexible channels, where spending can be adjusted to
optimize within the remaining budget after prioritizing the higher-return channels.

Budget Flexibility:

For budgets slightly below 15,000, you might still achieve close to optimal returns by slightly reducing
spend on lower-return channels like Display Ads or Social Media.

This flexibility can be useful for scenarios where the budget fluctuates or where strict budget adherence is
essential.

Maximizing Returns within Constraints:

The results provide a clear strategy for allocating the budget across multiple channels to achieve the highest
possible return, given the budget constraint and marginal return values for each channel.

This approach can be repeated with updated marginal returns or adjusted budget limits to refine allocation
strategies based on real-time data or changing economic conditions.

Marginal returns calculation

Revenue−Previous RevenueSpend−Previous Spend \frac{\text{Revenue} - \text{Previous Revenue}}{\text{Spend} -

\text{Previous Spend}} Spend−Previous SpendRevenue−Previous Revenue

Overall breakeven analysis

Date filtering to the holidays

Campaign launch date range.

Total Return=(email_marketing×1.8)+(paid_search×1.4)+(social_media×1.5)+(display_ads×1.0) \text{Total Return} =

(\text{email\_marketing} \times 1.8) + (\text{paid\_search} \times 1.4) + (\text{social\_media} \times 1.5) +
(\text{display\_ads} \times 1.0) Total Return=(email_marketing×1.8)+(paid_search×1.4)+(social_media×1.5)+
(display_ads×1.0)

Optimal budget allocation

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence/bi-400-subscription-analytics-for-
growth-focused-products-using-data-distiller * * *

1. Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE

BI 400: Subscription Analytics for Growth-Focused Products using Data

Distiller
Unlocking Key Subscription Metrics to Drive Growth and Retention with Powerful Visualizations

Subscription analytics involves tracking and analyzing key metrics that measure the health and growth potential of a
subscription-based business. These metrics provide insights into customer behavior, revenue trends, and overall
product engagement, which are crucial for making informed business decisions. Common metrics in subscription
analytics include churn rate, Monthly Recurring Revenue (MRR), Average Revenue Per User (ARPU), Customer
Lifetime Value (CLTV), Daily Active Users (DAU), and Monthly Active Users (MAU). Monitoring these metrics
helps companies understand how well they are retaining customers, growing revenue, and encouraging product usage
over time.

Subscription analytics and metrics like the above are widely used across various many industries:

1. Software as a Service (SaaS): SaaS companies use subscription metrics to monitor user adoption, gauge
engagement, adjust pricing, and reduce churn. Key metrics like MRR, ARPU, and CLTV help them assess
profitability and the long-term value of customers

2. Media and Entertainment: Streaming services like Netflix and Spotify, along with digital content platforms, use
subscription analytics to track subscriber growth, engagement, and retention. DAU and MAU help measure
content consumption frequency, while churn rate indicates the stability of the subscriber bas

3. Telecommunications: Telecom companies use subscription metrics to manage services like mobile plans,
internet packages, and TV subscriptions. These metrics help track customer satisfaction, prevent service
cancellations, and identify upselling opportunities.

4. Health and Fitness: Subscription-based fitness apps, online coaching, and gym memberships use these metrics
to measure user engagement and assess the effectiveness of their services. By tracking user activity and
cancellations, companies can identify ways to improve customer experiences and reduce churn.

5. E-Learning and EdTech: Online learning platforms and educational subscription services use subscription
analytics to track student engagement, course completion rates, and subscription renewals. Metrics like MRR and
CLTV help determine the financial impact of customer acquisition efforts and retention strategies.

6. E-Commerce and Subscription Boxes: Companies offering subscription boxes (e.g., beauty products, meal kits)
or membership programs track these metrics to monitor growth, manage customer retention, and optimize
recurring revenue. Understanding churn and CLTV helps improve product offerings and marketing strategies.

7. Financial Services: Subscription-based financial services (e.g., investment platforms, credit monitoring services)
use these metrics to evaluate customer usage patterns and assess the impact of retention initiatives on overall
revenue.

Data Distiller excels in visualizing key product metrics, offering a comprehensive and intuitive experience compared
to other tools. With its powerful data processing capabilities, Data Distiller allows for seamless aggregation and
visualization of important metrics such as churn rate, MRR, ARPU, CLTV, DAU, and MAU. The product’s flexibility
in handling different data sources and transforming raw data into actionable insights makes it particularly effective for
tracking user engagement trends and customer lifetime metrics.

Compared to other vendors that provide pre-defined metric calculations and dashboards, Data Distiller enables deeper
customization and more granular analysis. This allows teams to identify underlying issues, such as differences between
DAU and MAU, and drill down into specific user behaviors that contribute to churn. The ability to build tailored
visualizations helps in making more data-driven decisions for growth strategies, empowering companies to proactively
address problems, optimize customer retention, and drive sustainable growth.

Download the following dataset:

Use the following tutorial to onboard this CSV data:

The dataset above customer journey events for 10000 users across a 12-month period. It is designed to capture key
actions taken by users as they interact with a SaaS application, providing a timeline of events such as signups, logins,
conversions to paid plans, monthly payments, and churn:

1. Visitor_ID: A unique identifier for each user.

2. Month: The month in which a particular event occurred. It covers a 12-month period from January to December.
Each user has a timeline of events that can span multiple months, depending on their activities.

3. Event: The type of event or action taken by the user. This could be one of the following:

browse: When the user is just browsing the website but has not signed up.

signup: When the user initially signs up for the app. When they do, they get access to the free version but
need to pay to get access to the premium features.

convert_to_paid: If the user upgrades from the free plan to a paid subscription.

payment: For users who have converted to paid, this represents the monthly payment they make (assumed
to be $5.99/month).

churn: If the user cancels their subscription after converting to paid.

4. Amount (only for payment events): The dollar amount for the subscription payment. It is set to $5.99 for users
who are paying for the app, applicable only for rows where the Event is ‘payment.’

You will also need the following prerequisite for the dashboards we will build:

and

Modeling the Customer Journey

When you’re analyzing customer journeys, you’re essentially mapping a series of interactions that a user has with a
product, platform, or service. These interactions aren’t linear; they vary widely depending on user behavior,
preferences, and decisions. Think of it like a decision branch where each action (signup, login, upgrade to paid, churn,
etc.) creates a new path in the user’s journey, and each path has its own implications for metrics like revenue,
engagement, and retention. However, no matter how intricate the customer journey is—filled with multiple branching
decisions, complex behaviors, and varied touchpoints—you can still bring that data into a simplified, flat schema. This
allows you to standardize and analyze the different paths users take.

Once your data is in a canonical schema, you can build templates for SQL queries that accommodate multiple
outcomes. This is crucial when modeling customer journeys because you can’t predict every individual path a user will
take. Instead of writing one-off queries for each possible decision branch (e.g., “users who signed up in January and
converted in February but churned in April”), you can generalize the query to cover all possible scenarios.

Because the schema is flat, the query structure stays consistent, and only the parameters (like dates, actions, or
conditions) need to change. This ensures that the analysis remains scalable as the user base grows, and different
decision branches are taken.

For this tutorial, I have modeled the customer journey as an example, but as a product manager, defining this journey
is crucial for your specific use case. The key is to pinpoint the critical touchpoints that matter most in the user
experience. While my example is simplified, your journey may include additional events, such as feature launches or
user engagement with specific features.

1. Visits/Browse Events: These are visitors browsing the SaaS app without signing up. If they don’t sign up, they
won’t generate any further events such as logins, paid conversions, or churn.

2. Signups: Each user has a signup event recorded for the month they initially join the app. This event is crucial for
tracking when a user becomes part of the system. Upon signing up, users gain access to the app’s free features.
It’s important to note that even if a user churns, they can still continue to use the free features.

3. Logins: Users log in multiple times after signing up. The login frequency varies across users and is randomly
generated for the months following their signup. This helps calculate metrics like DAU (Daily Active Users) and
MAU (Monthly Active Users).

4. Paid Conversions: Some users convert to a paid plan after signing up. The conversion happens after the signup
event, at some random time within the 12 months. This event is crucial for tracking conversion rates and
revenue generation.

5. Payments: Users who have converted to a paid plan are charged $5.99 per month. The payment event occurs
monthly for each user as long as they remain subscribed and have not churned.

6. Churn: Users who have converted to a paid plan can churn (cancel their subscription). The churn event occurs
randomly after the conversion, with a 15% churn rate. Once a user churns, no further payment events are
recorded for that user.

I am also going to define DAU, MAU and other SaaS style metrics for the purposes of this analysis:

DAU (Daily Active Users) is defined as at least one login per day per user. This is a time-series metric because it
tracks user activity over time, specifically on a daily basis. Each day’s DAU represents the number of unique
users who logged in at least once that day.

MAU (Monthly Active Users) is defined as at least 5 logins per month per user. This is a time-series metric
because it tracks user activity over time, specifically on a monthly basis. Each month’s MAU represents the
number of unique users who logged in at least 5 times during that month

Churn Rate represents the percentage of our users who cancel their subscription (indicated by a churn event)
or have stopped using the product (no login after a signup event) after upgrading to a paid plan within a given
period. It helps measure how quickly we are losing paying customers. The Retention Rate is 1 - Churn Rate.

MRR (Monthly Recurring Revenue): This measures the predictable revenue our app generates from paid users
on a monthly basis. In our case, every user who converts to a paid plan pays $5.99 per month. MRR allows us to
track how much recurring revenue is generated each month as users upgrade from the free version to the paid
plan.

ARR (Annual Recurring Revenue) projects the yearly value of our monthly subscriptions. In this example,
ARR is calculated by taking the Monthly Recurring Revenue (MRR) from the last month and multiplying it by
12. This represents a full year’s worth of payments from all paying users, based on the MRR of the most recent
month.

Average Revenue Per User (ARPU) is an essential metric that represents the average revenue generated per user
over a specific period— in this case, on a monthly basis. It provides insight into how much revenue, on average,
each user (including both free and paid users) contributes to your business.

Customer Lifetime Value (CLTV) is a metric that estimates the total revenue a business can expect to generate
from a customer over the entire duration of their relationship. It helps businesses understand the long-term value
of their customers and can guide decisions on customer acquisition and retention strategies. A high CLTV
indicates that customers are valuable and tend to stick around for longer periods, contributing more revenue over
time. The formula below is derived in the Appendix.

Signup to Paid Conversion Rate: The percentage of users who convert from a free trial or a free plan to a paid
plan..

Cohort Analysis (Excluded from this Chapter)

Cohort Retention: Track how different cohorts (groups of users who signed up in the same period) behave over
time in terms of engagement, retention, and conversion.

Cohort Conversion: Analyzing which signup cohorts are converting into paid users, and whether certain months
or marketing strategies perform better.

Create the Table in the Data Distiller Warehouse

1. Execute each of these three statements, one at a time using the Run Selected Query feature

-- Create the Database in the Warehouse CREATE DATABASE saas WITH (TYPE=QSACCEL,
ACCOUNT=acp_query_batch);

-- Create the data model CREATE SCHEMA saas.facttable;

-- Create an empty dataset CREATE TABLE saas.facttable.customer_journey_saas AS SELECT cast(null as

string) AS Visitor_ID, cast(null as string) AS Month, cast(null as string) AS Event, cast(null as decimal(18,2)) AS
Amount WHERE false;
-- Rename the data model for readability in Data Distiller Dashboards ALTER MODEL saas.facttable RENAME
TO saas_analysis;

-- Hydrate the table INSERT INTO saas.facttable.customer_journey_saas SELECT * FROM

customer_journey_saas;

Note that there are two datasets named **customer_journey_saas**. One is stored in the data lake, ingested
via a CSV upload, and the other is the one we created in the Warehouse. When dealing with datasets that share the
same name, you must specify the full path for the dataset in the Warehouse to ensure the Data Distiller engine correctly
identifies which one to use.

1. Launch the Data Distiller Query Pro Mode Editor by navigating to Dashboards->Create Dashboard->Name
of dashboard: saas->Choose Query Pro Mode->Add new Widget->Enter SQL

2. Before you run the SQL query, make sure you choose **saas_analysis** in the data model dropdown:

SELECT * FROM saas.facttable.customer_journey_saas;

3. You will follow the process of creating your SQL query with the **saas_analysis** data model and then
selecting the appropriate visualization. This should be fairly intuitive, but if you need assistance, you can refer to
the detailed steps in the tutorial here.

Total Unique Visitors (Yearly)

SQL code that you will paste in the Query Pro Mode is:

SELECT
COUNT(DISTINCT Visitor_ID) AS Total_Unique_Visitors
FROM customer_journey_saas;

Total Unique Signups (Yearly)

SQL code that you will paste in the Query Pro Mode is:

SELECT COUNT(DISTINCT Visitor_ID) AS Total_Signups

FROM customer_journey_saas
WHERE Event = 'signup';
The visualization that you should be able to get to is this one:

Total Paid Unique Customers (Yearly)

SELECT
COUNT(DISTINCT Visitor_ID) AS Total_Unique_Paid_Customers
FROM customer_journey_saas
WHERE Event = 'convert_to_paid';

Paid Conversion Rate (Over the Year)

SQL code that you will paste in the Query Pro Mode Editor is:

WITH Total_Signups AS (
SELECT COUNT(DISTINCT Visitor_ID) AS Total_Signups
FROM customer_journey_saas
WHERE Event = 'signup'
),
Paid_Conversions AS (
SELECT COUNT(DISTINCT Visitor_ID) AS Total_Paid
FROM customer_journey_saas
WHERE Event = 'convert_to_paid'
)
SELECT
(Paid_Conversions.Total_Paid * 100.0) / Total_Signups.Total_Signups AS
Paid_Conversion_Rate
FROM Total_Signups, Paid_Conversions;

The visualization that you should be able to get to is this one. In fact, this is the ratio of the signup to paid customers
from the big number charts above.

Total Churned Unique Customers (Yearly)

The query would be:

SELECT
COUNT(DISTINCT Visitor_ID) AS Total_Unique_Churned_Customers
FROM customer_journey_saas
WHERE Event = 'churn';

You would be led to believe that this would be approximately 133/838=15% on a yearly basis, based on the numbers
above. You would think that this represents the ratio of unique churned customers to unique paid customers

WITH Total_Paid_Users AS (
SELECT COUNT(DISTINCT Visitor_ID) AS Total_Paid
FROM customer_journey_saas
WHERE Event = 'convert_to_paid'
),
Churned_Users AS (
SELECT COUNT(DISTINCT Visitor_ID) AS Churned
FROM customer_journey_saas
WHERE Event = 'churn'
)
SELECT
(Churned_Users.Churned * 100.0) / Total_Paid_Users.Total_Paid AS Churn_Rate
FROM Total_Paid_Users, Churned_Users;
But our definition requires that we also consider users who are either paying but not actively using the product, or
those who have access to the free version but are not using it.

WITH Paid_Customers AS (
-- Total unique paid customers who converted to a paid plan
SELECT DISTINCT Visitor_ID
FROM customer_journey_saas
WHERE Event = 'convert_to_paid'
),
Churned_Customers AS (
-- Users with an explicit churn event
SELECT DISTINCT Visitor_ID
FROM customer_journey_saas
WHERE Event = 'churn'
),
Inactive_Customers AS (
-- Users who have stopped using the product (no login after signup)
SELECT DISTINCT s.Visitor_ID
FROM customer_journey_saas s
LEFT JOIN customer_journey_saas l
ON s.Visitor_ID = l.Visitor_ID
AND l.Event = 'login'
AND l.Month > s.Month
WHERE s.Event = 'signup'
AND l.Visitor_ID IS NULL -- No login after signup
)
-- Calculate the Churn Rate based on both churned and inactive customers
SELECT
(COUNT(DISTINCT c.Visitor_ID) + COUNT(DISTINCT i.Visitor_ID)) * 100.0 /
COUNT(DISTINCT p.Visitor_ID) AS Churn_Rate
FROM Paid_Customers p
LEFT JOIN Churned_Customers c ON p.Visitor_ID = c.Visitor_ID
LEFT JOIN Inactive_Customers i ON p.Visitor_ID = i.Visitor_ID;

The code above requires some explanation. Consider the following example:

You can clearly see that **Visitor_ID=2** has churned as an inactive user for the timespan above.

If we examine the following code fragment, we see:

SELECT DISTINCT s.Visitor_ID

FROM customer_journey_saas s
LEFT JOIN customer_journey_saas l
ON s.Visitor_ID = l.Visitor_ID
AND l.Event = 'login'
AND l.Month > s.Month
WHERE s.Event = 'signup'
AND l.Visitor_ID IS NULL -- No login after signup

**LEFT JOIN** means the rows on the left hand side that cannot be matched with any rows fro the right hand side
will remain. Also, let us explore the **ON** condition which is part of the **LEFT JOIN** clause and it specifies
how the two tables (**s** and **l**) should be joined. Let’s break down what each part of this condition means:

1. **s.Visitor_ID = l.Visitor_ID**: This condition joins the records from the two tables (s and l)
based on matching Visitor_ID. It ensures that we are comparing events for the same user in both tables.
2. **AND l.Event = 'login'**: This condition filters the joined records to include only rows where the
event in the l table is a 'login' event. This means we are only interested in login activities when joining the
tables.

3. **AND l.Month > s.Month**: This condition checks that the login event (l.Month) occurred after the
signup event (s.Month). It ensures that we are only considering login activities that happened after the user
signed up.

Now, it turns out that rows in Table **s** that cannot be joined with Table **l** will have all the valuees of the l
Table all appear as NULLs as seeen for **Visitor_ID=2**

We need these rows and so the following filter condition woul extract that information out.

WHERE s.Event = 'signup'

AND l.Visitor_ID IS NULLAND l.Visitor_ID IS NULL

The chart should look liken which is depressing:

Monthly Recurring Revenue

We will convert the month names to numerical values to ensure the time axis on the chart is properly ordered. The line
chart cannot automatically order string values since it lacks the inherent ordering information from the data.
Additionally, note that if there are around 1,000 paid users, each paying $6 per month, the graph should not exceed
$60,000. Here is the SQL query:

SELECT
Month,
SUM(Amount) AS MRR
FROM customer_journey_saas
WHERE Event = 'payment'
GROUP BY Month
ORDER BY
CASE
WHEN Month = 'Jan' THEN 1
WHEN Month = 'Feb' THEN 2
WHEN Month = 'Mar' THEN 3
WHEN Month = 'Apr' THEN 4
WHEN Month = 'May' THEN 5
WHEN Month = 'Jun' THEN 6
WHEN Month = 'Jul' THEN 7
WHEN Month = 'Aug' THEN 8
WHEN Month = 'Sep' THEN 9
WHEN Month = 'Oct' THEN 10
WHEN Month = 'Nov' THEN 11
WHEN Month = 'Dec' THEN 12
ELSE NULL
END;

The **ORDER BY** clause with the CASE statement in the query is used to sort the results in chronological order
based on the month names (e.g., ‘Jan’, ‘Feb’, etc.). The CASE statement assigns a numeric value to each month
name:

'Jan' is assigned the value 1.

'Feb' is assigned the value 2.

'Mar' is assigned the value 3.

And so on, up to 'Dec', which is assigned the value 12.

This mapping converts the month names into a numeric order, representing the natural sequence of months.

This query takes the last month which is around $4000 MRR for December and multiplies that by 12.

SELECT
SUM(MRR) * 12 AS ARR
FROM (
SELECT
SUM(Amount) AS MRR
FROM customer_journey_saas
WHERE Event = 'payment'
-- Find the most recent month (you can adjust this depending on your date
format)
GROUP BY Month
ORDER BY CASE
WHEN Month = 'Jan' THEN '01'
WHEN Month = 'Feb' THEN '02'
WHEN Month = 'Mar' THEN '03'
WHEN Month = 'Apr' THEN '04'
WHEN Month = 'May' THEN '05'
WHEN Month = 'Jun' THEN '06'
WHEN Month = 'Jul' THEN '07'
WHEN Month = 'Aug' THEN '08'
WHEN Month = 'Sep' THEN '09'
WHEN Month = 'Oct' THEN '10'
WHEN Month = 'Nov' THEN '11'
WHEN Month = 'Dec' THEN '12'
ELSE NULL
END DESC
-- Limit to the last month only
LIMIT 1
) AS Last_Month_MRR;

Average Revenue Per User (Monthly)

To calculate the Average Revenue Per User (ARPU) by dividing the total revenue by the total number of unique
users (both free and paid, excluding non-signups) for each month, you can use the following SQL query:

WITH Active_Signups AS (
-- Get all users who signed up (excluding non-signups)
SELECT DISTINCT
Month,
Visitor_ID
FROM customer_journey_saas
WHERE Event IN ('signup', 'login')
),
Monthly_Revenue AS (
-- Calculate the total revenue generated from payments in each month
SELECT
Month,
SUM(Amount) AS Total_Revenue
FROM customer_journey_saas
WHERE Event = 'payment'
GROUP BY Month
)
-- Calculate ARPU by dividing total revenue by the number of active signups
SELECT
a.Month,
COALESCE(mr.Total_Revenue, 0) / COUNT(DISTINCT a.Visitor_ID) AS ARPU
FROM Active_Signups a
LEFT JOIN Monthly_Revenue mr ON a.Month = mr.Month
GROUP BY a.Month, mr.Total_Revenue
ORDER BY
CASE
WHEN a.Month = 'Jan' THEN '01'
WHEN a.Month = 'Feb' THEN '02'
WHEN a.Month = 'Mar' THEN '03'
WHEN a.Month = 'Apr' THEN '04'
WHEN a.Month = 'May' THEN '05'
WHEN a.Month = 'Jun' THEN '06'
WHEN a.Month = 'Jul' THEN '07'
WHEN a.Month = 'Aug' THEN '08'
WHEN a.Month = 'Sep' THEN '09'
WHEN a.Month = 'Oct' THEN '10'
WHEN a.Month = 'Nov' THEN '11'
WHEN a.Month = 'Dec' THEN '12'
ELSE NULL
END;

Customer Lifetime Value (Yearly)

This just takes the formula of ARPU and Churn Rate at the end of the year and divides them. You can see that it should
be approximately

WITH Active_Signups AS (
-- Get all users who signed up or logged in (excluding non-signups)
SELECT DISTINCT
Month,
Visitor_ID
FROM customer_journey_saas
WHERE Event IN ('signup', 'login')
),
Monthly_Revenue AS (
-- Calculate the total revenue generated from payments in each month
SELECT
Month,
SUM(Amount) AS Total_Revenue
FROM customer_journey_saas
WHERE Event = 'payment'
GROUP BY Month
),
ARPU_Calculation AS (
-- Calculate ARPU for December
SELECT
a.Month,
COALESCE(mr.Total_Revenue, 0) / COUNT(DISTINCT a.Visitor_ID) AS ARPU
FROM Active_Signups a
LEFT JOIN Monthly_Revenue mr ON a.Month = mr.Month
WHERE a.Month = 'Dec'
GROUP BY a.Month, mr.Total_Revenue
),
Paid_Customers AS (
-- Total unique paid customers who converted to a paid plan during the year
SELECT DISTINCT Visitor_ID
FROM customer_journey_saas
WHERE Event = 'convert_to_paid'
),
Churned_Customers AS (
-- Users with an explicit churn event during the year
SELECT DISTINCT Visitor_ID
FROM customer_journey_saas
WHERE Event = 'churn'
),
Inactive_Customers AS (
-- Users who have stopped using the product (no login after signup
throughout the year)
SELECT DISTINCT s.Visitor_ID
FROM customer_journey_saas s
LEFT JOIN customer_journey_saas l
ON s.Visitor_ID = l.Visitor_ID
AND l.Event = 'login'
AND l.Month > s.Month
WHERE s.Event = 'signup'
AND l.Visitor_ID IS NULL -- No login after signup
),
Churn_Rate_Calculation AS (
-- Calculate the Churn Rate for the entire year based on both churned and
inactive customers
SELECT
(COUNT(DISTINCT c.Visitor_ID) + COUNT(DISTINCT i.Visitor_ID)) * 100.0 /
COUNT(DISTINCT p.Visitor_ID) AS Churn_Rate
FROM Paid_Customers p
LEFT JOIN Churned_Customers c ON p.Visitor_ID = c.Visitor_ID
LEFT JOIN Inactive_Customers i ON p.Visitor_ID = i.Visitor_ID
)
-- Calculate the CLTV at the end of the year (using ARPU for December and Churn
Rate for the entire year)
SELECT
'Dec' AS Period,
ac.ARPU,
cr.Churn_Rate,
COALESCE(ac.ARPU, 0) / NULLIF(cr.Churn_Rate / 100.0, 0) AS CLTV -- Convert
Churn Rate to decimal
FROM ARPU_Calculation ac
CROSS JOIN Churn_Rate_Calculation cr;

It’s crucial for the company to recognize that a 15% cancellation rate might initially seem manageable, leading them
to estimate the CLTV (Customer Lifetime Value) at around $11.2 per customer, assuming a steady revenue stream and
ignoring the usage statistics. However, a deeper look reveals a concerning trend: many customers are not actively
using the product. This inactivity is a strong indication that these customers may eventually cancel their subscriptions.
Relying solely on the churn rate without considering engagement metrics could create a false sense of security,
masking the risk of future cancellations and revenue loss. It’s a signal that the company needs to address product
adoption and engagement to truly understand and improve customer retention.
Monthly Active Users (MAU)

Based on our definition earlier, a user will count towards a MAU if they have logged in atleast 5 times a month.

WITH Monthly_Logins AS (
-- Count the number of logins per user per month
SELECT
Month,
Visitor_ID,
COUNT(*) AS Login_Count
FROM customer_journey_saas
WHERE Event = 'login'
GROUP BY Month, Visitor_ID
)
-- Calculate MAU by counting users with at least 5 logins in the month
SELECT
Month,
COUNT(DISTINCT Visitor_ID) AS MAU
FROM Monthly_Logins
WHERE Login_Count >= 5
GROUP BY Month
ORDER BY
CASE
WHEN Month = 'Jan' THEN 1
WHEN Month = 'Feb' THEN 2
WHEN Month = 'Mar' THEN 3
WHEN Month = 'Apr' THEN 4
WHEN Month = 'May' THEN 5
WHEN Month = 'Jun' THEN 6
WHEN Month = 'Jul' THEN 7
WHEN Month = 'Aug' THEN 8
WHEN Month = 'Sep' THEN 9
WHEN Month = 'Oct' THEN 10
WHEN Month = 'Nov' THEN 11
WHEN Month = 'Dec' THEN 12
ELSE NULL
END;

The chart should look like this:

Since we don’t have timestamps to identify daily login activity, we need to make assumptions to estimate the Daily
Active Users (DAU). One common approach is to assume that user activity is evenly distributed across the days in
the month. In other words, we assume that the number of active users is approximately the same for each day.

If we know the Monthly Active Users (MAU)—the number of unique users who logged in at least 5 times during the
month—we can distribute this number across the days in the month to estimate the average number of daily active
users.

The formula for this estimation is:

The SQL code will need to be adapted for this approximation:

WITH Monthly_Logins AS (
-- Count the number of users with at least 5 logins in the month (as per
the previous MAU calculation)
SELECT
Month,
COUNT(DISTINCT Visitor_ID) AS MAU
FROM customer_journey_saas
WHERE Event = 'login'
GROUP BY Month
)
-- Estimate DAU based on uniform distribution
SELECT
Month,
MAU / CASE
WHEN Month IN ('Jan', 'Mar', 'May', 'Jul', 'Aug', 'Oct', 'Dec')
THEN 31
WHEN Month IN ('Apr', 'Jun', 'Sep', 'Nov') THEN 30
WHEN Month = 'Feb' THEN 28 -- or 29 for leap years
ELSE NULL
END AS Estimated_DAU
FROM Monthly_Logins
ORDER BY
CASE
WHEN Month = 'Jan' THEN 1
WHEN Month = 'Feb' THEN 2
WHEN Month = 'Mar' THEN 3
WHEN Month = 'Apr' THEN 4
WHEN Month = 'May' THEN 5
WHEN Month = 'Jun' THEN 6
WHEN Month = 'Jul' THEN 7
WHEN Month = 'Aug' THEN 8
WHEN Month = 'Sep' THEN 9
WHEN Month = 'Oct' THEN 10
WHEN Month = 'Nov' THEN 11
WHEN Month = 'Dec' THEN 12
ELSE NULL
END;

The chart will look like the following:

Analysis & Recommendations

Summary of key metrics:

Unique Visitors: 9,279 yearly.

Total Signups: 2,887 yearly.

Unique Paid Customers: 838 yearly.

Churned Customers: 133 yearly

Paid Conversion Rate: 29.03% yearly

Churn Rate: 54.42% annually.

Monthly Recurring Revenue (MRR): $4200 in December

Annual Recurring Revenue (ARR): $50,675

Average Revenue Per User (ARPU): $2.98 yearly

Customer Lifetime Value (CLTV): $2.98 yearly

High Churn Rate (54.42%): Over half of the paying customers are leaving within a year, which significantly
impacts revenue and Customer Lifetime Value (CLTV).

Low CLTV ($2.98) Relative to Subscription Fee ($5.99): The CLTV is much lower than the monthly
subscription cost, indicating that many customers do not stay subscribed long enough to cover even one full
month.

Low Average Revenue Per User (ARPU) ($2.98): The ARPU is also quite low, suggesting that the overall
revenue generated per user is not sufficient to compensate for high churn.

Disparity Between MAU (2,200) and DAU (83): The significant difference between Monthly Active Users
(MAU) and Daily Active Users (DAU) indicates that many users are not engaging with the product frequently,
potentially leading to cancellations.

Paid Conversion Rate (29.03%): Nearly a third of the users who sign up for the free version convert to a paid
plan, which shows that the product does provide perceived value to a portion of the users.

Growth Potential: With 9,279 unique visitors and 2,887 total signups annually, there is a sizable number of
potential customers. Increasing the conversion rate or re-engaging inactive users could lead to significant growth.

Monthly Recurring Revenue (MRR) ($4,200 in December): The MRR for December indicates some revenue
stability, which can be used as a foundation to build upon by reducing churn and boosting engagement.

Reduce Churn: Addressing churn should be the top priority. Strategies may include offering discounts for
longer-term commitments, improving customer support, and proactively engaging customers who show signs of
potential churn.

Boost Engagement: Focus on increasing DAU by enhancing user engagement. Improve onboarding to highlight
the product’s benefits, introduce gamification or loyalty rewards, and release new features to incentivize daily
usage.

Increase CLTV and ARPU: Introduce premium tiers or add-ons to increase ARPU. Additionally, emphasize
upselling and cross-selling to current customers to improve CLTV.

Leverage Conversion Opportunities: With a conversion rate of 29.03%, there is room for improvement.
Implementing targeted marketing campaigns or incentivizing trials could help increase the number of users
converting from free to paid.

Appendix: Derivation of the CLTV Formula

In a retention decay model, the probability that a customer remains subscribed at any time **t** (where **t** is in
months, for example) is given by an exponential decay function:

This describes how many customers remain at time **t**, given a constant churn rate.

Let’s assume that the business earns a fixed Average Revenue per User (ARPU) per month. Over time, the total
revenue from a customer who stays until time t is

To calculate the total expected revenue (CLTV) over a customer’s lifetime, we need to sum (integrate) the revenue
contribution from time **t=0** to **infinity** (as we assume the time horizon for the customer relationship
to be infinite in this model). Thus, we integrate the revenue flow over time:
which gives

Last updated 4 months ago

Churn Rate=
(Number of Churned Customers+Number of Inactive CustomersTotal Number of Paid Customers)×100\text{Churn
Rate} = \left( \frac{\text{Number of Churned Customers} + \text{Number of Inactive Customers}}{\text{Total
Number of Paid Customers}} \right) \times 100 Churn Rate=
(Total Number of Paid CustomersNumber of Churned Customers+Number of Inactive Customers)×100

MRR=Number of Paying Users×Subscription Fee Per User \text{MRR} = \text{Number of Paying Users} \times
\text{Subscription Fee Per User} MRR=Number of Paying Users×Subscription Fee Per User

ARPU=Total Monthly Recurring Revenue (MRR)Total Number of Active Users (Free and Paid) \text{ARPU} =
\frac{\text{Total Monthly Recurring Revenue (MRR)}}{\text{Total Number of Active Users (Free and Paid)}}
ARPU=Total Number of Active Users (Free and Paid)Total Monthly Recurring Revenue (MRR)

CLTV=Average Revenue per User (ARPU)Churn Rate\text{CLTV} = \frac{\text{Average Revenue per User
(ARPU)}}{\text{Churn Rate}} CLTV=Churn RateAverage Revenue per User (ARPU)

Conversion Rate=(Number of Users Who Convert to PaidTotal Signups)×100\text{Conversion Rate} = \left(

\frac{\text{Number of Users Who Convert to Paid}}{\text{Total Signups}} \right) \times 100 Conversion Rate=
(Total SignupsNumber of Users Who Convert to Paid)×100

Run each of these commands to get the Dataset into the Data Distiller Warehouse.

Query shows that the raw data has been loaded and ready to use.

Big number visualization.

Big Number visualization.

Big number visualization.

Big Number visualization.

Churned customers for the year

Trend shows exponential growth curve.

Annual Recurring revenue.

CLTV is not looking great.

MAU is increasing which includes free and paid customers.

Estimated DAU=MAUNumber of Days in the Month\text{Estimated DAU} = \frac{\text{MAU}}{\text{Number of

Days in the Month}}Estimated DAU=Number of Days in the MonthMAU

Estimated DAU on a monthly basis

Basic Metrics in first dashboard

Bssic Metrics in second dashboard.

R(t)=e−Churn Rate×tR(t) = e^{-\text{Churn Rate} \times t} R(t)=e−Churn Rate×t

Revenue at time t=ARPU×e−Churn Rate×t\text{Revenue at time } t = \text{ARPU} \times e^{-\text{Churn Rate}
\times t} Revenue at time t=ARPU×e−Churn Rate×t

CLTV=∫0∞ARPU×e−Churn Rate×t dt\text{CLTV} = \int_0^\infty \text{ARPU} \times e^{-\text{Churn Rate} \times

t} \, dt CLTV=∫0∞ARPU×e−Churn Rate×tdt

CLTV=ARPU×1Churn Rate\text{CLTV} = \text{ARPU} \times \frac{1}{\text{Churn Rate}}

CLTV=ARPU×Churn Rate1

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-100-python-and-
jupyterlab-setup-for-data-distiller * * *

Even if you are a seasoned data scientist, it is recommended that you install DBVisualizer so that you can execute your
queries directly on the Data Distiller database rather than having to query through python. The query strings you
obtain from DBVisualizer can be pasted back into Python making the data access simpler.

Also, to execute the sample code, you will need a dataset in Adobe Experience Platform. Follow the steps to upload
the dataset here:

Install Python: Anaconda Distribution

Download the free version of Python and its associated libraries for your operating system here:

There is a wizard that will walk you through the steps of the installation here:

Remember that you will be using pip to download packages for your Python distribution as the use cases may require.
It is not necessary that you have all the packages installed. You can do this anytime_. pip_ (Package Installer for
Python) is already included in the Anaconda distribution:

Install JupyterLab as the Data Science Editor

Follow the instructions here:

At the Terminal window on Mac or Command Prompt in Windows. Navigate to the directory where you want to start
JupyterLab and then type:

It should look like this:

Warning: When you launch a Jupyterlab notebook, make sure you are in the directory where you can read and write
files. So change your directory to that location and then launch it there.

Connect to Data Distiller

In your notebook, you need to establish a connection. This is done using the following code:

import psycopg2;
conn = psycopg2.connect('''sslmode=require host=<YOUR_HOST_CREDENTIAL> port=80
dbname=prod:all user=<USERNAME> password=<YOUR_PASSWORD>''')

1. The information for these fields will be available by navigating to Data Management->Queries->Credentials.

2. Click and copy the fields over to your connection strings.

Tip: When you copy and paste the password, it will be long - so do not panic!

Warning: Remember that above should be replaced by the string value of the password. Do not keep the < and >.

Execute the following code:

import psycopg2;

# Establish a connection to the database

conn = psycopg2.connect('''sslmode=require host=ZZZZ port=80 dbname=prod:all
user=YYYYY@AdobeOrg password=XXXXX''')

# Create a cursor object for executing SQL commands

cursor = conn.cursor()

# Example query
query = "SELECT * FROM movie_data;"
cursor.execute(query)

# Fetch all the results

# results = cursor.fetchall()

# Fetch the results in chunks

chunk_size = 50
while True:
# chunk is a cursor that fetches 50 rows at a time. It is not a running
index.
chunk = cursor.fetchmany(chunk_size);
# Break the while loop if there are no rows to be fetced
if not chunk:
break
# Print each row. Print by default is a newline.
for row in chunk:
print(row);

# Close the cursor and connection

cursor.close()
conn.close()

These are the results that you should see:

Note the following:

1. psycopg2 module is a popular PostgreSQL adapter for Python. It allows Python programs to interact with
PostgreSQL databases, enabling you to perform various database operations such as connecting to a database,
executing SQL queries, and more.

2. ZZZZ is the name of the host

3. YYYY is the username which is the same as the IMSOrg name

4. In the context of database programming, a “cursor” is an object that allows you to interact with the database and
manage the execution of SQL queries. The line of code cursor = conn.cursor() is used to create a
cursor object associated with a database connection (conn).
5. **query = "SELECT * FROM movie_data;"**: This line prepares the SQL query to select all
columns from the movie_data table.

6. **cursor.execute(query)**: This line executes the SQL query using the cursor. It sends the query to
Data Distiller which begins processing the query and fetching the results.

7. Fetching results in chunks:

Note that the query is executed once at the beginning to retrieve all the results, and then you’re fetching
those results in chunks using fetchmany() in the loop.

Keep in mind that fetchmany() retrieves already-fetched rows from the cursor’s internal buffer, so
you’re not re-executing the query with each call to _**fetchmany()**_. You’re simply fetching
chunks of data that have already been retrieved from the database.

The loop starts, and chunk = cursor.fetchmany(chunk_size) fetches a chunk of rows from the
result set. The number of rows fetched is determined by the chunk_size parameter.

If there are still rows left in the result set, the loop continues to the next iteration, fetching the next chunk.

If there are no more rows left to fetch, cursor.fetchmany() returns an empty list, which leads to the
condition if not chunk: being satisfied.

The loop breaks and the fetching process stops.

JupyterLab initial screen.

Access the Data Distiller credentials UI

Map the fields as shown above.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-101-learn-basic-
python-online * * *

Remember the goal of the connectivity to the Data Distiller from Python is to extract a table for analysis. This table is
typically a sample that will be stored as a “DataFrame” (a table within Python) and subsequent operations will operate
on this local DataFrame. This is no different from downloading the results of a SQL query in Data Distiller as a local
file and then reading that into Python. For our training, we will assume that you have extracted this table as a CSV file
locally.

Create an Account in Kaggle

To learn Python on the go, we will be using the notebook editor at Kaggle. All you need is an email address to login.

Warning: Do not use Kaggle for uploading any client or customer data. Even if it means that you are sampling the
data for prototyping the algorithm. Kaggle is owned by Google and the data is kept on the cloud. You should use
Kaggle for learning Python with example data. If you want to prototype with customer data, your best option is a local
installation of Python with Jupyterlab as the frontend UI. But make sure you know what the governance policies are in
your organization or department.

To upload the data, click on the ”+” on the Homepage and “New Dataset”. Upload the dataset from your local machine
and name it.

Make sure you name the notebook and also add the dataset you uploaded.
To find the path to this file, you will need to click through on the datasets and see the path:

Warning: As you go through this tutorial, you will need to copy and paste the code line by line into a notebook so that
the code works correctly. I have intentionally made it this way so that you do not skip key learning points.

The word “pandas” originated from “panel data”, a term used for describing multi-dimensional data that vary over
time.

Pandas is perhaps the most important library as far as we are concerned as it allows for manipulation and analysis in
the Python programming language.

The key features that we will be using:

1. Data Structures: Pandas introduces two main data structures, the Series and the DataFrame. A Series is
essentially a one-dimensional array with labeled indices, while a DataFrame is a two-dimensional tabular data
structure, similar to a spreadsheet or a SQL table.

2. SQL-Like Operations:

1. Much like SQL engines,. Pandas provide powerful tools for manipulating and transforming data. You can
filter, sort, aggregate, and reshape data easily.

2. It has functions to handle missing data, duplicate values, and data type conversions, making data-cleaning
tasks more manageable.

3. You can combine multiple data sources through operations like merging, joining, and concatenating
DataFrames.

If you need to use Pandas locally, you’ll need to install it first. You can install it using the following command in your
Python environment:

Tip: SQLAlchemy is very similar in functionality to Pandas but is a library that is geared more toward SQL users with
object-centric thinking where SQL constructs like TABLE etc are first-class constructs. Even if you lov SQL, Pandas is
important for you to learn.

Execute the following piece of code and ensure that the CSV file path is specified correctly as mentioned earlier

import pandas as pd;

# Create a variable by reading the CSV file

data = pd.read_csv('Python101Data');
# Create a DataFrame
df = pd.DataFrame(data);

# Print the full contents of the DataFrame

print(df);

You should see the results as:

Note the following:

import pandas: This part indicates that you want to use the functionality provided by the Pandas library in
your program.

as pd: This part gives an alias to the imported library. By using pd, you’re saying that instead of typing out
“pandas” every time you want to use a Pandas function, you can use the shorter alias “pd.”
DataFrame: As mentioned earlier, this is a class provided by the Pandas library that represents a tabular data
structure, similar to a table in a database or a spreadsheet.

data: This is a variable that holds the data you want to use to create the DataFrame. It’s usually in the form of a
dictionary, where the keys represent the column names and the values represent the data for each column. Here
the keys are Name, Age, and Gender.

print(df)will display the entire DataFrame by default if it’s not too large. However, if the DataFrame is large,
Pandas will display a summarized view, showing only a subset of rows and columns with an ellipsis (...) to
indicate that there’s more data not shown.

Let us now execute:

The only column that will havee statistics is the id column.

df.describe() will generate statistics on the numerical columns of the DataFrame. This is very similar to
ANALYZE TABLE command for computing statistics in Data Distiller.

count``: The number of non-null values in each column.

mean``: The mean (average) value of each column.

**std**``: The standard deviation, which measures the dispersion of values around the mean.

min``: The minimum value in each column.

25%``: The 25th percentile (also known as the first quartile).

50%``: The 50th percentile (also known as the median).

75%: The 75th percentile (also known as the third quartile).

max``: The maximum value in each column.

Let us try to preview the first 10 rows by executing:

Let us count the number of each gender type in the population

grouped_gender = df.groupby('gender').count();
print(grouped_gender);

Remember that grouped_gender is a DataFrame. When you use the groupby() function and then apply an
aggregation function like count(), it returns a DataFrame with the counts of occurrences for each gender type. The
above code is very similar to an aggregation COUNT with GROUP BY in SQL.

The answer that you will get should look like this:

Other functions that you could have used in place of count() are sum(), mean(), std(), var(), min(),
max(), and median().

Let us create a function that computes the percentage of total for all the gender types

# Define the function

def percent_of_total(column):
return 100*column/column.sum();

# Apply the function to the 'gender' column

percent_of_total_gender = percent_of_total(grouped_gender);
print(percent_of_total_gender);

Note the following:

1. The hash sign # is used to create comments.

2. Note that def has a semi-colon.

3. return should be indented properly.

4. The function percent_of_total is applied to each individual element in the column.

5. percent_of_total_gender is also a DataFrame as will be obvious from the answers below.

The answers you will get will look like this:

To just retrieve a single column, let us use:

print(percent_of_total_gender['id']);

This gives

Alternatively, we could have also created a Series object instead of a DataFrame for percent_of_total_gender

percent_of_total_gender = percent_of_total(grouped_gender['id']);
print(percent_of_total_gender);

And that would give us the exact same answer.

Let us persist these results that are a Series object into a new DataFram

percent_of_total_df = percent_of_total_gender.to_frame(name='Percentage);
print(percent_of_total_df);

Results are

Generate a Randomized Yearly Purchase Column

We are going to emulate the random number generation as in the example here:

Also, let us take this new column and add it to the DataFrame. Let us execute this code:

import random;

# Function to generate random purchases

def generate_random_purchases(column):
return random.randint(1000, 10000) # Modify the range as needed

# Apply the function to generate purchases for each row

df['YearlyPurchases'] = df['id'].apply(generate_random_purchases)

print(df)

The results show that a new column was added:

To learn more about thee random library, read this.

Let us make our first foray into visualizing the histogram:

import matplotlib.pyplot as viz;

viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.show();

matplotlib.pyplot is a visualization library in Python. It is unfortunate that it sounds very similar to MATLAB
which also has plot commands. To plot a chart like the histogram, you can use this site as a reference.

The code is no different from what we used for creating a DataFrame. You first initialize a handle on a library and then
access the functions within that library. The function names should be self-explanatory as to what they do.

The results look like this:

The histogram looks messy so let us clean this up:

import matplotlib.pyplot as viz;

viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.tick_params(axis='x', rotation=45)
viz.tight_layout()
viz.show();

We added two extra functions

The viz.tick_params(axis='x', rotation=45) rotates the x-labels by 45 degrees

The viz.tight_layout() improves the spacing and layout of the plot elements to avoid overlapping.

(Extra Credit) Advanced Visualizations

There is one last thing we want to do here. What if we wanted to plot a histogram and a bar graph together at the same
time?

The answer is that if you have ever used MATLAB, the following code will seem similar:

# Create a 1x2 grid of plots

fig, axes = viz.subplots(1, 2, figsize=(12, 5))

# Histogram Plot
axes[0].hist(df['gender'], bins=10, edgecolor='black', color='skyblue')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Gender')
axes[0].tick_params(axis='x', rotation=45)

# Bar Plot
gender_counts = df['gender'].value_counts()
axes[1].bar(gender_counts.index, gender_counts.values, color='salmon')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bar Plot of Gender')
axes[1].tick_params(axis='x', rotation=45)

# Adjust layout for better spacing

fig.tight_layout()

#Display
viz.show()

The results will look like:

Note the following in the code:

1. The heart of this code is fig, axes = viz.subplots(1, 2, figsize=(12, 5)) Much like
MATLAB, this function call creates a grid of subplots in a single figure.

1: The number of rows in the grid.

2: The number of columns in the grid

figsize=(12, 5): This specifies the size of the entire figure in inches. (12, 5) means the figure will
be 12 inches wide and 5 inches tall.

The function returns two objects:

fig: The figure object, which represents the entire figure.

axes: A 2D array of subplot axes. In this case, it’s a 1x2 array, meaning there’s one row and two
columns of subplot axes.

2. fig.tight.layout() is done at the entire figure level rather than individual charts. That is how this library
has been designed.

(Extra Credit) Exploring the random library

Generating random data is a good skill to acquire especially in the world of data science. The random library in
Python is a built-in module that provides functions for generating random numbers and performing various random
operations.

Here are some of the key functions provided by the random library:

1. Generating Random Numbers:

random.random(): Generates a random float between 0 and 1.

random.randint(a, b): Generates a random integer between a and b (inclusive).

random.uniform(a, b): Generates a random float between a and b.

2. Generating Random Sequences:

random.choice(sequence): Returns a random element from the given sequence

random.sample(sequence, k): Returns a list of k unique random elements from the sequence.

random.shuffle(sequence): Shuffles the elements in the sequence randomly.

3. Random Selection:

random.choices(population, weights=None, k=1): Returns a list of k elements randomly

selected from the population, possibly with specified weights.

4. Randomness Simulation:

random.seed(a=None): Initializes the random number generator with a seed. Providing the same seed
will produce the same sequence of random numbers.

random.random() These functions generate pseudo-random numbers, which appear random but are
actually determined by an initial state (seed) of the random number generator.

Here is some example code to try out:

import random

libraries = ["NumPy", "Pandas", "Matplotlib", "TensorFlow", "Scikit-learn",

"PyTorch"]

# Choose a random library from the list

random_library_choice = random.choice(libraries)

# Choose 2 random libraries without replacement (no duplicates)

random_library_sequence = random.sample(libraries, 2)

# Shuffle the list of libraries in place

random.shuffle(libraries)
random_library_shuffle = libraries

print("Randomly selected library:", random_library_choice)

print("Randomly selected sequence of libraries:", random_library_sequence)
print("Shuffled list of libraries:", random_library_shuffle)

# Set the seed for reproducibility

seed_value = 23
# Initialize the random number generator
random.seed(seed_value)
# Generate random float between 0 and 1
random_numbers = [random.random() for i in range(1,10,1)]

# print the values

print("Random numbers:", random_numbers)
print("Random numbers:", random_numbers)
print("Random numbers generated with seed", random_numbers)

Remember the syntax for for loop

for element in iterable:

# Code block to be executed for each element
# Indentation is crucial in Python to define the block of code inside the
loop
You can also use a for loop with the range() function to iterate over a sequence of numbers:

for number in range(1, 10, 1):

print(number)
# Code to be executed in each iteration
# The starting value is 1 of the sequence and it will be included
# The ending value of the sequence is 10 and it will be excluded
# The step size between each number. It's optional; the default step is 1.

Other Important Python Libraries

1. Scientific

1. NumPy: Similar to MATLAB. A fundamental package for scientific computing with Python. It provides
support for arrays and matrices, along with mathematical functions to operate on these structures efficiently.

2. Machine Learning

Scikit-learn: A machine learning library that provides simple and efficient tools for data mining and data
analysis. It includes a wide variety of machine-learning algorithms and tools for tasks like classification,
regression, clustering, and more.

TensorFlow: An open-source machine learning framework developed by Google. It’s widely used for
building and training deep learning models, especially neural networks.

PyTorch: Another popular open-source machine learning framework, developed by Facebook’s AI

Research lab. It’s known for its dynamic computation graph and ease of use in building neural networks. It
is very popular in the research community.

3. SQLAlchemy: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It simplifies database
interactions and allows you to work with databases in a more Pythonic way. This is required for Data Distiller.

4. Requests: A simple and elegant HTTP library for making HTTP requests to interact with web services and APIs.
This is useful for working with Adobe Experience Platform APIs.

Create a New Notebook. You can also add a New Dataset

Add Python101Data data source along with the CSV data to your notebook.

Make sure you click on the data source and access the CSV file to get the full path.

print(df) summarizes the results much like a SELECT * on a table.

Descriptive statistics of the DataFrame

Preview the first 10 rows. We can change the parameter from 10 to higher or a lower number.

Various types of gender present in the dataset.

Percentage of total computation.

New Dataframe object created from the Series object

Adding a column to a DataFrame.

Histogram of gender type.

Plotting two different visualizations side by side.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-200-unlock-dataset-
metadata-insights-via-adobe-experience-platform-apis-and-python * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience

Platform APIs and Python
This chapter covers the essential steps for installing necessary libraries, generating access tokens, and making
authenticated API requests.

Last updated 5 months ago

Our objective is to create a snapshot of the datasets available in the Adobe Experience Platform (AEP) catalog and
enrich this snapshot with additional dimensions such as the number of rows, dataset size, and whether the dataset is
managed by the customer or the system. While some of this information is available through the AEP UI under the
Datasets tab, our goal is to extract this data programmatically, store it in a data lake, and leverage Data Distiller for
advanced slice-and-dice analysis.

By doing so, we will be able to track the growth of datasets over time, with a particular focus on those marked for the
Profile Store and Identity Store. This use case falls under the category of metadata analytics, involving potentially
hundreds or thousands of datasets. Although AEP exposes some of this metadata via system datasets, our requirements
demand a customized view of the base dataset. In such scenarios, a Python-based solution to extract, format, and
ingest the data into the platform becomes a highly effective approach.

REST APIs in the Platform

A fundamental aspect of Adobe Experience Platform’s architecture is its use of modular services that interact through
REST APIs to perform tasks. This service-oriented approach forms the backbone of the platform. So, if you’re unable
to find specific data or information in the UI or product documentation, the REST APIs should be your go-to resource.
In essence, every UI feature and function is powered by underlying API calls.

Adobe Experience Platform APIs: This is the page that you need be keeping a watch on.

My approach to working with APIs begins by creating a Developer Project, obtaining the necessary credentials and
IMS organization details. I then use Python to extract, process, and format the data. Finally, I leverage Python’s
Postgres SQL connector to write the data back into Adobe Experience Platform. I could use Postman as well, but
having spent a significant amount of time working at MathWorks, I have a deep appreciation for both MATLAB and
Python. This is why I tend to favor Python for my workflow—it’s just a personal preference!

Python vs Postman: Choosing the Right Tool for REST API Interactions

Both Python and Postman offer powerful ways to interact with REST APIs, but they serve different purposes
depending on the context. Postman is ideal for quick testing, debugging, and prototyping of API calls. It provides an
intuitive user interface for crafting requests, viewing responses, and managing collections of API calls without needing
to write code. This makes it particularly useful for developers during the early stages of API development or testing, as
well as for non-developers who need to work with APIs. However, Python excels when you need to integrate API
calls into a larger workflow, automate tasks, or manipulate the response data programmatically. With libraries like
requests, Python enables greater flexibility and customization, allowing for complex logic, error handling, and
scheduled tasks. The trade-off is that Python requires more setup and knowledge of coding, making it more suitable for
repeatable, automated processes rather than ad-hoc testing. In summary, Postman shines in quick, manual interactions
with APIs, while Python is the go-to for automation and advanced API workflows.

Before You Create a Project

Reach out to an Admin Console administrator within your organization and request to be added as a developer to an
Experience Platform product profile through the Admin Console. Also make sure that and a user for a product profile.
Without this permission, you will not be able to create a project or move any further with the APIs.

Once you are done creating a project, you will be generating the following that you will use to gain access:

1. {ACCESS_TOKEN}: This is a short-lived token used to authenticate API requests. It is obtained through an
OAuth flow or Adobe’s IMS (Identity Management System) and is required for authorization to interact with
AEP services. However, the {ACCESS_TOKEN} must be refreshed every 24 hours, as it is a short-lived token.
This means that each day you need to request a new access token using your API credentials to continue making
authenticated API requests to Adobe Experience Platform.

1. **{API_KEY}**``: Also known as the {**CLIENT ID}**, this is a unique identifier assigned to
your application when you create a Developer Project in Adobe Developer Console. It is used to
authenticate and identify the application making the API requests.

2. **{CLIENT SECRET}:** The Client Secret is used along with the Client ID to authenticate your
application when requesting an access token. It acts as a password for your app and should be kept secure.
When using OAuth Server-to-Server authentication, you need both the Client ID and Client Secret to
securely generate the access token.

3. **{SCOPES}**define the permissions your application is requesting. They specify which Adobe services
or APIs your access token can interact with. Without defining the appropriate scopes, your token may not
have the necessary access to certain resources in the platform.

4. The access token is generated after successful authentication using the Client ID, Client Secret, and
Scopes. It is then used in each API request to authorize the interaction.

5. **{ORG_ID}**``: The Organization ID (IMS Organization ID) uniquely identifies your Adobe
Experience Platform organization. This ID is required in API requests to specify which organization’s data
you are interacting with

6. **{SANDBOX_NAME}:** Some APIs may require you to have the sandbox name identify the specific
sandbox environment you are working in within Adobe Experience Platform. Sandboxes allow you to
segment and manage different environments (such as development, staging, and production) to safely test,
develop, and deploy features without affecting live data.

7. **TECHNICAL_ACCOUNT_ID** belongs to a machine account (not a human user), it is used when your
application needs to perform automated tasks, such as fetching data or executing processes in Adobe
Experience Platform, without any user interaction.

Stuff I Wish They Told Me

What is an access token really?

It grants the bearer permission to interact with the AEP services for a limited time (typically 24 hours). The token
includes encoded information about the user or application, the permissions granted, and its expiration time.
In simple terms, the access token serves as a “proof of identity” that you or your application can present when calling
AEP APIs, ensuring that only authorized users can perform actions or retrieve data from the platform.

Why do I need a API key if I have the access token?

When we are making API calls from Python (or any other client), both the API key and the access token play different
roles:

When you make API calls, the API key (also known as the Client ID) is sent along with your requests. This helps
Adobe track which application is making the requests. The API key is typically passed in the headers or query
parameters of each request. It allows Adobe to monitor usage, enforce rate limits, and tie your requests to the specific
developer project associated with your API key.

1. Access Token for Every API Request:

The access token is used in every single API request to authenticate and authorize your actions. It’s sent in the
headers of your API calls, typically in the form of a “Bearer” token (a kind of token used in HTTP authentication).
Without a valid access token, your API calls will be rejected because the platform needs to confirm your identity and
the scope of permissions granted to you.

Wait, where are these API calls going in the Adobe system?

When you make API requests to Adobe Experience Platform (AEP) or even other Adobe services, those requests are
routed through Adobe I/O, Adobe’s central API gateway.

Adobe I/O is not a traditional web server; it functions as an API gateway and integration platform, providing a
centralized access point. It works with Adobe’s Identity Management System (IMS) to validate your API keys and
access tokens, ensuring you have the proper permissions to access the service. Once authenticated, Adobe I/O directs
your requests to the correct Adobe service, such as AEP or Analytics.

Additionally, Adobe I/O manages traffic by enforcing rate limits to ensure fair usage and protect Adobe’s
infrastructure.

What is this OAuth Server-to-Server?

OAuth is a server-to-server security protocol that lets you grant limited access to your data or services to an app (the
Python client we will use) without giving away your personal login details. Instead of sharing your password, you
authorize the app to get an access token, which the app (our Python client) then uses to make secure API requests.
OAuth generates access tokens that are short-lived and easily revoked, reducing the risk if a token is compromised.

Create a Developer Project

Here are the steps to create a project

2. You will be greeted on a a screen that looks like this. Click on Create Project

3. The new Project screen will look like this:

4. Add API and choose Adobe Experience Platform and Experience Platform API

5. Click on Next and you will see the following screen. Choose OAuth Server-to-Server

6. Now you need to choose the product profiles. Product profile lets you define what products you have access to.
The product profile is typically created by the administrator at https://adminconsole.adobe.com/.
Previously, permissions for role-based access control on AEP objects, such as dataset access, were managed within the
product profile. However, this has now changed with Permissions being introduced as a separate feature in the left
navigation panel of the AEP UI. It’s important to confirm with your admin that you’ve selected a product profile that
grants you developer access to Adobe Experience Platform.

1. Click Save the configured API and you will see the following. There will be a API Key that you can copy. You
will also get an option to Generate access token.

2. You can also the edit the project if you like to change the name or description

3. Also, if you clicked on OAuth Server-to-Server, you will be able to see new information such as CLIENT ID,
CLIENT SECRET and SCOPES. Copy them. You will use this information to generate the access token every
24 hours to connect to the Platform APIs. Alternatively, you can log into this project everyday and generate the
access token. Make sure you look into the SCOPES to make sure you have the right SCOPE for the service that
you are querying into.

4. The API requests will be sent to an endpoint. To access that endpoint, click on the View cURL command.

5. If you scroll down, you will get additional details on **ORG_ID** and **TECHNICAL_ACCOUNT_ID**
which you should copy. Note the Credential name. You will need this later in the Permissions UI.

11. Go to the AEP UI or have the admin add a role with the right permissions to your Technical Account ID. Click on
Permissions->Roles->Create Role

1. Add the All permission in Data Management and make sure you choose the right Sandboxes. Also add all
permissions for Sandbox Administration, Data Ingestion and Query Service

2. Scroll to API Credentials. The credential name is the same one that we saw in the project. Click on the
credential name from your project and go to Roles and add the role you created.

3. Let us write some Python code to do so. Copy paste the following in JupyterLab and make sure you have all the
parameters from the previous section.

!pip install requests

import requests

Replace with your Adobe credentials

client_id = ‘your_client_id’ client_secret = ‘your_client_secret’ org_id = ‘your_org_id’ tech_account_id =
‘your_tech_account_id’ scope = ’scope’
auth_endpoint = ‘https://ims-na1.adobelogin.com/ims/token/v2’

Prepare the data for the access token request

data = { ‘grant_type’: ‘client_credentials’, ‘client_id’: client_id, ‘client_secret’: client_secret, ‘scope’: scope #
Specify the scope relevant to your API usage }

Request the access token from Adobe IMS

response = requests.post(auth_endpoint, data=data)
if response.status_code == 200: access_token = response.json()[‘access_token’] print(f”Access Token:
{access_token}”) else: print(f”Failed to obtain access token. Status Code: {response.status_code}, Error:
{response.text}”)

4. You should see a response that looks like this:

Retrieve Datasets Information Across All Sandboxes

1. We will use the Sandbox Management APIs to retrieve the list of sandboxes. Then we will loop through these
sandboxes and make calls to the Catalog API to get the dataset information.

2. In a separate cell, copy paste the following and make sure the cell containing the variables for the headers are
executed from the previous section

import json import requests from datetime import datetime

Set headers for API requests

headers = { “Authorization”: f”Bearer {access_token}”, “x-api-key”: client_id, “x-gw-ims-org-id”: org_id,
“Content-Type”: “application/json” }

List to store all datasets across sandboxes

all_datasets = {}

Initial sandbox endpoint

sandbox_endpoint = ‘https://platform.adobe.io/data/foundation/sandbox-management/sandboxes’ url =
sandbox_endpoint

Iterate over pages of sandboxes

while url: sandbox_response = requests.get(url, headers=headers) # Check if the sandboxes were retrieved
successfully if sandbox_response.status_code == 200: sandbox_data = sandbox_response.json() sandboxes =
sandbox_data.get(‘sandboxes’, []) # Iterate over each sandbox on the current page for sandbox in sandboxes:
sandbox_id = sandbox.get(‘id’) sandbox_name = sandbox.get(‘name’) print(f”Processing sandbox:
{sandbox_name}...”) datasets = [] dataset_count = 0 # Initialize the dataset count start = 0 limit = 50 dataset_url
= ‘https://platform.adobe.io/data/foundation/catalog/datasets’ # Set headers specific to the current sandbox
headers.update({ ‘x-sandbox-id’: sandbox_id, ‘x-sandbox-name’: sandbox_name }) # Pagination for datasets in
each sandbox while True: params = { ‘start’: start, ‘limit’: limit } # Get the dataset info for the current sandbox
with pagination dataset_response = requests.get(dataset_url, headers=headers, params=params) if
dataset_response.status_code == 200: data = dataset_response.json() if not data: # If the returned JSON is empty,
we’ve reached the end break # Append the entire dataset info from this page for dataset_id, dataset_info in
data.items(): dataset_info_entry = dataset_info # Store the entire dataset information
dataset_info_entry[‘dataset_id’] = dataset_id # Add dataset ID explicitly datasets.append(dataset_info_entry)
dataset_count += 1 # Increment the dataset count # Update start for the next page of datasets start += limit else:
print(f”Failed to retrieve datasets for sandbox {sandbox_name}. Status Code: {dataset_response.status_code},
Response: {dataset_response.text}”) break # Save the entire dataset info for the current sandbox
all_datasets[sandbox_name] = datasets # Print the dataset count for the current sandbox print(f”Sandbox:
{sandbox_name}, Dataset Count: {dataset_count}”) # If the sandbox is “testingqs-for-computed-attributes”, list
all datasets with name, ID, and managed_by if sandbox_name == “testingqs-for-computed-attributes”:
print(f”Listing datasets for sandbox: {sandbox_name}”) for dataset in datasets: print(f”Dataset Name:
{dataset.get(‘name’, ‘UNKNOWN’)}, Dataset ID: {dataset.get(‘dataset_id’, ‘UNKNOWN’)}, Managed by:
{dataset.get(‘classification’, {}).get(‘managedBy’, ‘UNKNOWN’)}”) # Check if there is a next page for
sandboxes url = sandbox_data.get(‘_links’, {}).get(‘next’, {}).get(‘href’) if not url: break else: print(f”Failed to
retrieve sandboxes. Status Code: {sandbox_response.status_code}, Response: {sandbox_response.text}”) url =
None

Save the retrieved datasets (full info) to a file called

‘datasets.json’
with open(‘datasets.json’, ‘w’) as outfile: json.dump(all_datasets, outfile, indent=4) print(“Full datasets
information saved to ‘datasets.json’.”)

3. You will see output in your Python environment that looks like this:

Processing sandbox: prod... Sandbox: prod, Dataset Count: 225 Processing sandbox: for-anksharm... Sandbox:
for-anksharm, Dataset Count: 12 Processing sandbox: for-testing-dag-instantiation... Sandbox: for-testing-dag-
instantiation, Dataset Count: 4 Processing sandbox: testingqs-for-computed-attributes... Sandbox: testingqs-for-
computed-attributes, Dataset Count: 34 Listing datasets for sandbox: testingqs-for-computed-attributes Dataset
Name: test_attributs, Dataset ID: 610c42d50219081949bb7657, Managed by: CUSTOMER Dataset Name:
map_collection, Dataset ID: 611edf69f405ee194863e323, Managed by: CUSTOMER Dataset Name:
kevin_map_test, Dataset ID: 6124fb414817aa1948910ea1, Managed by: CUSTOMER Dataset Name:
kevin_profile_map_test, Dataset ID: 612519e340ff6c19484b20fe, Managed by: CUSTOMER Dataset Name:
snil_map_test2, Dataset ID: 6125c12d2858791948907b27, Managed by: CUSTOMER Dataset Name: kevintest,
Dataset ID: 612d0cd754537419486fd9fc, Managed by: CUSTOMER Dataset Name: Segment Ingestion Dataset,
Dataset ID: 612e4cfa7c31be19494306b9, Managed by: SYSTEM Dataset Name: Segmentdefinition-Snapshot-
Export-acf28952-2b6c-47ed-8f7f-016ac3c6b4e7, Dataset ID: 612e4d007c31be19494306bd, Managed by:
SYSTEM Dataset Name: profile_dim_date, Dataset ID: 612e4e3be09fff1948cf39ef, Managed by: SYSTEM
Dataset Name: Kevin raw events data, Dataset ID: 612e598aae879f194866bb2a, Managed by: CUSTOMER
Dataset Name: kevin cdome test, Dataset ID: 612ffa2a7aed981949ed309c, Managed by: CUSTOMER Dataset
Name: aniltest1, Dataset ID: 62414d49cc5ebf1949abe1f8, Managed by: CUSTOMER Dataset Name: abc,
Dataset ID: 628bb11138d0c9194a9caed7, Managed by: CUSTOMER Dataset Name: AnilTestAgain, Dataset ID:
62a3e16467ec671c084c9bdb, Managed by: CUSTOMER Dataset Name: Summit Product, Dataset ID:
63e40fb069250e1bd0a27842, Managed by: CUSTOMER Dataset Name: Summit Customer Profile, Dataset ID:
63e41c1cccf9aa1bd06ce1ef, Managed by: CUSTOMER Dataset Name: Summit Analytics dataset, Dataset ID:
63eb22cc8de7911bd0d2263c, Managed by: CUSTOMER Dataset Name: Export segments, Dataset ID:
63ee6c0ee8d2521bd06985b1, Managed by: CUSTOMER Dataset Name: Email Events, Dataset ID:
63ee6c8ef74ca41bd079e60f, Managed by: CUSTOMER Dataset Name: test_extend_model, Dataset ID:
63f692630898171bd28405e1, Managed by: CUSTOMER Dataset Name: Derived Attributes, Dataset ID:
63f6f6d4a9089c1bd1f3bdac, Managed by: CUSTOMER Dataset Name: Derived Attributes2, Dataset ID:
63f6f6ef3818f91bd06ae638, Managed by: CUSTOMER Dataset Name: Profile Export Dataset, Dataset ID:
63f7ec92a780dc1bd0fc9826, Managed by: CUSTOMER Dataset Name: Profile-Snapshot-Export-bb7e3be2-
7d36-40d6-8749-aa59892b07da, Dataset ID: 63f85e52d243821bd05c67e0, Managed by: SYSTEM Dataset
Name: Summit profile attributes, Dataset ID: 640182aa379c881bd032506a, Managed by: CUSTOMER Dataset
Name: Profile Export For Destination - Merge Policy - c9194b0d-2fc3-4a12-b6d3-d847c726e511, Dataset ID:
64082ccf52eedc1bd06b5ee1, Managed by: SYSTEM Dataset Name: DIM_Destination, Dataset ID:
640937a40d52171bd111a423, Managed by: SYSTEM Dataset Name: BR_Segment_Destination, Dataset ID:
640937bb22929e1bd050e4da, Managed by: SYSTEM Dataset Name: BR_Namespace_Destination, Dataset ID:
640937ce141de11bd0b1a88e, Managed by: SYSTEM Dataset Name: AOOutputForUPSDataset, Dataset ID:
64b1f3c86476cb1ca9879d81, Managed by: SYSTEM Dataset Name: Audience Orchestration Profile Dataset,
Dataset ID: 64b1f3c8896f871ca8ace685, Managed by: SYSTEM Dataset Name: testfact, Dataset ID:
651c51e655563128d37f6fba, Managed by: CUSTOMER Dataset Name: loyalty_historial_data, Dataset ID:
654c362bed6a4a28d36df2b7, Managed by: CUSTOMER Dataset Name: untitled, Dataset ID:
6553ede23811da28d2bf1266, Managed by: CUSTOMER Processing sandbox: rrussell-test... Sandbox: rrussell-
test, Dataset Count: 14 Full datasets information saved to ‘datasets_full_info.json’.

4. Open up datasets.json in a notepad-like application and you should see something similar to this. Use the JSON
file to get a sense of the various fields and their values. We do not need all of these values.

The most important part of the above code is the pagination code for getting the sandboxes and the datasets (in sets of
50). If you miss the pagination code, your answers will be wrong. It also helps to print the datasets in a sandbox to
compare your results.

Data Processing of JSON into a Flat File

1. Copy paste and execute the following code:

import json import pandas as pd

Load the JSON file (replace with your correct file

path)
file_path = ‘datasets.json’ with open(file_path) as json_file: data = json.load(json_file) extracted_rows = []

Loop through each sandbox, which is the top-level

key
for sandbox_name, sandbox_datasets in data.items(): # Loop through each dataset within the current sandbox
(sandbox_datasets is a list) for dataset_info in sandbox_datasets: # Clean up values in unifiedProfile and
unifiedIdentity unifiedProfile = dataset_info.get(‘tags’, {}).get(‘unifiedProfile’, [None])[0]
unifiedProfile_enabledAt = dataset_info.get(‘tags’, {}).get(‘unifiedProfile’, [None, None])[1] unifiedIdentity =
dataset_info.get(‘tags’, {}).get(‘unifiedIdentity’, [None])[0]
# If the value contains “enabled”, strip it, otherwise use None unifiedProfile_clean =
unifiedProfile.replace(“enabled:”, ”″) if unifiedProfile else None unifiedProfile_enabledAt_clean =
unifiedProfile_enabledAt.replace(“enabledAt:”, ”″) if unifiedProfile_enabledAt and “enabledAt” in
unifiedProfile_enabledAt else None unifiedIdentity_clean = unifiedIdentity.replace(“enabled:”, ”″) if
unifiedIdentity else None # Get sandbox name and ID, assuming sandboxId exists in the dataset info sandbox_id
= dataset_info.get(‘sandboxId’, None) # Append the cleaned row to extracted_rows row = { ‘sandbox_name’:
sandbox_name, # Added Sandbox Name ‘dataset_id’: dataset_info.get(‘dataset_id’, None), # Added Dataset ID
‘dataset_name’: dataset_info.get(‘name’, None), ‘dataset_ownership’: dataset_info.get(‘classification’,
{}).get(‘managedBy’, None), ‘dataset_type’: dataset_info.get(‘classification’, {}).get(‘dataBehavior’, None),
‘imsOrg_id’: dataset_info.get(‘imsOrg’, None), ‘sandbox_id’: sandbox_id, ‘profile_enabled’:
unifiedProfile_clean, ‘date_enabled_profile’: unifiedProfile_enabledAt_clean, ‘identity_enabled’:
unifiedIdentity_clean } extracted_rows.append(row)
Convert the extracted rows into a DataFrame
extracted_df = pd.DataFrame(extracted_rows)

Display the DataFrame (print the first few rows)

print(extracted_df.head())

Save the DataFrame to a CSV file

extracted_df.to_csv(‘cleaned_extracted_data.csv’, index=False)

2. The result should look something similar to like this in the editor:

sandbox_name dataset_id dataset_name

0 prod 5ff3a3870e8e54194a1bcf2d Profile Import
1 prod 5ff3a8bd29e35b194cdd6e0b Streaming Profile Import
2 prod 5ff3a9013ce5d2194b7a7f91 Product Definition
3 prod 5ff3aaf22810141955d546ea Favorite Products
4 prod 5ff58db1131575194bfebd96 Segment Ingestion Dataset

dataset_ownership dataset_type imsOrg_id

0 CUSTOMER record FCBD04245FCEC73F0A495FC9@AdobeOrg
1 CUSTOMER record FCBD04245FCEC73F0A495FC9@AdobeOrg
2 CUSTOMER record FCBD04245FCEC73F0A495FC9@AdobeOrg
3 CUSTOMER record FCBD04245FCEC73F0A495FC9@AdobeOrg
4 SYSTEM record FCBD04245FCEC73F0A495FC9@AdobeOrg

sandbox_id profile_enabled date_enabled_profile \

0 51c9d32c-7654-468a-89d3-2c7654768ab1 true 2021-01-04 23:23:54

1 51c9d32c-7654-468a-89d3-2c7654768ab1 true 2021-01-04 23:46:11
2 51c9d32c-7654-468a-89d3-2c7654768ab1 true 2021-01-04 23:47:15
3 51c9d32c-7654-468a-89d3-2c7654768ab1 true 2021-01-04 23:55:32
4 51c9d32c-7654-468a-89d3-2c7654768ab1 true None

identity_enabled
0 true
1 true
2 true
3 true
4 true

Tip: If you try to ingest this CSV file, you will need to use a source schema that has no spaces in its column names. If
you use the manual CSV upload workflow, you will need to reformat the columns names to exclude the spaces. But
this is what real life is - dirty column names. That is the reason why we are being extra cautious in how we do our
naming.

Get Row and Count Statistics on the Datasets

1, To generate the statistics of the dataset, you can use the Statistics endpoint:

import json
import pandas as pd
import requests
from datetime import datetime

# Get the current timestamp (will be the same for all rows)
current_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

# Load the JSON file (replace with your correct file path)
file_path = 'datasets.json' # Update this to the correct path
with open(file_path) as json_file:
data = json.load(json_file)

extracted_rows = []

# Loop through each sandbox, which is the top-level key

for sandbox_name, sandbox_datasets in data.items():
# Loop through each dataset within the current sandbox (assuming
sandbox_datasets is a list)
for dataset_info in sandbox_datasets:
# Clean up values in unifiedProfile and unifiedIdentity
unifiedProfile = dataset_info.get('tags', {}).get('unifiedProfile',
[None])[0]
unifiedProfile_enabledAt = dataset_info.get('tags',
{}).get('unifiedProfile', [None, None])[1]
unifiedIdentity = dataset_info.get('tags', {}).get('unifiedIdentity',
[None])[0]

# If the value contains "enabled", strip it, otherwise use None

unifiedProfile_clean = unifiedProfile.replace("enabled:", "") if
unifiedProfile else None
unifiedProfile_enabledAt_clean =
unifiedProfile_enabledAt.replace("enabledAt:", "") if unifiedProfile_enabledAt
and "enabledAt" in unifiedProfile_enabledAt else None
unifiedIdentity_clean = unifiedIdentity.replace("enabled:", "") if
unifiedIdentity else None

# Get dataset ID and sandbox ID

dataset_id = dataset_info.get('dataset_id', None)
sandbox_id = dataset_info.get('sandbox_id', None)

# Statistics API call logic (replacing function)

stats_endpoint =
f"https://platform.adobe.io/data/foundation/statistics/statistics?
statisticType=table&dataSet={dataset_id}"
headers = {
"Authorization": f"Bearer {access_token}",
"x-api-key": "acp_foundation_statistics",
"x-gw-ims-org-id": org_id,
"x-sandbox-id": sandbox_id,
"x-sandbox-name": sandbox_name
}

# Initialize statistics
rows, size = 0, 0

# Make the request for statistics

response = requests.get(stats_endpoint, headers=headers)
if response.status_code == 200:
stats_data = response.json()
# Iterate over statistics to find row_count and total_size
for stat in stats_data.get("statistics", []):
if stat["name"] == "row_count" and stat["value"] != "0":
rows = stat["value"]
elif stat["name"] == "total_size" and stat["value"] != "0":
size = stat["value"]

# Append the cleaned row to extracted_rows, using the updated format

row = {
'timestamp': current_timestamp, # Added Timestamp
'sandbox_name': sandbox_name, # Added Sandbox Name
'dataset_id': dataset_id, # Added Dataset ID
'dataset_name': dataset_info.get('name', None),
'dataset_ownership': dataset_info.get('classification',
{}).get('managedBy', None),
'dataset_type': dataset_info.get('classification',
{}).get('dataBehavior', None),
'imsOrg_id': dataset_info.get('imsOrg', None),
'sandbox_id': sandbox_id,
'profile_enabled': unifiedProfile_clean,
'date_enabled_profile': unifiedProfile_enabledAt_clean,
'identity_enabled': unifiedIdentity_clean,
'row_count': rows, # Added row count
'total_size': size # Added total size
}
extracted_rows.append(row)

# Convert the extracted rows into a DataFrame

extracted_df = pd.DataFrame(extracted_rows)

# Display the DataFrame (print the first few rows)

print(extracted_df.head())

# Save the DataFrame to a CSV file

extracted_df.to_csv('cleaned_extracted_data_with_statistics.csv', index=False)

print("Data saved to 'cleaned_extracted_data_with_statistics.csv'.")

1. The results will be the following:

timestamp sandbox_name dataset_id \

0 2024-09-06 23:43:11 prod 5ff3a3870e8e54194a1bcf2d

1 2024-09-06 23:43:11 prod 5ff3a8bd29e35b194cdd6e0b
2 2024-09-06 23:43:11 prod 5ff3a9013ce5d2194b7a7f91
3 2024-09-06 23:43:11 prod 5ff3aaf22810141955d546ea
4 2024-09-06 23:43:11 prod 5ff58db1131575194bfebd96

dataset_name dataset_ownership dataset_type \

0 Profile Import CUSTOMER record
1 Streaming Profile Import CUSTOMER record
2 Product Definition CUSTOMER record
3 Favorite Products CUSTOMER record
4 Segment Ingestion Dataset SYSTEM record

imsOrg_id sandbox_id profile_enabled \

0 FCBD04245FCEC73F0A495FC9@AdobeOrg None true

1 FCBD04245FCEC73F0A495FC9@AdobeOrg None true
2 FCBD04245FCEC73F0A495FC9@AdobeOrg None true
3 FCBD04245FCEC73F0A495FC9@AdobeOrg None true
4 FCBD04245FCEC73F0A495FC9@AdobeOrg None true

date_enabled_profile identity_enabled row_count total_size

0 2021-01-04 23:23:54 true 6000 1384852
1 2021-01-04 23:46:11 true 500 171497
2 2021-01-04 23:47:15 true 14 31548
3 2021-01-04 23:55:32 true 5000 121660
4 None true 7710 36696301
Data saved to ‘cleaned_extracted_data_with_statistics.csv’.

Upload the CSV into Data Landing Zone

We will now upload this CSV into the Data Landing Zone with the expectation that a schema and a data flow was
created as per the following prerequisite tutorial.

1. You need to extract the SAS URI from the AEP UI. Go to Sources->Catalog->Data Landing Zone. Click on the
card gently and a right panel will appear. Scroll down to get the SAS URI. Note that the SAS URI already
contains the SAS Token

2. Just copy paste the SAS URI in the following code for the variable full_sas_url variable. Also observe the
csv_file_path variable and how it is being injected into the SAS URI in the variable sas_url_with_file

import requests

Path to the CSV file you want to upload

csv_file_path = ‘cleaned_extracted_data_with_statistics.csv’ # Replace with your file path

The full SAS URL with both URI and Token (you
already have this)
full_sas_url = ‘your_SAS_URI’
Split the URL into the base URI (before the ‘?’) and
the SAS token (starting from ‘?’)
sas_uri, sas_token = full_sas_url.split(‘?’)

Inject the file name into the SAS URI

file_name = ‘cleaned_extracted_data_with_statistics.csv’ # The file name you are uploading sas_url_with_file =
f’{sas_uri}/{file_name}?{sas_token}′ # Recombine URI, file name, and SAS token

Open the CSV file in binary mode

with open(csv_file_path, ‘rb’) as file_data: # Set the required headers headers = { ‘x-ms-blob-type’: ‘BlockBlob’,
# Required for Azure Blob Storage uploads } # Make a PUT request to upload the file using the constructed SAS
URL response = requests.put(sas_url_with_file, headers=headers, data=file_data) # Check if the upload was
successful if response.status_code == 201: print(f”File ‘{csv_file_path}’ uploaded successfully.”) else:
print(f”Failed to upload file. Status Code: {response.status_code}, Response: {response.text}”)

3. The response you will see is the following:

File ‘cleaned_extracted_data_with_statistics.csv’ uploaded successfully.

4. If you click into the Data Landing Zone i.e. Sources->Catalog->Data Landing Zone->Add data, you will see:

Tip: The great thing about this approach is that if we keep running this Python notebook on schedule, then the files
dropped into Data Landing Zone will be picked up by the dataflow runs.

Number of Datasets by Sandbox and Ownership

1. Execute thee following piece of code

import pandas as pd import matplotlib.pyplot as plt

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)

Group the data by sandbox_name and

dataset_ownership, counting the number of datasets
grouped_df = df.groupby([‘sandbox_name’, ‘dataset_ownership’]).size().unstack(fill_value=0)

Plotting the stacked bar chart

ax = grouped_df.plot(kind=‘bar’, stacked=True, figsize=(10, 6))
Set the labels and title
ax.set_title(‘Number of Datasets by Sandbox and Ownership’) ax.set_xlabel(‘Sandbox Name’)
ax.set_ylabel(‘Number of Datasets’)

Rotate x-axis labels for better readability

plt.xticks(rotation=45)

Display the plot

plt.tight_layout() plt.show()

Total Volume Used in GB Across All Sandboxes Split By Ownership

1. Copy paste and execute the following piece of code

import pandas as pd import matplotlib.pyplot as plt

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)

Convert the ‘total_size’ column to numeric (in case

there are any non-numeric entries)
df[‘total_size’] = pd.to_numeric(df[‘total_size’], errors=‘coerce’)

Drop any rows with missing sizes

df_clean = df.dropna(subset=[‘total_size’])

Convert bytes to gigabytes (1 GB = 1e9 bytes)

df_clean[‘total_size_gb’] = df_clean[‘total_size’] / 1e9

Group by dataset_ownership (assuming it contains

‘customer’ vs ‘system’) and sum the total_size_gb for
each group
ownership_grouped = df_clean.groupby(‘dataset_ownership’)[‘total_size_gb’].sum()
Calculate the percentage of the total size for each
group
total_size = ownership_grouped.sum() ownership_percent = (ownership_grouped / total_size) * 100

Print the total volume in GB and the percentage

for ownership, size_gb in ownership_grouped.items(): percentage = ownership_percent[ownership]
print(f”Ownership: {ownership}, Size: {size_gb:.2f} GB, Percentage: {percentage:.2f}%”)

Plotting the pie chart

plt.figure(figsize=(8, 8)) plt.pie(ownership_grouped, labels=ownership_grouped.index, autopct=‘%1.1f%%’,
startangle=140, colors=plt.cm.Paired.colors)

Set the title

plt.title(‘Total Volume Used by Customer vs System in GB’)

Display the pie chart

plt.tight_layout() plt.show()

Top 10 Datasets by Size in GB

1. Remember the size is in bytes and needs to be converted to GB or TB. Copy paste the following piece of code
and execute:

import pandas as pd import matplotlib.pyplot as plt

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)

Convert the ‘total_size’ column to numeric (in case

there are any non-numeric entries)
df[‘total_size’] = pd.to_numeric(df[‘total_size’], errors=‘coerce’)

Drop any rows with missing sizes

df_clean = df.dropna(subset=[‘total_size’])
Convert bytes to gigabytes (1 GB = 1e9 bytes)
df_clean[‘total_size_gb’] = df_clean[‘total_size’] / 1e9

Sort the DataFrame by ‘total_size_gb’ in descending

order
top_datasets_by_size = df_clean.sort_values(by=‘total_size_gb’, ascending=False)

Select the top 10 datasets by size

top_10_datasets = top_datasets_by_size.head(10)

Plotting the top 10 datasets by size in GB

plt.figure(figsize=(10, 6)) plt.barh(top_10_datasets[‘dataset_name’], top_10_datasets[‘total_size_gb’],
color=‘skyblue’) plt.xlabel(‘Total Size (GB)‘) plt.ylabel(‘Dataset Name’) plt.title(‘Top 10 Datasets by Size (in
GB)’)

Display the plot

plt.tight_layout() plt.gca().invert_yaxis() # To display the largest dataset on top plt.show()

Retrieve a List of Dataset Sizes and Rows by Sandbox

1. Copy paste and execute the following piece of code

import pandas as pd

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)

Filter for the “for-anksharm” sandbox

filtered_df = df[df[‘sandbox_name’] == ‘for-anksharm’]

Convert the ‘total_size’ and ‘row_count’ columns to

numeric (in case there are any non-numeric entries)
filtered_df[‘total_size’] = pd.to_numeric(filtered_df[‘total_size’], errors=‘coerce’) filtered_df[‘row_count’] =
pd.to_numeric(filtered_df[‘row_count’], errors=‘coerce’)
Drop any rows with missing sizes or row counts
filtered_df_clean = filtered_df.dropna(subset=[‘total_size’, ‘row_count’])

Convert total_size from bytes to gigabytes

filtered_df_clean[‘total_size_gb’] = filtered_df_clean[‘total_size’] / 1e9

Group by dataset name and sum the total size (in

GB) and row counts
sandbox_grouped = filtered_df_clean.groupby(‘dataset_name’)[[‘total_size_gb’, ‘row_count’]].sum()

Display the grouped result

print(sandbox_grouped)

Optionally, save the result to a CSV file

sandbox_grouped.to_csv(‘for_anksharm_dataset_sizes_and_rows_gb.csv’)

2. The result will look like

total_size_gb row_count

dataset_name
AOOutputForUPSDataset 0.000000 0 Audience Orchestration Profile Dataset 0.000000 0 Profile Attribute
3abe79fd-491d-4264-b61a-bb837... 0.001702 35 Profile Segment Definition 38831571-fb0e-4db1-a... 0.000170 2
Segment Ingestion Dataset 0.032348 243 TestDataset 0.000138 16 TestEE 0.000063 3 TestProfile 0.000038 1
akhatri - Demo System - Event Dataset for Website 0.000000 0 akhatri - Demo System - Profile Dataset for
Web... 0.000000 0 akhatri Demo System - Profile Dataset for CRM 0.000000 0 profile_dim_date 0.000642 7305

Average Record Richness by Sandbox (bytes per record)

1. Copy paste and execute the following piece of code:

import pandas as pd

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)
Convert the ‘total_size’ and ‘row_count’ columns to
numeric (in case there are any non-numeric entries)
df[‘total_size’] = pd.to_numeric(df[‘total_size’], errors=‘coerce’) df[‘row_count’] =
pd.to_numeric(df[‘row_count’], errors=‘coerce’)

Drop any rows with missing sizes or row counts

df_clean = df.dropna(subset=[‘total_size’, ‘row_count’])

Group by sandbox_name and sum the total_size and

row_count
sandbox_grouped = df_clean.groupby(‘sandbox_name’)[[‘total_size’, ‘row_count’]].sum()

Calculate the average record richness (bytes per

record)
sandbox_grouped[‘avg_record_richness’] = sandbox_grouped[‘total_size’] / sandbox_grouped[‘row_count’]

Display the grouped result with average record

richness
print(sandbox_grouped[[‘total_size’, ‘row_count’, ‘avg_record_richness’]])

Optionally, save the result to a CSV file

sandbox_grouped.to_csv(‘sandbox_avg_record_richness.csv’)

2. The result will look like:

total_size row_count avg_record_richness

sandbox_name
for-anksharm 35101767 7605 4615.616963 for-testing-dag-instantiation 708512 7306 96.976731 prod
802206064 5725404 140.113442 rrussell-test 111728896 102888 1085.927377 testingqs-for-computed-attributes
48124523 137704 349.478033

Histogram of Dataset Sizes Across All Sandboxes

1. Copy paste and execute the following code

import pandas as pd import matplotlib.pyplot as plt

Load the dataset
file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)

Convert the ‘total_size’ column to numeric (in case

there are any non-numeric entries)
df[‘total_size’] = pd.to_numeric(df[‘total_size’], errors=‘coerce’)

Drop any rows with missing sizes

df_clean = df.dropna(subset=[‘total_size’])

Convert total_size from bytes to gigabytes (1 GB =

1e9 bytes)
df_clean[‘total_size_gb’] = df_clean[‘total_size’] / 1e9

Plot the histogram of dataset sizes in GB

plt.figure(figsize=(10, 6)) plt.hist(df_clean[‘total_size_gb’], bins=20, color=‘skyblue’, edgecolor=‘black’)

Set the labels and title

plt.title(‘Histogram of Dataset Sizes (in GB) Across All Sandboxes’) plt.xlabel(‘Dataset Size (GB)‘)
plt.ylabel(‘Frequency’)

Display the plot

plt.tight_layout() plt.show()

2. The result will look like:

Profile-Enabled Datasets (GB) By Sandbox

1. Copy paste and execute the following code:

import pandas as pd import matplotlib.pyplot as plt

Load the dataset

file_path = ‘cleaned_extracted_data_with_statistics.csv’ df = pd.read_csv(file_path)
Filter for the “prod” sandbox and datasets that are
profile enabled
df_filtered = df[(df[‘sandbox_name’] == ‘prod’) & (df[‘profile_enabled’].notnull())]

Convert the ‘total_size’ column to numeric (in case

there are any non-numeric entries)
df_filtered[‘total_size’] = pd.to_numeric(df_filtered[‘total_size’], errors=‘coerce’)

Drop any rows with missing sizes

df_filtered = df_filtered.dropna(subset=[‘total_size’])

Convert the total size from bytes to gigabytes (1 GB

= 1e9 bytes)
df_filtered[‘total_size_gb’] = df_filtered[‘total_size’] / 1e9

Sort the DataFrame by total size in descending order

df_filtered = df_filtered.sort_values(by=‘total_size_gb’, ascending=False)

Define colors based on the dataset type (assuming

‘dataset_type’ column exists with values like ‘record’
or ‘event’)
colors = df_filtered[‘dataset_type’].map({‘record’: ‘blue’, ‘event’: ‘green’}).fillna(‘gray’)

Plot the bar chart for profile-enabled datasets in the

prod sandbox
plt.figure(figsize=(10, 6)) plt.barh(df_filtered[‘dataset_name’], df_filtered[‘total_size_gb’], color=colors)

Set the title and labels

plt.title(‘Profile-Enabled Datasets in Prod Sandbox (Sorted by Size in GB)‘, fontsize=14) plt.xlabel(‘Size in GB’)
plt.ylabel(‘Dataset Name’)
Invert y-axis to have the largest dataset on top
plt.gca().invert_yaxis()

Add a legend to indicate which color represents

record vs event datasets
from matplotlib.lines import Line2D legend_elements = [Line2D([0], [0], color=‘blue’, lw=4, label=‘Record’),
Line2D([0], [0], color=‘green’, lw=4, label=‘Event’)] plt.legend(handles=legend_elements, title=“Dataset Type”)

Display the chart

plt.tight_layout() plt.show()

2. The result will look like:

Note: The dataset sizes reported here reflect the sizes in the Data Lake, where data is stored in the Parquet file
format. Parquet provides significant compression, typically ranging from 5-10x compared to JSON data. However, the
Profile Store uses a different file format and compression mechanism, so the volume comparison between Data Lake
and Profile Store is not 1:1. Instead, these sizes should be viewed in terms of relative proportions, rather than exact
volumes.

User Contributions & Improvements

Many thanks to David Teko Kangni modularized the code and also fixed the warnings which I had intentionally not
fixed:

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead for the code

You will first need to generate a token very 24 hours by using the Client ID, Secret and Scopes.

Subsequent calls will use the access token and other parameters along with the API call to get the results. It does not
require the technical account or the client secret anymore.

Choose your authentication protocol

Choosing a product profile

Generate the API key and the access token

Change the name or description if you need to.

On the project screen, you will get extra details on client ID, client secret and scopes.

Getting the endpoint where we will send the API requests to.

More details available on scroll.

Name the role just to track it in the system

Add all the permissions and the sandboxes

Choose the role you created

JSON file opened in Visual Studio Code

SAS URI and SAS Token from the AEP UI

CSV file has been uploaded into Data Landing Zone

Number of datasets broken down by sandbox and ownership

Data volume split by ownership

The chart reveals a long tail indicating fragmentation

Histogram is showing typical distribution for a demo environment

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-201-securing-data-
distiller-access-with-robust-ip-whitelisting * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 201: Securing Data Distiller Access with Robust IP

Whitelisting
Secure Access, Simplified: Protect Data Distiller with IP Whitelisting

This tutorial provides a comprehensive guide to implementing IP whitelisting to enhance the security of Data Distiller.
IP whitelisting is a crucial security feature that allows you to define and manage specific IP ranges that are permitted to
interact with your data, ensuring that only authorized networks and devices have access. This is particularly important
when using tools like BI dashboards, Apache Superset, and JupyterLab, which are often installed on local or
organizational machines.

How IP Whitelisting Secures Analytical Tools

IP whitelisting creates a secure perimeter around Data Distiller, ensuring that only devices within approved networks
or with specific IP addresses can connect. Here’s how this benefits tools like Apache Superset, JupyterLab, and other
BI platforms:

1. Controlled Access for BI Tools: Business intelligence (BI) tools like Apache Superset, Tableau, and Power BI
frequently query data from platforms like Data Distiller for visualization and reporting. With IP whitelisting, you
can restrict access to these tools based on their hosting environment’s IP addresses, ensuring that only known
installations can connect.

2. Enhanced Security for JupyterLab: Data scientists using JupyterLab often interact with Data Distiller for
advanced analytics and model training. By enforcing IP whitelisting, you ensure that only authorized JupyterLab
instances—installed on approved devices or in secure environments—can access the data lake, reducing the risk
of data leakage.

3. Restricting Access to Organizational Machines: IP whitelisting ensures that access is limited to machines
within the organization’s network or specific cloud environments. This is especially useful for:
On-premises setups: Ensuring that only devices within your local office network can access Data Distiller.

Cloud deployments: Restricting access to specific VMs or instances running on platforms like AWS or
Azure.

4. Minimizing Unauthorized Access: By defining allowed IP ranges, unauthorized users or devices attempting to
connect from outside the whitelisted range are automatically denied access. This provides an additional layer of
security beyond user credentials and API keys.

5. Simplifying Monitoring and Auditing: With IP whitelisting in place, all access attempts are limited to approved
sources. This makes it easier to monitor and audit access logs for suspicious activities or anomalies, ensuring
compliance with data governance policies.

The Role of IP Whitelisting in Enhancing Security of Data Distiller

Here’s why IP whitelisting is a cornerstone of robust data security:

Enhanced Access Control: IP whitelisting ensures that only devices or networks operating within an approved IP
range can access Data Distiller. This prevents external, unapproved networks, or malicious actors from attempting to
infiltrate sensitive data resources. Tools like Apache Superset and JupyterLab installed on approved machines are
securely confined within the authorized IP range, blocking access from any other source.

Alignment with Company Policies: Organizations can align their IP whitelisting configurations with corporate
security policies by defining ranges for office locations, corporate VPNs, or other controlled environments. This
ensures that only personnel using organization-sanctioned networks can query and process data, providing consistency
and control over access permissions.

Minimized Risk of Credential Misuse: Even if user credentials are compromised, IP whitelisting adds a second layer
of defense by denying login attempts from non-whitelisted IP addresses. This prevents attackers from exploiting stolen
credentials if their devices are outside the approved network.

Integration with Monitoring and Auditing: Coupling IP whitelisting with Adobe Experience Platform’s Audit
Service, amplifies security capabilities by providing detailed visibility into data access:

Detecting Anomalous Behavior: Unauthorized attempts from IPs outside the whitelist can be flagged for
investigation, helping identify potential security breaches or misconfigurations.

Auditing Access Patterns: Logs of access requests from whitelisted IPs enable organizations to monitor usage
patterns, detect anomalies, and address insider threats or credential misuse.

Customizable and Scalable: IP whitelisting is highly adaptable, allowing organizations to dynamically update
approved IP ranges as infrastructure changes occur. Whether adding new office locations, onboarding cloud resources,
or updating partner access, companies can ensure that their security measures evolve alongside their operational needs
without compromising control.

Strengthened Regulatory Compliance: Many industries demand stringent controls over access to sensitive data to
meet regulatory standards like GDPR, HIPAA, and financial security mandates. IP whitelisting supports these
requirements by ensuring that data access is limited to explicitly authorized environments, reducing the risk of
compliance violations.

Computer Monitoring: A Complementary Measure

While IP whitelisting provides perimeter-level security, monitoring access behavior within the whitelist is crucial. By
implementing comprehensive monitoring systems:
Administrators can track user activities, query logs, and data access patterns.

Security teams can identify potential misuse or unauthorized queries within approved IP ranges.

Real-time alerts can help mitigate threats faster than periodic audits.

Together, IP whitelisting and computer monitoring form a multi-layered defense strategy for securing access to Data
Distiller. This approach ensures that access is limited to company-approved networks while maintaining visibility into
how services are being utilized, empowering organizations to proactively protect their data assets.

Understanding the Scope of Data Distiller IP Whitelisting

Data Distiller’s IP whitelisting is a robust mechanism designed specifically to secure access to its Postgres
connectors, which are the backbone of integrations with BI tools such as Tableau, Power BI, Apache Superset, and
analytical environments like JupyterLab. By allowing only predefined IP ranges to connect via these tools,
organizations can ensure secure and controlled access to Query Service. However, it’s important to note the limitations
of this feature, particularly when it comes to API access.

Postgres Connectors: IP whitelisting is implemented exclusively for Postgres connectors. These connectors are
commonly used by tools like JupyterLab and Apache Superset to interact with the Data Distiller environment.
This ensures that only trusted networks or devices within approved IP ranges can execute queries or retrieve data
through these tools.

APIs Not Covered: The IP whitelisting feature does not extend to APIs. APIs in Data Distiller can be used to:

Create and access Data Distiller jobs/schedules on the Data Lake.

Access data stored in the Data Distiller Warehouse, also known as Accel Store.

Implications of API Access

Since APIs bypass the IP whitelisting rules, their misuse could expose the system to potential security risks. If access
tokens are compromised or misused, unauthorized entities could exploit API endpoints to manipulate jobs, access data,
or extract information from the warehouse.

Recommended Security Measures for APIs

To mitigate risks and secure API access:

1. Disable Token Access:

Restrict access to the APIs by ensuring that no access tokens are issued to users or applications that do not
require API functionality.

This effectively blocks unauthorized API requests, as tokens are a prerequisite for authenticating API calls.

2. Network-Level Restrictions: Implement additional network-level security measures such as firewalls or VPNs
to limit access to the API endpoints from unauthorized environments.

3. API Gateway and Monitoring:

Deploy an API gateway to enforce stricter access controls and logging.

Monitor API usage patterns for anomalies, such as unexpected data extraction or job creation requests, to
proactively identify potential misuse.
Obtain the necessary authentication credentials, including **ACCESS_TOKEN**, **API_KEY**, **ORG_ID**,
and **SANDBOX_NAME**by following instructions:

Set the IP Allow List Permissions

Ensure you have the necessary permissions to manage allowed IP ranges. The **Manage Allowed List**
permission is required.

To enable permissions, you need to do the following:

1. Navigate to AEP UI by going to Permissions->Roles->[Your_Role]->Edit

2. Add the Manage Allow List and then Save the permission

Generate the Access Token

Make sure you have followed all the steps to get the access token here:

You should be executing the following piece of code:

!pip install requests

import requests

# Replace with your Adobe credentials

client_id = 'your_client_id'
client_secret = 'your_client_secret'
org_id = 'your_org_id'
tech_account_id = 'your_tech_account_id'
scope = 'scope'
auth_endpoint = 'https://ims-na1.adobelogin.com/ims/token/v2'

# Prepare the data for the access token request

data = {
'grant_type': 'client_credentials',
'client_id': client_id,
'client_secret': client_secret,
'scope': scope # Specify the scope relevant to your API usage
}

# Request the access token from Adobe IMS

response = requests.post(auth_endpoint, data=data)

if response.status_code == 200:
access_token = response.json()['access_token']
print(f"Access Token: {access_token}")
else:
print(f"Failed to obtain access token. Status Code: {response.status_code},
Error: {response.text}")

To access the sandbox name, you need to navigate to the AEP UI for Sandboxes->Browse. If you click on the sandbox
name, you will get the name of the sandbox in the right panel.

Enable Debugging in Python

This code below helps you see everything happening behind the scenes when your program talks to a server over the
internet. It’s like turning on a flashlight to see exactly what your program sends and receives, such as the details of the
messages it sends to the server (requests) and what the server sends back (responses). It also shows any errors or
unexpected behavior in this process. This is especially helpful when you’re trying to figure out why something isn’t
working, like if your program isn’t sending the right information or the server isn’t responding as expected. It’s a tool
for catching mistakes and making sure everything is working as planned.

We set up HTTP debugging in Python to monitor and log detailed information about HTTP requests and responses. By
enabling the **debuglevel** attribute of **HTTPConnection** from the **http.client** module, the
code allows low-level HTTP communications, such as request headers, response headers, and the raw data being
transmitted, to be outputted to the console. The **logging.basicConfig(level=logging.DEBUG)**
command configures the Python logging module to capture and display debug-level logs. This setup is particularly
useful for troubleshooting API integrations, as it provides visibility into the request and response lifecycle, helping
developers identify and resolve issues like incorrect headers, payload formatting, or unexpected server responses.

import logging
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig(level=logging.DEBUG)

IP Range Formats: The **allowedIpRanges** field can include two types of IP specifications:

CIDR: Standard CIDR notation (e.g., "136.23.110.0/23") to define IP ranges.

Fixed IP: Single IPs for individual access permissions (e.g., **"101.10.1.1"**).

To manage allowed IP ranges for Data Distiller using Python, you can utilize the **requests** library to interact
with the IP Access API. This API enables you to fetch, set, and delete IP ranges associated with your organization’s ID.

You can retrieve the list of all IP ranges configured for your sandbox. If no IP ranges are set, all IPs are allowed by
default.

import requests

# Define your credentials and headers

headers = {
'Authorization': f'Bearer {access_token}', # Replace with your actual
access token
'x-gw-ims-org-id': org_id, # Replace with your actual Org ID
'x-api-key': 'acp_queryService_auth', # DO NOT REPLACE THIS
'x-sandbox-name': 'prod', # Replace with your sandbox name
'Content-Type': 'application/json', # Indicates JSON content in the
request
'Accept': 'application/json' # Indicates that JSON is expected in the
response
}

# Define the API endpoint

url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the GET request

response = requests.get(url, headers=headers)

# Handle the response

if response.status_code == 200:
print("Request was successful!")
ip_ranges = response.json().get('allowedIpRanges', [])
print("Allowed IP Ranges:", ip_ranges)
else:
print(f"Request failed with status code {response.status_code}")
print(f"Error message: {response.text}")

All requests are made to the /**queryauth/security/ip-access** endpoint of the Adobe Experience
Platform.

Remember that you should only use the following hard coded values for the API key

'x-api-key': 'acp_queryService_auth', # DO NOT REPLACE THIS

The result will look like the following:

You can overwrite existing IP ranges by setting a new list for the sandbox. This operation requires a complete list of IP
ranges, including any that remain unchanged.

import requests

# Define your credentials and headers

headers = {
'Authorization': f'Bearer {access_token}', # Replace with your actual
access token
'x-api-key': 'acp_queryService_auth', # DO NOT REPLACE THIS
'x-gw-ims-org-id': org_id, # Replace with your actual Org ID
'x-sandbox-name': 'prod', # Replace with your sandbox name
'Content-Type': 'application/json', # Indicates JSON content in the
request
'Accept': 'application/json' # Indicates that JSON is expected in the
response
}

# Define the new IP ranges to set with the correct key 'allowedIpRanges'
ip_ranges = {
"allowedIpRanges": [
{"ipRange": "136.23.110.0/23", "description": "VPN-1 gateway IPs"},
{"ipRange": "17.102.17.0/23", "description": "VPN-2 gateway IPs"},
{"ipRange": "101.10.1.1"}, # Single IP address
{"ipRange": "163.77.30.9", "description": "Test server IP"} # Single
IP address with a description
]
}

# Define the API endpoint for IP range configuration

url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the PUT request

response = requests.put(url, headers=headers, json=ip_ranges)

# Handle the response

if response.status_code == 200:
print("Successfully set new IP ranges.")
print("Response:", response.json())
else:
print(f"Failed to set IP ranges: {response.status_code}")
print(f"Error message: {response.text}")

The result should be

You can replace the old list of IP ranges with the new list provided in the **updated_ip_ranges** payload. The
**PUT** request is designed to overwrite the current configuration of **allowedIpRanges** in Data Distiller
with the payload specified in the request. Any existing IP ranges that are not included in the new payload will be
removed from the configuration. The new list (**updated_ip_ranges**) will become the complete set of
allowed IP ranges.

import requests

# Define your credentials and headers

# Define the updated IP ranges to exclude the top two entries

updated_ip_ranges = {
"allowedIpRanges": [
{"ipRange": "101.10.1.1"}, # Single IP address
{"ipRange": "163.77.30.9", "description": "Test server IP"} # Single
IP address with a description
]
}

# Define the API endpoint for IP range configuration

url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the PUT request to update the IP ranges

response_put = requests.put(url, headers=headers, json=updated_ip_ranges)

# Handle the response for the PUT request

if response_put.status_code == 200:
print("Successfully updated IP ranges.")
print("Response:", response_put.json())
else:
print(f"Failed to update IP ranges: {response_put.status_code}")
print(f"Error message: {response_put.text}")

The response will be:

You can use the IP Validation API endpoint to determine whether a specific IP address is authorized to access a
designated sandbox for Data Distiller usage. This ensures clarity on whether access restrictions are in place and if the
IP address has the necessary permissions to interact with data within the sandbox.
import requests

# Define your credentials and headers

# Define the Validate API endpoint

url =
'https://platform.adobe.io/data/foundation/queryauth/security/validate/ip-
access'

# Define the request body

payload = {
"ipAddress": "197.2.0.2" # Replace with the IP address to validate
}

# Make the POST request

response = requests.post(url, headers=headers, json=payload)

# Handle the response

if response.status_code == 200:
print("Validation was successful!")
print("Response:", response.json())
elif response.status_code == 404:
print("Endpoint not found. Verify the URL and ensure your organization has
access to this API.")
else:
print(f"Validation failed with status code {response.status_code}")
print(f"Error message: {response.text}")

The result will look like the following:

You can remove all configured IP ranges for the sandbox. This action deletes the IP ranges and returns the deleted IP
list.

import requests

# Define your credentials and headers

headers = {
'Authorization': f'Bearer {access_token}', # Replace with your actual
access token
'x-api-key': 'acp_queryService_auth', # DO NOT REPLACE THIS
'x-gw-ims-org-id': org_id, # Replace with your actual Org ID
'x-sandbox-name': 'prod', # Replace with your sandbox name
'Content-Type': 'application/json' # Indicates JSON content in the request
}
# Define the API endpoint for IP range deletion
url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the DELETE request

response_delete = requests.delete(url, headers=headers)

# Handle the response

if response_delete.status_code == 200:
print("Successfully deleted all IP ranges.")
deleted_ip_ranges = response_delete.json().get('deletedIpRanges', [])
print("Deleted IP Ranges:", deleted_ip_ranges)
else:
print(f"Failed to delete IP ranges: {response_delete.status_code}")
print(f"Error message: {response_delete.text}")

The result should look like the following:

Last updated 3 months ago

Access the role permissions.

Sandbox name can be accessed in Adobe Experience Platform UI.

Return of the IP address ranges

IP ranges are now set for accepting queries into Data Distiller

Response of the updated IP range code

You can see that this IP address does not lie in the list allowed and hence the result is false.

All IP ranges are deleted

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-300-ai-and-
machine-learning-basic-concepts-for-data-distiller-users * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller
Users
Unlock the power of AI and machine learning in this course—equipping you with the basic concepts to make a real-
world impact

In the past, machine learning (ML) and artificial intelligence (AI) were primarily in the realm of data scientists, who
specialized in building predictive models, tuning algorithms, and applying statistical techniques. Data engineers played
a critical role in preparing, managing, and ensuring the quality of data for these models, but the actual development of
AI systems was seen as the domain of data science. However, with recent breakthroughs in deep learning and the
emergence of Large Language Models (LLMs), the landscape has shifted, offering data engineers a unique opportunity
to dive into machine learning and AI with minimal barriers.

The Shift in Roles: From Data Science to Data Engineering

Traditional machine learning relied heavily on data scientists for tasks like feature engineering, model selection, and
hyperparameter tuning. Data engineers provided the foundational support by managing data infrastructure, pipelines,
and cleaning datasets, but the bulk of AI work was considered highly specialized and out of reach for most engineers.

As deep learning becomes more prevalent, the demand for robust, scalable data systems has never been higher. LLMs,
in particular, need vast datasets to perform at their best, and this is where data engineers can truly excel. The skills that
data engineers already possess—building scalable data pipelines, managing large datasets, and ensuring data integrity
—are now at the forefront of what makes AI and machine learning successful.

With traditional machine learning becoming more automated, data engineers are no longer confined to backend roles.
They are now stepping into the realm of machine learning and AI, playing an active role in deploying models and
ensuring that they can process real-world data in real-time.

Here’s how you can now foray into AI and machine learning:

1. LLMs and Deep Learning Require Data Expertise: LLMs like GPTs rely heavily on data volume and quality.
As a data engineer, your expertise in data processing, ETL pipelines, and data warehousing is critical to making
sure these models have the resources they need to function optimally. You can now contribute directly to model
performance by ensuring clean, well-structured data flows into these AI systems.

2. AI Systems are Data-Hungry: Deep learning models thrive on large datasets, and data pipelines are needed to
continually feed these models. Data engineers, with their experience managing large-scale data pipelines, are
perfectly positioned to take on roles previously occupied by data scientists. By optimizing these pipelines, you’re
directly contributing to the model’s effectiveness, making you an essential part of the AI development cycle.

3. Traditional ML Becoming Accessible: As machine learning has matured, many of the previously complex tasks
(such as feature engineering and model tuning) have become more accessible through automated platforms.
These platforms allow data engineers to apply ML models to their pipelines without needing to dive into the deep
statistical theory behind the algorithms. Now, engineers can build models and predictions using tools that were
once exclusive to data scientists, lowering the entry barrier into AI.

4. Data-Centric AI: Modern AI systems are becoming more data-centric, meaning that the quality, diversity, and
volume of the data often matter more than the complexity of the model itself. Data engineers are the ones with
the expertise to ensure that AI models get the best data possible. In this data-centric era, you’re no longer simply
feeding data to models—you’re shaping the quality of the insights they produce.

Who is This Concept Course For?

If you’re a Data Distiller user exploring the world of data, this course is for you. As you work through the material,
you’ll find that many of the concepts are universal, and you’ve likely already encountered them in some form. The
goal is to break down the jargon and technical terms you’ll come across, helping you see through complex arguments
and understand them clearly.

If you ever find yourself confused or lost, always return to your data. Ask the simple questions—remember, data
doesn’t lie. Don’t be swayed by flashy results or “intelligent” models without first understanding what the algorithm or
data is really doing. Be skeptical of claims of “intelligence.” Embrace the insights it offers, but learn how to manage
and evaluate it. Know what it does well and when it falls short. In the end, be the one in control of the tools, not the
other way around.

This chapter draws from my research in robotics, computer vision and neuroscience, where I observed the convergence
of mathematics and various branches of engineering, shaping the evolution of machine learning. Traditionally, many
machine learning algorithms are direct applications of statistical methods, relying on mathematical principles to
perform tasks like classification, regression, and clustering. These methods have served as foundational tools for
analyzing data and making predictions. However, with the advent of modern computation, deep learning has emerged
as a transformative force, pushing the boundaries of what machine learning and AI can achieve. Unlike traditional
algorithms that depend on predefined statistical models, deep learning uses neural networks to automatically learn and
uncover intricate patterns in massive datasets. This intersection of computational power, mathematics, and engineering
has paved the way for groundbreaking advances, particularly in fields like computer vision and natural language
processing, making deep learning the new frontier in machine learning and AI.

There was a time when terms like machine learning, artificial intelligence, and machine vision were confined to
academic circles, seen as esoteric concepts understood only by a select few. Neural networks, once popular, eventually
fell out of favor, largely due to the technical challenges that arose. While the lack of computational power to simulate
large neural networks was a significant hurdle, there were also deep theoretical limitations. By the late 1970s, interest
in the field had waned. Today, however, deep learning is a hot topic, fueled by affordable computing and the rise of big
data.

Regardless of the trends that come and go, the core principles of how we learn—and how we expect machines, robots,
or agents (whatever we choose to call them) to learn—remain fundamentally similar. In many respects, what we create
in these systems is a reflection of how we, as biological beings, are wired.

The purpose of these notes is to equip you with the knowledge to ask insightful questions about machine learning that
go beyond just the math or the algorithms. Don’t be swayed by those who throw complex equations at you or present
flashy results. If you ask the right questions and seek clear explanations, you’ll be able to discern whether what’s being
presented is legitimate or not. And that’s important—it saves you both time and money.

A learning machine, broadly defined, is any device whose actions are influenced by past experiences. A subclass of
learning machines is those that learn to discover hidden relationships in data. This also means that they can be trained
to recognize patterns.

Brains are also learning machines that condition, parse, and store data. Neural networks are non-linear dynamical
systems with many degrees of freedom that can be used to solve computational problems. The mathematical
foundations for learning machines were laid down by a group of researchers in the 1940s and 1950s. There are related
concepts of pattern, pattern classification, discriminant functions, and decision surfaces.

Machine Learning (ML) vs. Artificial Intelligence (AI)

Many people tend to conflate terms like “AI/ML” without recognizing the key differences. However, “learning” and
“intelligence” are not the same. You might learn less than others but still demonstrate greater intelligence when
tackling a decision-making problem.

Setting aside the technicalities, it’s important to note that the fields have evolved differently. Think of machine learning
as focusing on specialized, functional expertise, while artificial intelligence aims for broader generalization across
multiple domains.

Artificial Intelligence (AI)

Learn from Data: Learn patterns and make predictions from data. It involves training models on large datasets
to improve performance on specific use cases. Learning from data is a fundamental aspect of ML.

Build an Intelligent System: These can be rule-based or symbolic or use a knowledge base to reason when faced with
a context. They may not always involve learning from data. Some AI systems are rule-driven and do not adapt or
learn from new information.. Y_ou will see these in robotics where certain aspects are all rule-based while individual
functions are running decentralized algorithms that are all ML-based._

Functional Learning: You could design machine learning algorithms for different tasks.
System-Level Decisions: Based on what you can learn from what you perceive, make decisions i.e. should I take the
chance and attack

Generative Language Models are fundamentally machine learning models. However, they have learned such an
extensive range of patterns that they now display emergent capabilities that resemble aspects of human intelligence.

What is the Essence of Learning?

Learning is not just confined to machines.

The fundamental goal of any learning task is to be able to generalize i.e. ability to apply principles and concepts
learned from examples to a new problem. This is pretty much in line with our educational experience in school or
college.

In the most general form, what this means is the following that you should be able to generate a solution “y” for any
new problem “x” thrown at you. ’f” denotes the algorithm that you would use based on the examples that have been
taught to you i.e. pairs of (x1, y1), (x2, y2), etc

x1 and x2 itself could be a vector of values (latitude and longitude for example). We may choose only one of these
values in our modeling because we find that it is a “feature” that seems to have a lot of influence on the output. We call
the values we chose as “features” and the vectors as “feature vectors”. They are no different from the attributes or
dimensions that you encounter in the relational world except that you are being smart about how many you actually
need.

Remember that f is an algorithm or a technique that you are using to establish the relationship between the inputs x
and outputs y. It is not that you are discovering the actual mathematical relationship between the two in the real world.
Your function may be an approximation of that real-world function because you observed data that was perhaps
confined to a smaller subset. Your ability to establish relationships is only a function of the data you have. Plus, you
will also need to make some assumptions about the nature of the data x and the output y so that you can solve the
problem in a cost-effective way. Hence, if you do not state your assumptions, you are fooling yourself.

Some comments can be made about the way you are taught these examples:

1. Quality of the training set: These examples should have variety so that you are able to learn the key concept
from multiple points of view.

2. Reducing Redundant Information: Imagine e giving you a textbook with paragraphs that are repeated multiple
times throughout the text. This repetition can slow you down or worse, waste your time which is the resource that
you have. You would take a marker and just mark these duplicate paragraphs out. You just reduced the size of the
data within a chapter without losing the essence of what the chapter was. You will hear machine learning folks
call out things such as “dimensionality reduction” which is similar to the above idea.

3. Overfitting: You could spend all your time mastering specific examples to the point where you ace every test
based on them. Your accuracy with these examples becomes exceptional—almost like you’ve memorized them.
In machine learning terms, this is called “overfitting”—when you’ve learned the examples so well that you’ve
essentially memorized them. The downside is that your ability to generalize and handle new, unseen questions
(those outside the “syllabus”) becomes compromised.

4. Underfitting: However, the test may not include those exact examples. There will be some problems similar to
what you were taught, and you’ll likely do well on those. But there will also be problems that are quite different,
requiring you to apply what you’ve learned in new ways. Your success with these unfamiliar challenges depends
on how much practice you’ve had with similar types of problems. In machine learning terms, if you haven’t
learned the examples well enough (you can generalize but struggle with accuracy), this is known as
“underfitting.”
5. Cost: You have a finite time to finish your preparation for the test. Hence to maximize your chances of a high
score, you need a strategy.

6. Test: Your learning strategy depends on the nature of the test:

If most questions are similar to what you were taught, it makes sense to focus on revising those examples.
This is why rule-based or rote-learning strategies are popular with many students—they work well when the
test closely mirrors the material.

However, if the majority of questions are different from what you’ve learned, you’re better off practicing a
wider variety of problems to improve your ability to adapt to new situations.

In reality, finding the right balance between these strategies is key. In many ways, you’re making a bet on the
future—there’s no certainty about what will happen, but you must take calculated risks. Life, like learning, is
about navigating uncertainty.

7. Feedback: Suppose you have the chance to retake the test. It would be incredibly helpful to analyze the pattern
of questions that were asked, understand which ones you answered correctly (and why), and identify those you
got wrong (and the reasons behind it).

Overfitting is commonly associated with low bias and high variance, which indicate how well a model captures the
complexity of the task. Low bias means the model makes minimal errors by accurately reflecting the complexity of the
data without oversimplifying. However, as the model becomes more complex, it also starts learning irrelevant details
or “noise,” which increases variance. Variance refers to the model’s sensitivity to changes in the training data, making
it less capable of generalizing to new, unseen data.

Underfitting is typically associated with high bias and low variance, which indicate how poorly a model captures the
complexity of the task. High bias means the model makes significant errors because it oversimplifies the data, failing
to learn important patterns. In this case, the model is too simplistic and cannot capture the underlying structure of the
data. While the model has low variance, meaning it is not sensitive to fluctuations in the training data, it struggles to
perform well even on the training set, let alone on new, unseen data. This results in poor overall performance and lack
of generalization.

Reinforcement learning is a concept associated with psychologists like B.F. Skinner. Reinforcement learning focuses
on how behaviors are learned and modified through the consequences that follow those behaviors, such as rewards and
punishments. That is why having a “reward” manifest via some form of appreciation encourages us to “adapt” and
hence learn.

Self-motivation is very powerful when you are searching for “training examples” to “learn” new concepts. Unlike
robots that cannot be self-directed like us, this quality is what leads to us coming up with newer and faster ways of
doing things.

Machines and Organisms: Overview of Neural Networks

We often hear about deep learning because these neural networks can have many layers with billions of parameters.
But what are these neurons, and how do they manage to approximate real-world relationships so well? The answer lies
in how we humans do it. By mimicking how our brains work, we bring that ability into the machines and agents we
build.

The building block of all the wiring in our brain is the neuron (or nerve cell). A neuron can either fire (transmit a
signal, represented as 1) or remain inactive (transmit a 0). At their core, both neurons and electronic logic gates exhibit
binary behavior—they either produce an output or don’t, depending on certain conditions or thresholds. By connecting
these neurons in a network, we can create any logical system.
The key takeaway here is that we can use simple, non-linear units (neurons or gates) to form multiple layers of
complex behavior. The challenge is in figuring out the structure, pathways, and priorities needed to generate the
desired outputs.

As long as we can represent input and output data in binary form, it’s possible to wire up circuits in the brain that map
inputs to outputs. Learning is the process by which the brain builds that mapping. It’s often said, “If I can think it, I can
learn it without physically doing it.” While this is partially true—because you can mentally prepare the brain for
learning—you still need physical interaction and stimuli to make the learning truly effective.

Did you know that in basic mathematics, the Weierstrass approximation theorem states that any continuous function
defined on a closed interval [a, b] can be approximated as closely as desired by a polynomial function? Continuous
functions are everywhere in the physical world.

Interestingly, neural networks can approximate polynomial functions too. In fact, neural networks are universal
function approximators, meaning they can approximate a wide range of functions, including polynomials. The key to
this capability lies in their architecture and the activation functions used within the network.

That’s quite remarkable—it sheds light on how we are able to learn and understand physics-based functions, which are
often continuous by nature. It’s also important to note that this ability to learn implies we can do so within a finite
amount of time.

Here is what we would need to consider:

1. Architecture: To approximate polynomial functions, you can use a feedforward neural network with one or more
hidden layers. The number of neurons in each hidden layer and the overall depth of the network can be adjusted
depending on the complexity of the polynomial you’re trying to approximate.

2. Activation Functions: Common activation functions like the sigmoid can be applied in the hidden layers. These
activation functions introduce non-linearity, enabling the network to model more complex relationships and
behaviors.

3. Training: The network is trained using a dataset of input-output pairs, where the inputs are the values of the
independent variable (e.g., x) and the outputs are the corresponding values of the polynomial function (e.g., f(x)).
During training, the neural network adjusts its weights and biases to minimize the difference between its
predicted outputs and the actual values in the dataset.

4. Performance Evaluation with Loss Functions: A typical loss function for regression tasks, such as
approximating a polynomial, is mean squared error (MSE). MSE measures the average squared difference
between the predicted and actual values, similar to how Root Mean Square (RMS) values are used in electrical
engineering. Just as RMS captures the power intensity of a signal without considering its peaks or signed values,
MSE provides a way to assess the overall error without focusing on outliers.

5. Optimization: Gradient-based optimization algorithms are employed to minimize the loss function and
iteratively update the network’s parameters, improving the model’s performance with each iteration.

Do I Need More Neurons? The number of neural networks, or neurons, in the human brain, is relatively consistent
across individuals. However, what can differ significantly among individuals is the efficiency and organization of these
neural networks. Highly intelligent people often exhibit more efficient and optimized neural connections, allowing for
quicker and more effective processing of information. It’s not about having more neurons, but rather how those
neurons are connected and function.

Our cognitive abilities are shaped by our experiences. In fact, evolutionary biologists suggest that human intelligence
evolved in response to the physical body’s interactions with the environment. For instance, the ability to reach out and
pluck a fruit—a seemingly simple task—requires a level of intelligence not found in many other species. This
interaction between physical tasks and mental development highlights how intelligence may have evolved as a
practical adaptation to our surroundings.

Modeling Human Memory: One way to think about human memory is as an implicit “lookup table.” However,
unlike a computer, the human brain doesn’t have dedicated “storage cells” to hold individual bits of information.
Instead, we rely on neurons, and the information is encoded within neural networks. These “lookups” are essential
components of the knowledge we use to navigate the real world. But building true intelligence requires much more
than just machine learning.

Structural Layout of Neural Networks

One of the key findings about the brain as to what we perceive as different forms of “intelligence” has to do with the
size and number of neurons, connections, and layers. The reason why each of us is “intelligent” in a different way is
because our brains have figured out efficient ways of organizing the neurons and the flow of information. In fact,
“insight” or “aha moments” are supposed to be conscious moments of us discovering these pathways.

Many of our abilities—such as perception, movement (which we begin learning as toddlers), language, thought, and
memory—are made possible by the interconnected network of both serial and parallel processors in distinct regions of
the brain, each responsible for specific functions. If one area of the brain is damaged, you are not entirely impaired; the
brain has the remarkable ability to reorganize its processing units to recover lost functions. However, this
reorganization takes time and requires training, which is closely tied to motivation and encouragement.

As you add neural layers from the input to the output, you are creating higher-level abstractions. Each stage can be
thought of as the input to the next stage. The last stage prior to the final output should have the highest abstraction.
Layers add abstraction and refinement to the learning.

In the human brain, over time, without training or inferencing, these links can become weak and you experience
forgetting what you have learned. This may not be a bad thing as we all know -overcoming a bad experience can be
addressed by giving “it time” when we engage in different activities or even changing our environment.

Creativity and Gaps in Learning: One of the greatest challenges for renowned guitarists like David Gilmour, or even
entire bands, is making each song and album sound distinct. The music is often tied to a specific period in the band’s
history, and the songs tend to shape the guitar playing accordingly. Breaking free from what the band has already
produced is extremely difficult. This likely explains why Pink Floyd took extended breaks between albums—they
needed time to “reset” and allow their creative links to fade. In contrast, a band like U2, later in their career, began
releasing albums almost annually, but the music started to blend together, leading to diminishing returns. Creativity
thrives on long breaks, and it’s challenging to continuously generate fresh, groundbreaking ideas without giving the
mind time to recharge.

Marketer-Machine Learning Analogies

If we consider how marketers learn, we can find many parallels with machine learning. While the fundamental process
of learning is similar, the mechanisms used by humans are more adaptable due to evolution.

These techniques are not only relevant to marketing but also commonly applied in education, training, and personal
development for acquiring new skills. Below are some common learning methods used by marketers and their
corresponding machine learning techniques:

Currently, most machine learning algorithms require human involvement at various stages, such as data labeling,
model selection, and evaluation. However, in the future, it’s possible that these algorithms could bootstrap themselves,
autonomously optimizing their learning processes through techniques like self-supervised learning or reinforcement
learning, potentially reducing or eliminating the need for human intervention.
In marketing, participative learning involves engaging with customer insights actively, using feedback sessions,
brainstorming, and A/B testing to refine strategies. Discussions and debates within teams lead to creative solutions and
improvements in campaigns.

Active Learning refers to algorithms identifying the most valuable customer segments or touchpoints to focus on for
better engagement, using minimal data points for maximum insight. These models prioritize which customer data to
further analyze, improving targeting efficiency.

Visual/Auditory/Lab Learning

Marketers absorb information through visual aids like customer journey maps, heatmaps, and campaign performance
charts. Listening to customer feedback through interviews, surveys, and social media also drives understanding.
Hands-on learning comes from testing campaign strategies in the market.

Computer Vision/NLP: Machine learning models process customer data from various media—analyzing images (e.g.,
product photos), speech (e.g., call center data), and text (e.g., reviews) to extract insights on customer preferences,
sentiment, and behavior for optimized marketing strategies.

Marketers use mnemonics or frameworks like the “4 Ps of Marketing” (Product, Price, Place, Promotion) to organize
their strategies and recall best practices. These frameworks help marketers maintain consistency and effectiveness
across campaigns.

Feature Engineering: In machine learning, marketers work with engineers to create features that make data more
actionable. For example, segmenting customers based on engagement patterns or purchasing behavior allows for more
targeted marketing efforts and better customer insights.

Marketers use mind mapping to visually organize campaign ideas, segment customers, and brainstorm strategies for
product launches or promotions.

Topic Modeling (LDA): Machine learning can analyze large sets of customer text data (e.g., social media or reviews)
to identify key topics and themes, providing insights that help marketers organize strategies and content.

Marketers use key statistics or insights (like flashcards) to quickly memorize important facts about customer segments,
brand values, or trends that can be applied in campaigns.

Labeling: In machine learning, labeling data points allows models to understand what they represent (e.g., identifying
customer behaviors or preferences), helping models learn effectively from labeled data.

In marketing, collaboration between teams (e.g., creative and analytics) helps solve problems, share insights, and
improve strategies through diverse perspectives.

Ensemble Learning: Just as teams work together, ensemble learning combines multiple models to improve
predictions in marketing, such as combining models for customer segmentation, churn prediction, or ad targeting for
better performance.

In marketing, gamification involves using game-like elements (e.g., points, competitions, or rewards) in customer
engagement strategies to increase motivation and interaction. Loyalty programs or contests are examples.

Reinforcement Learning: This can be used to optimize customer engagement by rewarding behaviors that lead to
desired actions (e.g., purchases or loyalty). Models learn by experimenting with various tactics and adjusting based on
results.

Marketers analyze and evaluate data from campaigns to make informed decisions. They synthesize customer insights,
feedback, and market trends to refine strategies and solve problems.
This is not ML but mostly AI.

Expert Systems: AI systems in marketing can simulate expert problem-solving by analyzing customer data and
making decisions on the best campaign strategies, content, or offers for specific audiences.

Marketers learn by addressing real-world challenges such as optimizing campaigns or solving customer pain points.
This hands-on learning helps marketers enhance problem-solving skills.

Training Dataset: Machine learning models in marketing are trained on real-world customer data, learning from past
behaviors to predict future actions and help optimize marketing strategies.

In marketing, asking open-ended questions can stimulate creative thinking and drive deeper exploration of customer
needs, leading to more innovative solutions.

Data Exploration: In machine learning, exploring customer data involves asking open-ended questions to uncover
trends, patterns, or insights that can guide strategic decisions and future campaigns.

Marketing strategies are constantly refined based on customer feedback and assessment of campaign performance.
This process helps identify areas for improvement.

Evaluation Metrics: In machine learning, you would assess model performance using metrics such as accuracy,
recall, and precision. Feedback from the model’s performance helps refine marketing campaigns.

How do I Implement a Machine Learning (ML) Algorithm?

There are essentially the key steps in this process and some of them are iterative:

1. Use Case Analysis: Truth be told - if you do not understand what is the goal of the algorithm you are
implementing and what exactly you are looking for, then you will design something that no one understands and
will not be useful.

Budget: You should ask questions about the budget. Remember that every prediction you make is a cost
that has to be allocated to some business process. Budget will decide the complexity of the model, the size
of the data, and how frequently you want to train.

2. Data Exploration: You need to understand the volume and variety of the data. More importantly, you have to get
a feel for the structure of the data. Analysis is an absolute requirement - so improve your SQL and Python skills if
you have not.

3. Feature Engineering: You will have to make choices on the attributes you care about and the dimensionality of
your data. You will also need to represent the data in a different topological space as required by the algorithm or
the nature of the data itself.

4. Test Drive Multiple ML Algorithms: You will need to sample and test multiple ML algorithms. You may need
to start architecting a strategy as to whether you want to mix and match them.

Evaluation of performance metrics becomes absolutely critical at this step. It is standard practice now to
split the data into training and test data.

Parameter Tuning: Your algorithm will be using parameters and you want to be able to tune them so that
you get the best performance or get the desired behavior for your use case for the same performance.

Modeling Assumptions: Every algorithm you choose assumes certain facts are true about the input/output
data and the model. You need to be able to articulate what those assumptions are.
Explainability: At this point, you should be able to articulate what the algorithm is doing, where it fails,
and what it needs in order to become better.

5. Train and Optimize the Models in Production: Choose a state-of-the-art production system to do the training.

6. Model Inferencing: Use the trained model on the live data to make predictions.

7. Retrain the Model: Remember the heart of the algorithm is the data. You are always looking for data to train on
that has the most impact on the performance in the future. The decision to retrain on the full data or the new data
depends on many factors:

1. Drifts: The structure of the data or even the relationship between the features and the output may change
over time. You will see this as a degradation in the model inferencing. If this happens, I would give more
weight to the new data and decide when I want to retrain.

2. Volume of Data: Sometimes the volume of the data is so large, that it may not be worth the spend to do it
all over again. Or I may want to sample my way through this.

3. Regulatory: Sometimes regulatory requirements such as data retention laws will require you to retrain.
Here I do not have a choice.

4. Nature of the Algorithm: If the algorithm makes predictions by relying on interrelationships between data
(in time and space), then you do not have a choice but to retrain.

Reality Check: Steps 3, 4, and 5 will take up about 80% of your time. It’s often joked in machine learning circles that
data scientists spend most of their time on BI and dashboards, leaving little room for “real” data science. The reality is
that many algorithms and deployment tools are readily available off-the-shelf. What truly distinguishes you is your
ability to apply these tools effectively, starting with a deep understanding of the data and developing a strong
algorithmic strategy.

Key Machine Learning Algorithms

Remember that any form of prediction is ultimately inductive reasoning meaning that we are trying to generalize or
predict based on specific observations or data/examples. Generalization is what we are looking for. If we cannot induce
the prediction, then our learning has failed. Deductive reasoning goes the other way around - you start with the first
principles, rules, and assumptions and build your way to the conclusion (which could be a generalization). It is very
akin to proving a theorem in geometry.

A Comment on Supervision: You’ll often come across terms like supervised and unsupervised learning. Supervised
learning means that a human (like you or me) is involved in labeling the data for training. In contrast, unsupervised
learning occurs when the algorithm identifies patterns or labels on its own, as seen in clustering. Many techniques are
actually semi-supervised, blending elements of both supervised and unsupervised approaches.

Most machine learning algorithms typically focus on either predicting numerical values (regression) or classifying data
into categories (classification). However, some algorithms are designed for tasks like clustering, dimensionality
reduction, or generating new data (generative models).

If you read machine learning books, you will come across some of these algorithms:

Linear and Nonlinear regression

Establish a linear or nonlinear relationship between the inputs and outputs.

Average spend modeled as a function of income, family size and geographic location.
ARIMA (Time Series Forecasting)

Very popular in time series forecasting. It captures the relationship between a time series and its lagged values (Auto-
Regressive).

The “Integrated” component is the differencing needed to make the time series stationary. The “Moving Average”
component models the current value on past forecast errors (residuals).

Segment trends, revenue trends, web traffic trends.

Key idea is to start with ‘K’ centroids and iteratively assign points to these K clusters and keep updating the centroids
of the points assigned.

Customer segmentation, anomaly detection

Support Vector Machines (SVM)

SVMs seek to find the optimal hyperplane that separates data points into different classes. SVMs are capable of fitting
to nonlinear data, as you saw earlier in this section. SVMs use a clever technique in order to fit to nonlinear data: the
kernel trick. A kernel is a mathematical construct that can “warp” the space where the data lives. The algorithm can
then find a linear boundary in this warped space, making the boundary nonlinear in the original space.

Customer segmentation, anomaly detection

Predict the class of a new data point based on the majority class (for classification) or the average of nearby data
points.

The key idea is to use assume that all events are independent (naive assumption) and then use the simplified
conditional probabilities to classify a new event.

This is very similar to numerical regression except that you have a discrete output variable.

Old-Fashioned Decision Trees

Key idea is to partition the feature space so that you can localize the output possibilities. Very similar to nested IF
THEN ELSE semantics.

Customer segmentation, lead scoring, market basket analysis, recommendations

The idea is to combines the predictions of multiple (ensemble) decision trees to make better predictions than one
single decision tree.

Customer segmentation, lead scoring, market basket analysis, recommendations

So Many Options, So Little Time: All of the machine learning algorithms mentioned are simply mathematical
models representing real-world phenomena. As a result, many of them can be used to tackle the same set of problems.
For instance, you could apply a deep learning algorithm to solve a basic linear regression task. However, the general
rule of thumb is to opt for the simplest algorithm and prioritize gathering as much data as possible. In the long run,
data volume trumps algorithmic complexity. In the world of AI, the real battle is always about the data, not the
algorithm.

Heuristics: No matter which algorithm you use to generate results, applying heuristics or practical judgment is
essential. This is crucial in machine learning, whether you’re doing feature engineering or choosing an algorithm
strategy. The core idea is to find solutions efficiently by using simple rules of thumb or shortcuts that draw on prior
knowledge and experience. Keep in mind that this knowledge is often domain-specific and can vary based on the
specific problem you’re trying to solve.
In the early days of machine learning, people would often advocate for their preferred algorithm. In new scenarios,
they would compare multiple algorithms to find the best one. Nowadays, the standard approach is a “combo”
technique known as “ensembles.” The idea is similar to a team-based effort, where each model in the ensemble works
together—either by covering each other’s weaknesses or combining their outputs to achieve better results.
Interestingly, this teamwork strategy, even with less individually powerful models, often outperforms using the best
individual model alone. However, the challenge lies in determining, from a strategic perspective, which model should
handle which part of the task. Two key concepts come into play here:

1. Bagging or Bootstrap Aggregating: In bagging, multiple instances of the same base model are trained in
parallel on different subsets of the data and their predictions are combined. The base models can be of the same
type but trained on different subsets of the data with randomness introduced through resampling. As mentioned
above, the Random Forest is a bagging ensemble of decision trees. Each decision tree is different and trained
independently, and their predictions are aggregated to make the final prediction.

2. Boosting: In boosting, base models are trained sequentially, and each subsequent model is trained to correct the
errors made by the previous ones. The base models in boosting can be different types of models, and they are
weighted based on their performance. AdaBoost (Adaptive Boosting) can use a variety of base models, such as
decision trees, linear models, or other classifiers, in a sequential manner

Bootstrapping: You’ll hear this term frequently in machine learning. The concept involves random sampling with
replacement, meaning that each sample can be selected multiple times, while others may not be selected at all.

For example, suppose you have a list of numbers: [55, 56, 57, 58, 59, 60]. Your task is to create three
samples of size 3 using the bootstrap method:

Sample 1 could be: [55, 55, 55]

Sample 2 could be: [55, 56, 57]

Sample 3 could be: [55, 59, 59]

In this case, some numbers, like 55, appeared multiple times, while others weren’t selected at all. This happens
because we used the “with replacement” approach—after each selection, the chosen number (e.g., 55) is placed back
into the pool, making it available for selection in subsequent draws.

Bootstrapping is particularly useful for small datasets since it allows you to generate numerous samples as needed. It
also eliminates the need to make assumptions about the underlying distribution of the data (assuming independence).
However, this assumption may not hold true for time series data, where dependencies like seasonal fluctuations could
exist.

Mix and Match Algorithms: In ensemble machine learning, the models you combine don’t need to be of the same
type. One of the key benefits of ensemble methods is their ability to improve predictive performance by leveraging
diverse models. Ensembles work by aggregating the predictions of multiple base models (often called “base learners”
or “weak learners”) to make a final prediction. These base models can vary in type and even use entirely different
algorithms, allowing for a more robust and versatile approach to problem-solving.

The choice of performance metrics is determined by the goals of the specific machine learning task and the preferences
of the data scientist. Here are the typical steps involved in selecting performance metrics for a machine learning
algorithm:

1. Business Objectives: The first and foremost metric should be that the metrics should align with the broader
business objectives. For example, in an e-commerce recommendation system, the goal might be to maximize
conversion rates. This must be weighed against the cost of developing and maintaining the algorithm and data.
2. Type of Problem: The second step is to define the ML problem. Different problems require different metrics. For
example:

Regression: For regression tasks (e.g., predicting house prices), common metrics include mean squared
error (MSE) and R-squared (R²)

Classification: If you are solving a classification problem (e.g., spam detection), you might use metrics like
accuracy, precision, recall, F1-score, and ROC AUC.

Clustering: Clustering algorithms might use metrics like silhouette score or Davies-Bouldin index.

3. Industry Specific: Consider industry-specific knowledge and constraints. Some metrics may be more relevant or
meaningful in certain domains. For instance, in medical diagnosis, sensitivity and specificity may be crucial.

4. Data Characteristics: The nature of your data can influence metric selection. Noisy data might need robust
metrics.

Be cautious of imbalanced data. This occurs when one class is significantly underrepresented compared to another,
which can create challenges during model training and evaluation due to skewed class distributions. In contrast,
balanced data provides a more even distribution across classes, making it easier to train models and evaluate their
performance accurately. If avoiding imbalanced data isn’t possible, you may need to use alternative performance
metrics to properly assess your model’s effectiveness.

Linear/Nonlinear Numerical Regression

Some common metrics used are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), and R-squared (R²).

1. Mean Absolute Error (MAE): MAE (Mean Absolute Error) shows how much the model’s predictions differ
from the actual values, on average. A smaller MAE means the predictions are generally closer to the real values.
It gives an idea of the “average” error the model makes, without worrying whether it predicted too high or too
low.

2. Mean Squared Error (MSE): MSE (Mean Squared Error) measures how far the predicted values are from the
actual values, but it squares the differences, so bigger mistakes have a much larger impact. This makes MSE
especially useful when you want to place more importance on larger errors, making sure the model doesn’t make
any big mistakes.

3. R-squared (R²) also known as the Coefficient of Determination, shows how well a regression model fits the
actual data. It tells you how much of the variation in the outcome (the thing you’re trying to predict) is explained
by the input variables (the things you’re using to make the prediction). In simple terms, it helps you understand
how well your model matches the real data points.

Root Mean Squared Error (RMSE): RMSE is the square root of MSE, and it’s expressed in the same units as the
target variable. It is similar to MSE.

Classification Algorithms

Let’s frame this section in terms of taking a test with a mix of easy and hard questions. As the test taker, you don’t
know in advance which questions are easy and which are hard—you have to make educated guesses and decide where
to focus your time. You can’t answer every question, so you need to choose a strategy. Let’s say your strategy is to go
after the hard questions because that’s where your strengths lie. The key metrics you’ll use to evaluate your
performance on this test are similar to how machine learning models are assessed: Accuracy, Precision, Recall, F1-
Score, and ROC AUC.
Just as strategy plays a vital role in balancing speed versus stability for something like a self-balancing robot, your
approach here balances several factors: how well you find the hard questions (recall), avoid unnecessary mistakes
(precision), and how close your overall performance is to reality (accuracy). These metrics help you evaluate whether
your test-taking approach is effective, much like they help measure the performance of machine learning models.

Accuracy: Accuracy represents your overall performance on the test. It’s like getting a percentage score based on
how many questions you answered correctly out of the total attempted. However, accuracy can be misleading if
the test is imbalanced—for instance, if there are mostly easy questions and just a few difficult ones. Even if you
get all the easy ones right, your accuracy might still look high, but you may have missed the more challenging,
critical questions. Remember, this metric doesn’t penalize you for questions that you left unattempted.

Precision: Precision is about avoiding mistakes, or false positives. In this test analogy, it’s like ensuring you only
answer the hard questions and don’t waste time on the easy ones. Precision measures how well you stayed
focused on answering the difficult questions correctly without mistakenly tackling the easy ones. If you end up
answering questions you shouldn’t have, your precision drops. If you didn’t attempt certain questions, they aren’t
factored into precision.

Recall (Sensitivity) or the True Positive Rate: In terms of our test analogy, recall is like your ability to find and
answer all the difficult questions (true positives). If you miss some of these tough questions (false negatives),
your recall suffers. High recall means you’ve successfully identified and attempted most of the hard questions,
which is key if your goal is to tackle the challenging ones. Note that this metric does capture the unattempted
ones as your failure to identify them is a recall failure.

Fall-Out (False Positive Rate): This refers to how often you mistakenly answered easy questions (false
positives) when your goal was to focus on the hard ones. It measures the proportion of easy questions you
incorrectly attempted out of all the easy questions on the test.

ROC AUC (Receiver Operating Characteristic - Area Under the Curve) helps evaluate how well you balance
answering the difficult questions (high recall) while avoiding the easy ones (low fall-out) in our test analogy. The
ROC curve plots Recall (True Positive Rate) against Fall-Out (False Positive Rate) at different decision
points, or thresholds, representing how you performed under varying strategies. At the end of the day, a higher
AUC combined with a high true positive rate (recall) means you are good at answering the important, difficult
questions while staying focused and avoiding unnecessary errors with the easier ones. This makes AUC a
comprehensive measure of how well your test-taking strategy worked.

Unwanted/Easy Questions as False Positives: The “easy” questions that you mistakenly answered are your false
positives. If you answer many easy questions that you weren’t supposed to, especially if you get them wrong, your
precision decreases because you’ve deviated from your focus on the difficult ones.

In multi-class classification, precision is typically calculated independently for each class, and then you might
calculate an average precision across all classes if needed.

Last updated 4 months ago

A basic AI example with some machine learning: Consider a self-balancing robot placed on a test track where the
classic trade-off is between speed and stability. The stability control parameters were learned through statistical
experiments, but a finite state machine is used to implement rules that protect against falls or react when the camera
sensors detect sharp curves. By recognizing the context of its environment, the robot can switch strategies accordingly
to maintain balance and navigate effectively.

Neural networks in the brain can be way more complex than the 2-D pictures of neatly arranged layers you see in
machine learning literature.

Layers inside a neural network add absraction and refinement to learning.

So many options, so little time.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-301-a-concept-
course-on-language-models * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 301: A Concept Course on Language Models

Learn the key ideas behind language models

As a Data Distiller user, understanding the power and potential of Large Language Models (LLMs) and deep
learning breakthroughs is essential because these technologies represent the future of AI. LLMs, powered by
advanced neural networks, have revolutionized the way AI models process vast amounts of data, enabling them to
generate human-like text, learn from complex patterns, and adapt across various domains. These capabilities are
reshaping industries—from marketing and customer engagement to product development and beyond.

Knowledge Representation Holds the Key to Artificial Intelligence

To address the complex problems encountered in artificial intelligence, a large amount of knowledge and effective
mechanisms for manipulating that knowledge are essential to create solutions for new challenges.

Human memory stores an immense amount of knowledge about the world and serves as the foundation for higher
forms of learning. Systems that cannot learn, in practical terms, only exhibit basic common sense (providing
straightforward answers to simple questions). While we haven’t yet developed a complete theory of human memory,
neural networks—such as the Hopfield network—offer a close analogy to how neural memory might function.

Psychological research highlights several distinctions in human memory. One key distinction is between short-term
memory (STM) and long-term memory (LTM). LTM is relatively permanent, while STM, or working memory,
holds perceptual information temporarily. In LTM, production rules—stored as knowledge—match themselves
against items in STM, firing to modify STM and repeating this process. LTM is further divided into episodic memory
and semantic memory. Episodic memory stores personal experiences from an autobiographical perspective, while
semantic memory holds facts like “birds fly,” which are not linked to personal experiences.

In the context of neural networks and AI, “memory” typically refers to a model’s ability to store and retrieve
information from past experiences or training data. This is achieved through various mechanisms, such as recurrent
neural networks (RNNs) or long short-term memory networks (LSTMs). These networks include memory cells
that can capture and retain information over sequences, allowing the model to learn from and apply knowledge across
time and data.

Facts & Representation Mappings

In AI, different methods are used to represent knowledge (facts) within a program.

Facts: These are the truths about the world that we want to capture, such as “birds fly.”

Representation: This is how we encode those facts in a way that the AI program can work with.

There are two key levels involved in representing knowledge:

1. Knowledge level: This is where we describe the actual facts and behaviors, including the goals of an agent.

2. Symbolic level: At this level, we take those facts and represent them with symbols that the AI can manipulate.
There are two types of mappings that happen in AI:

Forward mapping: This is where we map facts from the real world into the representation the AI uses.

Backward mapping: This goes in the opposite direction, mapping the representation back to the real-world
facts.

However, these mappings aren’t always perfect or one-to-one. A single fact might have several possible
representations, and multiple facts might share the same representation.

What an AI program does is manipulate these internal representations. The goal is for the AI to create new structures
from the information it has, which can also be interpreted as solutions to the problem it’s trying to solve. In other
words, the AI uses the facts it knows, manipulates them, and generates new facts or answers.

Sometimes, finding the right representation makes solving a problem much easier, even for humans. Think about how
changing the way you approach a problem can make it much simpler to solve. The same is true for AI—finding a good
representation can turn a complex problem into a trivial one.

If there isn’t a good way to represent a problem, no matter how advanced the AI program is, it won’t be able to come
up with the right solution. In some cases, it may not be possible to find a perfect representation, so we have to settle for
something less ideal.

In AI, we haven’t found a single system that works perfectly for every type of knowledge. As a result, multiple
methods of knowledge representation are used, each with its strengths and weaknesses depending on the situation.

Building a Knowledge Representation

A knowledge representation should answer the following questions:

1. How should sets of objects be represented?

2. Are there any attributes so basic that they occur in almost every problem domain?

3. Are there any important relationships that exist among attributes of objects?

4. Given a large amount of knowledge stored in a database, how can relevant parts be accessed when they are
needed?

5. At what level should knowledge be represented? Is there a good set of primitives into which all knowledge can
be broken down? Is it helpful to use such primitives?

6. At what level of detail should the world be represented?”. Another way this question is often phrased is, ”What
should be our primitives?” Should there be a small number of low-level ones or should there be a larger number
covering a range of granularities?

7. What knowledge structure should we choose so that we consume less resources?

Kinds of Knowledge Representation

Here are the various knowledge representations:

1. Simple Relational Knowledge: Represent declarative facts as a set of relations of the same sort used in database
systems. This representation is simple but provides very weak inferential capabilities. However, knowledge
represented in this form may serve as the input to more powerful inference engines. Providing support for
relational knowledge is what database systems are designed to do.
2. Inheritable Knowledge: One of the most useful forms of inference is property inheritance, in which elements of
specific classes inherit attributes and values from more general classes in which they are called. In order to
support property inheritance, objects must be organized into classes and classes must be arranged in a
generalization hierarchy.

3. Inferential Knowledge: The power of traditional logic and sometimes even more than that is necessary to
describe the inferences that are needed. There are many procedures, some of which reason forward on the
knowledge present in the system. One of the most useful procedures is resolution, which exploits a proof-by-
contradiction strategy.

4. Procedural Knowledge: The most commonly used technique for representing procedural knowledge in AI
programs is the use of production rules.

Knowledge in Large Language Models (LLMs) is typically represented in the form of pre-trained language models.
These models are trained on vast amounts of text data from the internet, which helps them capture a broad spectrum of
human knowledge. Here’s how knowledge is represented in LLMs:

1. Word Embeddings: LLMs represent words as dense vectors in high-dimensional spaces, where similar words
are closer in the vector space. These word embeddings capture semantic relationships between words, helping the
model understand word meanings.

2. Contextual Embeddings: LLMs go beyond simple word embeddings by considering the context in which words
appear. They generate contextual embeddings that change based on the surrounding words. This allows the model
to understand how word meanings shift depending on context.

3. Structured Knowledge: LLMs may include structured knowledge in their training data, such as facts, entities,
and relationships. This information can be used to answer factual questions and generate coherent responses.

4. Commonsense Knowledge: LLMs are trained on a diverse range of texts, enabling them to capture common
knowledge about the world. They can answer general knowledge questions and make predictions based on this
information.

5. Attention Mechanisms: LLMs employ attention mechanisms that highlight relevant parts of the input text when
generating responses. This helps them focus on the most informative parts of the text.

6. External Knowledge Sources: LLMs can access external knowledge sources, such as databases or knowledge
graphs, to retrieve information during inference. This allows them to provide up-to-date and accurate answers.

The good news: There is structure to human languages and patterns that thankfully can be learned.

What makes “language” unique is that a finite set of sounds can be created in infinite ways. This set of sounds is called
phonemes. For example, in English, the words “cat” and “bat” differ by one phoneme (the sound ‘k’ vs. the sound ‘b’),
and changing this phoneme changes the meaning of the word.

Morphemes are composed of phonemes and are the building blocks of words. There is a variety of ways, you can spot
these in the English language:

1. cat and bat are examples of morphemes where the word is the morpheme itself.

2. telephone word is composed of two morphemes of tele and phone.

3. Unhappy word is composed of two morphemes Un- and happy.

Each language in the world has a distinctive set of phonemes and rules for combining them into morphemes and
morphemes into words. Words will need to follow the rules of grammar to create any number of sentences.
In order to learn any language (you should try this with a foreign language), I need to learn the following patterns:

1. Phonemes: the basic sounds of the words themselves. I need this to be able to pronounce and understand spoken
language.

2. Morphemes: How phonemes combine to create morphemes

3. Vocabulary of Words: How morphemes combine to create words. It is much faster this way. Most of the time, we
start with words and learn the phonemes and morphemes.

4. Combining words to create sentences that have meaning: I need to know the rules of grammar. ‘Meaning’ has
different levels of abstraction i.e. what words you choose and how you present that order decides whether it
comes across as poetic or prose.

The pattern of learning a language is universal for all infants across the world. It is the exact same pattern that every
infant goes through until they start specializing in a specific language. Infants possess the remarkable ability to discern
subtle sound differences that signify distinctions in the languages spoken around them. In a very short time, they
undergo a rapid learning process that enables them to detect statistical patterns in the language they are exposed to.
This allows them to establish phonetic categories, recognize words within the continuous flow of speech, and grasp the
structural patterns of their native language, all before they reach the age of 10 months.

A parallel journey unfolds in speech production, with infants exhibiting universal speech patterns during their early
months, followed by increasing differentiation by around the age of 10 months. By the end of their first year, when
they begin uttering their initial words, the process of language acquisition transitions from universal speech perception
and production patterns to language-specific patterns. At the age of 10 months, if you expose the infant to a different
language, they can pick it up easily.

In the research around LLMs (Large Language Models), the key is to provide enough examples so that the model
can learn the structure of language, including word relationships, sentence syntax, and implicit meanings. When you
train on vast amounts of text data and use a truly large model with numerous neural layers and parameters, the model
begins to exhibit human-like abilities in generating sentences. However, this training process requires significant
computational resources and is costly due to the sheer volume of data and model complexity.

A Note on Emergent Abilities

One of the most fascinating aspects of learning is the idea that latent abilities can emerge from mastering simple tasks.
This phenomenon is similar to how we perceive certain “geniuses” as being exceptionally creative. Their unexpected
or unplanned abilities often arise from interactions of simpler patterns within their brain, which lead to remarkable
outcomes that leave us in awe.

When we examine their creative instincts, a few key characteristics stand out:

Complexity and Non-Linearity: Creativity often emerges from non-linear interactions, where small changes or
combinations at one level can lead to significant and sometimes unpredictable outcomes at a higher level.
Knowledge from one domain can suddenly resonate and apply to another in unexpected ways.

Self-Organization: Creative individuals often cannot fully explain their creativity. Their brains exhibit self-
organizing properties, where new patterns or behaviors spontaneously arise without direct external influence or
control.

Unpredictability: Many geniuses are known for their unpredictability. Their creative output is often inconsistent
and hard to forecast, making it difficult to predict when or how their next breakthrough will occur.
These emergent abilities arise from the interaction of simpler processes, much like how creative genius in humans can
spring from complex, self-organizing brain functions

Understanding Large Language Models

Training large language models (LLMs) is like teaching a computer to understand and generate language, just like
humans do when they learn to speak or write. These models use vast amounts of text data to learn patterns in language
and can then generate responses, predict what comes next in a sentence, or even hold conversations. This chapter
breaks down how these models are trained, evaluated, and optimized, making complex concepts accessible while
diving deeper into the key steps.

What is Language Modeling?

The Basic Idea Language modeling involves predicting the next word in a sequence, similar to guessing the next
line in a song based on the lyrics you’ve heard so far. For example, if a sentence starts with “The sun rises in the
___,” the model is likely to predict “east” because it has seen many similar sentences during training.

How Do Language Models Work? LLMs analyze massive amounts of text data to learn patterns, like which
words often appear together or in certain sequences. They use these patterns to predict the most likely next word
in a sentence. It’s not just about memorizing; it’s about understanding the probabilities of word combinations. For
instance, “cat” is more likely to follow “The black” than “sky” would be.

The Model’s Brain: The Transformer

Why Transformers? Transformers revolutionized language modeling by introducing a way to look at

relationships between all the words in a sentence simultaneously, rather than processing them one by one. Think
of it as reading an entire paragraph at once to understand its meaning, rather than going word by word.

Key Components of the Transformer

Attention Mechanism: This mechanism allows the model to focus on the important words in a sentence,
similar to how we focus on key details when reading. For example, in the sentence “The cat, which was
small and fluffy, climbed the tree,” the attention mechanism helps the model focus more on “cat” and
“climbed the tree” than the details about the cat’s appearance.

Positional Encoding: Since word order matters in language (e.g., “John hit the ball” is different from “The
ball hit John”), the model needs to understand the positions of words. Positional encoding helps the model
recognize word order in sentences.

How Do We Teach the Model?

Step 1: Pre-training: Pre-training involves feeding the model a vast amount of text from books, articles, and
websites. It learns general language rules, just like how a person learns basic grammar by reading. The goal is for
the model to get good at predicting the next word across various contexts. During pre-training, the model might
see a sentence like “The dog ___” and learn to guess “barked” or “ran” based on the context in similar sentences
it has encountered.

Step 2: Fine-Tuning: Fine-tuning is like specialized training. After learning general language rules, the model is
further trained on specific tasks, such as answering questions or writing code. This is done using smaller, more
focused datasets that are relevant to the task. Fine-tuning helps the model adapt to particular types of content or
writing styles.

Breaking Down Text: Tokenization

What is Tokenization? Tokenization splits text into smaller pieces called tokens, which can be words, parts of
words, or even individual characters. For example, “reading” might be split into “read” and “ing” or just treated
as a single token.

Why Tokenization Matters: The model processes text more efficiently when it works with tokens. It also allows
the model to handle typos, slang, and compound words better. For instance, tokenizing “unhappiness” into “un,”
“happy,” and “ness” helps the model understand the components of the word.

How Do We Know It’s Working? Evaluation

Perplexity: Perplexity measures how well a model predicts a sample of text. Think of it as the number of
different choices the model hesitates between when guessing the next word. Lower perplexity indicates the model
is making more confident predictions.

Human Preference Ratings: Human evaluators review the model’s output for tasks such as summarizing an
article or writing an essay. They rate the responses based on criteria like coherence, relevance, and accuracy. This
helps improve the model by giving feedback on what it did well and where it struggled.

Aggregated Benchmarks: LLMs are also tested against standardized benchmarks, which consist of various
language tasks. This helps compare the model’s performance against other models or previous versions.

Making the Model Better: Scaling Up

More Data, Better Results: The more data the model is trained on, the better it can learn language patterns. It’s
similar to how a person who reads a lot can improve their vocabulary and understanding. Larger models with
more data can generate more accurate and nuanced responses.

Optimizing Resources: Training LLMs requires significant computing power, often using thousands of GPUs
(graphics processing units). Techniques like mixed precision training (using smaller numbers to speed up
calculations) and parallel processing (splitting the training across many GPUs) help make training more efficient.

Challenges and Future Directions

Handling Mistakes and Hallucinations: LLMs can sometimes generate incorrect or made-up information,
known as “hallucinations.” Researchers are working on ways to reduce these errors by improving how models
are fine-tuned and evaluated.

Multimodal Language Models: The future of LLMs may involve combining text with other forms of data, like
images or audio, to create models that understand and generate content across different types of media.

Ethical Considerations: Issues like data privacy, bias in training data, and the ethical use of AI are significant
challenges. As models get smarter, addressing these concerns will be crucial to ensure responsible development.

Emergent Abilities in LLMs and Human Analogies

The key breakthrough in LLMs (Large Language Models) is that, as the models were scaled up to a large number of
parameters, they began to display emergent abilities—behaviors that weren’t observed at smaller scales. These
emergent properties become apparent only when the models surpass a certain size threshold. The combination of the
model’s architecture, extensive training data, and fine-tuning enables these abilities. The parallels to human cognitive
abilities are remarkable, and this scaling has unlocked behaviors that smaller models cannot exhibit.

Complex Neural Architecture: Large Language Models (LLMs) are often built using sophisticated neural
architectures like Transformers, which excel at learning complex patterns and relationships from vast amounts of data.
Analogy: Just as we each have our own creative strengths, we’re able to recognize patterns and make sense of
sequences and their interconnections. More importantly, we know how to focus on the most relevant details, which
enhances our ability to learn and create.

Large and Diverse Training Corpus: LLMs are trained on enormous and diverse datasets, pulling from a wide array
of sources, topics, writing styles, and languages. This variety helps the models develop a broad linguistic and factual
understanding. Analogy: Before mastering a field, we all have to immerse ourselves in the works that came before us.
It’s a rite of passage that prepares us to eventually become experts.

Unsupervised Learning: LLMs primarily rely on unsupervised learning, meaning they predict the next word in a
sentence without needing explicit labels. This method allows the models to discover complex structures in language on
their own. Analogy: Many creative geniuses are self-taught, and in the same way, you’ll find yourself teaching and
learning independently in whatever field you want to master.

Transfer Learning: LLMs use transfer learning, which allows them to apply knowledge learned in one domain to
other areas. Analogy: There are universal patterns in the world that enable us to apply skills across different domains,
just like how expertise in one area can translate to another.

Few-Shot and Zero-Shot Learning: LLMs, such as GPT-3, have the remarkable ability to perform tasks with few
examples (few-shot learning) or even without examples (zero-shot learning), leveraging their prior knowledge and
language understanding. Analogy: Have you ever heard of “improvisation”? It’s the ability to create something new on
the spot with minimal or no prior preparation, much like how LLMs tackle tasks with little guidance.

Last updated 4 months ago

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-302-a-concept-
course-on-feature-engineering-techniques-for-machine-learning * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 302: A Concept Course on Feature Engineering Techniques for

Machine Learning
Transform raw data into predictive power with essential feature engineering techniques.

Feature engineering is a vital process that transforms raw data into insights that machine learning models can harness.
By selecting, modifying, and creating the right features, you can significantly enhance model performance. In this
guide, we’ll dive into essential feature engineering techniques, each illustrated with unique, real-world examples to
give you practical insights.

Imputation: Filling Missing Values

In real-world datasets, missing values are commonplace. Imputation replaces these gaps with estimates, ensuring
models can make full use of the available data.

Mean/Median Imputation: Numerical missing values are replaced with the feature’s mean or median. This
method is simple yet effective in preventing the model from focusing too heavily on gaps in data.

Mode Imputation: For categorical data, replacing missing values with the most frequent category (mode)
preserves data consistency.

Example: Imputation in Marketing Campaign Data

In a marketing dataset for predicting customer lifetime value (CLV), essential attributes like “monthly spend” or
“engagement score” may sometimes be missing, perhaps due to data gaps in recent purchases or website activity
tracking. By imputing missing “monthly spend” values with the median spend of similar customers, we avoid data loss
and ensure this feature’s predictive power is retained. This imputation helps the model accurately gauge customer
value, even when recent spend data is incomplete.

Alternative Imputation with K-Nearest Neighbors (KNN): For a more tailored imputation, KNN can be applied to
fill missing values based on customer similarity. For instance, to estimate a missing “engagement score” for a
customer, KNN identifies the most similar customers (based on features like demographics or purchase history) and
averages their engagement scores. This approach ensures that imputed values reflect patterns in customer behavior,
increasing the accuracy of customer segmentation and lifetime value predictions.

Scaling: Aligning Feature Ranges

Data often includes features measured on different scales, which can mislead models. Scaling aligns features to ensure
that one feature’s magnitude doesn’t dominate another’s importance.

Normalization: Transforms feature values to a 0-1 range, a common choice for algorithms sensitive to input
scales, like neural networks.

Standardization: Centers data around a mean of 0 and variance of 1, useful for algorithms like linear regression,
which are sensitive to scale differences.

Example: Standardization in Marketing Spend Optimization

In a marketing campaign model predicting ROI, features like “advertising spend” (in thousands of dollars) and
“campaign duration” (in days) may vary widely in scale. Advertising spend is typically much larger numerically
compared to campaign duration, which could skew the model toward weighting spend more heavily than duration. By
standardizing these features to have a mean of 0 and variance of 1, both features contribute evenly to the model’s
predictions.

This standardization ensures that campaign ROI is influenced appropriately by both factors, allowing for more
balanced insights. For example, the model can better identify high-performing campaigns where a shorter duration
achieved a higher ROI, regardless of spend size.

Encoding Categorical Variables: From Words to Numbers

Machine learning models work with numbers, so categorical features need transformation. Various encoding methods
can represent these categories effectively.

One-Hot Encoding: Creates binary columns for each category in a feature, marking 1 for presence and 0 for
absence.

Label Encoding: Assigns each category a unique numerical label, though this approach can introduce unintended
ordinal relationships.

Target Encoding (Mean Encoding): Replaces each category with the mean of the target variable within that
category, useful for models where categories have a strong predictive relationship with the target.

Example: Target Encoding for Predicting Customer Response Rates

In a marketing dataset used to predict customer response rates to email campaigns, product categories like “beauty,”
“electronics,” and “fitness” often influence customer engagement. Rather than expanding the dataset with one-hot
encoding, which would add a new column for each category, target encoding assigns each category a numeric value
based on average past response rates.

For instance, if customers engaging with “fitness” products historically show a 20% email response rate, “fitness” can
be target-encoded with the value 0.2. This encoding approach allows the model to capture each product category’s
unique influence on engagement, without creating multiple columns. By understanding that customers in the “fitness”
category are more responsive, the model can better predict overall campaign success and guide targeted marketing
strategies.

Polynomial Features: Creating Interaction Terms

Polynomial features create new features by combining existing ones, which can capture relationships that simple
features might miss.

Example: Interaction Terms for Predicting Customer Conversion

In a model predicting customer conversion likelihood, features like “ad frequency” (number of times an ad is shown)
and “time spent on site” may not individually correlate linearly with conversions. However, their interaction—such as
“ad frequency * time spent on site”—could reveal compounded effects, where high ad exposure combined with longer
site engagement significantly boosts conversion probability.

For example, multiplying “ad frequency” by “time spent on site” captures cases where users exposed to ads more often
and who spend longer on the site are more likely to convert. Similarly, squaring “time spent on site” could reveal that
engagement time alone has an exponential impact on conversion likelihood. By adding these interaction terms, the
model gains a deeper understanding of how multiple behaviors drive conversions, helping marketers optimize
campaign strategies.

Feature Binning: Grouping Continuous Values

Binning groups continuous variables into discrete bins, creating categories that simplify patterns and reduce noise.

Example: Binning Monthly Usage Data for Subscription Renewal Prediction

To predict subscription renewal, monthly usage data can be simplified by binning it into “Low,” “Medium,” and
“High” categories. This categorization reduces variability and highlights distinct usage patterns, making it easier for
the model to detect trends in renewal likelihood based on customer engagement levels.

Feature Extraction with Domain Knowledge

Feature extraction is the process of creating new variables by combining existing data, often leveraging domain
expertise.

Example: Feature Extraction for Predicting Customer Upsell Potential

In a marketing model aimed at predicting a customer’s likelihood to purchase an upsell product, transaction data might
include details like “last purchase amount,” “category of product,” and “location.” By creating additional features—
such as “time since last purchase” or “average spend per visit”—the model can capture valuable patterns in buying
behavior that indicate upsell potential.

For example, a customer who frequently makes large purchases in quick succession might be more open to upsell
offers, while a customer with longer gaps between purchases might require different marketing strategies. Similarly,
extracting features like “total spend in the last 30 days” can help identify high-engagement customers who are prime
targets for upsell opportunities. This enhanced dataset enables the model to detect nuanced behavior patterns and refine
upsell targeting for greater campaign effectiveness.

Handling Outliers: Managing Data Extremes

Outliers can distort predictions, especially in linear models. Handling them carefully is essential to prevent models
from overfitting to extremes.

Capping/Flooring: Set outliers above or below certain thresholds to specific boundary values, preserving the
data range while removing extreme values.

Winsorization: Similar to capping but more controlled, it adjusts extreme values to a specified percentile range.

Example: Capping Outliers for Customer Lifetime Value Prediction

In a marketing model predicting Customer Lifetime Value (CLV), certain customers might have exceptionally high
purchase values due to one-time events like seasonal promotions or bulk orders. To prevent these outliers from
skewing the model, spending values above the 95th percentile can be capped, ensuring the model isn’t misled into
overestimating typical customer value.

For instance, a handful of customers may make abnormally large purchases around holiday sales, but this spending
doesn’t reflect their usual behavior. By capping these extreme values, the model can focus on predicting realistic
lifetime value, providing a more accurate basis for customer segmentation and personalized marketing strategies.

Log Transformation: Reducing Data Skewness

Log transformation helps manage skewness in features where a few values are disproportionately large, enabling
models to better capture underlying patterns.

Example: Log Transformation for Predicting Customer Spend Behavior

In a marketing model predicting customer spend behavior, transaction amounts often show a right-skewed distribution,
with a small number of high-value purchases. Applying a log transformation to these values reduces skewness,
enabling the model to better distinguish between typical spending patterns and higher purchase brackets.

For example, while most customers make moderate purchases, a few make significant, infrequent purchases that can
heavily skew the data. The log transformation smooths these variations, allowing the model to identify spending trends
more accurately across different customer segments and create more effective targeting strategies based on spend
behavior.

Dimensionality Reduction: Principal Component Analysis (PCA)

Dimensionality reduction reduces the number of features, focusing on those with the highest impact. PCA transforms
features into a smaller set of principal components, retaining variance while simplifying the dataset.

Example: Principal Component Analysis for Customer Segmentation

In a marketing model designed for customer segmentation, datasets often contain a wide array of features, like
frequency of purchases, average transaction value, preferred product categories, and engagement metrics. Using
Principal Component Analysis (PCA), these numerous purchasing behaviors can be condensed into a few principal
components that capture the core patterns in customer behavior.
For example, instead of analyzing dozens of detailed metrics, PCA reduces them to a handful of components—like
“spending intensity” or “category diversity”—which represent essential spending and engagement patterns. This
simplification allows the model to segment customers effectively, grouping them based on meaningful behavioral
patterns rather than a cluttered set of individual features.

Time-Based Feature Engineering: Adding Temporal Insight

Time-series data often benefits from time-based features that reveal seasonality, trends, or changes over time.

Example: Leveraging Temporal Features for Customer Purchase Prediction

In a marketing model predicting customer purchase behavior, temporal features like “day of the week” and “days since
last interaction” provide valuable insights into buying patterns. By aggregating transaction data by week, month, or
quarter, the model captures recurring trends, such as customers making purchases more frequently on weekends or
during certain seasonal periods.

For instance, if data shows that customers are more likely to buy during the first week of the month, this pattern can
guide when to launch campaigns or promotions. Aggregating and analyzing these time-based behaviors helps the
model better predict future purchases, enabling marketers to optimize campaign timing for maximum impact.

Target Transformation: Adjusting the Outcome Variable

Transforming the target variable can also improve predictions, especially when it has skewed distributions.

Log Transformation: Commonly applied to positive-skewed targets, enabling models to detect patterns across a
wide range of values.

Box-Cox Transformation: A more flexible approach that handles various types of skew, useful when log
transformation isn’t sufficient.

Example: Log Transformation for Predicting Campaign ROI

In a marketing model aimed at predicting campaign ROI, revenue data can be highly skewed, with a few high-
performing campaigns generating outsized returns. Applying a log transformation to the revenue feature ensures the
model focuses on relative differences between campaigns, rather than being overwhelmed by large revenue values
from a few outliers.

For example, instead of letting a few big campaigns dominate the dataset, the log transformation smooths the revenue
distribution. This allows the model to capture meaningful trends across all campaigns, making it easier to identify
which marketing strategies are consistently effective, regardless of their scale.

Feature Selection: Choosing the Most Relevant Features

Selecting relevant features and discarding others prevents model overfitting and simplifies interpretations.

Filter Methods: Select features based on statistical properties like correlation with the target variable.

Wrapper Methods: Evaluate subsets of features based on model performance.

Embedded Methods: Automatically select features during the training process (e.g., Lasso regression).

Example: Feature Selection for Customer Churn Prediction

In a marketing model predicting customer churn, relevant features like “subscription length,” “monthly engagement,”
and “number of support interactions” are selected based on their correlation with churn likelihood. Irrelevant features,
such as “customer ID” or “signup source,” are dropped to reduce noise, allowing the model to focus on the data that
directly impacts churn prediction.

By retaining only the most impactful features, the model improves its ability to accurately identify high-risk
customers. This streamlined approach enables marketers to proactively address potential churn by tailoring retention
strategies to customers who display patterns linked to leaving.

Feature engineering is a powerful tool in machine learning, transforming raw data into meaningful predictors. By
using techniques like imputation, scaling, encoding, outlier handling, and transformation, data scientists can craft
features that bring out the best in models. Each dataset and problem is unique, but applying these techniques
thoughtfully can lead to models that offer accurate, reliable insights.

Last updated 4 months ago

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-400-data-distiller-
basic-statistics-functions * * *

You need to ingest the CSV file below using the following tutorial:

To demonstrate the use of statistical functions in a marketing domain dataset, let’s generate a dataset representing
customer transactions and campaign performance. The dataset includes information about customer purchases,
campaign engagement, and customer demographics. Then, I’ll provide SQL examples for each statistical function
along with a suitable use case.

When you ingest this dataset, make sure you name it as **marketing_campaign_data**

The dataset is related to a marketing campaign and contains the following columns:

1. customer_id: A unique identifier for each customer.

2. campaign_id: The identifier for the marketing campaign the customer participated in.

3. purchase_amount: The amount of money the customer spent during the campaign.

4. engagement_score: A score indicating the level of customer engagement in the campaign.

5. age: The age of the customer.

6. clv (Customer Lifetime Value): An estimated value of the customer’s future spending.

“Average” typically refers to the mean, which is the sum of all values in a dataset divided by the number of values. It
represents a measure of central tendency, indicating the central point of a dataset. The mean provides a summary of the
data by finding a single value that represents the overall level of all observations.

To calculate the mean, you add up all the data points and divide by the total number of points. For example, if you
have a dataset of five numbers: 4, 8, 6, 5, and 7, the mean would be (4+8+6+5+7)/5=6.

The mean is useful for understanding the overall trend of numerical data, but it can be sensitive to outliers, which are
values significantly higher or lower than the others. Unlike the median (middle value) or the mode (most frequent
value), the mean takes into account all data points when summarizing the dataset.

Let us calculate the average purchase amount to assess the overall customer spend.
SELECT avg(purchase_amount) AS avg_purchase_amount
FROM marketing_campaign_data;

Let us calculate the total customer lifetime value (CLV) for all customers engaged in a specific campaign.

SELECT campaign_id, sum(clv) AS total_clv

FROM marketing_campaign_data
GROUP BY campaign_id;

Let us identify the minimum and maximum customer engagement scores for a campaign to gauge campaign
effectiveness.

SELECT campaign_id, min(engagement_score) AS min_engagement,

max(engagement_score) AS max_engagement
FROM marketing_campaign_data
GROUP BY campaign_id;

Standard Deviation (stddev/stddev_pop/stddev_samp)

In statistics, “standard deviation” is a measure of the dispersion or spread of a set of values around the mean (average).
It indicates how much individual data points deviate, on average, from the mean. A low standard deviation means that
the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread
out over a wider range.

Standard deviation is calculated as the square root of the variance, which is the average of the squared deviations from
the mean. It is commonly used in various fields to assess the variability or consistency of data. Unlike the mean
(central value) and the median (middle value), the standard deviation focuses on the extent of variation or dispersion in
the dataset.

Note that the **stddev** function is an alias for **stddev_samp**. It calculates the sample standard
deviation, using **N−1** as the divisor (where **N** is the total number of data points). This adjustment is known
as Bessel’s correction, and it accounts for the bias in estimating the population standard deviation from a sample.
**stddev_pop** computes the population standard deviation. It uses **N** as the divisor, treating the data as the
entire population.

The stddev/stddev_samp is computed as

stddev_samp=1N−1∑i=1N(xi−xˉ)2\text{stddev\_samp} = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2}

stddev_samp=N−11i=1∑N(xi−xˉ)2

Let us measure the variability in customer age to assess the diversity of your customer base:

SELECT stddev(age) AS age_stddev

FROM marketing_campaign_data;

The above sample computation is very useful when you need to construct a confidence interval for the mean of a
dataset, you need to use the sample standard deviation along with other statistical elements, such as:

1. Sample Mean: The average of the sample data.

xˉ=1N∑i=1Nxi\bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i xˉ=N1i=1∑Nxi

1. Standard Error of the Mean (SE): Calculated as

SE=stddev_sampNSE = \frac{\text{stddev\_samp}}{\sqrt{N}} SE=Nstddev_samp

1. Critical Value (z-score or t-score): Depends on the desired confidence level (e.g., 1.96 for 95% confidence if
using the normal distribution).

The confidence interval is then calculated as:

xˉ±(Critical Value×SE)\bar{x} \pm (\text{Critical Value} \times SE) xˉ±(Critical Value×SE)

Population Standard Deviation (stddev_pop)

Whether a dataset is considered a population or a sample depends on the context and the scope of the analysis. When
the dataset includes all possible observations relevant to the study, it is considered a population. For example, if you
have the entire customer base of a company and want to analyze their spending habits, that dataset would be treated as
the population. In this case, the population standard deviation (**stddev_pop**) is used because the data
represents the entire group, and no adjustments are necessary.

On the other hand, a dataset is considered a sample when it is a subset of the population, meant to represent a larger
group. For instance, if you survey 1,000 customers out of a larger group of 100,000 to understand general customer
preferences, this dataset would be considered a sample. In such cases, the sample standard deviation (stddev_samp)
is used because an adjustment is needed to account for the fact that the data is only a subset. This adjustment, known
as Bessel’s correction, compensates for potential bias when estimating the population characteristics from the sample.

The distinction between population and sample also depends on the context in which the data was collected. If the data
was gathered through a survey, experiment, or sampling process, it is generally treated as a sample. Additionally, if the
goal of the analysis is to make inferences about a larger group beyond the dataset itself, it should be considered a
sample. Even if the dataset is large, it may still be a sample if it does not cover all possible observations. Conversely,
small datasets can be populations if they include every relevant case. In practice, data is most often treated as a sample,
as it is rare to have data for the entire population.

In most practical scenarios, data is treated as a sample because it’s rare to have data for the entire population.

This gives the range in which the true population mean is likely to fall with the specified level of confidence.

The formula is:

stddev_pop=1N∑i=1N(xi−μ)2\text{stddev\_pop} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}

stddev_pop=N1i=1∑N(xi−μ)2

SELECT stddev_pop(age) AS age_stddev_pop

FROM marketing_campaign_data;

This gives for age:

The results are similar as the dataset has enough data points but you can see differences in the least significant digits.

Variance (**variance/var_pop/var_samp**)

The same principles apply to variance as they do for standard deviation, since variance is essentially the square of the
standard deviation. **variance** is the same as**var_samp.**The formulas and assumptions remain the
same as previously explained. In most cases, you will be using **variance** (or **var_samp**).

Our use case will be to determine the variance in customer engagement scores to see how consistently customers
interact with campaigns.

SELECT variance(engagement_score) AS engagement_variance

FROM marketing_campaign_data;
“Median” refers to the middle value of a dataset when the numbers are arranged in ascending or descending order. It
represents the point at which half of the data falls below and half falls above. If the dataset has an odd number of
observations, the median is the middle value. If the dataset has an even number of observations, the median is the
average of the two middle values.

The median is particularly useful for numerical data, especially when the data is skewed or contains outliers, as it is
less affected by extreme values than the mean (average). In contrast to the mode (most frequent value) and the mean,
the median provides a measure of central tendency that indicates the midpoint of the dataset.

Let us calculate the median purchase amount to understand the central spending tendency of customers.

SELECT percentile_approx(purchase_amount, 0.50) AS median_purchase_amount

FROM marketing_campaign_data;

“Mod” typically refers to the mode, which is the value that appears most frequently in a dataset. It represents the data
point or category that occurs with the highest frequency. The mode can be used for both numerical and categorical
data. For example, in a dataset of people’s favorite ice cream flavors, the mode would be the flavor that the largest
number of people prefer. In a numerical dataset, it would be the number that appears most often.

In contrast to the mean (average) and median (middle value), the mode focuses on the most common value in the
dataset.

Distribute customers evenly into 3 random marketing groups for campaign analysis.

-- Assign customers to 3 random marketing groups based on their customer_id

SELECT
customer_id,
campaign_id,
mod(customer_id, 3) AS marketing_group
FROM marketing_campaign_data;

Correlation measures the strength and direction of a linear relationship between two variables. It is expressed as a
correlation coefficient, denoted by rrr, which ranges from -1 to 1. A correlation coefficient close to 1 indicates a strong
positive linear relationship, meaning that as one variable increases, the other tends to increase as well. Conversely, a
correlation coefficient close to -1 suggests a strong negative linear relationship, where an increase in one variable
corresponds to a decrease in the other.

When the correlation coefficient is close to 0, it indicates little to no linear relationship between the variables; changes
in one variable do not reliably predict changes in the other. Correlation does not imply causation, meaning that even if
two variables are correlated, it does not necessarily mean that one variable causes the other to change. Correlation is
useful for identifying relationships in data, and it is commonly used in fields like finance, psychology, and social
sciences to uncover patterns and make predictions based on observed trends.

Let us check if there is a correlation between customer age and their engagement score with campaigns.

SELECT corr(age, engagement_score) AS age_engagement_correlation

FROM marketing_campaign_data;

Here is how to interpret the results:

Correlation Coefficient (r)

Perfect Negative Correlation

There is a perfect inverse linear relationship; as one variable increases, the other decreases.
Strong Negative Correlation

The variables have a strong inverse relationship, with one tending to decrease as the other increases.

Moderate Negative Correlation

There is a moderate inverse relationship, with some predictability in the variables’ opposite movements.

Weak Negative Correlation

The relationship is weak, with a slight tendency for the variables to move in opposite directions.

There is no linear relationship; changes in one variable do not predict changes in the other.

Weak Positive Correlation

There is a weak tendency for both variables to increase together, but the relationship is not strong.

Moderate Positive Correlation

There is a moderate positive relationship, with some predictability in the variables’ simultaneous increases.

Strong Positive Correlation

The variables have a strong tendency to increase together in a predictable manner.

Perfect Positive Correlation

There is a perfect direct linear relationship; as one variable increases, the other increases as well.

For our use case, the given result of approximately **r=0.0067**, this falls into the “No Correlation” category,
indicating that there is essentially no linear relationship between age and engagement score in our dataset.

Correlation is more commonly used than covariance because it standardizes the relationship between variables,
making comparisons easier. However, covariance is a key component in the calculation of correlation and provides
valuable directional insights into how two variables move together.

Covariance (**covar_pop/covar_samp**)

Covariance measures the degree to which two variables change together. It indicates the direction of the linear
relationship between the variables. If the covariance is positive, it means that as one variable increases, the other tends
to increase as well, indicating a positive relationship. Conversely, a negative covariance suggests that as one variable
increases, the other tends to decrease, indicating an inverse relationship.

The magnitude of the covariance value indicates the strength of the relationship; however, unlike correlation, it does
not provide a standardized measure. This means that the actual value of covariance can be difficult to interpret because
it depends on the scale of the variables. Covariance is used in various fields, such as finance, where it helps in
understanding how different assets move together, which is useful for portfolio diversification. While it indicates the
direction of a relationship, it does not measure the strength or causality between the variables.

Covariance becomes a correlation when it is standardized. The correlation coefficient is essentially a scaled version of
covariance, which adjusts for the variability (standard deviation) of each variable, making it a unitless measure. This
allows for a direct comparison of relationships regardless of the original scales of the variables.

r=cov(X,Y)σXσYr = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} r=σXσYcov(X,Y)

By dividing the covariance by the product of the standard deviations of **X** and **Y** variables, you normalize
the value, bringing it into the range of -1 to 1.

Let us compute the covariance between purchase amount and engagement score to see if higher engagement leads to
higher spending.

Just like the way we used functions for the population and sample, the formulas are the following:

Population Covariance

covar_pop=1N∑i=1N(Xi−μX)(Yi−μY)\text{covar\_pop} = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu_X)(Y_i -

\mu_Y) covar_pop=N1i=1∑N(Xi−μX)(Yi−μY)

where you have sample means subtracted from each value for both X and Y

Sample Covariance

covar_samp=1N−1∑i=1N(Xi−Xˉ)(Yi−Yˉ)\text{covar\_samp} = \frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})

(Y_i - \bar{Y}) covar_samp=N−11i=1∑N(Xi−Xˉ)(Yi−Yˉ)

Let us calculate the population covariance between customer age and lifetime value (CLV) to understand if older
customers tend to have higher value.

SELECT covar_pop(age, clv) AS age_clv_covariance

FROM marketing_campaign_data;

Positive Covariance (> 0)

As one variable increases, the other tends to increase as well. Similarly, as one decreases, the other tends to decrease.

Negative Covariance (< 0)

As one variable increases, the other tends to decrease, indicating an inverse relationship.

Zero or Near-Zero Covariance

There is no consistent pattern of changes between the two variables. Changes in one do not predict changes in the
other.

Remember:

The magnitude of covariance is influenced by the units of the variables, so the absolute value is not directly
indicative of the strength of the relationship.

Unlike correlation, covariance is not standardized, meaning it is not constrained within a fixed range (such as -1
to 1), making direct comparisons across datasets less meaningful without normalization.

Skewness measures the asymmetry of a dataset’s distribution. It indicates whether the data points are spread more
towards one side of the mean, resulting in a non-symmetric shape. Skewness can be positive, negative, or zero,
depending on the direction of the asymmetry:

1. Positive Skewness (Right-Skewed): When skewness is greater than zero, the distribution has a long tail on the
right side. This means that there are more values concentrated on the left, with a few larger values stretching the
distribution to the right.
2. Negative Skewness (Left-Skewed): When skewness is less than zero, the distribution has a long tail on the left
side. In this case, more values are concentrated on the right, with a few smaller values stretching the distribution
to the left.

3. Zero Skewness (Symmetrical Distribution): When skewness is approximately zero, the distribution is
symmetric, with data points evenly distributed on both sides of the mean. A perfectly symmetric distribution,
such as a normal distribution, has zero skewness.

Skewness helps to identify the extent and direction of deviation from a normal distribution, and it is useful for
understanding the nature of the data, particularly in fields like finance, economics, and quality control.

Let us determine if the distribution of purchase amounts is skewed towards lower or higher values.

SELECT skewness(purchase_amount) AS skewness_purchase

FROM marketing_campaign_data;

Interpretation for purchase_amount

Right-Skewed Distribution

The distribution has a long tail on the right, indicating that most customers make smaller purchases, while a few make
significantly larger purchases.

The distribution has a long tail on the left, suggesting that most customers make larger purchases, with a few making
much smaller purchases.

The distribution is symmetric, with purchases evenly spread around the mean, suggesting a balanced number of
smaller and larger purchases.

The results of the above query are:

The result of the skewness calculation for purchase_amount is approximately -0.00015. This value is very close to
zero, which indicates that the distribution of purchase_amount is nearly symmetric.

Kurtosis measures the “tailedness” or the sharpness of the peak of a dataset’s distribution. It indicates how much of
the data is concentrated in the tails and the peak compared to a normal distribution. Kurtosis helps to understand the
distribution’s shape, particularly the presence of outliers.

The distribution resembles a normal distribution, with a moderate level of peak height and tail weight. In other words,
the distribution does not have an unusually high or low number of data points far from the mean.

The distribution has a sharper peak and heavier tails than a normal distribution, indicating more frequent extreme
values (outliers).

The distribution has a flatter peak and lighter tails than a normal distribution, suggesting fewer outliers and a broader
spread of data points.

Let us assess the “peakedness” of customer engagement scores to understand if most scores are concentrated around
the mean.

SELECT kurtosis(engagement_score) AS kurtosis_engagement

FROM marketing_campaign_data;

The result of the kurtosis calculation for engagement_score is approximately -1.1989. This value is less than 3,
indicating that the distribution is platykurtic. The kurtosis value of -1.1989 suggests that the engagement_score
distribution has fewer extreme values (outliers) than a normal distribution. The data points are more spread out across
the range, with less concentration around the peak.

In a normal distribution, the data is symmetrically spread, with most values clustering around the mean, and the
frequency of extreme values decreases as you move away from the mean. When a distribution has no significant
excess in outliers, it means that the occurrence of data points far from the center is what you would expect based on a
normal distribution, with no additional concentration of extreme values in the tails.

Let us count the number of customers engaged in each marketing campaign to understand campaign reach.

SELECT campaign_id, count(customer_id) AS customer_count

FROM marketing_campaign_data
GROUP BY campaign_id;

Let us count how many customers have spent more than $200 in each campaign to identify high spenders.

SELECT campaign_id, count_if(purchase_amount > 200) AS high_spenders_count

FROM marketing_campaign_data
GROUP BY campaign_id;

Approximate Count Distinct (approx_count_distinct)

The Approximate Count Distinct (**approx_count_distinct**) function offers significant advantages over
the traditional Count Distinct (**count distinct**) function, especially when working with large datasets. It
employs algorithms like HyperLogLog to estimate the number of distinct values, providing a high degree of accuracy
while being much faster and more efficient than **count distinct**. This speed is achieved because
**approx_count_distinct** does not need to store and sort all unique values, making it particularly useful in
big data environments where datasets may be too large to fit into memory. Additionally, the function consumes less
memory by using probabilistic methods, enabling distinct counting on massive datasets without overwhelming system
resources. As a result, **approx_count_distinct** scales well with increasing data size, making it an ideal
choice for distributed computing platforms where performance and scalability are critical.

Let us estimate the number of unique customers engaged with a specific marketing campaign.

SELECT campaign_id, approx_count_distinct(customer_id) AS unique_customer_count

FROM marketing_campaign_data
GROUP BY campaign_id;

Generate Random Number from a Uniform Distribution (rand/random)

The **rand()** function is a mathematical function used to generate a random floating-point number between 0
(inclusive) and 1 (exclusive) from a uniform distribution. Each time **rand()** is called, it produces a different
pseudo-random number, simulating randomness. However, because it is based on an algorithm rather than true
randomness, the sequence of numbers generated is actually deterministic if the initial starting point (seed) is known.

Suppose you want to randomly assign customers to different marketing test groups for A/B testing.

-- Assign customers to random groups for A/B testing

SELECT
customer_id,
campaign_id,
purchase_amount,
rand() AS random_value,
CASE
WHEN rand() < 0.5 THEN 'Group A'
ELSE 'Group B'
END AS test_group
FROM marketing_campaign_data;

In this example, customers are assigned randomly to Group A or Group B based on the random value generated by
rand().

If you want to use a seed for predictability, try this:

-- Assign customers to random groups for A/B testing with a seed for
reproducibility
SELECT
customer_id,
campaign_id,
purchase_amount,
rand(12345) AS random_value, -- Using a seed value of 12345
CASE
WHEN rand(12345) < 0.5 THEN 'Group A'
ELSE 'Group B'
END AS test_group
FROM marketing_campaign_data;

**random()** is the same as **rand()**. Both functions generate random numbers uniformly distributed
between 0 (inclusive) and 1 (exclusive). They are interchangeable and serve the same purpose for creating random
values in this range.

Generate Random Number from a Normal/Gaussian Distribution (randn)

The **randn()** function generates random numbers following a normal (Gaussian) distribution, with a mean of 0
and a standard deviation of 1. Unlike **rand()**, the values produced by **randn()** are not limited to a
specific range and can be any real number, though most values will fall within three standard deviations of the mean.
This function is particularly useful for modeling data that follows a bell-curve shape, where most observations cluster
around the central value, such as natural phenomena, measurement errors, or financial returns.

Let us simulate customer engagement scores or create noise in the data to make the model more robust for training.

-- Simulate random variations in engagement scores (normal distribution noise)

SELECT
customer_id,
campaign_id,
engagement_score,
engagement_score + randn() * 5 AS engagement_score_with_noise
FROM marketing_campaign_data;

In this case, the randn() function adds normally distributed noise to the customer engagement scores, simulating
potential fluctuations in real-world data.

If you want to use a seed for predictability:

-- Simulate random variations in engagement scores (normal distribution noise)

with a seed for reproducibility
SELECT
customer_id,
campaign_id,
engagement_score,
engagement_score + randn(12345) * 5 AS engagement_score_with_noise -- Using
a seed value of 12345
FROM marketing_campaign_data;

Let us rank customers by their purchase amount within each campaign to identify top spenders.

SELECT customer_id, campaign_id, purchase_amount, rank() OVER (PARTITION BY

campaign_id ORDER BY purchase_amount DESC) AS rank
FROM marketing_campaign_data;

The query retrieves data from the **marketing_campaign_data** table, selecting the **customer_id**,
**campaign_id**, and **purchase_amount** columns, along with a calculated rank. The **rank()**
function is used to assign a ranking to each row within each **campaign_id** group (using **PARTITION BY
campaign_id**). The rows are ordered by purchase_amount in descending order (**ORDER BY
purchase_amount DESC**), meaning the highest **purchase_amount** within each campaign gets a rank
of 1, the second highest gets a rank of 2, and so on. This approach allows for ranking customers based on their
purchase amounts within each specific campaign, enabling comparisons and analysis of customer spending behavior
across different marketing campaigns.

Find the first customer by engagement score in each campaign to track early adopters.

SELECT campaign_id, first(customer_id) AS first_engaged_customer

FROM marketing_campaign_data
GROUP BY campaign_id;

Identify the last customer to make a purchase in each campaign to track lagging engagement.

SELECT campaign_id, last(customer_id) AS last_purchase_customer

FROM marketing_campaign_data
GROUP BY campaign_id;

Percent Rank (percent_rank)

Calculate the percent rank of customers based on their purchase amount within each campaign to categorize customer
spending.

SELECT customer_id, campaign_id, purchase_amount,

percent_rank() OVER (PARTITION BY campaign_id ORDER BY purchase_amount)
AS purchase_percent_rank
FROM marketing_campaign_data;

Percentile**(percentile or percentile_approx)**

A percentile is a measure that indicates the value below which a given percentage of observations in a dataset falls.
For example, the 25th percentile is the value below which 25% of the data points lie, while the 90th percentile is the
value below which 90% of the data points fall. Percentiles help in understanding the distribution of data by dividing it
into 100 equal parts.

Percentiles are commonly used in data analysis to assess the relative standing of individual observations within a
dataset. They are particularly useful for identifying outliers, comparing different data sets, or summarizing large
amounts of data. In educational testing, for example, if a student’s score is in the 85th percentile, it means they scored
higher than 85% of the other students. Percentiles provide a way to interpret data in terms of rank and position rather
than exact values.
The use of **percentile/percentile_approx** are both approximate percentiles and in a query provides a
significant performance advantage, especially when working with large datasets. Unlike exact percentile calculations,
both estimate by using algorithms that avoid the need to sort all the data. This approach results in faster execution and
lower memory usage, making it highly suitable for big data environments where datasets can be massive. The function
also scales efficiently allowing it to handle very large datasets seamlessly. Although it provides an approximate value
rather than an exact percentile, the trade-off is often worthwhile for the speed and resource efficiency it offers.

Let us calculate the 90th percentile of customer engagement scores to identify top-performing customers who are
highly engaged with a marketing campaign.

-- Calculate the 90th percentile of engagement scores

SELECT percentile(engagement_score, 0.90) AS p90_engagement_score
FROM marketing_campaign_data;

Percentile Approximation (percentile_approx)

Let us calculate the approximate 90th percentile of customer CLV to understand high-value customer thresholds.

SELECT percentile_approx(engagement_score, 0.90) AS p90_clv

FROM marketing_campaign_data;

Continuous Percentile (percentile_cont)

A continuous percentile is a measure used to determine the value below which a certain percentage of the data falls,
based on a continuous interpolation of the data points. In cases where the specified percentile does not correspond
exactly to a data point in the dataset, the continuous percentile calculates an interpolated value between the two nearest
data points. This provides a more precise estimate of the percentile, especially when dealing with small datasets or
when the data distribution is not uniform.

For example, if the 75th percentile falls between two data points, the continuous percentile will estimate a value that
represents a weighted average between these points, rather than just picking the closest one. This approach gives a
more accurate representation of the distribution, as it takes into account the relative positions of data points rather than
simply using discrete ranks. Continuous percentiles are often used in statistical analysis to better understand the
distribution of data, especially in situations where the exact percentile may lie between observed values.

The continuous percentile function calculates the exact percentile value by interpolating between the two nearest
data points if the specified percentile falls between them. It gives a precise answer by determining a value that may
not be in the original dataset but represents a point within the ordered range. This function is used when an exact,
interpolated percentile value is needed.

Let us calculate the 75th percentile of Customer Lifetime Value (CLV) to understand the top 25% most valuable
customers.

-- Calculate the continuous 75th percentile of CLV

SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY clv) AS p75_clv
FROM marketing_campaign_data;

Discrete Percentile**(percentile_disc)**

A discrete percentile is a measure used to determine the value below which a specified percentage of the data falls,
based on actual data points in the dataset. In contrast to a continuous percentile, which interpolates between data
points, a discrete percentile selects the closest actual data value that corresponds to the given percentile rank.
For example, if you want to find the 75th percentile in a discrete approach, the function will choose the value at or just
above the rank where 75% of the data points lie, without performing any interpolation. This means that the output will
always be one of the actual values from the dataset, making it a straightforward representation of the distribution based
on the observed data. Discrete percentiles are useful when the goal is to work with specific data values rather than
estimated positions, such as in ranking scenarios or when dealing with ordinal data where interpolation might not be
meaningful.

The discrete percentile function calculates the exact percentile value based on the actual data points, without any
interpolation. It selects the closest actual data value corresponding to the specified percentile, ensuring that the result is
one of the observed values in the dataset. This function is suitable for cases where only actual data values are
meaningful, such as ordinal data.

Let us calculate the 90th percentile of engagement scores to find the actual score that separates the top 10% of most
engaged customers.

-- Calculate the discrete 90th percentile of engagement scores

SELECT percentile_disc(0.90) WITHIN GROUP (ORDER BY engagement_score) AS
p90_engagement_score
FROM marketing_campaign_data;

Numeric Histograms (histogram_numeric)

Create a histogram of customer purchase amounts to analyze spending patterns.

-- Create a histogram for purchase amounts (divided into 5 buckets)

SELECT to_json(histogram_numeric(purchase_amount, 5)) AS purchase_histogram
FROM marketing_campaign_data;

This function would return the distribution of customer purchases across 5 buckets, which can be used to create
visualizations or perform further analysis.

In the histogram data returned by the query, the x and y values represent the following:

**x** (Bucket Range): The midpoint or representative value of each bucket in the histogram. In this case, the
purchase amounts have been divided into five buckets, so each x value represents the center of a range of
purchase amounts.

**y** (Frequency): The number of occurrences (or count) of purchase amounts that fall within each
corresponding bucket. This tells you how many purchase transactions fall within the range represented by the
**x** value.

So, each data point in the JSON array indicates how many purchases (y) are within a specific range of amounts
centered around x. Together, these values create a histogram showing the distribution of purchase amounts across five
intervals.

In the above algorithm, the buckets are determined based on the distribution of the data, not just evenly dividing the
range of values. This means that if certain ranges of purchase amounts have more data points, the bucket widths may
be adjusted to capture the distribution more accurately, resulting in non-equidistant **x** values.

Results of the above query

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-500-generative-sql-
with-microsoft-github-copilot-visual-studio-code-and-data-distiller * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual

Studio Code and Data Distiller
Streamline your development workflow with Visual Studio Code and Github Copilot—fast, lightweight, and
customizable for all your coding needs, from generating SQL queries to managing projects.

Last updated 4 months ago

Visual Studio Code (VS Code) is one of the most popular, lightweight, and versatile code editors available today.
Developed by Microsoft, it’s designed to work with a wide range of programming languages and tools, offering
support for extensions that make it even more powerful. When paired with SQL, VS Code becomes a powerful
environment for querying databases, managing data, and writing efficient SQL scripts.

Why Visual Studio Code for SQL?

Download JSON: DBVisualizer and many other tools charge for downloading JSON data as part of their
premium offerings, which can be a significant limitation, especially when working with Data Distiller. This
editor, however, provides much more flexibility without those restrictions.

Lightweight and Fast: VS Code is designed to be lightweight, allowing you to run it efficiently on various
systems, including machines with limited resources. Despite its small footprint, it packs powerful features.

Run Python Notebooks: Visual Studio Code supports Jupyter-style Python notebooks through the Jupyter
extension. This extension allows you to create, edit, and run Jupyter notebooks (.ipynb files) directly within the
VS Code interface, offering a similar experience to JupyterLab. Here’s what it provides:

IntelliSense for SQL: With SQL extensions installed, you get features like IntelliSense (auto-completion, syntax
highlighting, and error checking), making writing complex SQL queries faster and less error-prone.

Integration with GitHub Copilot: With GitHub Copilot integrated into VS Code, SQL developers can benefit
from AI-powered code suggestions while writing queries. Copilot can assist by auto-completing SQL statements,
suggesting optimized queries, or even generating complex SQL code based on comments or partial queries,
dramatically speeding up query development and reducing errors. This is particularly useful for beginners or
when working with complex SQL joins, aggregations, and subqueries.

Enhance Secure Access of Visual Studio Code with IP Whitelisting in Data Distiller

Since this client is installed on your local machine, if your administrator has concerns about accessing data in AEP
from a different machine, you should request that they enable the Data Distiller whitelisting feature. This ensures that
only IP addresses from the corporate network are allowed access. Even if someone attempts to spoof an IP and send a
query, the responses would still be restricted to the corporate network, preventing unauthorized access.

Download Visual Studio Code

You should download the version that matches your operating system.

This is not required but if you have not done so, download the JSON data

Connect to Data Distiller

1. Launch the Visual Studio Code editor

2. Click on Extensions to the left:

3. First you need to install the Postgres driver. To enable the Microsoft Visual Studio Code editor to communicate
with the Data Distiller engine, you need a PostgreSQL driver. This driver allows the editor to interact with Data
Distiller using PostgreSQL’s protocol and syntax, which Data Distiller supports. Search for “postgres
cockroach”. Link is here on the web.

4. It will also automatically install an extension called SQL Tools that gives us the tools to connect to databases,
write and manage queries. Link is here on web.

Keep in mind that Data Distiller adheres to Postgres SQL syntax, a widely used standard due to its robustness,
compatibility, and support for advanced features, making it a popular choice for many database applications.

1. Click on the icon on the UI and then click Add Connection and Choose PostgresSQL

2. You will need Data Distiller Credential parameters for the above by logging into the AEP UI and navigating to
Queries->Credentials. You will need to copy the following:

The PSQL command box will look something like this:

psql "sslmode=require host=demo.platform-query.adobe.io port=80

dbname=prod:all user=Demo@AdobeOrg password=****"

**host**: The host is the endpoint where you send the requests, representing the server or service (in this
case, Data Distiller) that you’re trying to connect to. It’s actually a URL/IP address.

**dbname**: The dbname refers to the specific database within Data Distiller that you’re connecting to. Each
database holds different sets of data, and the dbname helps you point to the correct one.

**port**: The port specifies the communication channel used to connect to the host. For Data Distiller, this
can be be port 80 (typically used for HTTP) or 5432 (the default port for PostgreSQL). Depending on your
environment or security setup, one of these ports will be used to facilitate the connection.

**sslmode**: This parameter ensures that the connection between your machine and the Data Distiller
database is secured using SSL (Secure Sockets Layer), protecting the data in transit from unauthorized access.

1. Fill out the options on the connection screen with the folowing parameters:

1. Connection name:

2. Connection group: Leave blank for now as this is for a logical group of multiple connections.
3. Connect using: Server and port

4. Server address: Copy **host** from Data Distiller Credentials UI shown above i.e. demo.platform-
query.adobe.io in the example above

5. Port: You can use 5432 or 80.

6. Database: Copy **dbname** or Database from Data Distiller Credentials UI shown above i.e. prod:all
in thee example above

7. Username: Copy **user** or Username from Data Distiller Credentials UI shown above i.e.
Demo@AdobeOrg in the example above

8. Use password: Select Save as plaintext in settings, though other options are available depending on your
intended use.

9. Password: Copy the password from the Data Distiller Credentials UI as shown above.

10. SSL: Change from Disabled to Enabled

11. If you scroll down, you will see an option Show Records Default Limit. By default, it shows 50 records,
but you can go way higher as much as the memory on your machine allows. I chose 10,000.

2. Scroll to the bottom and click on Connect Now

Remember that Data Distiller passwords expire in 24 hours. Frequent password expiration reduces the likelihood of
unauthorized access. Even if a password is compromised, the short validity window minimizes the risk by ensuring
that the credentials cannot be used for long. Many industries and organizations follow strict regulatory requirements
(e.g., HIPAA, ) that enforce regular password changes to protect sensitive data. Short password expiration cycles help
meet these compliance standards.

1. Navigate to Connections pane and click on the icon for Create New SQL File

2. Type the following queries. Make sure you separate out each query with a semi colon. It will execute all the
queries and show thee results in multiple tabs. You can also right click and choose Run selected query option as
well.

SHOW TABLES;

SELECT * FROM luma_web_data;

3. Let is now export this result as JSON. Click on Export and choose Save results as JSON

4. Once you save the file to your desktop or any other location, you will see the JSON file appear in Preview:

GitHub is part of Microsoft. Microsoft acquired GitHub in 2018. While GitHub operates as a subsidiary and
maintains its own identity and platform, the acquisition has led to closer integration between GitHub and other
Microsoft tools and services, such as Azure and Visual Studio Code. This integration is particularly evident in the
development of tools like GitHub Copilot, which leverages OpenAI’s technology (another Microsoft partner) and
works seamlessly with Microsoft products like Visual Studio Code. Through the acquisition, Microsoft has been able
to strengthen its offerings for developers while supporting GitHub’s open-source community.

GitHub Copilot, an AI-powered coding assistant, has revolutionized the way developers write code by providing
intelligent code suggestions directly within the development environment. When integrated with Visual Studio Code,
Copilot becomes an invaluable tool for generating code snippets, speeding up development workflows, and improving
productivity.
For SQL developers, GitHub Copilot offers the ability to generate SQL queries based on the context of the project,
making it easier to write and refine complex queries with just a few keystrokes.

If you’re someone who isn’t interested in understanding the SQL being generated, there are clear limitations to what
coding assistants—or any assistants—can do for you. Without grasping the nuances, you’ll likely struggle with
suggestions that may be incorrect or overly complex, making the process more difficult. This is why conversations
about developer productivity assume that developers have the expertise to use these assistants to automate parts of the
code they already understand. These tools are not a substitute for a skilled SQL developer.

The same principle applies to any AI assistants or task agents. If you don’t understand what’s happening and why,
there’s a high risk that a simple mistake could lead to system-wide issues. AI assistants can be very convincing, but
they can also provide incorrect answers without you realizing it. This highlights a broader challenge in communication
with AI—if you’re not fact-checking or fully understanding the context, you could find yourself in serious trouble.

When working with coding assistants like GitHub Copilot, it’s crucial to manage expectations, especially when
dealing with complex coding tasks. While Copilot excels at speeding up coding workflows and providing useful
suggestions for standard operations, it encounters limitations in handling deeply nested data structures, such as
arrays and maps. These structures, which are common in Data Distiller queries and data models, often require a
nuanced understanding of context and relationships that current language models struggle to fully grasp. In our
experiments with SQL, especially in scenarios involving complex subqueries, joins, and deeply nested data, we’ve
found that Copilot may struggle to generate accurate or optimal code.

Retrieval-Augmented Generation (RAG) is a technique that can help alleviate some of these limitations. In a RAG-
based system, the model augments its generation by retrieving relevant information from a knowledge base or external
documents. This approach can improve accuracy in tasks like querying complex datasets because it combines
generated content with factual information retrieval, making the model more context-aware.

However, there are trade-offs with using RAG-based approaches:

Restrictiveness: While RAG can help improve the precision of the generated SQL queries by providing more
context, it also makes the query generation process more restrictive. The model’s output becomes heavily
dependent on the retrieved information, meaning it is less likely to generate creative or flexible queries. This can
be beneficial when accuracy is critical, but it can also limit the flexibility needed for more exploratory or dynamic
queries.

Dependency on Knowledge Base: RAG systems rely on a well-curated knowledge base or set of documents for
retrieval. If the knowledge base is incomplete, outdated, or lacks detail on specific database schemas, the quality
of the suggestions can be limited. This can still result in gaps when dealing with custom or less-documented data
models.

Performance: RAG-based models require additional steps to retrieve and process information, which will
increase the latency of generating suggestions or you have to put in more resources. This might not be an issue
for smaller tasks, but for larger, complex queries, it could impact the overall efficiency and cost.

Copilot’s training data, sourced from existing code repositories, and the context gathered by its large language model
may introduce biases and errors that can be reflected in its suggestions. Additionally, Copilot Chat may favor SQL
coding styles, potentially leading to suboptimal or incomplete code recommendations.

How GitHub Copilot Works in Visual Studio Code

When you install GitHub Copilot extension in Visual Studio Code, it seamlessly integrates into your development
environment, offering code suggestions as you type. Here’s how it works:

1. Context-Aware Code Generation: Copilot analyzes the code you’re writing, the comments you provide, and the
overall project context. For SQL, this means that it can generate queries based on your data structures, table
names, and existing code patterns. For instance, if you begin writing a query or even a comment like -- Fetch
top 10 sales records, Copilot can suggest an appropriate SQL statement to achieve that goal.

2. Query Auto-Completion: Copilot helps autocomplete SQL syntax, suggesting relevant SQL commands, such as
SELECT, JOIN, WHERE, and GROUP BY, based on the schema you’re working with. This reduces the need to
constantly reference documentation or remember complex syntax.

3. Adaptive Learning: Over time, Copilot adapts to your coding style and project context, improving the relevance
and accuracy of its suggestions. Whether you’re managing a simple query or a complex data operation, Copilot
aims to reduce errors and save you time by offering optimized code solutions.

Check this page

GitHub Copilot isn’t free, but it’s highly affordable considering the significant time it can save during development.
What is cool is that you can get a free trial for a month to try and see if it meets your needs.

You can read about it here.

GitHub Copilot takes user privacy and security seriously. Here’s how your data is handled when using the tool based
on the information available

Contextual Suggestions Only: Copilot uses the code in your current file and project to make context-aware
suggestions. However, it does not access or share the broader content of your private repositories or any code
outside the active session to provide its suggestions.

No Use of Private Code for Training: While GitHub Copilot was trained on public code repositories, it does not
use your private repository data for training its underlying model. Private code remains private, and Copilot only
leverages the context within your active session for generating recommendations.

Data Collection for Feedback: Copilot collects limited data, such as whether you accept or reject a suggestion,
to help improve its performance. However, it does not collect or store the specific code you’re working on unless
you explicitly share it through feedback mechanisms.

Install the GitHub Copilot Extension

1. Go back to Extensions and search for Github Copilot. Install it.

2. After the extension is installed, it will request you to sign in to make sure that you have the right license.

3. It will also ask to link the GitHub Copilot to your Visual Studio Code environment

4. Go back to the editor and you will see thee GitHub icon in the editor itself. Start typing what you want it to do. In
my case, I am trying to get to thee Profile snapshot data that contains valuable information about the Profile
Store. Also, observe the chat icon that has now appeared in the left hand panel.

5. Open up the chat and start writing queries. You can see that I copy paste the values of maps and arrays from the
results and give additional context:

6. You can also right click on a query and have it explained, fix it for syntax issues, generate docs or write tests.

7. Now copy paste the following cryptic Data Distiller error (every engine has a set of cryptic errors:))

ErrorCode: 08P01 queryId: c4556695-e655-49ac-be22-bd8adf6f60b8 Unknown error encountered. Reason:

[Table not provisioned for dataset: 64109683dca32d1bd12960a9 at
abfss://[email protected]/platform/64109683dca32d1bd12960a9. To
prevent this error either add data to the dataset or specify withAutoProvisionTableEnabled(true) when calling this
operation]

The error occurs because there are no rows of data in the dataset. This dataset was not generated in Data Distiller but
was created using a different service within Adobe Experience Platform.

GitHub CoPilot has no knowledge of AEP or the documentation and so it tries to get creative. The heart of the answer
is right at the center.

90% of the text above is incorrect, but the correct answer is still present. If you’re a Data Distiller user familiar with
how the system works, you’d be able to identify the mistake. This highlights one of the biggest challenges with AI
assistants: they tend to benefit power users who can discern what the AI is saying correctly and where it’s making
errors.

Launch screen for Visual Studio Code

Click on Extensions to the left

postgres cockroach search.

Options that need to be filled out.

Make sure you have the SSL option enabled.

Show records default limit can bee set to a higher value.

will see the entire connection screen.

Choose save results as JSON

Link Visual Studio Code with GitHub Copilot.

Simple queries that boost productivity

As the data becomes more nested, you will need to give more context.

A variety of options are available.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-601-building-a-
period-to-period-customer-retention-model-using-logistics-regression * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 601: Building a Period-to-Period Customer Retention Model

Using Logistics Regression
Unlocking Future Engagement: Data-Driven Retention Predictions for Smarter Personalization Strategies

Last updated 3 months ago

You need to download this datasets:

To ingest the above datasets, you need to be familiar with this tutorial:

We will be using DBVisualizer for this tutorial but you can use anything you like:
A major app company aimed to understand and improve user retention by forecasting future retention rates based on
historical app usage patterns. By leveraging data-driven insights, the company sought to proactively address churn
risks, optimize user engagement strategies, and ultimately increase long-term loyalty. Using SQL and logistic
regression, a model was developed to predict which users would likely remain active in the coming period, enabling
the company to take targeted actions to retain valuable users.

Customer retention is crucial for sustainable growth, especially in highly competitive markets. By accurately
predicting whether customers will remain engaged from one period to the next, you can optimize marketing
campaigns, personalize customer experiences, and focus retention efforts where they are most effective. This model
provides such insights by examining customer activity patterns and creating retention forecasts for the next period,
aligning with the company’s goal of boosting loyalty. By focusing on customers predicted to have a lower probability
of retention, the company can take proactive actions—like sending personalized messages, offering discounts, or
recommending relevant content—to improve engagement and loyalty.

You will use historical activity data to create a dataset that captures customer activity over weekly periods. Using Data
Distiller, you will transform this data to identify active customers for each period and calculate key features:

Current Customers (active in the current period)

Previous Customers (active in the prior period)

Retention Rate (ratio of retained customers to current customers)

You will use these features as inputs for a logistic regression model, which will predict retention as a probabilistic
outcome, offering a forecast of whether each customer would stay active in the next period.

Time Span: 3 years (from January 1, 2023, to December 28, 2025)

Frequency: Weekly activity records

Initial Customers: 20,000 customers at the start of the simulation

Total Weeks: 156 weeks (3 years * 52 weeks per year)

Total Records: The dataset contains approximately 2 to 3 million records, depending on retention and churn
rates.

The dataset is stored in a CSV file with the following columns:

**timestamp**: A datetime object representing the start date of the week during which the customer
was active.

Example: "2023-01-01", "2023-01-08"

customer_id: A unique identifier for each customer.

Format: Original customers: "CUST_<number>"

New customers added: "NEW_CUST_<week_number>_<number>"

Example: "CUST_1", "NEW_CUST_0_1", "NEW_CUST_52_150"

Logistic Regression and Its Connection to Probabilities

Imagine you want to predict whether a customer will stay with a company or leave next month. This is a yes-or-no
question, so you’re looking for a result that tells you one of two things: “stay” or “leave.” This is where logistic
regression comes in. It’s a statistical method that helps us predict outcomes like this by calculating the probability of
each option.

Logistic regression is a technique used to predict binary outcomes — situations with only two possible results, like
“yes” or “no,” “win” or “lose,” “stay” or “leave.” Instead of giving you a number or a continuous outcome (like
predicting a future sales amount), logistic regression tells you the likelihood (probability) of an outcome.

Why Not Use Regular Math or Linear Regression?

Here’s where logistic regression is clever. It doesn’t just use a straight line (like you would in simple math or linear
regression). Instead, it applies a special mathematical function called the sigmoid function. This function takes any
input, positive or negative, and compresses it into a range between 0 and 1. This transformation gives us
probabilities, making the predictions easier to interpret as likelihoods.

You can frame logistic regression as a numerical (or continuous optimization) problem, but it fundamentally differs
from standard numerical regression methods like linear regression because of its underlying assumptions and the type
of outcome variable it predicts.

Difference Between Logistic and Numerical Regression

1. Target Variable:

Logistic regression is used for categorical outcomes (e.g., 0 or 1).

Numerical regression (e.g., linear regression) predicts continuous outcomes (e.g., price, temperature).

2. Model Assumptions:

Logistic regression assumes the log-odds of the target variable are linearly related to the features.

Linear regression assumes the dependent variable itself is linearly related to the features.

3. Output:

Logistic regression outputs probabilities that are thresholded to classify outcomes.

Numerical regression outputs continuous predictions directly.

Framing Logistic Regression as Numerical Optimization

The core mechanism of logistic regression differs significantly from numerical regression because logistic regression
optimizes a likelihood function, whereas numerical regression minimizes a root mean square error (RMSE) function.
The objective function and the predictions are done in the following way:

1. Continuous Loss Function: Logistic regression minimizes a loss function called the negative log-likelihood (or
equivalently, maximizes the likelihood of the observed data). This is a continuous optimization problem, typically
solved using gradient-based methods like gradient descent. The loss function for binary logistic regression is:

L(β)=−∑i=1N[yilog⁡(p^i)+(1−yi)log⁡(1−p^i)]L(\beta) = -\sum_{i=1}^N \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 -

\hat{p}_i) \right] L(β)=−i=1∑N[yilog(p^i)+(1−yi)log(1−p^i)]

L(β): The total negative log-likelihood we aim to minimize.

N: Number of observations (samples) in the dataset.

**y**: Binary target variable for the i-th observation

**phat**: Predicted probability that yi=1y_i = 1yi=1, based on the sigmoid function.

**x
**: Feature vector for the iii-th observation.

β: Coefficient vector (weights of the model).

1. Predicting Probabilities: Logistic regression predicts probabilities using the sigmoid (logistic) function, which
maps the linear combination of features

p^i=11+e−xi⋅β\hat{p}_i = \frac{1}{1 + e^{-\mathbf{x}_i \cdot \beta}} p^i=1+e−xi⋅β1

While the coefficients are fit numerically, the predicted outcomes themselves are probabilities, not continuous values.

Active Customers are customers who had some activity in the app, in a specific time period (e.g., weekly). This
calculation helps us determine the number of unique customers interacting with the app in each period.

activity_periods groups activity by each customer and weekly period.

**active_customers** then selects unique **customer_id**s for each period, representing those
customers who were active during that week.

If you execute this query, you will get:

Active Customers Lag introduces the prior period data for each customer. Using Data Distiller’s LAG function, we
access each customer’s previous active period, allowing us to see which customers were active in both the current and
prior periods.

**LAG(period)** retrieves the prior active period for each customer, enabling comparisons across
consecutive periods.

If **previous_period** is not null, it indicates the customer was active in both the current and previous
periods, meaning they were “retained.”

If you execute this query, you should get:

Customer retention generally indicates that a customer has remained active or engaged with the business over
multiple time periods. For our use case, retention is defined more specifically as a customer who is active in both the
current period and the previous period. Here’s a closer look at how this can be interpreted and extended:

1. Period-to-Period Retention (Binary Retention): In our case study, a customer is considered retained if they
were active both in the current period and the immediately preceding period. This is calculated by checking if a
customer’s last activity occurred both in the current period and the previous one, often using the SQL LAG()
function to compare periods.

2. Alternative Multi-Period Definitions:

Consecutive Period Retention: Some businesses define retention over multiple consecutive periods. For
example, a customer might be retained if they were active for three consecutive periods.

Flexible Retention Periods: A customer may also be considered retained if they were active in any two out
of the last three periods. This approach adds flexibility and can be beneficial for businesses where customer
activity are less frequent.

3. Cohort-Based Retention: Instead of looking at immediate consecutive periods, retention can be defined within
monthly or quarterly cohorts. For instance, you may consider a customer retained if they were active in the
current quarter after making at least one activity in the previous quarter, which can be valuable for tracking
longer-term engagement.

4. Retention Rate is typically defined as the ratio of customers retained (those active in consecutive periods) to the
total number of customers in the previous period. This provides an overall measure of how well the app is
retaining its customer base.
5. Churn Rate is often defined as the inverse of retention, i.e., customers who were active in the previous period
but not in the current one. This allows businesses to identify the rate at which customers stop engaging.

We need to combine the data on active customers and their previous periods to calculate:

Current Customers: Customers active in the current period.

Previous Customers: Customers active in the previous period.

Retained Customers: Customers active in both the current and previous periods.

These calculations yield metrics like churn rate (customers who did not return) and retention rate (customers who
stayed from one period to the next):

Churn Rate: Calculated as 1.0 - (current_customers / previous_customers) to capture

the percentage of customers lost in the current period.

Retention Rate: Calculated as retained_customers / current_customers to understand how

many customers from the previous period were retained.

Note that Retention Rate and Churn Rate are not the same, although they are related metrics. They measure different
aspects of customer behavior, and their calculations reflect this.

Churn Rate measures the percentage of customers lost from one period to the next. This metric focuses on the loss of
customers. If many customers from the previous period are not active in the current period, the churn rate will be high.
A churn rate of 0 indicates no customers were lost, while a churn rate of 1 indicates all customers were lost.

Retention Rate measures the percentage of customers from the previous period who are still active in the current
period. This metric focuses on the retention of customers. It shows how well the business is keeping its customers
engaged over time. A retention rate of 1 means all customers from the previous period stayed active, while a retention
rate closer to 0 means most previous customers were lost.

Suppose:

retained_customers = 60 (these are customers who were active in both periods)

Churn Rate = 1- (80/100)= 20%

Retention Rate = 60/80=75%

In other words, 60 customers remained active from the previous period to the current one. Starting with 100 customers,
this number decreased to 80 in the current period, meaning we lost 40 customers but also gained 20 new ones. Churn is
the net loss of customers, calculated as the difference between the customers lost (40) and new customers gained (20),
which is 20. Expressed as a percentage of the starting count (100), this gives us the churn rate of 20%. This calculation
focuses on the overall change in the customer base, rather than specifically measuring customers retained from one
period to the next.

WITH activity_periods AS (
SELECT
customer_id,
DATE_TRUNC('week', timestamp) AS period
FROM customer_activities
GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
active_customers AS (
SELECT period, customer_id
FROM activity_periods
),
active_customers_lag AS (
SELECT
customer_id,
period,
LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS
previous_period
FROM active_customers
),
churn_retention AS (
SELECT
period,
COUNT(DISTINCT customer_id) AS current_customers,
COUNT(DISTINCT CASE WHEN previous_period IS NULL THEN NULL ELSE
customer_id END) AS previous_customers,
COUNT(DISTINCT CASE WHEN previous_period IS NOT NULL THEN customer_id
END) AS retained_customers
FROM active_customers_lag
GROUP BY period
)
SELECT
period,
current_customers,
previous_customers,
-- Calculate churn_rate as a float, bounded within [0, 1]
CAST(CASE WHEN previous_customers = 0 THEN 0
ELSE LEAST(1.0 - (current_customers * 1.0 / previous_customers),
1.0) END AS FLOAT) AS churn_rate,
-- Calculate retention_rate as a float, bounded within [0, 1]
CAST(CASE WHEN current_customers = 0 THEN 0
ELSE LEAST(retained_customers * 1.0 / current_customers, 1.0) END
AS FLOAT) AS retention_rate
FROM churn_retention
ORDER BY period;

current_customers counts customers in each period.

previous_customers counts those who were active in the previous period (customers with non-null
previous_period).

retained_customers counts those who appear in both the current and previous periods, allowing us to
compute the retention rate and churn rate.

Our goal is to develop features at the individual customer level and monitor their metrics on a weekly basis. Designing
these features requires some creativity, but with a solid understanding of customer analytics, you should be able to
make insightful and effective choices.

1. Tenure Weeks (**tenure_weeks**): The total number of weeks a customer has been active up to the
current period: It reflects customer loyalty and experience with the service. Customers with longer tenures may
have established habits or preferences influencing retention.

2. Activity Frequency (activity_frequency): The number of activities or transactions the customer

performed in the current week. It indicates the customer’s engagement level during the current period. Higher
activity may correlate with increased satisfaction and likelihood of retention.
3. Average Engagement (**average_engagement**): The average number of activities per week up to the
current period. It provides context for the customer’s activity frequency. It helps identify deviations from typical
behavior (e.g., a sudden drop in engagement).

4. Time Since Last Activity (**time_since_last_activity_days**): The number of days since the
customer’s previous activity period. It measures recency of engagement. Longer durations may signal
disengagement or risk of churn.

5. Retained Next Week (**retained_next_week**): A binary label indicating whether the customer was
active in the following week (1 for yes, 0 for no). It serves as the target variable for the classification model. It
helps the model learn patterns associated with customer retention.

The main goal of feature engineering in this context is to transform raw activity data into meaningful features that can
help predict customer retention. By capturing various aspects of customer behavior, we aim to:

Understand Engagement Patterns: Features like activity_frequency and average_engagement

reflect how customers interact with the service over time.

Identify Risk of Churn: Features like time_since_last_activity_days can indicate when a customer
is becoming less engaged.

Capture Customer Loyalty: tenure_weeks highlights how long a customer has been with the service, which
may influence their retention behavior.

Provide Predictive Insights: These features allow the machine learning model to recognize patterns that precede
retention or churn.

All the features are given below in a single query but you can highlight and execute the **SELECT** queries within
each CTE.

Also observe how the features that we computed above are being cast into numeric representations as the logistics
model in Data Distiller can only accept numerical values and we have to make that explicit:

CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,

CAST(activity_frequency AS DOUBLE) AS activity_frequency,
CAST(average_engagement AS DOUBLE) AS average_engagement,
CAST(time_since_last_activity_days AS DOUBLE) AS
time_since_last_activity_days,
CAST(retained_next_week AS INT) AS retained_next_week

The features that we will extract are the following:

WITH activity_periods AS (
SELECT
customer_id,
DATE_TRUNC('week', timestamp) AS period,
COUNT(*) AS activity_count
FROM customer_activities
GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
SELECT
customer_id,
period,
activity_count,
LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS
next_period,
LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS
previous_period
FROM activity_periods
),
customer_retention AS (
SELECT
customer_id,
period,
activity_count,
CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS
retained_next_week,
previous_period
FROM customer_activity
),
customer_features AS (
SELECT
cr.customer_id,
cr.period,
cr.retained_next_week,
-- Tenure Weeks
COUNT(*) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS tenure_weeks,
-- Activity Frequency
cr.activity_count AS activity_frequency,
-- Average Engagement
AVG(cr.activity_count) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS average_engagement,
-- Time Since Last Activity
COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS
time_since_last_activity_days
FROM customer_retention cr
)
SELECT
customer_id,
period,
CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
CAST(activity_frequency AS DOUBLE) AS activity_frequency,
CAST(average_engagement AS DOUBLE) AS average_engagement,
CAST(time_since_last_activity_days AS DOUBLE) AS
time_since_last_activity_days,
CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL;

Logistic regression is a supervised machine learning algorithm used for binary or multi-class classification tasks.
Its primary goal is to predict the probability of an instance belonging to a specific class, typically modeled using the
sigmoid function to map inputs to probabilities.

MAX_ITER:

Maximum number of iterations for the optimization algorithm.

REGPARAM:

Regularization parameter to prevent overfitting by penalizing large coefficients.

ELASTICNETPARAM:

ElasticNet mixing parameter (α\alphaα):

α=0: Applies L2 regularization (Ridge).

α=1: Applies L1 regularization (Lasso).

Logistic regression offers several benefits compared to other classifiers, making it a popular choice for many
classification tasks. It is straightforward and highly interpretable, as it provides a clear relationship between input
features and the predicted probability through its coefficients, which represent the impact of each feature on the log-
odds of the outcome. Unlike some classifiers, such as decision trees, logistic regression outputs probabilities, allowing
for flexible thresholding for classification or ranking tasks. It is particularly efficient on small to moderately sized
datasets where linear separability is a reasonable assumption. Built-in support for L1 (Lasso) and L2 (Ridge)
regularization makes logistic regression robust to overfitting, especially in high-dimensional datasets, while its
computational cost is low compared to more complex models like random forests or neural networks. Often used as a
baseline model, logistic regression offers simplicity and reasonable performance across a wide range of problems.

Let’s take the code we previously created to extract the features and label, and simply add the **CREATE MODEL**,
**TRANSFORM**, and **OPTIONS** clauses.

The code below creates a logistic regression model named **Retention_model_logistic_reg** designed
to predict customer retention. Here’s a breakdown of each part:

DROP MODEL IF EXISTS Retention_model_logistic_reg;

CREATE MODEL Retention_model_logistic_reg

TRANSFORM(
vector_assembler(array(
tenure_weeks,
activity_frequency,
average_engagement,
time_since_last_activity_days
)) AS features
)
OPTIONS(
MODEL_TYPE='logistic_reg',
LABEL='retained_next_week'
)
AS
WITH activity_periods AS (
SELECT
customer_id,
DATE_TRUNC('week', timestamp) AS period,
COUNT(*) AS activity_count
FROM customer_activities
GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
SELECT
customer_id,
period,
activity_count,
LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS
next_period,
LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS
previous_period
FROM activity_periods
),
customer_retention AS (
SELECT
customer_id,
period,
activity_count,
CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS
retained_next_week,
previous_period
FROM customer_activity
),
customer_features AS (
SELECT
cr.customer_id,
cr.period,
cr.retained_next_week,
-- Tenure Weeks
COUNT(*) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS tenure_weeks,
-- Activity Frequency
cr.activity_count AS activity_frequency,
-- Average Engagement
AVG(cr.activity_count) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS average_engagement,
-- Time Since Last Activity
COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS
time_since_last_activity_days
FROM customer_retention cr
)
SELECT
customer_id,
period,
CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
CAST(activity_frequency AS DOUBLE) AS activity_frequency,
CAST(average_engagement AS DOUBLE) AS average_engagement,
CAST(time_since_last_activity_days AS DOUBLE) AS
time_since_last_activity_days,
CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL;

1. CREATE MODEL Retention_model_logistic_reg

This statement creates a new machine learning model named Retention_model_logistic_reg. The model is
intended for predicting customer retention, specifically whether a customer will be active in the next week. Naming the
model allows you to reference it later for evaluation, prediction, or deployment tasks.

**2. TRANSFORM(vector_assembler(array(tenure_weeks, activity_frequency,

average_engagement, time_since_last_activity_days)) AS features)**

This line defines how the input data is prepared before being fed into the logistic regression model. Here’s a
breakdown:

**vector_assembler**:

A function that combines multiple columns into a single feature vector.

Essential for machine learning algorithms that require input features in vector form.

**array(tenure_weeks, activity_frequency, average_engagement,

time_since_last_activity_days)**:

Specifies the columns to be combined into the feature vector. The columns are:

1. **tenure_weeks**:

Represents the number of weeks the customer has been active up to the current period.

Reflects customer loyalty and tenure.

2. **activity_frequency**:

The number of activities or transactions the customer had in the current period.

Indicates the current engagement level.

3. **average_engagement**:

The average number of activities per period up to the current period.

Provides insight into the customer’s typical engagement over time.

4. **time_since_last_activity_days**:

The number of days since the customer’s last activity period.

Measures recency of engagement, which can be a predictor of churn.

If you execute the query, you will get the following:

Let’s run the model evaluation on the **test_customer_activities** dataset (make sure you have uploaded
this dataset onto AEP)

SELECT *
FROM model_evaluate(Retention_model_logistic_reg, 1,
WITH activity_periods AS (
SELECT
customer_id,
DATE_TRUNC('week', timestamp) AS period,
COUNT(*) AS activity_count
FROM test_customer_activities
GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
SELECT
customer_id,
period,
activity_count,
LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS
next_period,
LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS
previous_period
FROM activity_periods
),
customer_retention AS (
SELECT
customer_id,
period,
activity_count,
CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS
retained_next_week,
previous_period
FROM customer_activity
),
customer_features AS (
SELECT
cr.customer_id,
cr.period,
cr.retained_next_week,
-- Tenure Weeks
COUNT(*) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS tenure_weeks,
-- Activity Frequency
cr.activity_count AS activity_frequency,
-- Average Engagement
AVG(cr.activity_count) OVER (
PARTITION BY cr.customer_id
ORDER BY cr.period
) AS average_engagement,
-- Time Since Last Activity
COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS
time_since_last_activity_days
FROM customer_retention cr
)
SELECT
customer_id,
period,
CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
CAST(activity_frequency AS DOUBLE) AS activity_frequency,
CAST(average_engagement AS DOUBLE) AS average_engagement,
CAST(time_since_last_activity_days AS DOUBLE) AS
time_since_last_activity_days,
CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL);
In the above screenshot, the model_evaluate function has returned four evaluation metrics for the logistic
regression model:

1. AUC_ROC (Area Under the Receiver Operating Characteristic Curve): AUC-ROC is a measure of the
model’s ability to distinguish between classes (in this case, “retained” vs. “not retained”). This AUC-ROC score
of 0.787 suggests that the model has a reasonably good ability to distinguish between the “retained” and “not
retained” classes. An AUC-ROC of 0.5 would mean random guessing, so a score of 0.787 indicates the model is
performing better than random but still has room for improvement in separating the two classes.

2. Accuracy: Accuracy is the proportion of correct predictions out of the total predictions made by the model. With
an accuracy of 85.61%, the model is correctly classifying a high percentage of instances overall. However, it’s
worth noting that accuracy alone doesn’t capture the balance between false positives and false negatives, which is
why precision and recall are also important.

3. Precision: Precision (or Positive Predictive Value) measures the accuracy of the positive predictions. The
precision score of 0.7283 indicates that when the model predicts a customer will be “retained,” it is correct
72.83% of the time. This score is helpful in understanding the model’s reliability for positive predictions (i.e.,
predicting retention). It is calculated as:

Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True

Positives} + \text{False Positives}} Precision=True Positives+False PositivesTrue Positives

1. Recall: Recall (or Sensitivity) measures how well the model captures all actual positive instances. A recall score
of 0.8627 means that the model correctly identifies 86.27% of all actual retained customers. This high recall
indicates that the model is good at capturing most of the retained customers but might be allowing some false
positives. It is calculated as:

Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True

Positives} + \text{False Negatives}} Recall=True Positives+False NegativesTrue Positives

This model has a decent balance between precision and recall, with a solid AUC-ROC score suggesting effective
classification. Fine-tuning (adjusting the various parameters in the options of this model) could further improve the
balance between precision and recall, depending on whether it’s more critical to avoid false positives or false negatives
in the retention context.

Once the model has been trained and evaluated, it can be used to make predictions on new data in
**customer_inference_dataset**

SELECT *
FROM model_predict(Retention_model_logistic_reg, 1,
SELECT
customer_id,
CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
CAST(activity_frequency AS DOUBLE) AS activity_frequency,
CAST(average_engagement AS DOUBLE) AS average_engagement,
CAST(time_since_last_activity_days AS DOUBLE) AS
time_since_last_activity_days
FROM customer_inference_dataset);

The results would be:

You can go ahead and materialize the dataset if you like:

CREATE TABLE retention_prediction AS

Retentiion rate along with the current and past customer periods.

Feature engineering at each customer level.

ML model has been created.

Model evaluation on test data.

Inferencing on the logistics model

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-
advanced-statistics-and-machine-learning-models * * *

Use case tutorials are here:

Data Distiller users need a convenient way to generate data insights to predict the best strategies for targeting users
across various use cases. They want the ability to predict a user’s likelihood of buying a specific product, estimate the
quantity they may purchase, and identify which products are most likely to be bought. Currently, there is no option to
leverage machine learning algorithms directly through SQL to produce predictive insights from the data.

With the introduction of statistical functions such as CREATE MODEL, MODEL_EVALUATE, and MODEL_PREDICT,
Data Distiller users will gain the capability to create predictive insights from data stored in the lake. This three-step
querying process enables them to easily generate actionable insights from their data.

Augmenting Fully Featured Machine Learning Platform Use Cases

Data Distiller’s statistics and ML capabilities can play a crucial role in augmenting full-scale ML platforms like
Databricks, Google Cloud AI Platform, Azure Machine Learning and Amazon SageMaker, providing valuable support
for the end-to-end machine learning workflow. Here’s how these features could be leveraged:

Prototyping and Rapid Experimentation

Quick Prototyping: The ability to use SQL-based ML models and transformations allows data scientists and
engineers to quickly prototype models and test different features without setting up complex ML pipelines. This
rapid iteration is particularly valuable in the early stages of feature engineering and model development.

Feature Validation: By experimenting with various feature transformations and basic models within Data
Distiller, users can validate the quality and impact of different features. This ensures that only the most relevant
features are sent for training in full-scale ML platforms, thereby optimizing model performance.

Preprocessing and Feature Engineering

Efficient Feature Processing: Data Distiller’s built-in transformers (e.g., vector assemblers, scalers, and
encoders) can be used for feature engineering and data preprocessing steps. This enables seamless integration
with platforms by preparing the data in a format that is ready for advanced model training.

Automated Feature Selection: With basic statistical and machine learning capabilities, Data Distiller can help
automate feature selection by running simple models to identify the most predictive features before moving to a
full-scale ML environment.

Reducing Development Time and Cost

Cost-Effective Experimentation: By using Data Distiller to conduct initial model experiments and
transformations, teams can avoid the high costs associated with running large-scale ML jobs on platforms. This is
particularly useful when working with large datasets or conducting frequent iterations.

Integrated Workflow: Once features and models are validated in Data Distiller, the results can be easily
transferred to the machine learning platform for full-scale training. This integrated approach streamlines the
development process, reducing the time needed for data preparation and experimentation.

Feature Prototyping: Data Distiller can serve as a testing ground for new features and transformations. For
example, users can build basic predictive models or clustering algorithms to understand the potential of different
features before moving to more complex models on Databricks or SageMaker.

Model Evaluation and Validation: Basic model evaluation (e.g., classification accuracy, regression metrics)
within Data Distiller can help identify promising feature sets. These insights can guide further tuning and training
in full-scale ML environments, reducing the need for costly experiments.

Best Practices for Integration

Modular Approach: Design Data Distiller processes to produce well-defined outputs that can be easily
integrated into downstream ML workflows. For instance, transformed features and initial model insights can be
exported as data artifacts for further training.

Continuous Learning Loop: Use the insights from Data Distiller to inform feature engineering strategies. This
iterative loop ensures that the models trained on full-scale platforms are built on well-curated and optimized data.

Advanced Statistics & Machine Learning Functions in Data Distiller

Data Distiller supports various advanced statistics and machine learning operations through SQL commands, enabling
users to:

The steps above describe the following:

Source Data: The process begins with the available source data, which serves as the input for training the
machine learning model.

**CREATE MODEL** Using Training Data: A predictive model is created using the training data. This step
involves selecting the appropriate machine learning algorithm and training it to learn patterns from the data.

**MODEL_EVALUATE** to Check the Accuracy of the Model: The trained model is then evaluated to
measure its accuracy and ensure it performs well on unseen data. This step helps validate the model’s
effectiveness.

**MODEL_PREDICT** to Make Predictions on New Data: Once the model’s accuracy is verified, it is used to
make predictions on new, unseen data, generating predictive insights.
Output Prediction Data: Finally, the predictions are outputted, providing actionable insights based on the
processed data.

Supported Advanced Statistics & Machine Learning Algorithms

Linear Regression: Fits a linear relationship between features and a target variable.

Decision Tree Regression: Uses a tree structure to model and predict continuous values.

Random Forest Regression: An ensemble of decision trees that predicts the average output.

Gradient Boosted Tree Regression: Uses an ensemble of trees to minimize prediction error iteratively.

Generalized Linear Regression: Extends linear regression to model non-normal target distributions.

Isotonic Regression: Fits a non-decreasing or non-increasing sequence to the data.

Survival Regression: Models time-to-event data based on the Weibull distribution.

Factorization Machines Regression: Models interactions between features, making it suitable for sparse
datasets and high-dimensional data.

Classification (Supervised)

Logistic Regression: Predicts probabilities for binary or multiclass classification problems.

Decision Tree Classifier: Uses a tree structure to classify data into distinct categories.

Random Forest Classifier: An ensemble of decision trees that classifies data based on majority voting.

Naive Bayes Classifier: Uses Bayes’ theorem with strong independence assumptions between features.

Factorization Machines Classifier: Models interactions between features for classification, making it suitable
for sparse and high-dimensional data.

Linear Support Vector Classifier (LinearSVC): Constructs a hyperplane for binary classification tasks,
maximizing the margin between classes.

Multilayer Perceptron Classifier: A neural network classifier with multiple layers for mapping inputs to outputs
using an activation function.

K-Means: Partitions data into k clusters based on distance to cluster centroids.

Bisecting K-Means: Uses a hierarchical divisive approach for clustering.

Gaussian Mixture: Models data as a mixture of multiple Gaussian distributions.

Latent Dirichlet Allocation (LDA): Identifies topics in a collection of text documents.

Summary Table of Available Algorithms

Fits a linear relationship between features and a target variable.

Uses a tree structure to model and predict continuous values.

An ensemble of decision trees that predicts the average output.

Gradient Boosted Tree Regression

Uses an ensemble of trees to minimize prediction error iteratively.

Generalized Linear Regression

Extends linear regression to model non-normal target distributions.

Fits a non-decreasing or non-increasing sequence to the data.

Models time-to-event data based on the Weibull distribution.

Factorization Machines Regression

Models interactions between features, making it suitable for sparse datasets and high-dimensional data.

Classification (Supervised)

Predicts probabilities for binary or multiclass classification problems.

Uses a tree structure to classify data into distinct categories.

An ensemble of decision trees that classifies data based on majority voting.

Uses Bayes’ theorem with strong independence assumptions between features.

Factorization Machines Classifier

Models interactions between features for classification, suitable for sparse and high-dimensional data.

Linear Support Vector Classifier (LinearSVC)

Constructs a hyperplane for binary classification tasks, maximizing the margin between classes.

Multilayer Perceptron Classifier

A neural network classifier with multiple layers for mapping inputs to outputs using an activation function.

Partitions data into k clusters based on distance to cluster centroids.

Uses a hierarchical divisive approach for clustering.

Models data as a mixture of multiple Gaussian distributions.

Latent Dirichlet Allocation (LDA)

Identifies topics in a collection of text documents.

SQL Syntax for Advanced Statistics & Machine Learning Functions

Use the **CREATE MODEL** command to define a new machine learning model.

CREATE MODEL IF NOT EXISTS my_linear_model

OPTIONS (MODEL_TYPE='linear_reg', MAX_ITER=100, REG_PARAM=0.1)
AS
SELECT feature1, feature2, target_variable
FROM training_dataset;

In this example:

MODEL_TYPE specifies the algorithm.

MAX_ITER sets the number of iterations.

REG_PARAM is the regularization parameter.

Note that the syntax does not support reading from a **TEMP** table and does not allow for braces such as:

(SELECT feature1, feature2, target_variable

FROM training_dataset);

Creating a Model with Preprocessing

The TRANSFORM clause allows you to preprocess features before training.

CREATE MODEL my_classification_model

TRANSFORM (
binarizer(numeric_feature, 50) as binarized_feature,
string_indexer(categorical_feature) as indexed_feature,
vector_assembler(array(binarized_feature, indexed_feature)) as features
)
OPTIONS (MODEL_TYPE='logistic_reg', LABEL='label_column')
AS
SELECT numeric_feature, categorical_feature, label_column
FROM training_data;

This example demonstrates:

Binarizing a numeric feature.

Indexing a categorical feature.

Assembling multiple features into a vector.

Feature Transformation Functions

Feature transformation is the process of extracting meaningful features from raw data to enhance the accuracy of
downstream statistical models. The Data Distiller feature engineering SQL extension provides a comprehensive suite
of techniques that streamline and automate data preprocessing. These functions allow for seamless, efficient data
preparation and enable easy experimentation with various feature engineering methods. Designed for distributed
computing, the SQL extension supports feature engineering on large datasets in a parallel and scalable manner,
significantly reducing the time needed for preprocessing.

Feature transformation is broadly used for the following purposes:

1. Extraction: Extracts important information from data columns, helping models to identify key signals. For
example, in textual data, long sentences may contain irrelevant words that need to be removed to improve model
performance.
2. Transformation: Converts raw data into a format that machine learning models can consume. Since models
understand numbers but not text, transformers are used to convert non-numerical data into numerical features.

Define custom preprocessing steps using the TRANSFORM clause.

CREATE MODEL custom_transform_model

TRANSFORM (
numeric_imputer(missing_numeric_column, 'mean') as imputed_column,
binarizer(imputed_column, 0.0) as binarized_column
)
OPTIONS (MODEL_TYPE='logistic_reg', LABEL='outcome')
AS
SELECT missing_numeric_column, outcome
FROM raw_dataset;

If the TRANSFORM clause is omitted, Data Distiller performs basic preprocessing.

Data Distiller Transformers

Several transformers can be used for feature engineering:

Description: Fills missing numeric values using a specified strategy such as “mean,” “median,” or “mode.”

Example:

TRANSFORM (numeric_imputer(age, 'median') as age_imputed)

Description: Replaces missing string values with a specified string.

Example:

TRANSFORM (string_imputer(city, 'unknown') as city_imputed)

Description: Completes missing values in a boolean column using a specified boolean value.

Example:

TRANSFORM (boolean_imputer(has_account, true) as account_imputed)

Description: Combines multiple columns into a single vector column. Useful for creating feature vectors from
multiple features.

Example:

TRANSFORM (vector_assembler(array(col1, col2)) as feature_vector)

Description: Converts a numeric column to a binary value (0 or 1) based on a specified threshold.

Example:

TRANSFORM (binarizer(rating, 10.0) as binarized_rating)

Description: Splits a continuous numeric column into discrete bins based on specified thresholds.

Example:
TRANSFORM (bucketizer(age, array(18, 30, 50)) as age_group)

Description: Converts a column of strings into a column of indexed numerical values, typically used for
categorical features.

Example:

TRANSFORM (string_indexer(category) as indexed_category)

Description: Converts categorical features represented as indices into a one-hot encoded vector.

Example:

TRANSFORM (one_hot_encoder(indexed_category) as encoded_category)

Description: Standardizes a numeric column by removing the mean and scaling to unit variance.

Example:

TRANSFORM (standard_scaler(income) as scaled_income)

Description: Scales a numeric column to a specified range, typically [0, 1].

Example:

TRANSFORM (min_max_scaler(income, 0, 1) as scaled_income)

Description: Scales a numeric column by dividing each value by the maximum absolute value in that column.

Example:

TRANSFORM (max_abs_scaler(weight) as scaled_weight)

Description: Normalizes a vector to have unit norm, typically used for scaling individual samples.

Example:

TRANSFORM (normalizer(feature_vector) as normalized_features)

Description: Expands a vector of features into a polynomial feature space.

Example:

TRANSFORM (polynomial_expansion(features, 2) as poly_features)

Description: Selects the top features based on the Chi-Square test of independence.

TRANSFORM (chi_square_selector(features, 3) as selected_features)

PCA (Principal Component Analysis)

Description: Reduces the dimensionality of the data by projecting it onto a lower-dimensional subspace.

Example:

TRANSFORM (pca(features, 5) as pca_features)

Description: Converts categorical features into numerical features using the hashing trick, resulting in a fixed-
length feature vector.

Example:

TRANSFORM (feature_hasher(array(col1, col2), 100) as hashed_features)

Description: Removes common stop words from a column of text data.

Example:

TRANSFORM (stop_words_remover(text_column) as cleaned_text)

Description: Converts a column of text data into a sequence of n-grams.

Example:

sqlCopy codeTRANSFORM (ngram(words, 2) as bigrams)

Description: Splits a string column into a list of words.

Example:

TRANSFORM (tokenizer(sentence) as words)

TF-IDF (Term Frequency-Inverse Document Frequency)

Description: TF-IDF is a statistic that reflects how important a word is to a document within a collection or
corpus. It is widely used in text mining and natural language processing to transform text data into numerical
features. Given a term ttt, a document ddd, and a corpus DDD:

Term Frequency (TF) measures the frequency of a term in a document is the number of times term ttt
appears in document ddd.

Document Frequency (DF) counts how many documents contain the term is the number of documents in
the corpus DDD that include the term ttt.

Using only term frequency can overemphasize terms that appear frequently but carry little meaningful
information (e.g., “a,” “the,” “of”). TF-IDF addresses this by weighting terms inversely proportional to their
frequency across the corpus, thus highlighting terms that are more informative for a particular document.

Example:

TRANSFORM (tf_idf(tokenized_text) as tfidf_features)

TF-IDF helps in converting a collection of text documents into a matrix of numerical features that can be used as input
for machine learning models. It is particularly useful for feature extraction in text classification tasks, sentiment
analysis, and information retrieval.

Description: **Word2Vec** is an estimator that takes sequences of words representing documents and trains a
**Word2VecModel**. The model maps each word to a unique fixed-size vector in a continuous vector space.
The Word2VecModel then transforms each document into a vector by averaging the vectors of all the words in
the document. This technique is widely used in natural language processing (NLP) tasks to capture the semantic
meaning of words and represent them in a numerical format suitable for machine learning models.

Example:
TRANSFORM (
tokenizer(review) as tokenized,
word2vec(tokenized, 10, 1) as word2vec_features
)

In this example:

The tokenizer transformer splits the input text into individual words.

The word2vec transformer generates a fixed-size vector (with a specified size of 10) for each word in the
sequence and computes the average vector for all words in the document.

**Word2Vec** is commonly used to convert text data into numerical features, allowing machine learning
algorithms to process textual information while capturing semantic relationships between words.

Description: The CountVectorizer is used to convert a collection of text documents into vectors of token counts.
It generates sparse representations for the documents based on the vocabulary, allowing further processing by
algorithms such as Latent Dirichlet Allocation (LDA) and other text analysis techniques. The output is a sparse
vector where the value of each element represents the count of a term in the document.

Input Data Type: array[string]

Output Data Type: Sparse vector

Parameters:

**VOCAB_SIZE**: The maximum size of the vocabulary. The CountVectorizer will build a vocabulary
that considers only the top vocabSize terms, ordered by term frequency across the corpus.

**MIN_DOC_FREQ**: Specifies the minimum number of different documents a term must appear in to be
included in the vocabulary. If set as an integer, it indicates the number of documents; if a double in [0,1),
it indicates a fraction of documents.

**MAX_DOC_FREQ**: Specifies the maximum number of different documents a term could appear in to
be included in the vocabulary. Terms appearing more than the threshold are ignored. If set as an integer, it
indicates the maximum number of documents; if a double in [0,1), it indicates the maximum fraction of
documents.

**MIN_TERM_FREQ**: Filters out rare words in a document. Terms with a frequency lower than the
threshold in a document are ignored. If an integer, it specifies the count; if a double in [0,1), it specifies a
fraction.

Example:

TRANSFORM (
count_vectorizer(texts) as cv_output
)

Summary Table of Transformers

Fills missing numeric values using “mean,” “median,” or “mode.”

TRANSFORM (numeric_imputer(age, 'median') as age_imputed)

Replaces missing string values with a specified string.

TRANSFORM (string_imputer(city, 'unknown') as city_imputed)

Completes missing values in a boolean column using a specified boolean value.

TRANSFORM (boolean_imputer(has_account, true) as account_imputed)

Combines multiple columns into a single vector column.

TRANSFORM (vector_assembler(array(col1, col2)) as feature_vector)

Converts a numeric column to a binary value (0 or 1) based on a specified threshold.

TRANSFORM (binarizer(rating, 10.0) as binarized_rating)

Splits a continuous numeric column into discrete bins based on specified thresholds.

TRANSFORM (bucketizer(age, array(18, 30, 50)) as age_group)

Converts a column of strings into indexed numerical values.

TRANSFORM (string_indexer(category) as indexed_category)

Converts categorical features represented as indices into a one-hot encoded vector.

TRANSFORM (one_hot_encoder(indexed_category) as encoded_category)

Standardizes a numeric column by removing the mean and scaling to unit variance.

TRANSFORM (standard_scaler(income) as scaled_income)

Scales a numeric column to a specified range, typically [0, 1].

TRANSFORM (min_max_scaler(income, 0, 1) as scaled_income)

Scales a numeric column by dividing each value by the maximum absolute value in the column.

TRANSFORM (max_abs_scaler(weight) as scaled_weight)

Normalizes a vector to have unit norm.

TRANSFORM (normalizer(feature_vector) as normalized_features)

Expands a vector of features into a polynomial feature space.

TRANSFORM (polynomial_expansion(features, 2) as poly_features)

Selects top features based on the Chi-Square test of independence.

TRANSFORM (chi_square_selector(features, 3) as selected_features)

PCA (Principal Component Analysis)

Reduces data dimensionality by projecting onto a lower-dimensional subspace.

TRANSFORM (pca(features, 5) as pca_features)

Converts categorical features into numerical features using the hashing trick.
TRANSFORM (feature_hasher(array(col1, col2), 100) as hashed_features)

Removes common stop words from a text data column.

TRANSFORM (stop_words_remover(text_column) as cleaned_text)

Converts text data into a sequence of n-grams.

TRANSFORM (ngram(words, 2) as bigrams)

Splits a string column into a list of words.

TRANSFORM (tokenizer(sentence) as words)

Converts a collection of text documents to a matrix of numerical features.

TRANSFORM (tf_idf(tokenized_text) as tfidf_features)

Maps words to a vector space and averages vectors for each document.

TRANSFORM (word2vec(tokenized, 10, 1) as word2vec_features)

Converts text documents to vectors of token counts.

TRANSFORM (count_vectorizer(texts) as cv_output)

Hyper-parameter Tuning and Model Configuration

Set hyper-parameters using the OPTIONS clause to optimize model performance.

Example:

CREATE MODEL tuned_random_forest

OPTIONS (
MODEL_TYPE='random_forest_regression',
NUM_TREES=50,
MAX_DEPTH=10
)
AS
SELECT feature1, feature2, target
FROM training_data;

Predicting Customer Churn Using Logistic Regression

CREATE MODEL customer_churn_model

OPTIONS (MODEL_TYPE='logistic_reg', LABEL='churn')
AS
SELECT age, income, num_purchases, churn
FROM customer_data;

Clustering Customers Based on Purchase Behavior

CREATE MODEL customer_clusters

OPTIONS (MODEL_TYPE='kmeans', NUM_CLUSTERS=5, MAX_ITER=20)
AS
SELECT purchase_amount, num_items_bought
FROM transaction_data;

CREATE MODEL topic_model

OPTIONS (MODEL_TYPE='lda', NUM_CLUSTERS=10)
AS
SELECT document_text
FROM text_data;

Model Evaluation and Prediction

SELECT * FROM MODEL_EVALUATE(customer_churn_model, SELECT age, income,

num_purchases FROM new_data);

SELECT * FROM MODEL_PREDICT(customer_churn_model, SELECT age, income,

num_purchases FROM new_data);

Best Practices and Recommendations

Use vector assemblers to combine related features.

Perform feature scaling (e.g., normalization) where applicable.

Choose models based on the problem type (e.g., classification vs. regression).

Most Common Regression Algorithm Parameters

The detailed list is here

Linear Regression ′linear_reg'

Maximum number of iterations for optimization.

Regularization parameter for controlling model complexity.

Mixing parameter for ElasticNet regularization (L1 vs. L2 penalty).

Decision Tree Regression ′decision_tree_regression'

Maximum number of bins for discretizing continuous features.

Whether to cache node IDs for training deeper trees.

How often to checkpoint cached node IDs during training.

Criterion for information gain calculation (“variance” used for regression).

Maximum depth of the tree.

Random Forest Regression

'random_forest_regression'

Number of trees in the forest.

Maximum depth of each tree in the forest.

Fraction of data used to train each tree.

Strategy for selecting features for each split.

“auto”, “all”, “sqrt”, “log2″

Criterion for information gain calculation (“variance” used for regression).

Gradient Boosted Tree Regression ′gradient_boosted_tree_regression'

Maximum number of iterations (equivalent to the number of trees).

Step size (learning rate) for scaling the contribution of each tree.

Loss function to be minimized during training.

Generalized Linear Regression ′generalized_linear_reg'

Maximum number of iterations for optimization.

Regularization parameter for controlling model complexity.

Family of distributions for the response variable (e.g., Gaussian, Poisson).

“gaussian”, “binomial”, “poisson”, “gamma”, “tweedie”

Isotonic Regression ′isotonic_regression'

Whether the output sequence should be isotonic (increasing) or antitonic (decreasing).

Survival Regression ′survival_regression'

Maximum number of iterations for optimization.

Convergence tolerance for optimization.

Factorization Machines Regression ′factorization_machines_regression'

Convergence tolerance for optimization.

Dimensionality of the factors.

Whether to fit an intercept term.

Whether to fit linear terms (1-way interactions).

Standard deviation of initial coefficients.

Number of iterations for the algorithm.

Fraction of data used in each mini-batch.

Regularization parameter.

Random seed for reproducibility.

Solver algorithm used for optimization.

Initial step size for the first step.

Name of the column for prediction output.

Most Common Classification Algorithm Parameters

The detailed list is here

Logistic Regression ′logistic_reg'

Maximum number of iterations for optimization.

Regularization parameter for controlling model complexity.

Mixing parameter for ElasticNet regularization (L1 vs. L2 penalty).

Whether to fit an intercept term in the model.

Convergence tolerance for optimization.

Column name for predicted class probabilities.

Column name for raw prediction output (confidence scores).

Thresholds for binary or multiclass classification.

Decision Tree Classifier ′decision_tree_classifier'

Maximum number of bins for discretizing continuous features.

Whether to cache node IDs for training deeper trees.

How often to checkpoint cached node IDs during training.

Criterion for information gain calculation.

Maximum depth of the tree.

Minimum information gain required for a split at a node.

Minimum number of instances required in each child after a split.

Random seed for reproducibility.

Column name for sample weights.

Random Forest Classifier ′random_forest_classifier'

Number of trees in the forest.

Maximum number of bins for discretizing continuous features.

Maximum depth of each tree in the forest.

Criterion for information gain calculation.

Fraction of data used to train each tree.

Strategy for selecting features for each split.

“auto”, “all”, “sqrt”, “log2″

Whether to use bootstrap sampling when building trees.

Random seed for reproducibility.

Column name for sample weights.

Column name for predicted class probabilities.

Column name for raw prediction output (confidence scores).

Naive Bayes Classifier ′naive_bayes_classifier'

Type of Naive Bayes model used (e.g., multinomial, bernoulli).

“multinomial”, “bernoulli”, “gaussian”

Smoothing parameter to prevent zero probabilities.

Column name for predicted class probabilities.

Column name for raw prediction output (confidence scores).

Column name for sample weights.

Factorization Machines Classifier ′factorization_machines_classifier'

Convergence tolerance for optimization.

Dimensionality of the factors.

Whether to fit an intercept term.

Whether to fit linear terms (1-way interactions).

Standard deviation of initial coefficients.

Number of iterations for the algorithm.

Fraction of data used in each mini-batch.

Regularization parameter.

Random seed for reproducibility.

Solver algorithm used for optimization.

Initial step size for the first step.

Column name for predicted class conditional probabilities.

Name of the column for prediction output.

Column name for raw prediction (confidence scores).

Whether to enable one-vs-rest classification.

Linear Support Vector Classifier ′linear_svc_classifier'

Number of iterations for optimization.

Suggested depth for tree aggregation.

Whether to fit an intercept term.

Convergence tolerance for optimization.

Maximum memory in MB for stacking input data into blocks.

Regularization parameter.

Whether to standardize the training features.

Name of the column for prediction output.

Column name for raw prediction (confidence scores).

Whether to enable one-vs-rest classification.

Multilayer Perceptron Classifier ′multilayer_perceptron_classifier'

Number of iterations for the algorithm.

Block size for stacking input data in matrices.

Step size for each iteration of optimization.

Convergence tolerance for optimization.

Name of the column for prediction output.

Random seed for reproducibility.

Column name for predicted class conditional probabilities.

Column name for raw prediction (confidence scores).

Whether to enable one-vs-rest classification.

Gradient Boosted Tree Classifier ’gradient_boosted_tree_classifier'

Maximum number of bins used for discretizing continuous features and choosing how to split on features at each node.
More bins give higher granularity.

Must be >= 2 and >= number of categories in any categorical feature

If false, the algorithm passes trees to executors to match instances with nodes. If true, node IDs for each instance are
cached to speed up training of deeper trees.
Specifies how often to checkpoint the cached node IDs (e.g., 10 means checkpoint every 10 iterations). This is used
only if cacheNodeIds is true and the checkpoint directory is set in SparkContext.

Maximum depth of the tree. For example, depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.

Most Common Unsupervised Algorithm Parameters

The detailed list is here

Maximum number of iterations for the clustering algorithm.

Convergence tolerance for the iterative algorithm.

Number of clusters to form.

Distance measure used for clustering.

Initialization algorithm for cluster centers.

Number of steps for the k-means

Column name for the predicted cluster.

Random seed for reproducibility.

Column name for sample weights.

Bisecting K-Means ′bisecting_kmeans'

Maximum number of iterations for the clustering algorithm.

Number of leaf clusters to form.

Distance measure used for clustering.

MIN_DIVISIBLE_CLUSTER_SIZE

Minimum number of points for a divisible cluster.

Column name for the predicted cluster.

Random seed for reproducibility.

Column name for sample weights.

Gaussian Mixture ′gaussian_mixture'

Maximum number of iterations for the EM algorithm.

Number of Gaussian distributions in the mixture model.

Convergence tolerance for iterative algorithms.

Depth for tree aggregation during the EM algorithm.

Column name for predicted class conditional probabilities.

Column name for the predicted cluster.

Random seed for reproducibility.

Column name for sample weights.

Latent Dirichlet Allocation (LDA) ′lda'

Maximum number of iterations for the algorithm.

Optimizer used to estimate the LDA model.

Number of topics to identify.

Concentration parameter for the prior placed on documents’ distributions over topics.

Concentration parameter for the prior placed on topics’ distributions over terms.

Learning rate for the online optimizer.

Learning parameter that downweights early iterations for the online optimizer.

Fraction of the corpus used for each iteration of mini-batch gradient descent.

OPTIMIZE_DOC_CONCENTRATION

Whether to optimize the doc concentration during training.

Frequency of checkpointing the cached node IDs.

Random seed for reproducibility.

Output column with estimates of the topic mixture distribution for each document.

Steps in a predictive flow.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-603-predicting-
customer-conversion-scores-using-random-forest-in-data-distiller * * *

Download the following datasets

Ingest the above datasets using:

Make sure you have read

Businesses aim to optimize marketing efforts by identifying customer behaviors that lead to conversions (e.g.,
purchases). Using SQL-based feature engineering and a Random Forest model, we can analyze user interactions,
extract actionable insights, and predict the likelihood of conversions.

A retail company tracks website activity, including page views, purchases, and campaign interactions. They want to:

1. Understand Customer Behavior: Analyze aggregated session data such as visit frequency, page views, and
campaign participation.

2. Predict Conversions: Use historical data to predict whether a specific user interaction will result in a purchase.
3. Optimize Engagement: Focus marketing campaigns and resources on high-conversion-probability customers to
maximize ROI.

Random Forest Regression Model

A Random Forest is an ensemble machine learning algorithm that uses multiple decision trees to make predictions. It
is a type of supervised learning algorithm widely employed for both classification and regression tasks. By combining
the predictions of several decision trees, Random Forest enhances accuracy and reduces the risk of overfitting, making
it a robust and reliable choice for a variety of machine learning problems.

The algorithm works by constructing multiple decision trees during training. Each tree is trained on a random subset of
the data and features, a technique known as bagging. For classification problems, Random Forest aggregates the
predictions of individual trees using majority voting. In regression problems, it averages the predictions across trees to
determine the final output. By selecting random subsets of features for training, Random Forest reduces the correlation
between individual trees, leading to improved overall prediction accuracy.

In this use case, the goal is to predict the score of user conversion based on web event data. Random Forest is
particularly well-suited to this scenario for several reasons. First, it handles mixed data types seamlessly. The dataset
contains both categorical features, such as browser and campaign IDs, and numerical features, like page views and
purchases. Random Forest accommodates these variations without requiring extensive preprocessing.

Additionally, Random Forest is robust against noise and overfitting. Web activity data often contains irrelevant features
or noisy observations. By averaging predictions across trees, the algorithm reduces the influence of noisy data and
avoids overfitting, ensuring more reliable predictions. Furthermore, Random Forest provides valuable insights into
feature importance, helping to identify which factors, such as page views or campaign IDs, contribute most
significantly to user conversions.

Another advantage of Random Forest is its ability to model non-linear relationships. User conversion likelihood is
often influenced by complex interactions between features. Random Forest captures these relationships effectively
without requiring explicit feature engineering. The algorithm is also scalable, capable of handling large datasets with
millions of user sessions, thanks to its parallel computation capabilities.

Random Forest is flexible for regression tasks, which is crucial for this use case where the target variable is a
conversion score between 0 and 1. Its inherent design makes it ideal for predicting continuous outcomes. In contrast, a
single decision tree, while simpler, is prone to overfitting, especially in datasets with many features and potential noise.
Random Forest mitigates this limitation by averaging the predictions of multiple trees, delivering more generalizable
and robust results.

Rule-Based Labeling for Conversion Scoring: Automating Data Annotation with Data Distiller

Using SQL transformations to encode features and prepare the dataset:

-- Create a transformed dataset

CREATE TABLE transformed_webevents AS
SELECT
visit_id,
UPPER(country_cd) AS country_encode,
campaign_id,
browser_id,
operating_system_id,
COUNT(*) AS visits,
SUM(pageviews) AS total_pageviews,
SUM(purchases) AS total_purchases,
CASE
WHEN SUM(purchases) > 0 THEN 1
ELSE 0
END AS converted
FROM webevents_train
GROUP BY visit_id, country_cd, campaign_id, browser_id, operating_system_id;

Note that

string_indexer encodes categorical features (visit_id, country_cd,

**campaign_id**, **browser_id**, **operating_system_id**).

vector_assembler combines encoded categorical and numerical features (visits,

**pageviews**, **purchases**) into a single feature vector.

**standard_scaler** scales this feature vector to normalize values for training and enhance model
performance.

Note that we are using a simple CASE statement to assign a score in our data

Loss of Nuance: By converting the target variable to a binary 0 or 1, we may lose information about the magnitude of
purchases. For instance, a user with one purchase is treated the same as a user with multiple purchases. In cases where
we want to predict the extent of engagement or the volume of purchases, this binary target may not capture the full
range of user behavior.

Suitability for Regression: Since we are using random forest regression, which is typically better suited for
continuous targets, applying it to a binary target might not be ideal. Random forest regression will still function, but it
may not fully leverage the model’s strengths in predicting continuous outcomes. If our primary goal is to predict
conversion likelihood (0 or 1), a classifier like random forest classification might be more appropriate.

Alternatives: If we have access to more granular data on the number of purchases, we could consider using a different
target variable that reflects this information, such as the count of purchases or the monetary value of purchases. Using
a continuous target with random forest regression could enable the model to capture the full range of behaviors, giving
us insights into not just who is likely to convert but also to what extent they engage in purchases. Alternatively, if our
primary objective is binary conversion prediction, we could use a random forest classifier to better align with the
binary nature of our target.

Build the Random Forest Model

CREATE MODEL random_forest_model

TRANSFORM (
string_indexer(visit_id) AS si_id,
string_indexer(country_encode) AS country_code,
string_indexer(campaign_id) AS campaign_encode,
string_indexer(browser_id) AS browser_encode,
string_indexer(operating_system_id) AS os_encode,
vector_assembler(array(si_id, country_code, campaign_encode,
browser_encode, os_encode, visits, total_pageviews, total_purchases)) AS
features,
standard_scaler(features) AS scaled_features
)
OPTIONS (
MODEL_TYPE = 'random_forest_regression',
NUM_TREES = 20,
MAX_DEPTH = 5,
LABEL = 'converted'
)
AS
SELECT *
FROM transformed_webevents;

The result will be:

Evaluate the model using test data:

SELECT *
FROM model_evaluate(
random_forest_model,
1, -- Validation split percentage (1 for 100% evaluation on provided data)
SELECT
visit_id,
country_cd AS country_encode,
campaign_id,
browser_id,
operating_system_id,
COUNT(*) AS visits,
SUM(pageviews) AS total_pageviews,
SUM(purchases) AS total_purchases,
CASE
WHEN SUM(purchases) > 0 THEN 1
ELSE 0
END AS converted
FROM webevents_test
GROUP BY
visit_id,
country_cd,
campaign_id,
browser_id,
operating_system_id
);

The results are:

Here’s what each metric means in the context of your Random Forest model evaluation:

Root Mean Squared Error (RMSE): RMSE is a metric that measures the average magnitude of the errors between
the predicted values and the actual values in your test dataset. It is the square root of the average squared differences
between predictions and actuals. In this case, an RMSE of 0.048 indicates that the model’s predictions are, on
average, about 0.048 away from the actual conversion likelihood values. Since RMSE is on the same scale as the
target variable (in this case, a probability score between 0 and 1 for conversion likelihood), a lower RMSE suggests
that the model’s predictions are relatively accurate.

R-squared (R²): R², or the coefficient of determination, measures the proportion of variance in the dependent variable
(conversion likelihood) that is predictable from the independent variables (features). An R² value of 0.9907 indicates
that the model explains approximately 99.07% of the variance in the conversion likelihoods. This is a high R² value,
which suggests that the model fits the data very well and that the features used in the model account for almost all of
the variability in conversion outcomes.

Model Accuracy: The combination of a low RMSE and a high R² value suggests that your Random Forest model
is performing exceptionally well in predicting conversion likelihood.
Suitability for Use: These results indicate that the model is reliable for predicting conversions based on the test
dataset, and it is likely capturing meaningful patterns in the data

If this performance holds across additional data (e.g., an inference dataset or real-world data), the model can be a
valuable tool for predicting user conversions and guiding targeted marketing efforts. However, it’s essential to validate
the model with real-world data periodically, as models trained on historical data may degrade in accuracy over time.

Use the model for prediction on new data:

SELECT *
FROM model_predict(
random_forest_model,
1, -- Validation split percentage (1 for 100% evaluation on provided data)
SELECT
visit_id,
country_cd AS country_encode,
campaign_id,
browser_id,
operating_system_id,
COUNT(*) AS visits,
SUM(pageviews) AS total_pageviews,
SUM(purchases) AS total_purchases,
CASE
WHEN SUM(purchases) > 0 THEN 1
ELSE 0
END AS converted
FROM webevents_inference
GROUP BY
visit_id,
country_cd,
campaign_id,
browser_id,
operating_system_id
);

Creating thee feature set.

Results of the evaluation.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-602-techniques-for-
bot-detection-in-data-distiller * * *

Download the following datasets

Ingest them by following the tutorial for each:

Make sure you have read:

Bots are software applications designed to perform automated tasks over the internet, often at a high frequency and
with minimal human intervention. They can be used for a variety of purposes, ranging from beneficial tasks like
indexing websites for search engines to malicious activities such as spamming, scraping content, or launching denial-
of-service attacks. Bots are typically programmed to mimic human behavior and can be controlled remotely, allowing
them to interact with websites, applications, and services just like a human user would, albeit at a much faster and
more repetitive pace.
Bots are implemented using scripts or programs that automate specific actions, often through APIs (Application
Programming Interfaces) or web automation frameworks like Selenium. Developers use programming languages such
as Python, JavaScript, or Java to write bot scripts that simulate clicks, form submissions, or page requests. For
complex tasks, bots may incorporate machine learning algorithms to enhance their ability to mimic human-like
interactions, avoiding detection by bot-filtering systems. Bot networks, or “botnets,” are collections of bots controlled
from a central server, enabling large-scale automated activity. While bots are essential for applications like search
engines and customer service chatbots, their misuse necessitates robust detection and filtering mechanisms to protect
the integrity of online platforms and data.

Why Bot Filtering Matters: Protecting Data Quality and Driving Accurate Insights

Bots often produce high-frequency, repetitive actions, while normal users generally produce fewer actions at irregular
intervals.

Bot filtering is essential to ensure the integrity and quality of web traffic data. Bots, or non-human interactions, can
inflate metrics like page views, clicks, and sessions, leading to inaccurate analytics and poor decision-making. In
Adobe Experience Platform, bot filtering can be implemented using SQL within the Query Service, enabling
automated detection and filtering of bot-like activity from clickstream data.

Allowing bot activity to infiltrate the Real-Time Customer Data Platform (CDP) or Customer Journey Analytics can
significantly degrade the quality and reliability of insights. Bots can generate large volumes of fake interactions,
diluting the data used to segment audiences, personalize experiences, and trigger automated actions. This
contamination can lead to inaccurate customer profiles, where bots are mistakenly treated as real customers, impacting
everything from marketing spend to product recommendations.

Moreover, inflated metrics from bot traffic can lead to incorrect entitlement calculations, potentially resulting in over-
licensing issues, which affects cost efficiency. In environments where businesses are charged based on active users or
usage volume, bot-induced data can escalate costs, consuming resources allocated for real customers. Overall, bot
contamination in a CDP undermines the platform’s ability to deliver accurate, actionable insights, compromising the
effectiveness of customer engagement strategies and reducing return on investment in marketing and analytics
platforms.

However, keeping a copy of bot data on the data lake can be beneficial for several reasons. First, retaining bot data
enables teams to continuously refine and improve bot-detection algorithms. By analyzing historical bot behavior, data
scientists and engineers can identify evolving patterns and adapt filtering rules, which can enhance future bot filtering
and maintain data integrity in real-time analytics environments. Additionally, bot data can serve as a valuable training
dataset for machine learning models, which can distinguish between bot and human behavior more accurately over
time. For security and compliance teams, archived bot data can provide insights into potential malicious activities,
allowing for faster responses to threats and better protection measures. Storing bot data on the data lake also supports
compliance, enabling organizations to audit and track how they manage non-human interactions if required. Therefore,
while it’s important to filter bot data from production datasets to maintain accurate customer insights, keeping an
archived copy on the data lake provides value across analytics, security, and compliance domains.

Bot filtering, anomaly detection, and fraud detection share the common goal of identifying unusual patterns in data,
but each serves a distinct purpose. Bot filtering focuses on distinguishing and removing non-human, automated
interactions from datasets to ensure that analytics accurately reflect real user behavior. Anomaly detection is a broader
process aimed at identifying any unusual or unexpected data points or trends, which may indicate system issues, data
errors, or emerging trends. Fraud detection is a specialized type of anomaly detection, specifically designed to identify
suspicious and potentially harmful behaviors, such as fraudulent transactions or malicious activities, by detecting
complex patterns that are often subtle and well-hidden. While bot filtering primarily relies on rules and thresholds to
detect high-frequency, repetitive behaviors typical of bots, anomaly and fraud detection increasingly leverage machine
learning models and sophisticated pattern recognition techniques to uncover irregularities. Each method is essential in
maintaining data integrity, safeguarding against threats, and enabling more reliable insights across various domains.
Decision Tree Classifier and Bot Detection

A decision tree is a supervised machine learning algorithm used for classification and regression tasks. It operates by
recursively splitting data into subsets based on the feature values that provide the best separation. Each internal node
represents a decision on a feature, each branch represents the outcome of the decision, and each leaf node represents a
final class label or prediction.The algorithm aims to find the most informative features to split the data, maximizing the
purity (homogeneity) of the resulting subsets. Popular metrics for these splits include Gini Impurity, Entropy, and
Information Gain.

Key Characteristics of Decision Trees:

Simple and Intuitive: Easy to visualize and interpret.

Handles Nonlinear Data: Captures complex relationships between features and labels without requiring feature
scaling.

Rule-Based: The hierarchical structure maps directly to logical rules, making them interpretable for domain-
specific tasks.

Bot detection typically involves identifying patterns of behavior that distinguish bots from real users. Decision trees
are well-suited for this task for several reasons:

1. Ability to Handle Mixed Data: Bot detection often involves both numerical features (e.g., counts of actions per
interval) and categorical features (e.g., action types). Decision trees can natively handle both types of data
without requiring feature transformations.

2. Explainability: A decision tree provides clear, rule-based decisions that can be interpreted easily. For example, a
rule like “If actions in 1 minute > 60 AND actions in 30 minutes < 500, then it’s a bot” aligns with how bots
exhibit distinct patterns in clickstream data.

3. Effective Feature Selection: In bot detection, not all features are equally important. Decision trees prioritize the
most informative features, such as the frequency and intensity of actions. This makes them efficient for
identifying bots based on behavioral thresholds.

4. Handles Nonlinear Relationships: Bots often exhibit nonlinear patterns in their behavior, such as a sudden spike
in activity over a short interval. Decision trees can effectively model such relationships, unlike linear models that
assume a straight-line relationship.

5. Adaptability to Imbalanced Data: While imbalanced data is a challenge for most algorithms, decision trees can
mitigate this by prioritizing splits that maximize purity (e.g., separating bots from non-bots).

6. Suitability for Rule-Based Domains: In contexts like bot detection, domain experts often have predefined rules
or thresholds. Decision trees align naturally with such rule-based systems, allowing experts to validate or refine
the model.

Example in the Context of Bot Detection

For a dataset with features like:

count_1_min: Actions in 1-minute intervals.

count_5_mins: Actions in 5-minute intervals.

count_30_mins: Actions in 30-minute intervals.

A decision tree might generate rules like:

1. If count_1_min > 60 and count_5_mins > 200 → Bot.

2. If count_1_min < 20 and count_30_mins > 700 → Bot.

Such thresholds are highly interpretable and directly actionable, making decision trees an ideal choice for detecting
anomalous bot-like behavior in user activity logs.

Designing Features to Detect Bot Activity

The feature strategy for bot detection involves aggregating click activity across different time intervals to capture
patterns indicative of non-human behavior. Specifically, the data is grouped and counted based on one-minute, five-
minute, and thirty-minute intervals, which helps identify high-frequency click patterns over both short and extended
durations. In this approach, users with an unusually high number of clicks within each interval—up to 60 clicks in one
minute, 300 clicks in five minutes, and 1800 clicks in 30 minutes—are flagged as potential bots. By structuring the
data this way, we can detect bursts of activity that exceed typical human behavior, regardless of the interval length.
The results are stored in a nested dataframe format, with each user’s activity count grouped by timestamp, user ID, and
webpage name, providing a rich dataset for training and evaluating machine learning models. This multi-interval
aggregation allows us to capture nuanced bot activity patterns that may be missed by a single static threshold, making
bot detection more accurate and adaptable.

First, we’ll write a simple query to identify all ids that have generated 50 events within a 60-second interval, or one
minute.

SELECT *
FROM training_web_data
WHERE id NOT IN (
SELECT id
FROM bot_web_data
GROUP BY
UNIX_TIMESTAMP(timestamp) / 60,
id
HAVING COUNT(*) > 50
);

The results will be:

If you have ingested Adobe Analytics Data as in the tutorial here - the above query would be very similar to what you
would execute. Here is the query that you would have run:

SELECT *
FROM luma_web_data
WHERE enduserids._experience.mcid NOT IN (
SELECT enduserids._experience.mcid
FROM luma_web_data
GROUP BY
Unix_timestamp(timestamp) / 60,
enduserids._experience.mcid
HAVING COUNT(*) > 50);

The result would be:

The 1-minute, 5-minute, and 30-minute count features provide valuable insights into short-term, mid-term, and longer-
term activity patterns, which are useful for identifying bot-like behavior. Bots often exhibit high-frequency actions in
short periods, while genuine users are likely to have lower and more varied activity over time. However, these time-
based counts alone might not fully capture the nuances of bot behavior. Here are some additional features that could
enhance the model’s ability to detect bots:

1. Unique Action Types per Interval: Count the unique actions (e.g., clicks, page views, add-to-cart) performed in
each interval. Bots may perform repetitive actions, so a low number of unique actions per interval could be a
strong bot indicator.

2. Average Time Between Actions: Calculate the average time gap between consecutive actions for each user. Bots
tend to have very consistent or minimal time gaps between actions, while human users have more variability.

3. Standard Deviation of Action Counts Across Intervals: Instead of just using the maximum counts, analyze the
standard deviation of action counts within each interval type (1-minute, 5-minute, 30-minute). Low variability
may indicate bot behavior, as bots often have more uniform activity patterns.

4. Session Duration: Measure the time between the first and last action within a session. Bots may have unusually
long or short sessions compared to typical user sessions.

5. Action Sequence Patterns: Look for specific sequences of actions, like “pageView -> addToCart -> purchase” or
repetitive patterns (e.g., repeated “click” actions). Certain sequences or repetitions can be strong indicators of
scripted bot behavior.

6. Frequency of Rare Actions: Identify rare actions (e.g., “logout” or “purchase”) and check if the frequency of
these actions is unusually high. Bots might disproportionately use or avoid certain actions that are less frequent
among typical users.

7. Clickstream Entropy: Calculate entropy on the sequence of actions for each user. High entropy (more
randomness) could indicate a human user, while low entropy (predictable patterns) might suggest automated
behavior.

8. Time of Day Patterns: Track actions by time of day. Bots might operate at times when human activity is
typically lower, such as very late at night or early morning.

9. Location or IP Address: If the dataset includes location or IP data, unusual patterns like multiple user IDs with
the same IP or multiple sessions from the same location could be signs of bot activity.

10. Number of Sessions per User: If available, the number of separate sessions per user within a day or week could
indicate bots, as bots might operate continuously or have unusually high session counts.

Integrating these features into the model could improve its ability to distinguish bots from genuine users by adding
context around activity patterns, user behavior, and usage variations. They would also help address any blind spots in
the current model, especially where bot behavior is more complex than just high frequency within short time intervals.

Rule-Based Annotation for Training Data Labeling with Data Distiller

Let us use a combination of patterns and thresholds across the three different time intervals (**count_1_min**,
**count_5_mins**, and **count_30_mins**). Here are complex rules we will implement:

Multi-Interval Threshold Combinations

Burst Pattern: A bot-like burst pattern that has high activity over shorter intervals and moderate activity over
longer intervals.
CASE
WHEN MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND
MAX(count_30_mins) < 500 THEN 1
ELSE 0
END AS isBot

Sustained High Activity: Bots that sustain high activity across all intervals.

CASE
WHEN MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) > 800 THEN 1
ELSE 0
END AS isBot

Short-Term Peaks with Long-Term Low Activity: Bots that peak within short intervals but have lower overall
long-term activity, indicating possible bursty or periodic automation.

CASE
WHEN MAX(count_1_min) > 70 AND MAX(count_5_mins) < 150 AND
MAX(count_30_mins) < 300 THEN 1
ELSE 0
END AS isBot

Patterned Activity with Anomalous Long-Term Spikes

Short and Medium Bursts with Occasional High Long-Term Activity: Users with moderate short- and
medium-term activity but extreme spikes over longer intervals, which could indicate periodic scripted
automation.

CASE
WHEN MAX(count_1_min) BETWEEN 30 AND 60
AND MAX(count_5_mins) BETWEEN 150 AND 250
AND MAX(count_30_mins) > 1000 THEN 1
ELSE 0
END AS isBot

Inconsistent High Activity Over Varying Intervals

Fluctuating Activity: Bots that exhibit very high activity in one interval but comparatively low activity in others.
This can capture erratic or adaptive bots.

CASE
WHEN (MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 500)
OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) < 400) THEN 1
ELSE 0
END AS isBot

Periodic Low-Frequency Bots

Regular Intervals with Low Intensity: Bots that perform fewer actions but consistently over set intervals,
indicating periodic scraping or data polling.

CASE
WHEN MAX(count_1_min) BETWEEN 10 AND 30
AND MAX(count_5_mins) BETWEEN 50 AND 100
AND MAX(count_30_mins) BETWEEN 150 AND 300 THEN 1
ELSE 0
END AS isBot

High Long-Term Activity with Low Short-Term Activity

Continuous Background Activity: Bots that run continuously but without peaks in short bursts, which might
indicate a less aggressive but consistent bot process.

CASE
WHEN MAX(count_1_min) < 20
AND MAX(count_5_mins) < 100
AND MAX(count_30_mins) > 700 THEN 1
ELSE 0
END AS isBot

Now let us create the feature set:

-- Step 1: Count actions in each interval and calculate max counts

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min
FROM training_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins
FROM training_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins
FROM training_web_data
GROUP BY id, interval
),

-- Step 2: Consolidate counts for each user by merging the counts from each
interval
consolidated_counts AS (
SELECT
COALESCE(c1.id, c5.id, c30.id) AS id,
COALESCE(c1.count_1_min, 0) AS count_1_min,
COALESCE(c5.count_5_mins, 0) AS count_5_mins,
COALESCE(c30.count_30_mins, 0) AS count_30_mins
FROM count_1_min c1
FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval =
c5.interval
FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval =
c30.interval
),

-- Step 3: Calculate max counts per interval per user with complex bot
detection rules
final_features AS (
SELECT
id,
MAX(count_1_min) AS max_count_1_min,
MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins,
CASE
-- Complex bot detection rules
WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND
200 AND MAX(count_30_mins) < 500)
OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) > 800)
OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins)
BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 500)
OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) < 400))
OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins)
BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 700)
THEN 1
ELSE 0
END AS isBot
FROM consolidated_counts
GROUP BY id
)

-- Step 4: Select the final feature set with bot labels

SELECT
id,
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot
FROM final_features;

This produces the result:

The three time-based aggregation features used in this bot detection query—**max_count_1_min**,
**max_count_5_mins**, and **max_count_30_mins**—each serve a unique purpose in capturing
different patterns of potential bot behavior:

1. 1-Minute Count (**max_count_1_min**): This feature reflects the highest count of actions a user performs
within any single 1-minute interval. High action counts in this short timeframe often indicate rapid, automated
interactions that exceed typical human behavior. Bots that operate in quick bursts will tend to show elevated
values here, helping to detect sudden spikes in activity.
2. 5-Minute Count (**max_count_5_mins**): This feature captures mid-term activity by aggregating user
actions over a 5-minute period. Bots may not always maintain extreme activity levels in short intervals, but they
may show persistent, above-average activity across mid-term intervals. The **max_count_5_mins** feature
helps detect bots that modulate their activity, slowing down slightly to mimic human behavior but still
maintaining an overall high rate of interaction compared to genuine users.

3. 30-Minute Count (**max_count_30_mins**): The 30-minute interval allows for detecting long-term
activity patterns. Bots, especially those performing continuous or background tasks, may exhibit sustained
interaction levels over longer periods. This feature helps to identify scripts or automated processes that maintain a
steady, high frequency of activity over time, which would be uncommon for human users.

Each of these features—1-minute, 5-minute, and 30-minute action counts—provides a view into distinct time-based
behavioral patterns that help distinguish bots from human users. By combining these features and applying complex
detection rules, the model can capture a wider variety of bot-like behaviors, from rapid bursts to prolonged
engagement, making it more robust against different types of automated interactions.

Bot vs. Non-Bots in Training Data

To compute the ratio of bots to non-bots in the above result, you can use a simple SQL query that calculates the count
of bots and non-bots, then computes their ratio. Here’s how to do it:

1. Count Bots and Non-Bots: Use a CASE statement to classify each user as a bot or non-bot based on the isBot
flag.

2. Calculate the Ratio: Use the bot and non-bot counts to calculate the bot-to-non-bot ratio.

-- Step 1: Count actions in each interval and calculate max counts WITH count_1_min AS ( SELECT id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval, COUNT(*) AS count_1_min FROM
training_web_data GROUP BY id, interval ),

count_5_mins AS ( SELECT id, FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval, COUNT(*) AS

count_5_mins FROM training_web_data GROUP BY id, interval ),

count_30_mins AS ( SELECT id, FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval, COUNT(*) AS

count_30_mins FROM training_web_data GROUP BY id, interval ),

-- Step 2: Consolidate counts for each user by merging the counts from each interval consolidated_counts AS (
SELECT COALESCE(c1.id, c5.id, c30.id) AS id, COALESCE(c1.count_1_min, 0) AS count_1_min,
COALESCE(c5.count_5_mins, 0) AS count_5_mins, COALESCE(c30.count_30_mins, 0) AS count_30_mins
FROM count_1_min c1 FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval ),

-- Step 3: Calculate max counts per interval per user with complex bot detection rules final_features AS (
SELECT id, MAX(count_1_min) AS max_count_1_min, MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins, CASE -- Complex bot detection rules WHEN
(MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) <
500) OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800) OR
(MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND
MAX(count_30_mins) > 1000) OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 500) OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) < 400)) OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins)
BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300) OR (MAX(count_1_min) <
20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700) THEN 1 ELSE 0 END AS isBot
FROM consolidated_counts GROUP BY id ),
-- Step 4: Aggregate bot and non-bot counts bot_counts AS ( SELECT SUM(CASE WHEN isBot = 1 THEN 1
ELSE 0 END) AS bot_count, SUM(CASE WHEN isBot = 0 THEN 1 ELSE 0 END) AS non_bot_count FROM
final_features )

-- Step 5: Calculate the bot-to-non-bot ratio and display counts SELECT bot_count, non_bot_count, bot_count *
1.0 / NULLIF(non_bot_count, 0) AS bot_to_non_bot_ratio FROM bot_counts;

The result will be:

In bot detection, the distribution of bots versus non-bots in the dataset plays a critical role in the model’s effectiveness.
If the dataset is imbalanced like above— where non-bot data far outweighs bot data — the model may struggle to
recognize bot-like behavior accurately, leading to a bias toward labeling most activity as non-bot. Conversely, a
balanced dataset — where both bots and non-bots are equally represented — can help the model learn the distinct
patterns of bot behavior more effectively.

Imbalanced Data in Bot Detection

In real-world data, bots typically represent a small fraction of total interactions, resulting in an imbalanced dataset.
This imbalance can lead to several challenges:

Bias Toward Non-Bot Predictions: The model may default to labeling most users as non-bots, as it has far more
examples of non-bot behavior. This can result in a high number of false negatives, where bots are misclassified as
non-bots.

Misleading Metrics: Accuracy alone can be misleading in an imbalanced dataset. For instance, if bots make up
only 5% of the data, a model could achieve 95% accuracy by predicting “non-bot” every time. This accuracy
doesn’t reflect the model’s ability to actually detect bots.

Reduced Sensitivity for Bots: Imbalance reduces the model’s exposure to bot patterns, making it harder to
achieve strong recall for bot detection. In this context, recall is crucial, as we want the model to correctly identify
as many bots as possible.

To address imbalanced data in bot detection, various strategies can be employed:

Resampling: Increasing the representation of bot data by oversampling bots or undersampling non-bots can help
balance the dataset.

Synthetic Data Generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be
used to create synthetic examples of bot behavior, enriching the model’s understanding of bot patterns.

In an ideal setting, having a balanced dataset with equal representation of bots and non-bots enables the model to
recognize both classes well. This balance helps the model capture both bot and non-bot behavior accurately, leading to
better performance across precision, recall, and overall accuracy. However, achieving a balanced dataset in bot
detection can be challenging due to the naturally low prevalence of bots in most datasets.

For our bot detection use case, balancing the dataset or addressing the imbalance is essential to improve the model’s
recall and precision in identifying bot behavior. Without handling imbalance, the model may fail to detect bots
effectively, resulting in contaminated data insights that impact customer segmentation, personalization, and analytics.
By using techniques to balance or adjust for the imbalance in bot and non-bot data, the model becomes better equipped
to accurately classify bot activity, thus enhancing data quality and ensuring more reliable insights for business
decisions.

Train a Decision Tree Classifier Model

A decision tree learns boundaries from training data that represent various patterns of bot versus non-bot activity.
Unlike a strict threshold rule, the tree can accommodate complex patterns and combinations of high/low activity across
different time intervals that are more predictive of bot behavior.

DROP MODEL IF EXISTS bot_filtering_model;

-- Define the model with transformations and options
CREATE MODEL bot_filtering_model
TRANSFORM (
numeric_imputer(max_count_1_min, 'mean') imputed_one_minute, --
Impute missing values in 1-minute count with mean
numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute, --
Impute missing values in 5-minute count with mode
numeric_imputer(max_count_30_mins) imputed_thirty_minute, --
Impute missing values in 30-minute count
string_imputer(id, 'unknown') imputed_id, --
Impute missing user IDs as 'unknown'
string_indexer(imputed_id) si_id, -- Index
the ID as a numeric feature
quantile_discretizer(imputed_five_minute) buckets_five, --
Discretize the 5-minute feature using quantiles
quantile_discretizer(imputed_thirty_minute) buckets_thirty, --
Discretize the 30-minute feature using quantiles
vector_assembler(array(si_id, imputed_one_minute, buckets_five,
buckets_thirty)) features, -- Assemble all features into a single vector
min_max_scaler(features) scaled_features -- Scale
features to be within a range of 0 to 1
)
OPTIONS (
MODEL_TYPE = 'decision_tree_classifier',
MAX_DEPTH = 4,
LABEL = 'isBot'
) AS

-- Feature Engineering for Training Data

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min
FROM training_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins
FROM training_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins
FROM training_web_data
GROUP BY id, interval
),

-- Consolidate counts across different intervals

consolidated_counts AS (
SELECT
COALESCE(c1.id, c5.id, c30.id) AS id,
COALESCE(c1.count_1_min, 0) AS count_1_min,
COALESCE(c5.count_5_mins, 0) AS count_5_mins,
COALESCE(c30.count_30_mins, 0) AS count_30_mins
FROM count_1_min c1
FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval =
c5.interval
FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval =
c30.interval
),

-- Calculate max counts per interval per user and apply complex bot detection
rules
final_features AS (
SELECT
id,
MAX(count_1_min) AS max_count_1_min,
MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins,
CASE
WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND
200 AND MAX(count_30_mins) < 500)
OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) > 800)
OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins)
BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 500)
OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) < 400))
OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins)
BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 700)
THEN 1
ELSE 0
END AS isBot
FROM consolidated_counts
GROUP BY id
)

-- Select features and label for training

SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM final_features;

The result will be:

Feature Transformers Used for Bot Detection

The SQL **TRANSFORM** clause enables streamlined feature engineering and preprocessing for machine learning.

The numeric_imputer transformer handles missing values in numerical features, ensuring that no data points are
lost due to null values. By imputing missing values, this step maintains data integrity and ensures robust model
training.

Example:

max_count_1_min is imputed using the mean value of the column.

max_count_5_mins is imputed using the mode (most frequent value).

max_count_30_mins is imputed using the mean.

The **string_imputer** replaces missing values in categorical features with a default value, such as
'``**unknown**``', to ensure consistency in the dataset. This step avoids dropping records due to missing
categories, a common occurrence in user identifiers or other text-based features.

Example:id (user identifier) is imputed with '``unknown``'.

The string_indexer encodes categorical features into numeric indices, making them compatible with machine
learning algorithms. This transformation is crucial for models like decision trees, which do not natively handle
categorical data.

Example: The imputed id feature is converted into a numeric index as si_id.

The quantile_discretizer converts continuous numerical features into discrete buckets based on quantiles.
This allows the model to better capture non-linear patterns and handle a wider range of value distributions in the data.

Example:

max_count_5_mins is discretized into buckets (buckets_five).

max_count_30_mins is discretized into buckets (buckets_thirty).

The **vector_assembler** combines all preprocessed features, including encoded categorical features and
imputed/discretized numerical features, into a single feature vector. This unified representation is used as input for the
decision tree model.

Example: The transformer combines si_id, imputed_one_minute, buckets_five, and

**buckets_thirty** into a single vector called features.

The **min_max_scaler** scales the combined feature vector to a normalized range, typically 0 to 1. This
standardization ensures that all features contribute equally to the model training process, avoiding bias caused by
differing feature scales.

Example: The **features** vector is transformed into scaled_features to enhance model performance.
These feature transformers work together to preprocess the raw data into a structured and normalized format suitable
for training a Decision Tree Classifier. By effectively handling both categorical and numerical features, these
transformations improve model accuracy and interpretability, making them an essential step in the pipeline for
detecting bot activity.

Evaluate the Decision Tree Classifier Model

When evaluating this model, the primary goal is to test its ability to classify users as bots or non-bots based on their
activity patterns. Specifically, check if the model correctly predicts the isBot label (1 for bots, 0 for non-bots) based
on the time-based aggregation features. You’re looking for the model to generalize well – meaning it should identify
bot-like behavior in new, unseen data, not just replicate rules.

Overfitting is common when working with synthetic data, especially in scenarios where the data generation process is
simplified and highly structured. In synthetic datasets, patterns can often be overly consistent or lack the nuanced
variability found in real-world data. For instance, if synthetic data strictly follows fixed rules or thresholds without
incorporating randomness or exceptions, the model can easily “memorize” these patterns, resulting in high accuracy on
the synthetic data but poor generalization on real data.

This overfitting happens because machine learning models are sensitive to the underlying distribution of the training
data. When synthetic data doesn’t capture the full diversity of real-world behaviors, models may learn to recognize
only the specific patterns present in the training set, rather than generalize to similar yet slightly different patterns. In
the context of bot detection, synthetic data might include very clear thresholds for bot-like behavior (such as high click
counts in short intervals), which may not represent the subtleties of real bot or human interactions online.

To mitigate this, introducing noise, variability, and probabilistic elements into the synthetic dataset can help mimic the
diversity of real-world data, reducing the likelihood of overfitting and making the model evaluation metrics more
realistic. By adding controlled randomness and probabilistic labeling, we create a training and testing environment that
encourages the model to generalize rather than memorize specific rules.

Let us evaluate the model against test data:

-- Model evaluation query using strict rule-based bot detection

SELECT *
FROM model_evaluate(
bot_filtering_model,
1,

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min -- Strict count without random offset for
1-minute interval
FROM test_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins -- Strict count without random offset for
5-minute interval
FROM test_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins -- Strict count without random offset
for 30-minute interval
FROM test_web_data
GROUP BY id, interval
),

-- Step 1: Consolidate counts for each user by merging the counts from each
interval
consolidated_counts AS (
SELECT
COALESCE(c1.id, c5.id, c30.id) AS id,
COALESCE(c1.count_1_min, 0) AS count_1_min,
COALESCE(c5.count_5_mins, 0) AS count_5_mins,
COALESCE(c30.count_30_mins, 0) AS count_30_mins
FROM count_1_min c1
FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval =
c5.interval
FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval =
c30.interval
),

-- Step 2: Calculate max counts per interval per user with strict rule-
based bot detection
final_features AS (
SELECT
id,
MAX(count_1_min) AS max_count_1_min,
MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins,
CASE
-- Strict bot detection rules without probabilistic elements
WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100
AND 200 AND MAX(count_30_mins) < 500)
OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND
MAX(count_30_mins) > 800)
OR (MAX(count_1_min) BETWEEN 30 AND 60 AND
MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 500)
OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200
AND MAX(count_30_mins) < 400))
OR (MAX(count_1_min) BETWEEN 10 AND 30 AND
MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND
300)
OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND
MAX(count_30_mins) > 700)
THEN 1
ELSE 0
END AS isBot
FROM consolidated_counts
GROUP BY id
)

-- Step 3: Select the columns with expected names for model evaluation
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM final_features
);

The result will be:

This perfect score suggests that the synthetic nature of our test data is likely the main cause.

Predict Using the Decision Tree Classifier Model

-- Model prediction query with more lenient bot-detection thresholds, without

added randomness
SELECT *
FROM model_predict(
bot_filtering_model,
1,

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min
FROM inference_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins
FROM inference_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins
FROM inference_web_data
GROUP BY id, interval
),

-- Step 2: Calculate max counts per interval per user with more lenient bot
detection rules
final_features AS (
SELECT
id,
MAX(count_1_min) AS max_count_1_min,
MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins,
CASE
-- Modified bot detection rules to be more lenient
WHEN (MAX(count_1_min) > 40 AND MAX(count_5_mins) BETWEEN 80
AND 150 AND MAX(count_30_mins) < 300)
OR (MAX(count_1_min) > 35 AND MAX(count_5_mins) > 180 AND
MAX(count_30_mins) > 600)
OR (MAX(count_1_min) BETWEEN 25 AND 40 AND
MAX(count_5_mins) BETWEEN 120 AND 200 AND MAX(count_30_mins) > 700)
OR ((MAX(count_1_min) > 60 AND MAX(count_5_mins) < 90 AND
MAX(count_30_mins) > 400)
OR (MAX(count_1_min) < 30 AND MAX(count_5_mins) > 150
AND MAX(count_30_mins) < 300))
OR (MAX(count_1_min) BETWEEN 15 AND 30 AND
MAX(count_5_mins) BETWEEN 40 AND 80 AND MAX(count_30_mins) BETWEEN 100 AND 200)
OR (MAX(count_1_min) < 15 AND MAX(count_5_mins) < 80 AND
MAX(count_30_mins) > 500)
THEN 1
ELSE 0
END AS isBot
FROM consolidated_counts
GROUP BY id
)

-- Step 3: Select the columns with expected names for model prediction
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM final_features
);

The result will be:

The**rawPrediction** and **probability** columns are NULL by design and will be enhanced in the
future.

Diagnosing Issues in Production

There are numerous instances of bot mislabeling throughout. When we evaluate the model (just change
**model_predict**to **model_evaluate**in the SQL code above**)** on this dataset, the results will
reflect the following:

The evaluation results here indicate a relatively low area under the ROC curve (AUC-ROC) of 0.47, with an accuracy
of 0.586, precision of approximately 0.76, and recall of 0.586. These values suggest that the model has some capability
to identify bots but lacks robustness and generalization.

The imbalanced bot-to-non-bot ratio in the training data, at 26 bots to 774 non-bots, is likely a significant factor
contributing to this outcome. In cases where the dataset is highly skewed towards one class, like non-bots, models tend
to struggle to learn effective patterns to identify the minority class—in this case, bots. As a result:

AUC-ROC being close to 0.5 suggests the model’s classification performance is close to random, which is
typical when a model is trained on imbalanced data.

Precision at 0.76 shows that when the model predicts a bot, it’s correct 76% of the time. This might reflect that
the model is somewhat conservative in predicting bots, potentially due to the overwhelming majority of non-bots
in the training data.

Recall of 0.586 indicates that the model only captures about 58.6% of actual bots, likely missing many due to
insufficient learning from the minority class.

To improve performance, especially for recall, it might be necessary to either oversample the bot instances or
undersample the non-bots in the training data.

SQL Approximation of SMOTE (Synthetic Minority Oversampling Technique)

SMOTE (Synthetic Minority Oversampling Technique) is a widely used method in machine learning to address the
problem of imbalanced datasets. In imbalanced datasets, one class (often the minority class) has significantly fewer
examples than the other class (majority class). This imbalance can lead to biased models that perform poorly on the
minority class, as the model tends to favor the majority class.

SMOTE generates synthetic samples for the minority class by interpolating between existing data points. Instead of
merely duplicating existing data, SMOTE creates new samples along the line segments joining neighboring minority
class examples in feature space. This approach enhances the model’s ability to generalize by introducing variability
and richness to the minority class.

SMOTE is inherently a geometric algorithm that operates in high-dimensional feature space. Its core steps involve:

1. Identifying nearest neighbors: For each minority class sample, find k-nearest neighbors in feature space.

2. Generating synthetic samples: Randomly interpolate between the original sample and one of its neighbors.

These steps pose significant challenges in SQL, which is optimized for relational data processing and not for complex
geometric operations. Specific difficulties include:

Nearest Neighbor Calculations: SQL does not natively support efficient operations like distance computations
(e.g., Euclidean distance) required to identify neighbors.
Interpolation in High Dimensions: Generating synthetic samples requires linear algebra operations, which are
not inherently supported in SQL.

Scalability: SMOTE’s complexity increases with the dimensionality of the data and the size of the minority
class. Implementing these operations in SQL can result in performance bottlenecks.

Although exact SMOTE is challenging in SQL, an approximation can be effective for certain types of data, especially
when:

Features are structured: If the dataset has well-defined features with clear bounds (e.g., counts or categories),
random noise-based interpolation can mimic SMOTE’s synthetic generation.

Minority class is clearly defined: By focusing on generating variations of minority samples using domain-
specific rules, we can approximate synthetic oversampling.

Use case involves low-dimensional data: In cases where the feature space is low-dimensional (e.g., 3-5
features), simpler interpolation techniques can achieve similar results.

An SQL-based approximation typically involves:

Duplicating minority samples: This ensures the minority class is represented adequately in the training data.

Adding controlled random noise: Slight variations in the feature values simulate interpolation while remaining
computationally feasible in SQL.

CREATE TABLE new_training_data AS WITH count_1_min AS ( SELECT id,

FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval, COUNT(*) AS count_1_min FROM
training_web_data GROUP BY id, interval ),

count_5_mins AS ( SELECT id, FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval, COUNT(*) AS

count_5_mins FROM training_web_data GROUP BY id, interval ),

count_30_mins AS ( SELECT id, FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval, COUNT(*) AS

count_30_mins FROM training_web_data GROUP BY id, interval ),

-- Consolidate counts across different intervals consolidated_counts AS ( SELECT COALESCE(c1.id, c5.id,

c30.id) AS id, COALESCE(c1.count_1_min, 0) AS count_1_min, COALESCE(c5.count_5_mins, 0) AS
count_5_mins, COALESCE(c30.count_30_mins, 0) AS count_30_mins FROM count_1_min c1 FULL OUTER
JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval FULL OUTER JOIN count_30_mins c30
ON c1.id = c30.id AND c1.interval = c30.interval ),

-- Calculate max counts per interval per user with bot detection rules final_features AS ( SELECT id,
MAX(count_1_min) AS max_count_1_min, MAX(count_5_mins) AS max_count_5_mins,
MAX(count_30_mins) AS max_count_30_mins, CASE WHEN (MAX(count_1_min) > 60 AND
MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500) OR (MAX(count_1_min) >
50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800) OR (MAX(count_1_min) BETWEEN
30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000) OR
((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) OR
(MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400)) OR
(MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND
MAX(count_30_mins) BETWEEN 150 AND 300) OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) <
100 AND MAX(count_30_mins) > 700) THEN 1 ELSE 0 END AS isBot FROM consolidated_counts GROUP
BY id ),
-- Step 2: Extract minority class (isBot = 1) bot_records AS ( SELECT * FROM final_features WHERE isBot = 1
),

-- Step 3: Generate synthetic samples for the minority class synthetic_bot_samples AS ( SELECT
ROW_NUMBER() OVER (ORDER BY id) + FLOOR(RAND() * 1000) AS id, -- Generate new synthetic IDs
max_count_1_min + (RAND() * 10 - 5) AS max_count_1_min, -- Add random noise within ±5
max_count_5_mins + (RAND() * 20 - 10) AS max_count_5_mins, -- Add random noise within ±10
max_count_30_mins + (RAND() * 30 - 15) AS max_count_30_mins, -- Add random noise within ±15 1 AS isBot
-- Keep the bot label FROM bot_records ),

-- Step 4: Combine original data with synthetic samples balanced_training_data AS ( SELECT * FROM
final_features UNION ALL SELECT * FROM synthetic_bot_samples )

SELECT * FROM balanced_training_data;

The result of the SELECT query above is:

Execute the following to train the model on the feature dataset we generated above:

DROP MODEL IF EXISTS bot_filtering_model;

-- Define the model with transformations and options

CREATE MODEL bot_filtering_model
TRANSFORM (
numeric_imputer(max_count_1_min, 'mean') imputed_one_minute, --
Impute missing values in 1-minute count with mean
numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute, --
Impute missing values in 5-minute count with mode
numeric_imputer(max_count_30_mins) imputed_thirty_minute, --
Impute missing values in 30-minute count
string_imputer(id, 'unknown') imputed_id, --
Impute missing user IDs as 'unknown'
string_indexer(imputed_id) si_id, -- Index
the ID as a numeric feature
quantile_discretizer(imputed_five_minute) buckets_five, --
Discretize the 5-minute feature using quantiles
quantile_discretizer(imputed_thirty_minute) buckets_thirty, --
Discretize the 30-minute feature using quantiles
vector_assembler(array(si_id, imputed_one_minute, buckets_five,
buckets_thirty)) features, -- Assemble all features into a single vector
min_max_scaler(features) scaled_features -- Scale
features to be within a range of 0 to 1
)
OPTIONS (
MODEL_TYPE = 'decision_tree_classifier',
MAX_DEPTH = 4,
LABEL = 'isBot'
) AS
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM new_training_data;
Now if we do an evaluate on the inference data:

-- Model prediction query with more lenient bot-detection thresholds, without

added randomness
SELECT *
FROM model_evaluate(
bot_filtering_model,
1,

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min
FROM inference_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins
FROM inference_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins
FROM inference_web_data
GROUP BY id, interval
),

-- Step 3: Select the columns with expected names for model prediction
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM final_features
);

The result is:

Before SMOTE approximation (previous image without SMOTE):

After SMOTE approximation (current image with SMOTE):

Analysis of Before and After SMOTE Changes

1. AUC ROC: The AUC ROC increased slightly from 0.4686 to 0.4860. This indicates a modest improvement in
the model’s ability to distinguish between classes after balancing the dataset with SMOTE.

2. Accuracy: The accuracy also improved slightly, moving from 0.586 to 0.595. This suggests the model has
become somewhat more reliable overall with balanced data.

3. Precision: Precision remains nearly the same, with a minor increase from 0.764 to 0.767. This indicates that the
model’s ability to correctly identify actual bot cases (positive predictive value) was maintained after balancing.

4. Recall: The recall increased slightly from 0.586 to 0.595, indicating that the model is now slightly better at
capturing more of the actual bot cases.
The metrics show a slight improvement across all areas, especially in AUC ROC and recall. Applying SMOTE has
likely helped the model generalize better on the minority class (bot cases) by reducing the imbalance. However, the
improvement is modest, suggesting that other strategies, like tuning the model further or experimenting with additional
features, may be necessary to achieve substantial gains in performance.

Random Forest Classifier Algorithm

Make sure you use the **new_training_data** dataset created in the SMOTE section above

A Random Forest model can generally improve performance compared to a single Decision Tree, especially in
contexts like bot detection or other classification problems. Here’s why and how it works:

1. Reduction in Overfitting

Decision Tree: A single decision tree tends to overfit the training data, especially if it is allowed to grow deep
and learn every detail of the data. This can make the tree highly sensitive to small fluctuations in the data, leading
to high variance and poor generalization on new data.

Random Forest: Random forests build multiple decision trees (typically hundreds or thousands) on random
subsets of the data and aggregate their predictions. This ensemble approach reduces the risk of overfitting as the
“averaging” process smooths out the noise from individual trees, making the model more robust and stable.

2. Improved Accuracy: By combining the outputs of many trees, Random Forest often achieves higher accuracy than
a single Decision Tree. Each tree learns different patterns and features, and when their predictions are combined
(usually by majority vote for classification or average for regression), the model produces more accurate and reliable
predictions. This improvement is especially noticeable in complex datasets with many features or noisy data, where
individual trees might struggle to capture all patterns.

3. Reduction in Variance: Random forests reduce variance by averaging the results of multiple decision trees trained
on different subsets of the data. This results in a more generalized model, which tends to be more consistent and less
sensitive to small changes in the input data.

4. Feature Importance Insights: Random forests also provide more reliable estimates of feature importance
compared to a single decision tree. This can be valuable in understanding which features (e.g., specific counts,
intervals, or thresholds) are most influential in distinguishing bots from non-bots.

5. Handling Imbalanced Data: Our bot detection dataset is imbalanced, Random Forest is generally more capable
than a single Decision Tree in handling this, especially if combined with techniques like SMOTE or weighted classes.
Random Forest’s ensemble approach provides a more balanced perspective, making it a good choice for imbalanced
data.

Let us create the model using the same feature set:

DROP MODEL IF EXISTS bot_filtering_model;

-- Define the model with transformations and options

CREATE MODEL bot_filtering_model
TRANSFORM (
numeric_imputer(max_count_1_min, 'mean') imputed_one_minute, --
Impute missing values in 1-minute count with mean
numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute, --
Impute missing values in 5-minute count with mode
numeric_imputer(max_count_30_mins) imputed_thirty_minute, --
Impute missing values in 30-minute count
string_imputer(id, 'unknown') imputed_id, --
Impute missing user IDs as 'unknown'
string_indexer(imputed_id) si_id, -- Index
the ID as a numeric feature
quantile_discretizer(imputed_five_minute) buckets_five, --
Discretize the 5-minute feature using quantiles
quantile_discretizer(imputed_thirty_minute) buckets_thirty, --
Discretize the 30-minute feature using quantiles
vector_assembler(array(si_id, imputed_one_minute, buckets_five,
buckets_thirty)) features, -- Assemble all features into a single vector
min_max_scaler(features) scaled_features -- Scale
features to be within a range of 0 to 1
)
OPTIONS (
MODEL_TYPE = 'random_forest_classifier', -- Change model type to random
forest classifier
NUM_TREES = 20, -- Set the number of trees
MAX_DEPTH = 5, -- Set the maximum depth of trees
LABEL = 'isBot'
) AS
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM new_training_data;

If you do the same evaluation:

-- Model prediction query with more lenient bot-detection thresholds, without

added randomness
SELECT *
FROM model_evaluate(
bot_filtering_model,
1,

WITH count_1_min AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
COUNT(*) AS count_1_min
FROM inference_web_data
GROUP BY id, interval
),

count_5_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
COUNT(*) AS count_5_mins
FROM inference_web_data
GROUP BY id, interval
),

count_30_mins AS (
SELECT
id,
FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
COUNT(*) AS count_30_mins
FROM inference_web_data
GROUP BY id, interval
),

-- Step 3: Select the columns with expected names for model prediction
SELECT
max_count_1_min,
max_count_5_mins,
max_count_30_mins,
isBot,
id
FROM final_features
);

The results are:

Before Results (After SMOTE and Logistic Regression):

After Results (After SMOTE and Random Forest Implementation):

Insights and Recommendations

The random forest model slightly outperforms logistic regression in terms of AUC ROC, accuracy, and recall, while
precision remains identical between the two models. The improvement, although minor, indicates that the random
forest leverages its ensemble nature to better capture patterns in the data. The use of SMOTE for synthetic data
generation likely contributed to balancing the dataset, enabling both models to achieve reasonable precision and recall.
However, the AUC ROC values (~0.487) indicate that the models are struggling to effectively distinguish
between bots and non-bots, suggesting that the current features may not capture enough meaningful
differences.

To improve performance, consider enhancing feature engineering to include more discriminative features that better
separate bots from non-bots. Additionally, hyperparameter tuning for the random forest (e.g., increasing NUM_TREES
or MAX_DEPTH) could yield further improvements. Exploring alternative models like gradient-boosting algorithms
may also prove beneficial, as they tend to perform better on imbalanced datasets.

Appendix: Generating Balanced Synthetic Data in Data Distiller

Here is the code:

-- Generate balanced synthetic dataset with bot-like and non-bot behavior

SELECT
-- Generate unique synthetic ID
FLOOR(RAND() * 10000000000) AS id,

-- Generate random timestamps within the last year by subtracting random

seconds from current timestamp
TIMESTAMPADD(SECOND, -FLOOR(RAND() * 31536000), CURRENT_TIMESTAMP) AS
timestamp,

-- High count for 1-minute interval to simulate bot-like rapid activity for
bots, lower for non-bots
CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
70) + 50 -- Bot-like high count
ELSE FLOOR(RAND() * 30) -- Non-bot lower count
END AS count_1_min,

-- Moderate to high count for 5-minute interval to capture mid-level bot

behavior for bots, lower for non-bots
CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
150) + 100 -- Bot-like moderate to high count
ELSE FLOOR(RAND() * 80) -- Non-bot moderate count
END AS count_5_mins,

-- High count for 30-minute interval to capture long-duration bot-like

activity for bots, lower for non-bots
CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
400) + 500 -- Bot-like high count
ELSE FLOOR(RAND() * 200) + 50 -- Non-bot lower count
END AS count_30_mins,

-- Label half as bots and half as non-bots by using row numbers to

alternate bot and non-bot labels
CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN 1 -- Mark as bot
for even rows
ELSE 0 -- Mark as non-bot for odd rows
END AS isBot

-- Generate multiple records for a balanced dataset

FROM RANGE(10000);

This query creates a synthetic, balanced dataset to model bot-like behavior versus non-bot (human) behavior. It’s
designed to produce realistic variations in activity counts within specific time intervals to simulate patterns that might
help distinguish bots from humans.

The query generates a dataset where:

1. User IDs are randomized: Unique IDs represent individual users.

2. Timestamps are recent and varied: Random timestamps within the last year simulate user activity over time.

3. Activity Counts Simulate Bot-like and Non-bot Patterns: The query produces high-frequency activity counts
for bots and lower counts for non-bots within 1-minute, 5-minute, and 30-minute intervals.

4. Balanced Labels: The query labels 50% of the records as bots and the other 50% as non-bots to ensure a
balanced dataset, which helps prevent bias when training a classifier.

Let us dig into the code itself:

1. Generating Unique User IDs:

FLOOR(RAND() * 10000000000) AS id,

This line creates a unique ID for each record by generating a random number in the range of 0 to 10 billion. Each
ID acts as a synthetic user identifier.

2. Random Timestamps Within the Last Year:

TIMESTAMPADD(SECOND, -FLOOR(RAND() * 31536000), CURRENT_TIMESTAMP) AS

timestamp,

By subtracting a random number of seconds (up to approximately one year) from the current timestamp, this line
generates random timestamps within the past year. This simulates varying activity times across users.
3. Simulating 1-Minute Interval Counts:

CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
70) + 50 -- Bot-like high count
ELSE FLOOR(RAND() * 30) -- Non-bot lower count
END AS count_1_min,

Here, the query uses a CASE statement to assign different activity counts for bots and non-bots:

Bot-Like Users: Even-numbered rows (simulated as bots) receive a high count (between 50 and 120) to
reflect frequent actions within one minute.

Non-Bot Users: Odd-numbered rows (simulated as non-bots) receive a lower count (up to 30), which
reflects less frequent actions within one minute.

This pattern is applied by alternating the output of the ROW_NUMBER() function, where even rows are bots and
odd rows are non-bots.

4. Simulating 5-Minute Interval Counts:

sqlCopy codeCASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
150) + 100 -- Bot-like moderate to high count
ELSE FLOOR(RAND() * 80) -- Non-bot moderate count
END AS count_5_mins,

Similarly, this section simulates activity over a 5-minute interval. Bots get a higher range of activity counts
(between 100 and 250) to capture moderate-to-high activity. Non-bots receive lower values (up to 80), reflecting
normal usage patterns.

5. Simulating 30-Minute Interval Counts:

CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() *
400) + 500 -- Bot-like high count
ELSE FLOOR(RAND() * 200) + 50 -- Non-bot lower count
END AS count_30_mins,

For a 30-minute interval, bots show consistently high counts (from 500 to 900), reflecting sustained high-
frequency activity, while non-bots show lower values (up to 250).

6. Assigning Bot Labels:

CASE
WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN 1 -- Mark as bot
for even rows
ELSE 0 -- Mark as non-bot for odd rows
END AS isBot

By alternating between bots and non-bots with the ROW_NUMBER() function, the query ensures an even
distribution, which is critical for training a classifier. This balanced labeling helps the model learn to differentiate
bot-like behavior from normal human behavior without becoming biased toward one class.

7. Generating 10,000 Rows:

Finally, the RANGE(10000) clause creates 10,000 rows of synthetic data, each with its own combination of id,
timestamp, activity counts, and bot label.

Complex business rules for labeling datasets

You can inspect the actual label and prediction.

Poor performance in production.

Random forest algorithm results on SMOTE data

Synthetic data generation in Data Distiller.

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-604-data-
exploration-for-customer-ai-in-real-time-customer-data-platform * * *

Before we proceed, this is how the demo_data_intelligent_services_demo_midvalues look from an ingestion point
of view:

Note that this dataset has not been enabled for Real-Time Customer Profile, meaning it is not being ingested into the
profile.

This is how the schema looks:

Run the following on the Experience Event dataset, ensuring it adheres to either the Adobe Analytics Schema or the
Consumer Event Schema. Keep in mind that Customer AI automatically generates features using standard field groups.

SELECT * FROM demo_data_intelligent_services_demo_midvalues;

Your result should look something like this:

To get the JSON structure of the output use:

SELECT to_json(STRUCT(*)) AS json_output

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

Accessing the Standard Field Groups Used in Customer AI

Customer AI uses standard field groups to automatically generate features such as recency, frequency, and engagement
metrics without manual intervention. In addition to these standard field groups, custom events can be incorporated for
advanced customization, allowing for more tailored insights. While it is not necessary for the data to include all field
groups, having relevant ones significantly enhances model performance.

Purchases, Product Views, Checkouts

productListItems.SKU, commerce.order.purchaseID, commerce.purchases.value

Captures transaction-related data for commerce activities.

Web Visits, Page Views, Link Clicks

web.webPageDetails.name, web.webInteraction.linkClicks.value

Tracks website interactions and user behaviors online.

App Installs, Launches, Feature Usage

application.name, application.installs.value, application.featureUsages.value

Focuses on mobile or desktop application interactions.

Logs search behavior and keywords used by customers.

Customer Demographics, Preferences

person.name, person.gender, person.birthDate

Provides demographic and user profile information.

device.type, device.operatingSystem.name

Identifies devices used by the customer during interactions.

identityMap.ECID.id, identityMap.AAID.id

Links different identifiers for a unified customer view.

Experience Event Metadata

timestamp, channel, environment

Provides contextual metadata about customer events.

You can access the standard fields by executing the following:

SELECT to_json(web) AS web_json

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

SELECT to_json(productListItems) AS productListItems_json

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

SELECT to_json(commerce) AS commerce_json

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

SELECT to_json(application) AS application_json

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

SELECT to_json(search) AS search_json

FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

Flattening Standard Fields

First try running something like this, a template that has all the fields:

SELECT
-- Web Interaction Details
web.webPageDetails.name AS page_name,
web.webInteraction.linkClicks.value AS link_clicks,

-- Commerce Details
commerce.purchases.value AS purchase_value,
commerce.order.purchaseID AS purchase_id,
commerce.checkouts.value AS checkout_value,
commerce.productListViews.value AS product_list_views,
commerce.productListOpens.value AS product_list_opens,
commerce.productListRemovals.value AS product_list_removals,
commerce.productViews.value AS product_views,
productListItems.SKU AS product_sku,

-- Application Details
application.name AS application_name,
application.applicationCloses.value AS app_closes,
application.crashes.value AS app_crashes,
application.featureUsages.value AS feature_usages,
application.firstLaunches.value AS first_launches,

-- Search Information
search.keywords AS search_keywords,

-- Event Metadata
meta.intendedToExtend AS intended_to_extend,

-- Time Period
startDate,
endDate

FROM
demo_data_intelligent_services_demo_midvalues;

In my case, I will get an error that says:

The error message:

ErrorCode: 42601 ... no viable alternative at input 'commerce.order'

This suggests that **commerce.order** doesn’t exist. The key part to notice is **no viable
alternative at input 'commerce.order'**

Another error message that you will get which is also indicative of the same error is:

ErrorCode: 08P01 queryId: 83370942-ffd7-4aa3-9f54-22b1edd06c56 Unknown error

encountered. Reason: [[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function
parameter with name `meta`.`intendedtoextend` cannot be resolved. Did you mean
one of the following? [`demo_data_intelligent_services_demo_midvalues`.`_id`,
`demo_data_intelligent_services_demo_midvalues`.`web`,
`demo_data_intelligent_services_demo_midvalues`.`media`,
`demo_data_intelligent_services_demo_midvalues`.`device`,
`demo_data_intelligent_services_demo_midvalues`.`search`].; line 12 pos 2;
'GlobalLimit 50000 +- 'LocalLimit 50000 +- 'Project [web#503298.webpagedetails

This pattern will repeat for each missing column if you keep removing or commenting them manually.
Let us execute the query after commenting out the missing columns:

SELECT
-- Web Interaction Details
web.webPageDetails.name AS page_name,
web.webInteraction.linkClicks.value AS link_clicks,

-- Commerce Details
commerce.purchases.value AS purchase_value,
-- commerce.order.purchaseID AS purchase_id, -- COMMENTED OUT (missing
column)
commerce.checkouts.value AS checkout_value,
commerce.productListViews.value AS product_list_views,
commerce.productListOpens.value AS product_list_opens,
commerce.productListRemovals.value AS product_list_removals,
commerce.productViews.value AS product_views,
productListItems.SKU AS product_sku,

-- Application Details
-- application.name AS application_name,
-- application.applicationCloses.value AS app_closes,
-- application.crashes.value AS app_crashes,
-- application.featureUsages.value AS feature_usages,
-- application.firstLaunches.value AS first_launches,

-- Search Information
search.keywords AS search_keywords

-- Event Metadata
-- meta.intendedToExtend AS intended_to_extend,

-- Time Period
-- startDate,
-- endDate

FROM demo_data_intelligent_services_demo_midvalues;

Observe that the comma after **search_keywords** was removed as it is the last column.

The Data Quality Score (DQS) is a composite metric designed to measure how reliable, complete, and consistent data
is within a dataset. The goal is to quantify data quality so that issues can be identified and improvements can be
tracked over time.

We evaluated data quality based on three core dimensions:

The proportion of non-null (non-missing) values in the dataset. Missing data can skew analyses, leading to biased
insights.

Completeness (%)=(1−Null ValuesTotal Records)×100\text{Completeness (\%)} = \left(1 - \frac{\text{Null Values}}

{\text{Total Records}}\right) \times 100Completeness (%)=(1−Total RecordsNull Values)×100

The proportion of distinct (unique) values relative to the total number of records. Ensures data is free from
duplicates, which can distort aggregations or counts.

Uniqueness (%)=Distinct ValuesTotal Records×100\text{Uniqueness (\%)} = \frac{\text{Distinct Values}}{\text{Total

Records}} \times 100Uniqueness (%)=Total RecordsDistinct Values×100
Measures if the data conforms to expected formats, ranges, or patterns. Invalid data (e.g., negative prices,
malformed dates) can break business rules.

Validity (%)=Valid RecordsTotal Records×100\text{Validity (\%)} = \frac{\text{Valid Records}}{\text{Total

Records}} \times 100 Validity (%)=Total RecordsValid Records×100

We average the data quality metrics to provide a balanced view, ensuring that no single metric dominates the overall
score unless explicitly weighted. This approach maintains fairness across different dimensions of data quality.
However, flexible weighting can be applied when necessary. In certain contexts, such as financial data, specific
dimensions like validity might carry more weight due to their critical importance in ensuring data accuracy and
compliance.

Here is the query that you should execute:

WITH data_quality AS (
SELECT
-- Web Interaction Details (Completeness & Uniqueness)
(1 - (COUNT(CASE WHEN web.webPageDetails.name IS NULL THEN 1 END) /
COUNT(*))) * 100 AS page_name_completeness,
(COUNT(DISTINCT web.webPageDetails.name) / COUNT(*)) * 100 AS
page_name_uniqueness,

(1 - (COUNT(CASE WHEN web.webInteraction.linkClicks.value IS NULL THEN

1 END) / COUNT(*))) * 100 AS link_clicks_completeness,
(COUNT(DISTINCT web.webInteraction.linkClicks.value) / COUNT(*)) * 100
AS link_clicks_uniqueness,

-- Commerce Details (Completeness, Uniqueness, Validity)

(1 - (COUNT(CASE WHEN commerce.purchases.value IS NULL THEN 1 END) /
COUNT(*))) * 100 AS purchase_value_completeness,
(COUNT(DISTINCT commerce.purchases.value) / COUNT(*)) * 100 AS
purchase_value_uniqueness,
(COUNT(CASE WHEN commerce.purchases.value >= 0 THEN 1 END) / COUNT(*))
* 100 AS purchase_value_validity,

-- Commented Section for commerce.order

-- (1 - (COUNT(CASE WHEN commerce.order.purchaseID IS NULL THEN 1 END)
/ COUNT(*))) * 100 AS purchase_id_completeness,
-- (COUNT(DISTINCT commerce.order.purchaseID) / COUNT(*)) * 100 AS
purchase_id_uniqueness,

(1 - (COUNT(CASE WHEN commerce.checkouts.value IS NULL THEN 1 END) /

COUNT(*))) * 100 AS checkout_value_completeness,
(COUNT(DISTINCT commerce.checkouts.value) / COUNT(*)) * 100 AS
checkout_value_uniqueness,

(1 - (COUNT(CASE WHEN commerce.productListViews.value IS NULL THEN 1

END) / COUNT(*))) * 100 AS product_list_views_completeness,
(COUNT(DISTINCT commerce.productListViews.value) / COUNT(*)) * 100 AS
product_list_views_uniqueness,

(1 - (COUNT(CASE WHEN commerce.productListOpens.value IS NULL THEN 1

END) / COUNT(*))) * 100 AS product_list_opens_completeness,
(COUNT(DISTINCT commerce.productListOpens.value) / COUNT(*)) * 100 AS
product_list_opens_uniqueness,
(1 - (COUNT(CASE WHEN commerce.productListRemovals.value IS NULL THEN 1
END) / COUNT(*))) * 100 AS product_list_removals_completeness,
(COUNT(DISTINCT commerce.productListRemovals.value) / COUNT(*)) * 100
AS product_list_removals_uniqueness,

(1 - (COUNT(CASE WHEN commerce.productViews.value IS NULL THEN 1 END) /

COUNT(*))) * 100 AS product_views_completeness,
(COUNT(DISTINCT commerce.productViews.value) / COUNT(*)) * 100 AS
product_views_uniqueness,

(1 - (COUNT(CASE WHEN productListItems.SKU IS NULL THEN 1 END) /

COUNT(*))) * 100 AS product_sku_completeness,
(COUNT(DISTINCT productListItems.SKU) / COUNT(*)) * 100 AS
product_sku_uniqueness,
(COUNT(CASE WHEN SIZE(productListItems.SKU) > 0 THEN 1 END) / COUNT(*))
* 100 AS product_sku_validity,

-- Search Information
(1 - (COUNT(CASE WHEN search.keywords IS NULL THEN 1 END) / COUNT(*)))
* 100 AS search_keywords_completeness,
(COUNT(DISTINCT search.keywords) / COUNT(*)) * 100 AS
search_keywords_uniqueness
FROM demo_data_intelligent_services_demo_midvalues
)

SELECT 'page_name' AS column_name, (page_name_completeness +

page_name_uniqueness) / 2 AS data_quality_score FROM data_quality
UNION ALL
SELECT 'link_clicks', (link_clicks_completeness + link_clicks_uniqueness) / 2
FROM data_quality
UNION ALL
SELECT 'purchase_value', (purchase_value_completeness +
purchase_value_uniqueness + purchase_value_validity) / 3 FROM data_quality
UNION ALL
SELECT 'checkout_value', (checkout_value_completeness +
checkout_value_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_views', (product_list_views_completeness +
product_list_views_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_opens', (product_list_opens_completeness +
product_list_opens_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_removals', (product_list_removals_completeness +
product_list_removals_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_views', (product_views_completeness + product_views_uniqueness)
/ 2 FROM data_quality
UNION ALL
SELECT 'product_sku', (product_sku_completeness + product_sku_uniqueness +
product_sku_validity) / 3 FROM data_quality
UNION ALL
SELECT 'search_keywords', (search_keywords_completeness +
search_keywords_uniqueness) / 2 FROM data_quality;

The results are:

Recency, Frequency & Monetary Modeling

RFM modeling is a powerful customer segmentation technique that helps businesses understand and predict customer
behavior based on three key metrics:

Recency (R): How recently a customer performed an action (e.g., last purchase, last visit).

Frequency (F): How often the customer performs the action within a specific timeframe.

Monetary (M): How much the customer has spent over a period of time.

In traditional marketing and customer analytics, these metrics help identify high-value customers, predict churn, and
personalize marketing strategies.

Survival Analysis Principles and Propensity Modeling

In Customer AI, we’re tasked with predicting the propensity of an event occurring within the next N days, such as a
customer making a purchase or engaging with a product. At first glance, this might seem like a straightforward
classification problem—did the customer convert or not? However, the underlying mechanics of how we compute this
propensity are deeply influenced by survival analysis principles, even if we’re not explicitly running survival models.

Survival analysis is fundamentally about estimating the probability that an event has not occurred yet by a certain time,
represented by the survival function S(t). In the context of Customer AI, when we calculate the propensity to convert
in the next N days, we’re essentially working with 1−S(N)—the probability that the customer will convert within that
time frame. This is where the illusion comes into play: while we might not explicitly model S(t), the features we
engineer, such as Recency (R) and Frequency (F), are designed to behave as proxies that capture the dynamics of time-
to-event data, just like survival analysis would.

Recency (R) acts as an implicit measure of the time since the last event, closely tied to the hazard function h(t) in
survival analysis, which represents the instantaneous risk of an event occurring at time t. The more recent the
engagement, the higher the implied hazard or conversion risk. Similarly, Frequency (F) reflects the accumulated risk
over time, akin to the cumulative hazard function H(t). Customers with frequent engagements are treated as having a
higher cumulative risk of conversion because their repeated actions signal strong intent.

By feeding R and F into machine learning models like XGBoost, we are essentially embedding these survival-based
risk factors into the model’s decision-making process. The model learns to associate recent, frequent behaviors with
higher propensities to convert, mimicking the effects of survival functions without explicitly modeling them. This
approach allows us to handle large-scale behavioral data efficiently while still leveraging the time-sensitive nature of
customer actions, which is the core strength of survival analysis. In essence, we’re creating an illusion of survival
modeling—using its principles to shape our features and predictions, even though we’re technically solving a
classification problem.

A Note on Monetary Value (M)

While Monetary (M) is a critical component of traditional RFM (Recency, Frequency, Monetary) modeling, it is
not used natively in Customer AI. This is because Customer AI is designed to predict future customer behavior,
such as conversions or churn, with a strong emphasis on engagement patterns rather than historical spending.
Behavioral signals like Recency (R) and Frequency (F) are more dynamic and time-sensitive, making them better
aligned with predictive models that rely on survival analysis principles. Additionally, monetary data often suffers from
inconsistency across platforms, especially when customers engage through multiple channels, making it less reliable
for direct inclusion in propensity models.

However, if businesses wish to incorporate Monetary (M) into Customer AI for advanced segmentation, it can be
added as a Profile Attribute. This approach is particularly useful for use cases like lifetime value (LTV) prediction or
revenue-based customer segmentation, where understanding the financial impact of customer behavior is critical. By
complementing the existing propensity models with monetary data, organizations can gain deeper insights into not just
who is likely to convert, but also which customers are likely to bring the most value. This dual-layer analysis helps in
optimizing marketing strategies, resource allocation, and personalized customer engagement.

Attribute Assessment for RFM

A look at the table shows you which of the attributes are suitable or not suitable

✅ Strong (Tracks last page viewed)

✅ Strong (Counts page visits)
✅ Strong (Last product interaction)
✅ Strong (Product interaction counts)
⚠️ Moderate (Recent view may be missing)
⚠️ Moderate (Some views might be missed)
⚠️ Moderate (Recent clicks tracked inconsistently)
⚠️ Moderate (Click counts may be incomplete)
❌ Weak (Incomplete last view tracking)
❌ Weak (Inconsistent counts)
❌ Weak (Missing last open data)
❌ Weak (Sparse event tracking)
❌ Poor (Sparse events, weak recency)
❌ Poor (Low frequency, missing events)
✅ Applicable (Tracks transaction amounts)
❌ Weak (Limited search tracking)
❌ Weak (Few search event records)
❌ Very Poor (Critical purchase data missing)
❌ Very Poor (Few transactions captured)
✅ Applicable (Key for monetary analysis)
❌ Extremely Poor (Unreliable recency data)
❌ Extremely Poor (Event counts unreliable)
Monetary Value as a Profile Attribute

To calculate the Monetary (M) value and add it to the Profile, we do the following

Based on the JSON structure:

**commerce.productListViews.value**

**commerce.productListRemovals.value**

**commerce.order.purchaseID**

For Monetary (M), we will consider the **commerce.order** section, focusing on:

purchaseID (to identify transactions)

productListItems.SKU (to track purchased items)

commerce.purchases.value (if available) or aggregate values from transactions.

-- Step 1: Extract relevant transaction data CREATE OR REPLACE VIEW order_data AS SELECT
identityMap.ECID.id AS ecid, commerce.order.purchaseID AS purchase_id, productListItems.SKU AS sku,
commerce.purchases.value AS purchase_value, TO_DATE(_acp_system_metadata.timestamp) AS purchase_date
FROM demo_data_intelligent_services_demo_midvalues WHERE commerce.order.purchaseID IS NOT NULL;

-- Step 2: Aggregate the total monetary value per user CREATE OR REPLACE VIEW monetary_aggregation AS
SELECT ecid, SUM(CASE WHEN purchase_value IS NOT NULL THEN purchase_value ELSE 0 END) AS
total_monetary_value FROM order_data GROUP BY ecid;

-- Step 3: Create the profile table to store monetary value CREATE TABLE IF NOT EXISTS
adls_profile_monetary ( ecid TEXT PRIMARY IDENTITY NAMESPACE ‘ECID’, total_monetary_value
DECIMAL(18, 2) ) WITH (LABEL = ‘PROFILE’);

-- Step 4: Insert aggregated data into the profile table INSERT INTO adls_profile_monetary SELECT STRUCT(
ecid, total_monetary_value ) AS profile_data FROM monetary_aggregation;

Download the following file:

by following the steps here:

-- Step 1: Extract relevant transaction data

CREATE OR REPLACE VIEW order_data AS
SELECT
ECID AS ecid,
purchaseID AS purchase_id,
SKU AS sku,
purchase_value AS purchase_value,
TO_DATE(timestamp) AS purchase_date
FROM commerce_data
WHERE purchaseID IS NOT NULL;

-- Step 2: Aggregate the total monetary value per user

CREATE OR REPLACE VIEW monetary_aggregation AS
SELECT
ecid,
SUM(CASE
WHEN purchase_value IS NOT NULL THEN purchase_value
ELSE 0
END) AS total_monetary_value
FROM order_data
GROUP BY ecid;

-- Step 3: Create the profile table to store monetary value

CREATE TABLE IF NOT EXISTS adls_profile_monetary (
ecid TEXT PRIMARY IDENTITY NAMESPACE 'ECID',
total_monetary_value DECIMAL(18, 2)
)
WITH (LABEL = 'PROFILE');

-- Step 4: Insert aggregated data into the profile table

INSERT INTO adls_profile_monetary
SELECT
STRUCT(
ecid,
total_monetary_value
) AS profile_data
FROM monetary_aggregation;

In Step 3, the SQL code creates a table named adls_profile_monetary to store the aggregated monetary values
for each customer. The **ecid** (Experience Cloud ID) serves as the primary identifier, ensuring each customer’s
data remains unique within the ‘ECID’ namespace. This is critical where identity resolution and profile unification
rely on consistent identifiers. The **total_monetary_value** column captures the cumulative spending of
each customer, formatted as a decimal to handle currency values accurately. The WITH (LABEL = 'PROFILE')
clause designates the table as part of the Real-Time Customer Profile, enabling seamless integration with audience
segmentation, personalization, and activation workflows.

In Step 4, the aggregated data from the monetary_aggregation view is inserted into the newly created profile
table. The **STRUCT** function packages the ecid and its corresponding total_monetary_value into a
structured format compatible with profile-based systems. This approach ensures that monetary values are not just
stored but are readily available for real-time analytics and targeting. By centralizing this data at the profile level,
marketers can effortlessly identify high-value customers, create personalized offers, and drive data-driven marketing
strategies based on customers’ historical spending behavior.

Schema has standard field groups that resemble those in Adobe Analytics schema. Some of these standard field groups
will be used by Customer AI.

A simple SELECT query does not reveal much.

commerce fieldgroup details

Last error code that says no viable alternative at input

There could be data quality issues that we need to investigate

Data Quality Score of the Fields

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-800-turbocharging-
insights-with-data-distiller-a-hypercube-approach-to-big-data-analytics * * *
Ingest the following CSV files

by following the steps in the tutorial below

E-commerce platforms generate an overwhelming amount of interaction data daily, capturing every click, view, and
purchase. Understanding user behavior across product categories is essential for tailoring promotions, uncovering
preferences, and ultimately driving revenue. However, the sheer scale of this data creates significant challenges,
particularly when attempting to process it efficiently for actionable insights.

Data Distiller offers a solution by simplifying big data analysis through its use of hypercubes and sketches. These
advanced tools enable efficient aggregation and processing of data, drastically reducing computational overhead. In
this case study, we leverage Data Distiller to achieve three key objectives: counting distinct users per category,
analyzing user behavior across multiple categories, and merging data efficiently without the need to reprocess
historical datasets.

Analyzing e-commerce data requires addressing fundamental questions: How many unique users interacted with each
category? What patterns emerge in cross-category behaviors? And how can insights be delivered without repeatedly
recalculating metrics? Traditional systems fall short in this regard, often requiring the re-reading of raw data and
recalculating metrics, which is both time-intensive and resource-heavy.

By utilizing hypercubes, Data Distiller overcomes these inefficiencies. It employs probabilistic data structures, such as
sketches, to create compact, efficient summaries of datasets. This approach not only accelerates processing but also
ensures scalability, allowing organizations to focus on driving insights and delivering value to their users.

Understanding the Dataset

The dataset represents simulated user interactions on an e-commerce platform, capturing a broad range of activity from
100 unique users over the course of November 2024. Each user, identified by a unique **user_id** (e.g., U1, U2),
engages with the platform through multiple actions, interacting with products across various categories. These
categories, including Electronics, Apparel, Home Goods, Books, and Beauty, reflect common e-commerce offerings
and provide a foundation for analyzing user preferences and behaviors.

Each interaction is tied to a specific **product_id** (e.g., **P101**, **P102**), enabling detailed tracking
of user-product engagements. The **interaction_time** field, recorded as a timestamp, offers insights into
when these interactions occur, revealing temporal patterns such as peak shopping hours or specific dates of increased
activity. The dataset spans the entire month, providing a comprehensive view of user activity over time.

User actions are categorized into three **interaction_type**s: view, purchase, and cart_add. These types
represent the customer journey, from initial product exploration to the decision to buy or save an item for later. By
capturing these diverse actions, the dataset enables a deeper understanding of customer intent and conversion rates
across different product categories.

This rich dataset is ideal for exploring questions such as: How many unique users interacted with each category?
Which products or categories drive the most purchases? Are there patterns in user behavior across different times of
the day or days of the week? It provides a solid foundation for analytics, segmentation, and predictive modeling,
making it a valuable resource for developing strategies to enhance customer engagement and drive revenue.

Schema looks like the following:

user_id: Unique identifier for users.

product_id: Identifier for products.

category: Product category.

interaction_time: Timestamp of interaction.

interaction_type: Type of interaction (e.g., view, purchase).

HyperLogLog Sketches: The Key to Scalable and Efficient Big Data Insights on Unique Counts

Cardinality-based insights are critical for understanding true audience reach, optimizing resource allocation, and
driving personalized user engagement. However, deriving these insights from traditional methods can be prohibitively
expensive in terms of both computation and storage. This is where HyperLogLog (HLL) sketches come into play,
revolutionizing how businesses calculate cardinality by offering a fast, scalable, and cost-efficient solution.

Traditional methods for computing cardinality involve storing and processing raw data to identify unique elements.
For example, counting unique users interacting with our e-commerce platform over multiple campaigns requires
repeatedly scanning through massive datasets, de-duplicating entries, and aggregating results. This approach demands
substantial computational power and storage resources, which scale poorly as datasets grow. As a result, businesses
face escalating infrastructure costs, slower query execution times, and significant delays in delivering actionable
insights.

Additionally, traditional systems struggle with real-time analytics. To answer a simple question like, “How many
unique users engaged with Campaign A over the last 30 days?” businesses must process historical data alongside new
interactions, often leading to inefficiencies and delays.

HyperLogLog (HLL) sketches are a probabilistic data structure designed to estimate the cardinality, or the number of
unique elements, within a dataset. Unlike traditional methods that store and process every individual element to
compute distinct counts, HLL sketches use a compact representation that drastically reduces memory requirements.
They achieve this efficiency by using hash functions to map dataset elements to binary values and then analyzing the
patterns of trailing zeroes in these hashed values. The longer the sequence of trailing zeroes, the rarer the element,
which provides a statistical basis for estimating the overall cardinality. The resulting HLL sketch is a small, fixed-size
data object that can represent billions of unique items with a high degree of accuracy.

One of the key benefits of HLL sketches is their remarkable efficiency in handling large-scale datasets. Because the
size of the sketch remains constant regardless of the dataset’s size, they are highly scalable and suitable for big data
applications. This efficiency makes them particularly valuable for systems that need to process streaming data or
perform real-time analytics, as they can quickly update the sketch with minimal computational overhead.

Another significant advantage of HLL sketches is their ability to support operations like merging. By combining two
or more sketches, it is possible to estimate the unique count of a union of datasets without accessing the original data.
This property is incredibly useful in distributed systems where data is processed in parallel across multiple nodes. HLL
sketches enable these systems to efficiently consolidate results and provide global insights with minimal
communication overhead.

Use Case: Campaign Uniques Across Date Ranges

In marketing, one of the fundamental metrics is understanding how many unique users engage with a campaign over
specific date ranges. Traditional methods of calculating unique users require processing raw data repeatedly, which
becomes computationally expensive and slow as data scales. HyperLogLog (HLL) sketches provide a solution by
offering compact and efficient cardinality estimation.

For example, consider a scenario where a campaign spans multiple weeks, and the goal is to understand unique user
engagement week-by-week or across the entire campaign period. By leveraging HLL sketches, a sketch is created for
each week’s user interactions. These sketches, which represent the unique users for each week, can be stored and later
merged to estimate the total number of unique users for the entire campaign without requiring access to the original
data. This capability is particularly valuable for real-time reporting, as it eliminates the need to reprocess historical
data whenever new information is added.
Furthermore, HLL sketches can be used to compare user engagement across date ranges. For instance, you might want
to see how many users who interacted with the campaign in the first week returned in subsequent weeks. This overlap
analysis becomes seamless with sketches, as you can compute intersections and unions of sketches across different
periods to reveal trends, retention rates, and campaign effectiveness. These insights allow marketers to fine-tune their
strategies, optimize engagement, and measure campaign ROI efficiently.

Use Case: Microsegments Along Various Dimensions

Segmentation is critical in personalized marketing, where campaigns are tailored to specific subsets of users based on
their characteristics or behaviors. Microsegmentation takes this concept further, dividing users into highly granular
groups based on multiple dimensions such as location, product preferences, device type, and interaction type.
Calculating metrics like unique users for these microsegments can quickly become unmanageable as the number of
dimensions and their combinations increase.

HyperLogLog sketches enable efficient microsegmentation by allowing unique counts to be computed along multiple
dimensions without recalculating from raw data. For example, an e-commerce platform might create HLL sketches for
users who viewed products, added them to the cart, or made a purchase, segmented by categories like “Electronics,”
“Apparel,” or “Books.” These sketches can then be further segmented by other dimensions such as geographical
regions or device types. Marketers can instantly estimate the number of unique users in any segment or combination of
segments without additional processing.

In practice, this allows businesses to identify high-value microsegments, such as users in a specific region who
frequently purchase a particular product category. Additionally, HLL sketches can help track microsegment growth
over time or analyze overlaps between segments, such as users who interact with multiple categories. By unlocking
insights at this granular level, businesses can deliver hyper-targeted campaigns, enhance user experiences, and
maximize conversion rates while maintaining scalability and efficiency in their data operations.

Use Case: Understanding True Audience Reach

In marketing, knowing the total number of unique users engaging with a campaign provides a clear picture of its actual
reach. Without cardinality, repeated interactions from the same users might inflate metrics, leading to an
overestimation of success. By accurately measuring unique engagements, businesses can assess the effectiveness of
their campaigns, allocate resources more effectively, and ensure they are reaching the intended audience.

For instance, a campaign may generate 1 million clicks, but if only 100,000 unique users are responsible for those
clicks, it indicates a concentration of activity among a small audience. This insight might prompt marketers to expand
their targeting strategies to reach a broader demographic.

Create HyperLogLog (HLL) Sketches

To calculate distinct users for each category, we’ll aggregate interactions using the **hll_build_agg** function.
This function creates a compact sketch for estimating unique users.

CREATE TABLE category_sketches AS

SELECT
category,
hll_build_agg(user_id, 10) AS user_sketch
FROM
user_interactions
GROUP BY
category;

SELECT * FROM category_sketches;

This SQL query creates a new table named **category_sketches** to store compact representations of unique
user interactions with different product categories. It groups the data from the existing **user_interactions**
table by the category column and applies the **hll_build_agg** function to the user_id column within
each category. Additionally, the query specifies a parameter for the **hll_build_agg** function, which defines
the precision of the HyperLogLog (HLL) sketch by setting the number of buckets used in the estimation. The HLL
sketch, a probabilistic data structure, efficiently estimates the number of unique users (cardinality) in each category
without storing or scanning all individual user IDs.

The resulting table, **category_sketches**, contains two columns: **category**, which identifies the
product category, and user_sketch, which holds the HLL sketch for that category, configured with the specified
precision level. By adjusting the parameter, the query balances accuracy and memory efficiency, making it adaptable
for different use cases. This approach reduces data size and enables scalable, cost-effective cardinality calculations for
insights such as audience reach or engagement patterns across categories.

Creation of the HLL sketch column in the table looks like the following in DBVisualizer:

This is what the result looks like after executing a**SELECT** query on the resulting dataset:

In this query result, the column labeled USER_SKETCH contains HyperLogLog (HLL) sketches, which are compact
probabilistic representations of the unique users interacting within each category. These sketches are generated by the
**hll_build_agg** function applied to the **user_id** column during the query.

Each sketch encodes the distinct user IDs for the corresponding CATEGORY (e.g., “Home Goods,” “Apparel”). The
encoded string in the **USER_SKETCH** column is not raw data but a fixed-size structure that estimates the
cardinality (number of unique user IDs) efficiently. This enables large-scale datasets to be summarized in a memory-
efficient manner, as the size of each sketch remains small regardless of the number of users in the category.

These sketches can be used in subsequent queries to quickly calculate the estimated unique user counts
(**hll_estimate**), combine sketches from different categories (**hll_merge_agg**), or analyze overlaps
between categories. This approach avoids repeatedly processing raw data, reducing computational cost and time while
maintaining accuracy for decision-making.

All Data Distiller SQL queries for creation, merging, and estimating unique counts are fully functional across both the
Data Lake and the Data Distiller Warehouse, also known as the Accelerated Store.

At present, sketch columns are immutable and cannot be updated after creation. However, future updates are expected
to introduce functionality that allows for updating existing sketch columns. This enhancement will enable more
effective handling of scenarios such as missed processing runs or late-arriving data, ensuring greater flexibility and
accuracy in data management workflows.

Sometimes, you want to build a single HLL sketch that combines multiple unique identifiers from the same dataset.
For example:

In a multi-channel marketing context, you might want to track a user’s unique interactions across email, app, and
web by combining **email_id**, **app_user_id**, and **web_cookie_id** into a single sketch.

In Adobe Real-Time Customer Data Platforms, users have multiple identifiers, combining these into a single
sketch ensures accurate cardinality estimation across different data sources.

If our dataset includes email_id, app_user_id, and webcookie_id instead of a guaranteed

**user_id**, you can use the **COALESCE** function to ensure that at least one non-null identifier is used for
generating the HLL sketch:

CREATE TABLE category_sketches

AS SELECT category, hll_build_agg(coalesce(email_id, app_user_id,
webcookie_id), 10) AS user_sketch
FROM user_interactions GROUP BY category;

Configuration Parameters in HLL Sketches

If you look at the code for **hll_build_agg** above, you will observe that it has a configuration parameter of
10. If you do not specify this value, the default value of 12 is chosen.

hll_build_agg(user_id, 10)

The configuration parameter specifies the log-base-2 of the number of buckets (K) used in the HLL sketch. Buckets are
the internal data structures within the sketch used to estimate cardinality. Increasing the parameter increases the
number of buckets, improving the precision of the cardinality estimate but also requiring more memory to store the
sketch. The total number of buckets **K** is calculated as

K=2parameterK = 2^{\text{parameter}} K=2parameter

The valid range of the parameter is from 4 to 12

Minimum Value: 4 (16 buckets, low precision, very memory efficient). Lower values are sufficient for
exploratory analysis.

Maximum Value: 12 (4096 buckets, high precision, higher memory usage). A high value may be required for
highly sensitive financial or compliance reporting.

Confidence Intervals in HLL

In HLL sketches, the confidence interval is the range within which the true cardinality is expected to fall, given the
estimated value. The size of this range is inversely proportional to K, the number of buckets. In simpler terms:

As K increases, the confidence interval becomes narrower, meaning the estimate is more precise.

A smaller K results in a wider confidence interval, meaning the estimate is less precise but requires less memory.

The confidence interval for HLL typically follows a standard format, such as:

Relative Error≈1.04K\text{Relative Error} \approx \frac{1.04}{\sqrt{K}}Relative Error≈K1.04

Implications of K for Confidence Intervals:

1. Higher value of K (e.g. parameter value of 12):

K=4096 implies that the relative error is 0.016 (or 1.6%).

The estimate will have a tight confidence interval, making it highly reliable.

This configuration is useful for scenarios requiring high precision, such as compliance reporting or sensitive
financial analytics.

2. Lower value of K (e.g. parameter value of 10):

K=1024 implies that the relative error increases to 0.032 (or 3.2%).

The confidence interval is slightly wider, making the estimate less precise but still sufficient for general
analytics.
This setup is memory-efficient and suitable for exploratory or real-time analytics where speed is prioritized
over absolute precision.

Estimate Distinct User Counts

The **hll_estimate** function calculates the estimated number of unique users for each category.

SELECT
category,
hll_estimate(user_sketch) AS distinct_users
FROM
category_sketches;

The result is:

If we had executed the above query in the old fashioned way:

SELECT
category,
COUNT(DISTINCT user_id) AS distinct_users
FROM
user_interactions
GROUP BY
category;

The results are nearly identical due to the smaller data size, highlighting how sketches become significantly more
efficient as the scale of the dataset increases.

Merge Sketches for Cross-Dimensional Analysis

Our use case is to calculate the total unique users across all categories. Instead of recomputing the distinct counts from
raw data, we can use a merge function like **hll_merge_agg**, which deduplicates the unique IDs across each
of these dimensions efficiently.

This query is specifically designed to merge the HyperLogLog (HLL) sketches from all the categories (e.g., “Home
Goods,” “Apparel,” “Books,” “Beauty,” and “Electronics”) into a single, compact sketch. This merged sketch
represents the estimated total unique users across all categories combined, ensuring that users appearing in multiple
categories are only counted once.

To analyze behavior across categories, hll_merge_agg allows us to combine individual category-level

sketches into a single sketch that maintains cardinality estimates without requiring access to the raw data. This
approach is computationally efficient and scalable, making it ideal for handling large datasets or performing cross-
category audience analysis.

SELECT
hll_merge_agg(user_sketch) AS merged_sketch
FROM
category_sketches;

The result looks like the following:

Estimate Overall Distinct Users

Our use case is to calculate the total number of distinct users across all categories while directly deriving the final
estimated count. Instead of merging sketches and performing an additional estimation step, we can use the
**hll_merge_count_agg** function, which not only combines the HyperLogLog (HLL) sketches from each
category but also calculates the estimated total number of unique users in one step.

This query efficiently aggregates the HLL sketches from all categories (e.g., “Home Goods,” “Apparel,” “Books,”
“Beauty,” and “Electronics”), deduplicating unique IDs across these categories and directly returning the estimated
count of distinct users. By using **hll_merge_count_agg**, we streamline the process of combining category-
level sketches while avoiding overcounting users who interacted with multiple categories.

The function simplifies cross-category analysis by eliminating the need for a separate **hll_estimate** step
after merging. This makes it ideal for scenarios where the primary objective is to retrieve the final count of unique
users across all dimensions with minimal processing overhead, ensuring accuracy and scalability for large datasets.

SELECT
hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM
category_sketches;

The result looks like the following:

Two Approaches and Their Tradeoffs

Approach 1: hll_merge_agg + hll_estimate

SELECT
hll_estimate(
hll_merge_agg(user_sketch)
) AS total_distinct_users
FROM
category_sketches;

The result will be:

When to use this approach

This approach is more flexible because the merged sketch can be reused for additional operations (e.g., further
aggregations, intersections, or unions with other sketches) beyond just estimating the cardinality.

It is ideal if you need both the merged sketch for downstream use and the estimated count.

Approach 2: **hll_merge_count_agg**

SELECT
hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM
category_sketches;

The result will be:

When to use this approach:

This approach is more streamlined and efficient when the goal is solely to get the final estimated count of distinct
users.
It avoids creating an intermediate merged sketch, saving processing time and memory if the merged sketch is not
needed for further analysis.

Flexibility: The hll_merge_agg + hll_estimate approach provides an intermediate sketch

(merged_sketch) that can be reused, offering more flexibility for additional operations. In contrast,
**hll_merge_count_agg** is a one-step solution that calculates the count without producing a reusable
sketch.

Efficiency: If your goal is just the final distinct count, **hll_merge_count_agg** is more efficient
because it combines merging and estimation in a single operation.

Reusability: If further operations (e.g., intersections, unions, or additional merges) are needed with the combined
data, **hll_merge_agg** is preferred because it generates a reusable merged sketch.

Both approaches yield the same estimated result when the goal is only to calculate the total number of distinct users.
However, **hll_merge_agg** is more versatile, while **hll_merge_count_agg** is optimized for
simplicity and efficiency when no further operations are required. Your choice depends on whether you need the
intermediate sketch for additional analysis.

Incremental Additions to the Dataset

As previously mentioned, sketch columns are immutable and cannot be modified after they are created. However, new
rows containing sketch columns can be added, and aggregations can be performed on these rows to incorporate the
new data into the analysis.

-- Insert new rows

INSERT INTO category_sketches
SELECT
category,
hll_build_agg(user_id, 10) AS user_sketch
FROM
new_interactions
GROUP BY
category;

-- Examine the dataset

SELECT * FROM category_sketches;

-- Now execute the merge

SELECT
category,
hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM
category_sketches
GROUP BY
category;

TheSELECT query will show multiple rows:

The aggregate count query shows the following - ensure that you use the **GROUP BY**clause since you have
multiple rows with the same category name

Ensure that the configuration parameter for bucketing i.e**. K** remains consistent across all **INSERT** and
**CREATE** queries. This is crucial because the **merge** and **aggregation** functions require all
sketches to have the same number of buckets in order to work correctly. Inconsistent bucketing configurations will
result in errors during these operations.

Best Practice with Incremental Additions

To effectively manage and track data updates when creating new rows with aggregates, it’s important to include a
timestamp column that records the day of processing. This timestamp ensures that each new block of data can be tied
to its processing date, enabling better traceability, data auditing, and incremental updates. By recording the processing
date, you can differentiate between historical and newly added data, making it easier to debug, analyze trends, and
optimize queries. This approach is especially useful in scenarios where data arrives in batches or where late-arriving
data needs to be incorporated incrementally.

You will need to rewrite the query the following way and execute it block by block:

-- Create the empty dataset first

CREATE TABLE category_sketches AS
SELECT
CAST(NULL AS STRING) AS category,
CAST(NULL AS STRING) AS user_sketch,
CAST(NULL AS TIMESTAMP) AS processing_date
WHERE FALSE;

-- Insert backfill data with a processing timestamp

INSERT INTO category_sketches
SELECT
category,
hll_build_agg(user_id) AS user_sketch,
CAST(NOW() AS TIMESTAMP) AS processing_date
FROM
user_interactions
GROUP BY
category;

-- Insert new rows with a processing timestamp

INSERT INTO category_sketches
SELECT
category,
hll_build_agg(user_id) AS user_sketch,
CAST(NOW() AS TIMESTAMP) AS processing_date
FROM
new_user_interactions
GROUP BY
category;

-- Examine the dataset

SELECT * FROM category_sketches;

-- Now execute the merge

SELECT
category,
hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM
category_sketches
GROUP BY
category;
The results of the **SELECT** will be

The aggregation will yield the same result:

HLL sketch creation column.

hll_merge_agg + hll_estimate approach

hll_merge_count approach yields same result

Multiple rows will show the aggregate.

Results obtained after aggregation

Results of the query with a time dimension

Results of the same query

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-700-sentiment-
aware-product-review-search-with-retrieval-augmented-generation-rag * * *

1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 700: Sentiment-Aware Product Review Search with Retrieval

Augmented Generation (RAG)
This tutorial demonstrates how to implement a Retrieval-Augmented Generation (RAG) architecture using Python,
LangChain and Hugging Face Transformers.

This tutorial illustrates how to prototype advanced AI systems locally using Hugging Face Transformers, FAISS, and
Python, creating a structured framework for building, testing, and iterating on solutions that integrate retrieval-
augmented generation (RAG) and sentiment analysis capabilities. By shifting to local processing, this approach
significantly reduces costs, ensures privacy, and removes reliance on external APIs. Hugging Face’s open-source
models enable Data Distiller users to overcome complex implementation challenges and develop functional
prototypes efficiently, all while keeping sensitive data within their infrastructure. This approach is particularly
valuable for privacy-conscious organizations and cost-sensitive projects.

By leveraging Hugging Face’s modular tools and pretrained models, you can refine specific components of the system,
such as document retrieval accuracy or sentiment-aware response generation, without starting from scratch. This
accelerates the validation process, enabling iterative improvements and rapid feedback loops. Local prototyping with
Hugging Face not only reduces reliance on external APIs, which often incur ongoing costs, but also provides greater
control over data flow, ensuring compliance with privacy regulations.

The sentiment-aware RAG tutorial showcases how Python’s ecosystem and Hugging Face Transformers enable
seamless integration of sentiment metadata into retrieval and response pipelines. This local-first solution fosters
innovative applications across domains, from financial sentiment analysis to product reviews and customer feedback
categorization. Hugging Face’s pretrained models make it easy to extend this framework to specific industries,
unlocking new possibilities without significant investment in computational resources. With Hugging Face’s accessible
tools and Python’s versatility, businesses can rapidly visualize, test, and deploy solutions that provide actionable
insights while maintaining cost efficiency and data security.

In the e-commerce industry, providing an intuitive and engaging product search experience is critical for customer
satisfaction and conversion rates. Customers often rely on product reviews to make informed purchasing decisions but
are overwhelmed by the volume of unstructured feedback. This case study demonstrates how a sentiment-aware
Retrieval-Augmented Generation (RAG) system can transform the product search experience by enabling
conversational, sentiment-driven insights directly on the website.

Customers exploring a product catalog often have specific questions that require dynamic and detailed answers.
Traditional search solutions, like keyword-based search bars, fail to provide nuanced responses and leave users
frustrated. For example:

A customer might ask, “What do customers think about the durability of this product?” but only receive a list of
generic reviews without context.

Another user searching for negative reviews about battery life may struggle to filter out irrelevant or overly
positive results.

Beginners looking for summarized feedback might find the sheer number of reviews overwhelming.

To address these pain points, we need a solution that can:

1. Retrieve relevant reviews quickly and efficiently.

2. Analyze and incorporate sentiment to prioritize or filter feedback.

3. Provide conversational, natural language responses that summarize customer insights.

RAG Setup and Architecture

Setup Phase (Steps 1-4): Preparing the Data

1. Generate Embeddings for Reviews: The reviews (text data) are passed through a pre-trained embedding model,
such as **all-MiniLM-L6-v2**. This model converts the reviews into numerical vector representations,
known as embeddings. These embeddings capture the meaning of the reviews in a way that enables comparison
and similarity detection.

2. Store Embeddings in a FAISS (Facebook AI Similarity Search) Vector Database:: The generated
embeddings are stored in a FAISS vector database. FAISS indexes these embeddings to enable efficient similarity
searches. Each embedding represents a review and is indexed by its unique ID.

3. Include Metadata for Reviews: Metadata, such as sentiment or an ID for each review, is paired with the review
content to form documents. These documents are stored in an in memory data store. This step ensures that each
embedding in the FAISS database is linked to the corresponding review details.

4. Set Up a Link Between Embeddings and Metadata: A mapping is created between the FAISS vector index and
the document store, ensuring that the vector representation (embeddings) can be matched with the original review
content and metadata. This mapping enables retrieval of relevant context during a search.

RAG Phase (Steps 5-9): Processing a Query

1. Generate Embeddings for the Query: When a question (query) is asked, it is converted into an embedding
using the same model (**all-MiniLM-L6-v2**). This step ensures the query is represented in the same
vector space as the reviews, enabling effective comparison.

2. Find Similar Reviews: The query embedding is compared against the embeddings in the FAISS vector database.
FAISS uses Euclidean distance to identify the most similar reviews. This step narrows down the search to the
most relevant matches.
3. Retrieve Review Content: The IDs of the top matches from FAISS are used to fetch the corresponding
documents (review content and metadata) from the InMemoryDocstore. This step ensures that the retrieved
results include both the vectorized data and the human-readable review content.

4. Use an LLM to Generate an Answer: The retrieved reviews are passed to a language model (LLM) for
contextual understanding. The LLM processes these documents, understands their content, and generates a
response based on the question.

5. Deliver the Final Answer: The LLM outputs the final answer to the query. This answer is grounded in the
context of the retrieved reviews, ensuring it is relevant and informed.

6. Download the dataset and ensure it is located in the same working directory where your Python script is running.

7. Python installed based on

8. Install Hugging Face Transformers from the Terminal

If you have JupyterLab running, you will need to restart it so that it can recognize these libraries. Go to the Terminal
window and press Ctrl+C to kill the process and relaunch by typing in**jupyter lab** at the command prompt.

Hugging Face provides a robust ecosystem for working with machine learning models, particularly for natural
language processing (NLP). It offers:

1. Pre-trained Models: Hugging Face hosts thousands of models (e.g., GPT-2, BERT, T5) for tasks like text
generation, translation, sentiment analysis, and more.

2. Transformers Library: The **transformers** library simplifies loading and using these models with pre-
built **pipeline** functions, so you can perform tasks with minimal code.

3. Flexibility: You can fine-tune models for specific use cases or use them as-is.

4. Make sure you have installed the following as well from the Terminal

pip install -U langchain faiss-cpu vaderSentiment langchain-community sentence-transformers pandas numpy

LangChain is a framework designed for integrating language models into complex, multi-step workflows. It enables:

1. Chains: Sequences of tasks, such as retrieving documents, processing context, and generating responses.

2. Vector Stores: Storing and searching through text embeddings for efficient document retrieval.

3. Retrieval-Augmented Generation (RAG): Combining retrieval and generation, so models can answer queries
using both context and generation capabilities.

4. Interoperability: LangChain wraps external tools (like Hugging Face models) into its ecosystem for seamless
integration.

FAISS (Facebook AI Similarity Search) is a library designed to efficiently handle vector similarity searches and
clustering of large datasets. When used with Hugging Face and LangChain, FAISS acts as the retrieval backbone for
managing and searching through vector embeddings.

Hugging Face Transformers Library

Hugging Face Transformers is an open-source library that provides access to a wide variety of pretrained transformer
models, including BERT, GPT, and T5, among others. It is a versatile tool for tasks such as text generation,
classification, question answering, and embeddings, making it a powerful alternative to OpenAI’s closed ecosystem.

One key advantage of Hugging Face Transformers is its cost-effectiveness; since models can be run locally without
relying on APIs, businesses save on recurring cloud costs and avoid rate limits. Additionally, using Hugging Face
Transformers locally ensures data privacy, as no sensitive information needs to leave the organization’s
infrastructure. This feature is especially valuable for industries with strict compliance requirements, such as
healthcare or finance.

Here are the key Hugging Face models ideal for marketing applications, such as customer sentiment analysis,
personalized recommendations, and content creation:

GPT-2: Suited for text generation tasks.

BERT: Ideal for understanding tasks like question answering, sentiment analysis, and classification.

T5: Versatile for both text generation and understanding tasks, following a text-to-text framework.

GPT-2 (Generative Pre-trained Transformer 2)

Content Generation: Generate engaging ad copy, product descriptions, and blog posts.

Chatbots: Power conversational AI for customer service and lead nurturing.

Personalized Messaging: Craft tailored email content or social media posts.

We will be using the basic GPT-2 (117M parameter model) in this tutorial.

Give this a try:

from transformers import pipeline

# Load a local LLM model

generator = pipeline("text-generation", model="gpt2")
result = generator("What is the capital of France?", max_length=50,
num_return_sequences=1)
print(result[0]["generated_text"])

You should get the following:

GPT-2 is a powerful language model that excels in generating coherent text but has several limitations. It is
computationally intensive, especially in larger versions, requiring significant memory and processing power, which
can hinder deployment on resource-constrained devices. GPT-2 struggles with understanding long-term dependencies
in extended texts, limiting its effectiveness with very long documents. Without proper fine-tuning, it may
underperform in domain-specific tasks due to a lack of specialized vocabulary understanding. Additionally, GPT-2 can
produce grammatically correct but factually incorrect or nonsensical outputs because it lacks true reasoning
capabilities, and it may reflect biases present in its training data, necessitating careful evaluation and post-processing
in sensitive applications.

BERT (Bidirectional Encoder Representations from Transformers)

Sentiment Analysis: Analyze customer reviews, social media sentiment, or survey responses.

Search Optimization: Improve product search by understanding query intent and context.

Customer Segmentation: Classify and cluster customers based on behavior or preferences.

Give this a try and see how the answer is different:

from transformers import pipeline

# Load a BERT model for question answering

qa_pipeline = pipeline("question-answering", model="bert-base-uncased")

context = "Paris is the capital and most populous city of France."

question = "What is the capital of France?"

result = qa_pipeline(question=question, context=context)

print(result['answer'])

BERT is not designed for open-ended text generation. It excels in understanding and processing existing text. For
BERT to answer questions, it needs a context passage to extract the answer from.

T5 (Text-to-Text Transfer Transformer)

Versatility: Converts any NLP problem into a text-to-text task, enabling tasks like summarization, translation,
and text generation.

Automated Summaries: Create concise summaries of customer feedback or lengthy reports.

Multi-lingual Content: Generate marketing content or summaries in different languages.

The T5 model in the snippet below requires more setup compared to GPT-2 in the snippet above because T5 is a task-
specific sequence-to-sequence model designed to handle multiple NLP tasks, such as translation, summarization, and
question answering. It requires a task-specific prefix like question: or translate: to specify the context, which
is necessary for the model to understand the desired output format.

T5 also uses the SentencePiece tokenizer, which must encode the input text into token IDs compatible with its
architecture, ensuring accurate processing of subword units. Additionally, T5 allows fine-grained control over text
generation with parameters like **temperature**, **top_k**, and **top_p**, which determine randomness
and diversity in output. In contrast, GPT-2, as shown in the second snippet, is a simpler autoregressive model that
doesn’t require a prefix or task-specific setup. GPT-2 is quicker and easier to implement, though less flexible for
structured multi-task scenarios like T5.

Try first installing the tokenizer library **SentencePiece** widely used for models like T5 and Flan-T

pip install sentencepiece

T5 uses SentencePiece as its subword tokenizer. **SentencePiece** allows the tokenizer to handle a variety of
languages and create subword representations effectively. The tokenizer models included with Hugging Face T5
checkpoints (like **t5-base**) depend on SentencePiece to load the tokenizer model.

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Load the T5 model and tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base") # Use T5-Base model
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Step 2: Prepare the input text with task-specific prefix

# T5 requires task-specific prefixes like "translate English to French:" or
"answer the question:"
input_text = "question: What is the capital of France? context: " # Using
question-answering prefix
input_ids = tokenizer.encode(input_text, return_tensors="pt") # Proper
tokenization

# Step 3: Generate a response

outputs = model.generate(
input_ids,
max_length=50, # Limit output length
num_return_sequences=1, # Number of responses
temperature=0.7, # Controls randomness
top_k=50, # Limits sampling to top k tokens
top_p=0.95, # Nucleus sampling for diversity
)

# Step 4: Decode and print the output

response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Decode
generated output
print("Response:", response)

The T5 (Text-to-Text Transfer Transformer) model is computationally intensive, especially in larger versions like T5-
Large or T5-3B, requiring substantial memory and processing power, which can make deployment on resource-
constrained devices challenging. The model’s fixed input and output lengths limit its ability to handle very long texts
or generate extended outputs, affecting tasks that involve lengthy sequences.

Without proper fine-tuning, T5 may underperform in domain-specific applications, failing to capture specialized
vocabulary or nuances inherent to specific fields.

Additionally, like other large language models, it can produce outputs that are grammatically correct but factually
incorrect or nonsensical, especially in complex reasoning scenarios. Lastly, T5 may inadvertently incorporate biases
present in its training data, leading to biased or unfair outputs, necessitating careful evaluation and potential post-
processing when deployed in sensitive applications.

Model parameters in the context of machine learning models like GPT-2, BERT, and T5 refer to the internal variables
or “knobs” that the model adjusts during training to learn from data.

Imagine a machine learning model as a complex musical instrument with millions or even billions of adjustable dials
and switches (the parameters). Each dial controls a tiny aspect of the sound produced. When all the dials are set
correctly, the instrument plays beautiful music (produces accurate predictions or generates coherent text).

During the training process, the model “listens” to a lot of example music (training data) and learns how to adjust its
dials to reproduce similar sounds. Each parameter is adjusted slightly to reduce errors and improve performance. The
more parameters a model has, the more finely it can tune its performance, allowing it to capture intricate patterns and
nuances in the data.

Here’s how it relates to the models we mentioned before:

GPT-2: This model has variants with different numbers of parameters, ranging from 117 million to 1.5 billion.
More parameters allow the model to generate more coherent and contextually relevant text because it can model
more complex language patterns.

BERT: With versions like BERT-base (110 million parameters) and BERT-large (340 million parameters), BERT
uses its parameters to understand and process language, enabling tasks like answering questions and
understanding context.

T5: This model treats all tasks as text-to-text transformations and comes in sizes from 60 million to 11 billion
parameters. The larger models can perform a wide variety of language tasks with greater accuracy due to their
increased capacity.

Perform Sentiment Analysis

The raw data review data looks like:

VADER (Valence Aware Dictionary and sEntiment Reasoner)

The VADER (Valence Aware Dictionary and sEntiment Reasoner) Sentiment Analyzer is a tool designed to determine
the sentiment expressed in text. It is particularly good at analyzing text that includes opinions, emotions, or casual
language like product reviews, tweets, or comments

At its core, VADER uses a pre-built dictionary of words and phrases, where each word is assigned a sentiment score
based on its emotional intensity. For example:

Positive words like “amazing” or “great” have high positive scores.

Negative words like “terrible” or “awful” have high negative scores.

Neutral words like “book” or “laptop” have little to no sentiment score.

When analyzing a sentence, VADER looks at each word, sums up the sentiment scores, and adjusts for factors like
punctuation, capitalization, and special phrases. For example:

Words in ALL CAPS (e.g., “AWESOME!”) are treated as having stronger sentiment.

Punctuation like exclamation marks (!) also boosts emotional intensity.

It also accounts for:

Negation: Words like “not” or “never” can flip the sentiment of a phrase. For instance, “not great” is identified as
negative.

Intensity Modifiers: Words like “very” or “extremely” amplify sentiment, while words like “slightly” or
“barely” reduce it. For example, “very bad” is more negative than just “bad.”

Emoticons and Slang: VADER recognizes common emoticons (e.g., ”:)”, ”:( ”), slang (e.g., “lol”), and
abbreviations, making it ideal for social media or casual text.

Building a sentiment analyzer, like VADER, is achievable in Data Distiller using its integrated machine learning
models and pipelines. Data Distiller allows you to create a end-to-end workflow for sentiment analysis by leveraging
labeled sentiment data and custom ML models. Using transformers, you can preprocess text data by tokenizing,
normalizing, and extracting features such as word embeddings or term frequencies. These features can be fed into
machine learning models like Logistic Regression for sentiment classification.

Assign Sentiment Metadata

Analyze the sentiment of each review using the VADER sentiment analyzer and attach sentiment metadata.

import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Step 1: Load the CSV file into a DataFrame

file_path = "Product_Reviews.csv" # Replace with your file path
review_df = pd.read_csv(file_path)
# Step 2: Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Step 3: Analyze sentiment and attach metadata

# Ensure the column containing reviews is correctly identified (e.g., "review")
if "review" in review_df.columns:
review_df["sentiment"] = review_df["review"].apply(
lambda review: (
"Positive" if analyzer.polarity_scores(review)["compound"] > 0.05
else
"Negative" if analyzer.polarity_scores(review)["compound"] < -0.05
else
"Neutral"
)
)
else:
raise KeyError("The 'review' column is not found in the CSV file.")

# Step 4: Save the updated DataFrame back to a new CSV file

output_file_path = "Updated_Product_Reviews_with_Sentiment.csv"
review_df.to_csv(output_file_path, index=False)

The output CSV file should look like this:

Introduction to Vector Embeddings

A vector embedding is a way of converting text (like product reviews) into a numerical representation (a vector) that
computers can process and analyze. These embeddings capture the meaning and relationships between words in a
mathematically useful format. For example, sentences like “This laptop is amazing!” and “Great laptop performance!”
convey similar meanings. Embeddings convert these sentences into vectors that are close to each other in a
mathematical space, facilitating tasks like sentiment analysis, clustering, and similarity comparisons.

We are using Hugging Face Sentence Transformers in this setup, embeddings are generated locally using a pre-trained
transformer model like **all-MiniLM-L6-v2**. These embeddings play a central role in structuring and
enabling efficient similarity searches for customer reviews:

Pretrained Knowledge: The Hugging Face model is trained on extensive datasets, allowing it to understand
nuanced meanings. This enables handling domain-specific or complex queries effectively.

Contextual Understanding: The model produces embeddings that are context-aware, meaning it captures
relationships between words. For instance, “battery” in “battery life” has a distinct embedding from “battery of
tests.”

Privacy and Cost Efficiency: Unlike cloud-based embeddings (e.g., OpenAI models), Hugging Face models run
locally. This ensures data privacy and eliminates reliance on paid external APIs.

Customizability: The model can be fine-tuned with domain-specific data to improve accuracy and adaptability
for tailored applications.

We will be using the Hugging Face model **all-MiniLM-L6-v2** that creates high-quality embeddings for
product reviews. These embeddings are stored in a FAISS vector database, enabling efficient similarity searches.
Here’s how the workflow comes together:

1. Load Reviews: Customer reviews and their metadata (e.g., sentiment) are loaded from a CSV file.
2. Generate Embeddings: The reviews are transformed into numerical embeddings using the Hugging Face model.

The most common embedding models, based on performance and community adoption, are:

1. **all-MiniLM-L6-v2**: This model is perfect for marketing tasks that require a balance between speed
and accuracy. Whether you’re conducting semantic search to match user queries with the most relevant product
descriptions or performing customer review clustering to identify common themes, this model delivers reliable
results. With its 384-dimensional embeddings, it’s lightweight and efficient, making it ideal for real-time
marketing applications in resource-constrained environments, such as on-device personalization.

2. **all-mpnet-base-v2**: For high-precision marketing tasks, this model excels at capturing semantic
nuances. Its 768-dimensional embeddings make it the go-to choice for applications like paraphrase
identification, ensuring consistent messaging across campaigns, or textual entailment, which helps determine
whether user-generated content aligns with your brand’s values. This precision is invaluable for tasks such as
refining campaign strategies based on nuanced customer feedback.

3. multi-qa-MiniLM-L6-cos-v1: Designed for multilingual marketing, this model shines in global

campaigns. Supporting multiple languages, it is optimized for question-answering tasks, enabling businesses to
create smart search tools that instantly connect users to the right information. Its 384-dimensional embeddings
make it highly effective in cross-lingual semantic search, allowing marketers to target diverse audiences with
personalized and contextually accurate content, bridging language barriers seamlessly.

Different vector embeddings produce distinct representations because they are tailored to specific use cases. These
variations stem from differences in model architecture, training data, and the intended application. For example,
traditional embeddings like Word2Vec and GloVe emphasize word relationships through co-occurrence, while modern
models like BERT or Hugging Face Sentence Transformers take context into account, generating richer and more
nuanced representations.

The choice of training data significantly impacts the embedding’s performance. Models trained on general-purpose
datasets provide broad applicability across tasks, whereas domain-specific embeddings, such as those trained on legal,
medical, or financial texts, excel in specialized applications. Furthermore, embeddings can be optimized for diverse
goals, including semantic similarity, sentiment analysis, or intent recognition. This adaptability ensures that the
selected embedding model aligns precisely with the requirements of a given use case, offering the flexibility to tackle a
wide range of tasks effectively.

The dimensionality of vector embeddings—the number of components in each embedding vector—significantly

impacts how well these embeddings capture the underlying characteristics of the data. Higher-dimensional embeddings
have the capacity to represent more nuanced and complex relationships because they can encode more features and
patterns present in the data. This can lead to better performance in tasks like semantic similarity, classification, or
clustering. However, increasing the dimensionality isn’t always beneficial; it can introduce challenges such as higher
computational costs and the risk of overfitting, where the model learns noise instead of meaningful patterns.
Conversely, embeddings with too few dimensions might oversimplify the data, failing to capture important details and
leading to poorer performance. Therefore, the choice of embedding dimensions is a balance: enough to encapsulate the
necessary information without becoming inefficient or prone to overfitting. The optimal dimensionality often depends
on the complexity of the data and the specific requirements of the task at hand.

Introduction to FAISS (Facebook AI Similarity Search)

FAISS (Facebook AI Similarity Search) is a lightweight and efficient vector database optimized for local use, making
it an excellent choice for fast and scalable similarity searches. Unlike cloud-native alternatives, FAISS is designed to
run entirely on local hardware, making it a cost-effective solution for developers who prioritize privacy and control
over their data. For marketing applications, FAISS enables real-time retrieval of semantically similar data, such as
analyzing customer reviews to identify sentiments or finding related products based on specific customer preferences,
such as “affordable smartphones with excellent camera quality.”

FAISS is particularly well-suited for scenarios where lightweight and local infrastructure is needed. Its design
minimizes resource consumption while maintaining high performance, allowing teams to run advanced similarity
searches without the need for expensive cloud services. For example, marketers can store and search vector
embeddings locally, ensuring data privacy and avoiding latency issues often associated with cloud solutions.

Unlike cloud-based solutions such as Pinecone, FAISS provides unparalleled control over indexing and searching,
giving developers the flexibility to tune their workflows for specific needs. However, it lacks built-in support for
metadata filtering, which requires manual integration with external tools like pandas or JSON files. For teams that
require complete data ownership and are comfortable with some additional setup, FAISS is an excellent choice for
building recommendation engines, designing targeted ad campaigns, and conducting in-depth sentiment analysis. With
its simplicity and local-first architecture, FAISS empowers marketing teams to prototype and deploy sophisticated AI-
driven applications efficiently and privately.

The choice of vector database matters significantly, as it impacts the performance, scalability, and functionality of our
system. Vector databases are specifically designed to handle high-dimensional numerical data (embeddings), enabling
tasks like similarity search and nearest neighbor retrieval. Different vector databases, such as FAISS, Weaviate,
Pinecone, or Milvus, offer distinct features and optimizations that may suit specific use cases.

FAISS is optimized for speed and efficiency in handling very large datasets, making it ideal for applications where
real-time similarity searches are critical.

Weaviate and Pinecone provide additional functionality, like metadata filtering and integrations with external systems,
making them suitable for production environments where complex queries are needed.

The choice also depends on whether you prioritize on-premises solutions (e.g., FAISS) or managed cloud services
(e.g., Pinecone). Moreover, the vector database’s support for various indexing techniques, scalability, and ease of
integration with your embedding generation pipeline can significantly influence the system’s overall effectiveness.
Thus, the vector database complements the embeddings and ensures that your application can efficiently retrieve the
most relevant results based on similarity.

A vector database is fundamentally different from a traditional database in how it stores and retrieves data.
Traditional databases are optimized for structured data, like rows and columns, where queries are based on exact
matches or straightforward filtering (e.g., finding all products under $50). In contrast, a vector database is designed to
handle unstructured data, such as text, images, or audio, by storing high-dimensional numerical representations called
embeddings. Instead of exact matches, queries in a vector database focus on finding similar data based on proximity
in a mathematical space. For example, in a product review system, a vector database can retrieve reviews similar in
meaning to a user’s query, even if they don’t share the exact words. This capability makes vector databases ideal for
applications like recommendation systems, natural language processing, and image recognition, where similarity and
contextual understanding are more important than precise matches.

Besides FAISS (**faiss**), there are several other popular Python packages you could use for local vector
similarity search and indexing. One notable alternative is Annoy (**annoy**), developed by Spotify, which is
efficient in memory usage and provides fast approximate nearest neighbor searches, making it suitable for static
datasets where the index doesn’t require frequent updates. Another option is HNSWlib (**hnswlib**), which
implements Hierarchical Navigable Small World graphs and excels in high-performance approximate nearest neighbor
searches with dynamic updates, ideal for real-time applications that demand both speed and accuracy. NMSLIB
(**nmslib**) is also widely used and offers flexibility by supporting various distance metrics and algorithms for
fast approximate nearest neighbor search. While FAISS is highly regarded for its performance on large-scale, high-
dimensional data and remains one of the most popular choices in the machine learning community, these alternatives
like Annoy and HNSWlib are also popular and might be preferred depending on your specific project requirements,
such as data size, need for dynamic updates, computational resources, and ease of integration.
Store in Vector Database FAISS for Retrieval

In this part of the tutorial, we’re setting up a system that helps us find similar product reviews based on their meaning.
Think of each review as being converted into a list of numbers (as an “embedding”) that captures its essence. To
organize and search through these numerical representations efficiently, we create something called an index using
FAISS, a library designed for this purpose.

We start by telling the system how long each list of numbers is—this is the dimension of our embeddings (in this case,
384 numbers per review). Then, we initialize the index with a method called **IndexFlatL2**. The term “flat”
means that the index will store all our embeddings in a simple, straightforward way without any complex structures.
The “L2″ refers to using the standard way of measuring distance between two points in space (like measuring the
straight-line distance between two spots on a map).

By setting up the index this way, we’re preparing a tool that can compare any new review to all existing ones by
calculating how “far apart” their embeddings are. Reviews that are closer together in this numerical space are more
similar in content. The variable index now holds this prepared system, and we’re ready to add our embeddings to it.
Once added, we can quickly search through our reviews to find ones that are most similar to any given piece of text.

import faiss
# Import necessary libraries
import pandas as pd # To work with data files
from transformers import pipeline # To use AI models for understanding text
from langchain.vectorstores import FAISS # To create a searchable database
from langchain.embeddings.huggingface import HuggingFaceEmbeddings # To
generate review summaries
from langchain.schema import Document # To structure reviews with extra
details
from langchain.docstore.in_memory import InMemoryDocstore # To store reviews
temporarily

# Step 1: Load customer reviews from a file

file_path = "Updated_Product_Reviews_with_Sentiment.csv" # Replace with your
file path
review_data = pd.read_csv(file_path) # Load the data file into the program
reviews = review_data["review"].tolist() # Get the list of all reviews
metadata = [{"id": idx, "sentiment": row["sentiment"]} for idx, row in
review_data.iterrows()] # Add details like sentiment

# Step 2: Summarize reviews using AI

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # Load
a pre-trained AI model
embeddings = embedding_model.embed_documents(reviews) # Generate AI summaries
(embeddings) for each review

# Step 3: Set up a searchable database

import faiss # A tool to search for similar items
import numpy as np # To handle numerical data and arrays

# Convert the embeddings list to a NumPy array

embeddings_array = np.array(embeddings) # Convert embeddings to a NumPy array

# Get the size (number of dimensions) of each AI summary

dimension = embeddings_array.shape[1]
# Create a FAISS index to store these summaries, using Euclidean distance
search_index = faiss.IndexFlatL2(dimension)

# Add the AI summaries (embeddings) into the index for similarity searches
search_index.add(embeddings_array)

# Step 4: Connect reviews and their details

documents = [
Document(page_content=reviews[i], metadata=metadata[i]) for i in
range(len(reviews))
] # Create documents combining reviews and their details
docstore = InMemoryDocstore({str(i): doc for i, doc in enumerate(documents)})
# Store documents temporarily
index_to_docstore_id = {i: str(i) for i in range(len(reviews))} # Keep track
of each review's ID

# Step 5: Combine everything into a simple tool

vector_store = FAISS(
embedding_function=embedding_model, # Use the AI model to summarize new
queries
index=search_index, # Use the database to find similar reviews
docstore=docstore, # Include the original reviews and their details
index_to_docstore_id=index_to_docstore_id, # Match reviews to their IDs
)

print("Searchable database is ready!")

It is important for us to understand some of the key parts of the code above:

# Step 2: Summarize reviews using AI

# Step 3: Set up a searchable database

import faiss # A tool to search for similar items
import numpy as np # To handle numerical data and arrays

# Convert the embeddings list to a NumPy array

embeddings_array = np.array(embeddings) # Convert embeddings to a NumPy array

# Get the size (number of dimensions) of each AI summary

dimension = embeddings_array.shape[1]

# Create a FAISS index to store these summaries, using Euclidean distance

search_index = faiss.IndexFlatL2(dimension)

# Add the AI summaries (embeddings) into the index for similarity searches
search_index.add(embeddings_array)

HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"):This loads a pre-trained AI

model called all-MiniLM-L6-v2. This model is specifically designed to transform text ( customer reviews)
into vector representations i.e. embeddings.
**embedding_model.embed_documents(reviews)**: Here, the model processes each review in the
reviews list and converts it into a numerical representation called an embedding. Each embedding captures the
essence or “summary” of the review in a format that AI systems can easily compare and analyze.

Convert Embeddings to a NumPy Array:

np.array(embeddings):The embeddings generated in Step 2 are stored as a Python list. To

work with them efficiently, we convert this list into a NumPy array (embeddings_array). NumPy
arrays are faster and support advanced operations like getting dimensions.

Get the Size of Each Embedding:

**embeddings_array.shape[1]**: The shape attribute of the NumPy array tells us its structure.
Here, .shape[1] retrieves the number of dimensions in each embedding (e.g., 384 for the all-
MiniLM-L6-v2 model).

Create a FAISS Index:

faiss.IndexFlatL2(dimension): FAISS is a tool to efficiently search for similar

embeddings. The IndexFlatL2 creates a flat database that uses Euclidean distance to measure similarity
between embeddings.

Add Embeddings to the Index:

search_index.add(embeddings_array):This step adds the embeddings (as vectors) into

the FAISS index, making it ready to perform similarity searches. For example, you can now search for
reviews that are similar to a given review or query.

Data Preparation for Vector Store

The strategy for data preparation in the vector store involves restructuring the data to ensure all relevant elements—
reviews, their metadata, and embeddings—are readily accessible and interlinked. Each review is paired with its
associated metadata (such as sentiment or ID) to create Document objects, which provide context-rich units of
information. These documents are stored in an InMemoryDocstore for quick retrieval, and a mapping is created
between FAISS index IDs and the corresponding document IDs in the docstore. This approach integrates the raw text,
structured metadata, and vector representations into a unified system, enabling efficient similarity searches while
preserving the ability to trace results back to their original details. By organizing the data in this way, the vector store
becomes a powerful tool for querying and retrieving meaningful insights.

Remember that one is a document representation and the other is a vector representation. Here’s the explanation of
the modular architecture:

Document Representation (InMemoryDocstore): The InMemoryDocstore stores the actual

content of the documents (reviews in this case) along with their metadata, such as sentiment or any other
associated details. It’s essentially a structured repository that holds the human-readable information and
contextual details.

Vector Representation (FAISS Index): FAISS stores the numerical embeddings (vector representations) of the
reviews. These embeddings are mathematical representations of the textual content, capturing their semantic
meaning. FAISS uses these vectors for similarity searches.

When you use **docstore** in FAISS, it doesn’t mean that the document content itself is stored in FAISS.
Instead, it provides a way to link the vector representations in FAISS to their corresponding documents in the
**InMemoryDocstore:**
1. Mapping with **index_to_docstore_id**: Each vector in the FAISS index is assigned an ID. The
**index_to_docstore_id** below connects these FAISS vector IDs to the IDs of the documents in the
**docstore**.

2. Pointer Mechanism: When a similarity search is performed in FAISS, it retrieves vector IDs for the closest
matches. These IDs are then used to look up the associated **Document** objects in the
**InMemoryDocstore**.

This setup keeps FAISS optimized for fast numerical computations (vector searches) while delegating the task of
managing document content and metadata to the **docstore**. It’s a division of responsibilities:

FAISS handles efficient retrieval of relevant vectors.

The **InMemoryDocstore** enriches the retrieval process by adding contextual information from the
original documents.

This approach ensures the system remains modular and efficient while providing comprehensive query responses.

Let us now understand this code:

# Step 4: Connect reviews and their details

# Step 5: Combine everything into a simple tool

Create documents combining reviews and metadata:

A list of Document objects is created, where each Document contains a review (**page_content**)
and its associated metadata (**metadata[i]**), such as sentiment or ID.

This links each review to its additional details for better context during retrieval.

Set up an in-memory storage:

The documents are stored in an **InMemoryDocstore**, a temporary storage solution, where each
document is assigned a unique string key (its index as a string).

This allows for easy retrieval of the original reviews and their metadata during searches.

Create a mapping between index IDs and document IDs:

A dictionary called **index_to_docstore_id** is created, mapping each numerical index in the
FAISS vector store to the corresponding document ID in the docstore.

This ensures that when a match is found in the FAISS index, the correct document can be retrieved.

Combine everything into a unified vector store:

A FAISS object is created to integrate the embedding function (for summarizing new queries), the FAISS
search index (for similarity searches), the docstore (for original reviews and details), and the index-to-
docstore mapping.

This unified tool simplifies the workflow, allowing queries to be processed, matched, and linked to their
original content seamlessly.

Create a Retrieval-Augmented Generation (RAG) System

The Retrieval-Augmented Generation (RAG) concept addresses the limitations of standalone language models (LLMs)
by incorporating external context to improve response relevance and accuracy. When an LLM is asked a question
without context, it generates answers based solely on its pre-trained knowledge, which can result in randomness or
hallucinations—plausible-sounding but incorrect responses.

RAG mitigates this by integrating a retriever mechanism that fetches relevant context (e.g., documents or specific
knowledge) from a database or vector store based on the query. This retrieved context is then provided to the LLM
alongside the query, grounding the generation process in more accurate, up-to-date, or domain-specific information.

Remember that RAG improves how LLM answers questions by giving it helpful context to work with, such as related
documents or information from a database. This makes the responses more accurate and relevant. However, mistakes
will still happen if the retrieved documents don’t have enough useful information or if the AI misunderstands the
content. Even with these limitations, RAG is a powerful approach for getting more reliable and context-based answers,
especially in areas where accuracy and relevance are important.

# Step 1: Configure retriever

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs=
{"k": 3})

# Step 2: Set up text generation pipeline

text_generator = pipeline("text-generation", model="gpt2", max_new_tokens=50)
llm = HuggingFacePipeline(pipeline=text_generator)

# Step 3: Create RetrievalQA pipeline

rag_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff",
)

# Step 4: Test the pipeline

query = "What do customers say about battery life?"
response = rag_chain.run(query)

print(f"Query: {query}")
print(f"Response: {response}")

A retriever is a core component in information retrieval systems, designed to find and return relevant pieces of
information based on a query. Conceptually, it acts as a bridge between a user’s query and a large knowledge base,
enabling efficient and targeted searches. Retrievers work by comparing the query to the stored representations of data,
such as vector embeddings or indexed documents, to identify the most similar or relevant items.

Retrieving Similar Vectors:

The **retriever** uses the FAISS vector store for similarity search.

When a query is made (e.g., “What do customers say about battery life?”), the query text is transformed into
a vector embedding using the same **embedding_function** used during setup.

FAISS searches the stored vectors in the **search_index** to find the **k** most similar vectors to
the query embedding, based on Euclidean distance or another similarity metric.

Connecting to the docstore:

FAISS returns the IDs of the top **k** closest vector embeddings.

These IDs are mapped to their corresponding document IDs using the **index_to_docstore_id**
dictionary.

Fetching Documents:

The docstore is then queried using these document IDs.

It retrieves the actual document content (e.g., the original review) and metadata (e.g., sentiment, ID)
associated with each retrieved vector.

Returning Results:

The **retriever** compiles the matching documents, including their metadata, into a format that can
be used by downstream components (e.g., question-answering pipelines like RAG).

Retrieval with Generation - The RAG Pipeline

Let us now analyze

# Step 2: Set up text generation pipeline

text_generator = pipeline("text-generation", model="gpt2", max_new_tokens=50)
llm = HuggingFacePipeline(pipeline=text_generator)

# Step 3: Create RetrievalQA pipeline

rag_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff",
)

# Step 4: Test the pipeline

query = "What do customers say about battery life?"
response = rag_chain.run(query)

print(f"Query: {query}")
print(f"Response: {response}")

The code sets up a Retrieval-Augmented Generation (RAG) pipeline that combines a retriever with a language
model to generate context-aware responses.
1. Text Generation Setup:

The pipeline("text-generation", model="gpt2", max_new_tokens=50) creates

a text generation model (GPT-2) capable of generating text based on the input. The **pipeline**
function. is not a LangChain function but comes from Hugging Face’s **transformers** library.

The HuggingFacePipeline is a function from LangChain. It acts as a bridge to integrate

Hugging Face’s models into LangChain’s ecosystem, allowing Hugging Face models to be used seamlessly
in LangChain workflows, like RetrievalQA or other chain-based pipelines.

We wrap text_generator with HuggingFacePipeline to make it work with

LangChain. Hugging Face’s pipeline generates text on its own, but LangChain needs models to follow its
format to work well with tools like retrievers and chains. The **HuggingFacePipeline** acts like a
translator, connecting the text generator to LangChain, so everything works together smoothly in the
retrieval and question-answering process.

2. Retriever Role:

The **retriever** is already connected to the **vector_store** created earlier, which maps
query vectors to relevant documents.

When a query is provided to the RAG pipeline, the retriever first identifies the most relevant documents (or
text chunks) from the vector database by comparing the query’s embedding with stored embeddings.

3. Combining Retrieval with Generation:

The **RetrievalQA.from_chain_type** method combines the retriever and the LLM (llm) into
a unified pipeline.

The **retriever** fetches the most relevant context (e.g., product reviews or document snippets)
based on the query.

This retrieved context is then fed to the language model, which uses it to generate a more informed and
contextually accurate response.

4. Chain Type: The **chain_type="stuff"** specifies how the retrieved documents are handled. In this
case, all retrieved context is concatenated (“stuffed”) into a single input for the language model.

RetrievalQA is a class in the LangChain framework designed to enable Retrieval-Augmented Generation

(RAG) workflows. Its primary purpose is to combine a retriever (for finding relevant documents or data) with a
language model (LLM) to produce accurate and context-aware responses to user queries. The retrieved documents are
prepared (e.g., concatenated or summarized) based on the **chain_type.**

In LangChain’s **RetrievalQA**, the **chain_type** determines how retrieved documents are processed
and presented to the language model (LLM) to generate a response, offering flexibility for various use cases.

The **stuff** chain type as mentioned earlier concatenates all retrieved documents into a single input and sends it
to the LLM, making it simple and efficient for small sets of concise documents, though it may exceed token limits for
larger contexts.

The **map_reduce** chain processes each document independently to generate partial responses in the “map”
step and combines them into a final answer in the “reduce” step, ideal for contexts too large to fit into a single call.

The **refine** chain handles documents iteratively, refining the answer with each additional document, ensuring
thorough consideration of all retrieved data, which is useful for in-depth analyses.
Lastly, the **map_rerank** chain scores each document for relevance during the “map” step and selects the most
relevant one to generate the response, making it effective for scenarios with numerous retrieved documents requiring
prioritization.

This setup ensures that the model’s responses are grounded in the most relevant information retrieved by the
**retriever**, reducing hallucination and making the output more reliable and context-aware. The retriever
ensures that the LLM works with targeted, high-quality data rather than relying solely on its pre-trained knowledge.

The results are quite disappointing

The incoherence in the response is likely due to the combination of several factors:

1. Model Choice (GPT-2): The **gpt2** model is a general-purpose language model and is not specifically fine-
tuned for tasks like summarization or retrieval-augmented question answering. It might struggle to provide
coherent responses when fed raw retrieved contexts without fine-tuning or adaptation for the task.

2. “Stuff” Chain Type: The **chain_type="stuff"** concatenates all retrieved contexts into a single input
before passing it to the LLM. If the retrieved documents contain repetitive or slightly mismatched information,
the model might not handle this well and generate confusing responses. For example, repeated statements like
“Battery drains quickly” can confuse the LLM’s summarization process.

3. Quality of Retrieved Context: If the documents retrieved by FAISS contain irrelevant or overly similar content,
the LLM’s ability to generate a cohesive answer diminishes. This happens because the model is trying to
summarize redundant or poorly aligned input.

4. Token Limit and Truncation: If the combined context exceeds the model’s token limit, parts of the context may
be truncated. This can lead to partial or incomplete information being passed to the model, resulting in
incoherence.

5. Absence of Explicit Instruction to the LLM: Without explicit prompts or instructions on how to format the
response, the LLM might generate an answer that mixes context with the response, as seen in the output. GPT-2
works better when given very clear prompts.

6. Data Quality Issues in Retrieved Contexts: If the retrieved documents themselves contain incomplete,
repetitive, or poorly structured text, the final response will reflect those issues. The model can only work as well
as the data it is provided with.

We make the following changes:

1. Better Model (flan-t5-base): Replace gpt2 with flan-t5-base, which is fine-tuned

for tasks like summarization and QA. This ensures more accurate and coherent answers.

2. Improved Chain Type (**refine**): Switch from **stuff** to **refine**. This chain type ensures
that each retrieved document is processed iteratively, allowing the model to refine its answer with each step.

3. Cleaner and Clearer Prompt: Update the query to explicitly ask for a summary: "``**Summarize what
customers say about battery life in the reviews**``."

4. Maximum Token Limit Increased: Increased **max_new_tokens** to 100 to give the model more
flexibility in generating coherent answers.

Let us run this code”

# Step 1: Import necessary libraries

import faiss
import numpy as np
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.docstore.in_memory import InMemoryDocstore
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

# Step 2: Load customer reviews from a file

import pandas as pd
file_path = "Updated_Product_Reviews_with_Sentiment.csv" # Replace with your
file path
review_data = pd.read_csv(file_path)
reviews = review_data["review"].tolist()
metadata = [{"id": idx, "sentiment": row["sentiment"]} for idx, row in
review_data.iterrows()]

# Step 3: Generate embeddings using a better pre-trained AI model

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = embedding_model.embed_documents(reviews)

# Convert embeddings list to a NumPy array

embeddings_array = np.array(embeddings)

# Step 4: Set up FAISS index

dimension = embeddings_array.shape[1]
search_index = faiss.IndexFlatL2(dimension)
search_index.add(embeddings_array)

# Step 5: Connect reviews and metadata

documents = [
Document(page_content=reviews[i], metadata=metadata[i]) for i in
range(len(reviews))
]
docstore = InMemoryDocstore({str(i): doc for i, doc in enumerate(documents)})
index_to_docstore_id = {i: str(i) for i in range(len(reviews))}

# Combine everything into a FAISS vector store

vector_store = FAISS(
embedding_function=embedding_model,
index=search_index,
docstore=docstore,
index_to_docstore_id=index_to_docstore_id,
)

# Step 6: Configure the retriever with better retrieval quality

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs=
{"k": 3})

# Step 7: Set up a better text generation pipeline

text_generator = pipeline("text2text-generation", model="google/flan-t5-base",
max_new_tokens=100) # A fine-tuned model for QA
llm = HuggingFacePipeline(pipeline=text_generator)

# Step 8: Create RetrievalQA pipeline with `refine` chain type

rag_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="refine", # Iteratively refines answers based on retrieved
context
)

# Step 9: Test the pipeline with a better query and prompt

query = "Summarize what customers say about battery life in the reviews."
response = rag_chain.run(query)

print(f"Query: {query}")
print(f"Response: {response}")

The response is

Dynamic Sentiment Filtering

In this section, the goal is to enhance the context provided to the Language Model (LLM) by enriching it with
additional metadata extracted from the relevant documents. This process involves gathering all the documents that are
related to the query and compiling their content, along with their metadata, to create a richer, more detailed context.
The metadata can include supplementary information such as sentiment, review IDs, or other attributes that add depth
and specificity to the query. By combining these documents and their associated metadata, the input sent to the LLM
becomes more comprehensive, enabling it to generate more accurate, informed, and contextually relevant responses to
the user’s question. This step ensures that the LLM has access to all the necessary details to answer the query
effectively.

# Step 1: Import necessary libraries

import faiss
import numpy as np
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.docstore.in_memory import InMemoryDocstore
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
import pandas as pd

# Step 2: Load customer reviews from a file

file_path = "Updated_Product_Reviews_with_Sentiment.csv" # Replace with your
file path
review_data = pd.read_csv(file_path)
reviews = review_data["review"].tolist()
metadata = [{"id": idx, "sentiment": row["sentiment"]} for idx, row in
review_data.iterrows()]

# Step 3: Generate embeddings using a better pre-trained AI model

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings = embedding_model.embed_documents(reviews)

# Convert embeddings list to a NumPy array

embeddings_array = np.array(embeddings)

# Step 4: Set up FAISS index

dimension = embeddings_array.shape[1]
search_index = faiss.IndexFlatL2(dimension)
search_index.add(embeddings_array)

# Step 5: Connect reviews and metadata

# Combine everything into a FAISS vector store

vector_store = FAISS(
embedding_function=embedding_model,
index=search_index,
docstore=docstore,
index_to_docstore_id=index_to_docstore_id,
)

# Step 6: Configure the retriever with better retrieval quality

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs=
{"k": 10})

# Step 7: Set up a better text generation pipeline

text_generator = pipeline("text2text-generation", model="google/flan-t5-base",
max_new_tokens=100) # A fine-tuned model for QA
llm = HuggingFacePipeline(pipeline=text_generator)

# Step 8: Test the pipeline with dynamic sentiment filtering

queries = [
("Summarize what customers say about battery life in the reviews.", None),
# No sentiment filtering
("What do customers say about the durability of the product?", "Positive"),
# Positive sentiment
("What are the negative reviews about shipping?", "Negative"), # Negative
sentiment
]

for query, sentiment in queries:

# Retrieve relevant documents
retrieved_docs = retriever.get_relevant_documents(query)

# Apply sentiment filtering if specified

if sentiment:
filtered_docs = [doc for doc in retrieved_docs if
doc.metadata.get("sentiment") == sentiment]
if len(filtered_docs) == 0:
print(f"No documents found with sentiment '{sentiment}'. Using all
documents instead.")
filtered_docs = retrieved_docs
else:
filtered_docs = retrieved_docs

# Prepare the context from the filtered documents

context = "\n\n".join([doc.page_content for doc in filtered_docs])

# Create a prompt for the LLM

prompt = f"""You are a helpful assistant.

Based on the following customer reviews:

{context}

Answer the following question:

{query}
"""

# Generate response using the LLM

response = llm(prompt)

print(f"Query: {query}")
print(f"Response: {response}\n")

The response will be:

1. Loop Through Questions: There are a few questions (like “What do customers say about battery life?”) and an
optional sentiment filter (e.g., “Positive” or “Negative”). The loop goes through each question one by one.

for query, sentiment in queries:

2. Find Relevant Reviews: For each question, the program looks for reviews that are related to the question using a
“retriever.” Think of this as finding the most relevant reviews from a library.

retrieved_docs = retriever.get_relevant_documents(query)

3. Filter by Sentiment:: If you’re only interested in reviews with a specific sentiment (e.g., only positive reviews),
it will filter the results to include only those matching your preference. If no matching reviews are found, it will
print a message saying, “No documents found with sentiment ‘Positive’” and fall back to using all the reviews.

4. Combine Relevant Reviews: Once the relevant reviews (filtered or unfiltered) are ready, it combines their
content into a single block of text. This is like creating a summarized “cheat sheet” of what customers are saying.

context = "\n\n".join([doc.page_content for doc in filtered_docs])

5. Ask the LLM to Generate an Answer: Using the combined reviews, the program creates a “prompt” (a detailed
question) for the AI. It says: “Here are the customer reviews.”, “Based on these reviews, answer the following
question.” The question (like “What do customers say about battery life?”) is included in the prompt.

prompt = f"""You are a helpful assistant.

Based on the following customer reviews:

{context}

Answer the following question:

{query}
"""

6. Generate the Response: The AI reads the prompt, processes the reviews, and writes an answer to the question.

7. Display the Answer: Finally, it prints the question and the AI’s response.

print(f"Query: {query}")
print(f"Response: {response}\n")

Model hosting is a critical component of deploying machine learning models like Hugging Face Transformers in
production. You have two main options: hosting the model locally or using managed services. Managed hosting
solutions, such as Hugging Face Inference API, AWS SageMaker, or Google Cloud AI Platform, simplify
infrastructure management by providing pre-configured environments and scalable endpoints for inference. For
example, AWS SageMaker allows you to deploy pre-trained models with minimal effort, enabling your backend to call
these endpoints for generating responses. If you host the model locally, it can run alongside a FAISS index for efficient
similarity searches, but this approach requires managing server resources and scaling manually. Managed services, on
the other hand, ensure consistent performance during high traffic by leveraging cloud infrastructure, making them ideal
for applications with fluctuating demand.

Local Hosting: The InMemoryDocstore used in development can be directly hosted on your server alongside
the application. It is suitable for small-scale use cases or prototyping but not ideal for production where
persistence and scalability are needed.

Managed Databases: Migrate the doc store to cloud-hosted NoSQL databases like MongoDB Atlas, AWS
DynamoDB, or Firestore.These services allow you to persist metadata (e.g., review details and sentiment) and
ensure scalability and durability.

Local Hosting: Host the FAISS index on the same machine as the model and application backend.

This works well if your index size is manageable and you do not expect high traffic or scalability issues.

Cloud Hosting:

Custom VM Instances: Deploy FAISS on cloud services like AWS EC2, Google Cloud Compute
Engine, or Azure VMs.

These instances can handle larger datasets and high query throughput.

Serverless Functions: For smaller FAISS indexes, services like AWS Lambda or Google Cloud
Functions can be configured to load and query the FAISS index on-demand.

Docker/Kubernetes: Containerize the FAISS index with tools like Docker and deploy it on Kubernetes
clusters (e.g., AWS EKS, Google Kubernetes Engine).

FAISS on Managed Services: Tools like Pinecone or Weaviate offer vector search as a managed service,
abstracting the infrastructure for FAISS-like functionality. These services handle indexing, scaling, and querying
vectors, removing the need for manual FAISS management.

The pipeline is primarily implemented in a backend service that handles:

Query Processing:
Vectorizing the user query with Hugging Face Embeddings.

Searching the FAISS index for relevant documents.

Optional Filtering:

Filtering retrieved documents based on metadata, such as sentiment.

Context Creation:

Preparing the context (e.g., concatenating retrieved reviews) for the LLM.

Response Generation:

Using an LLM (e.g., Hugging Face Transformers) to generate enriched responses.

This backend service can be built using frameworks like FastAPI, Flask, or Django for Python, which allows for easy
integration with the vector search and the LLM.

Last updated 3 months ago

Load the local LLM and execute the questions

Raw data for reviews tagged by category

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-activation-and-data-export/act-100-dataset-activation-with-
data-distiller * * *

1. UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT

ACT 100: Dataset Activation with Data Distiller

Shipping your datasets to distant destinations for maximizing enterprise ROI

In today’s data-driven enterprises, activating datasets from a Customer Data Platform (CDP) plays a critical role in
maximizing the value of AI/ML models, enterprise reporting, and Customer 360 initiatives. Dataset activation enables
AI/ML algorithms to predict customer behavior, delivering highly personalized interactions across channels. In
enterprise reporting, activated data provides real-time insights for performance tracking. For Customer 360, it unifies
customer profiles, giving businesses a comprehensive view of their customers, ultimately driving better decision-
making, precise targeting, and improved customer experiences across the organization.

Data Distiller offers a variety of cloud storage options accessible through the Destination UI:

1. Cloud Storage Destinations (File-Based): Accessible via Destination UI, supporting 6 cloud storage options:

Azure Data Lake Storage Gen 2

2. Batch Export Options:

Incremental and first-time full export

Export frequencies: 3, 6, 8, 12, 24 hours

3. Output Formats: JSON, Parquet. CSV is not supported.

4. Data Job Export Limits:

Event data

Datasets conforming to Experience Event Schema that have a _id and timestamp: Maximum of
365 days

You can workaround this by using an Profile Record Schema

Volume: 10 billion records across all datasets in a single job

5. DULE Enforcement: Ensures derived datasets are manually labeled in Data Distiller for compliance.

Data Distiller Derived Datasets vs. Raw Dataset Export

Data Distiller provides additional lake storage with a flexible and generous data retention policy, ensuring that large
volumes of data can be stored and accessed over extended periods to meet enterprise requirements (check license
entitlements for details). It converts raw datasets into optimized, derived datasets tailored for enterprise use cases like
reporting, AI/ML model training, and business insights. This transformation ensures the data is structured, relevant,
and analysis-ready, eliminating the need for complex processing and simplifying the extraction of actionable insights.

Ensure that the batch export schedule is set to run after the completion of Data Distiller jobs. Currently, there is no
functionality to trigger a batch export immediately after a Data Distiller job finishes, so careful scheduling is required
to prevent overlap or incomplete data export.

Exporting a derived dataset, also referred to as a feature dataset in a AI/ML context, offers significant benefits
compared to working with raw data, particularly in scenarios involving data analysis, reporting, or model training.
Derived datasets consist of pre-processed, structured, and often enriched information that is ready for immediate
use. This structured nature provides several critical advantages:

1. Pre-Processed and Ready for Use: Derived datasets have undergone pre-processing to clean, transform, and
enhance the raw data. This involves steps such as data normalization, outlier removal, handling missing values,
and applying relevant transformations. By performing these tasks ahead of time, the dataset is ready for analysis
or AI/ML model training without requiring additional preparation. This significantly reduces the time and effort
needed for data cleaning and preprocessing, allowing teams to focus directly on extracting insights or building
models.

2. Feature Engineering: One of the key components of a derived dataset is the inclusion of engineered features.
These features are specifically designed to capture important insights, trends, or patterns that may not be apparent
in the raw data. For example, features could include customer behavior patterns, time-based aggregates (like
rolling averages), or calculated metrics (like customer lifetime value). By incorporating these meaningful
features, derived datasets eliminate the need for analysts or data scientists to manually engineer features from raw
data, thereby streamlining the analytical process.

3. Reduced Processing Time: Since the heavy lifting of data transformation has already been done, using a derived
dataset greatly reduces the processing time for queries, model training, or reports. Raw data often requires
multiple rounds of cleaning, joining, and transforming before it can be used effectively, which can be resource-
intensive. Derived datasets provide all of the necessary transformations in advance, allowing business users and
data scientists to bypass these steps and focus on the final analysis or model optimization.

4. Consistency Across Analyses: Derived datasets ensure that all users are working with the same set of pre-
calculated features and metrics, promoting consistency across different analyses and reports. By exporting a
dataset that includes standard features and attributes, organizations can avoid discrepancies that often arise when
different teams calculate metrics or derive features independently from raw data. This consistency not only
reduces errors but also enhances collaboration by ensuring everyone is working with the same version of the data.
5. Improved Performance for AI/ML Models: In machine learning workflows, derived datasets often lead to
better model performance. This is because the features included in the dataset have been carefully engineered to
highlight relevant patterns and relationships that are crucial for model training. Pre-processed data is cleaner,
more relevant, and typically optimized for specific use cases. By providing models with high-quality features,
organizations can improve prediction accuracy, reduce training time, and streamline hyperparameter tuning.

6. Cleaner and More Relevant Data: Derived datasets are typically cleaner and more relevant to specific
business problems. Raw data may contain irrelevant information, missing values, or noise that can skew results.
Derived datasets, on the other hand, focus on key attributes and features that have been filtered and processed for
accuracy and relevance. This results in datasets that are more aligned with business objectives, providing
decision-makers with higher-quality information for driving insights and making decisions.

7. Streamlined Decision-Making for Business Users: By delivering datasets that are pre-processed and enriched
with meaningful features, business users can more easily extract insights without requiring in-depth technical
knowledge of data processing. The simplified structure and curated features of a derived dataset allow for faster
and more accurate decision-making, whether the data is used for creating dashboards, running reports, or feeding
predictive models. This enables business teams to act quickly on data-driven insights without having to navigate
the complexities of raw data transformation.

In enterprise reporting, exporting a derived dataset offers significant advantages over working with raw data. Derived
datasets are the result of pre-processed data that integrates pre-calculated facts and meaningful attributes from a
well-structured data model, such as a star schema. This structure, which combines fact tables (like sales, revenue, or
transaction data) with enriched lookup tables (such as customer demographics or product categories), provides several
key benefits:

1. Simplified Data Structure: Derived datasets come pre-joined and pre-aggregated, meaning that the complex
relationships between fact tables and dimension tables have already been resolved. This eliminates the need for
additional joins or transformations during query time, reducing the complexity of data retrieval for reporting.
Users and analysts can immediately work with the data without needing to understand its underlying relational
structure, leading to faster time to insight.

2. Enhanced Performance: Because the dataset is already enriched and pre-calculated, query execution is
significantly faster. Raw data often requires multiple joins and real-time transformations, which can be time-
consuming, especially with large datasets. By exporting a derived dataset that includes pre-aggregated metrics
(such as total sales, revenue, or customer segments), enterprises can ensure that reporting dashboards, queries,
and analytics tools perform optimally, even under heavy workloads or high concurrency.

3. Consistency and Accuracy: Exporting derived datasets ensures that the same business logic and calculation
methods are applied consistently across all use cases. Whether generating dashboards, building reports, or
performing ad hoc analyses, the data remains consistent because the underlying facts and metrics have been
calculated and validated ahead of time. This reduces the risk of discrepancies or inconsistencies that can arise
when multiple teams perform their own calculations on raw data.

4. Pre-Integrated Lookups for Richer Insights: Derived datasets can also include enriched lookups, such as
customer demographics, product categories, and other contextual attributes. These lookup tables are already
linked to fact tables, providing a richer, more meaningful view of the data. For example, sales data is not only
presented as raw numbers but can also be segmented by customer age, location, or product type, which enables
more granular and insightful analysis without requiring additional processing steps.

5. Improved Dashboard Creation and Decision-Making: With pre-processed data that includes both metrics and
contextual information, creating dashboards and performing real-time analytics becomes more straightforward.
Decision-makers can rely on the fact that the data is immediately usable, accurate, and up-to-date, allowing them
to focus on interpreting insights rather than preparing or cleaning data. This helps accelerate decision-making
processes and ensures that the insights derived are trustworthy and actionable.
6. Reduced Operational Overhead: Exporting derived datasets reduces the operational burden on data teams. By
doing the heavy lifting of data transformation and enrichment upfront, enterprises can minimize the number of
transformations required during reporting. This leads to fewer mistakes, reduces the need for frequent
reprocessing, and frees up resources to focus on more strategic tasks like data governance or advanced analytics.

Adobe Analytics Batch Data Feed

Adobe Analytics data that has been imported into the Adobe Experience Platform (AEP) Data Lake can be further
processed through Data Distiller and then exported in batches for more granular analysis. This processing involves
several key steps that refine the raw data to provide more meaningful insights.

1. Sessionization: One of the core processing steps is sessionization, which groups user activities into defined
sessions. This can be achieved through a window function or a specialized Data Distiller function that
segments interactions into time-bound sessions. For example, all user activities within a 30-minute window can
be grouped as one session. Sessionization is crucial for understanding user journeys, behavior within defined
periods, and the continuity of interactions.

2. Attribution Functions: After sessionizing the data, attribution functions are applied. These functions help
assign credit for specific conversions or events to the appropriate marketing channels, touchpoints, or user
actions. By applying attribution models (such as first-touch, last-touch, or multi-touch attribution), businesses can
understand which marketing efforts led to conversions or significant customer actions.

3. Deep Insights into Behavior and Attribution Patterns: Processing the data through sessionization and
attribution enables businesses to gain a deeper understanding of customer behavior and how different channels,
campaigns, or touchpoints contribute to desired outcomes (such as purchases, sign-ups, or other conversions).
This detailed insight helps to uncover trends and patterns that might be missed with raw, unprocessed data.

4. Batch Export for Further Analysis: Once the data has been refined through sessionization and attribution, it can
be exported in batches. The batch export allows businesses to perform additional analysis, reporting, or
integration with other systems. This refined data is now enriched with session-based insights and attribution
details, making it more actionable for decision-making and performance tracking.

You can see these ideas in action in this special note here.

Special Export Formats for Audiences

There are limitations in Profile or Audience activation exports within Adobe Real-Time Customer Data Platform
regarding the structure of the output segment. Output segments are required to follow the essential structure of identity
and attributes, mirroring what is present in the Real-Time Customer Profile. Any other custom audience formatting use
cases fall under the domain of Data Distiller activation.

In certain cases, you may need to export audiences in a special format as required by a destination. These formats may
be unique to the destination’s data integration needs and cannot be handled by the standard Adobe Experience Platform
(AEP) Destination Framework.

In such scenarios, an audience format serves as a contract between AEP and the destination. This contract defines the
structure and rules for how the dataset (audience) should be exported. Essentially, these formats represent custom ways
of structuring audiences that are necessary for some destinations. While audiences are typically handled as part of
AEP’s Data Distiller Derived Datasets, there are special cases where the export format of an audience becomes a
more tailored requirement.

Key Benefits of Using the Destination Framework:

1. Access to Non-Cloud Storage Locations: The Destination Framework allows the export of data to various types
of storage systems, including on-premises, hybrid environments, or specialized non-cloud destinations.

2. Audience Definition Integration: The framework enables the integration of audience definitions within the
Real-Time Customer Profile, ensuring that audience segmentation aligns with the required format for
destinations.

Data Landing Zone Destination

The Data Landing Zone Source is a staging area on the source side where external data sources can push their data,
effectively mirroring the AEP data lake but outside the governance boundary. Each sandbox has its own Source Data
Landing Zone, with datasets having a 7-day time-to-live before deletion. Similarly, on the destination side, there is a
Data Landing Zone Destination where data can be picked up by external systems. This setup allows you to verify
dataset exports and even segment data, making it a fast and reliable method for confirming what data is being
exported, which we’ll utilize in our tutorial.

We will create a Developer Project and use Python to access credentials for the Data Landing Zone. After that, we’ll
use Azure Storage Explorer to retrieve and examine the exported data.

Access Data Landing Zone Destination Credentials

1. Setup the Developer Project based on the instructions in this section

2. Generate the Access Token in Python based on the instructions in this section

3. Access the Data Landing Zone Destination credentials by executing the following code:

import requests

Replace this with your sandbox name

sandbox_name = ‘prod’

The URL to access the Data Landing Zone

url = ‘https://platform.adobe.io/data/foundation/connectors/landingzone/credentials?type=dlz_destination’
sandbox_name=‘prod’

Set the headers

headers = { “Authorization”: f”Bearer {access_token}”, “x-api-key”: client_id, “x-gw-ims-org-id”: org_id, “x-
sandbox-name”: sandbox_name, “Content-Type”: “application/json” }

Send the GET request to access the Data Landing

Zone
response = requests.get(url, headers=headers)
Check the response status and output the result
if response.status_code == 200: # Successful, get the Data Landing Zone URL data_landing_zone =
response.json() print(“Data Landing Zone Info:”, data_landing_zone) else: # Handle errors print(f”Failed to get
Data Landing Zone. Status Code: {response.status_code}, Response: {response.text}”)

Send the GET request to retrieve the SAS URL

response = requests.get(url, headers=headers)

if response.status_code == 200: credentials = response.json() print(“Container Name:”,

credentials[‘containerName’]) print(“SAS Token:”, credentials[‘SASToken’]) print(“Storage Account Name:”,
credentials[‘storageAccountName’]) print(“SAS URI:”, credentials[‘SASUri’]) else: print(f”Failed to get
credentials: {response.status_code}”)

If you want to get the Data Landing Zone Source credentials, you can get the same by just replacing the **url** in
the above code as:

url = 'https://platform.adobe.io/data/foundation/connectors/
landingzone/credentials?type=user_drop_zone'

Setup Azure Storage Explorer

1. Download the Azure Storage Explorer based on the instructions in this section

2. Setup the Azure Storage Explorer by following the pictures in sequence

Upload the Data Distiller Derived Dataset

We are going to use the derived dataset that wee created in the following tutorial:

The CSV file is generated from the RFM_MODEL_SEGMENT View:

If you have not completed the tutorial, then follow the steps here to upload the CSV file. Name the dataset as
**RFM_SEGMENT_MODEL.** It looks like this:

1. Navigate to Connections->Destinations->Catalog->Cloud Storage->Data Landing Zone. Click Activate.

2. Choose Datasets and Configure Destinations

3. Configure the destination with the following parameters:

1. Datatype: Choose Datasets

2. Description: Be descriptive or just use DLZ_Data_Distiller

3. Compressed Format: GZIP. Gzip (GNU zip) is a popular file compression and decompression tool used to
reduce the size of files, making them easier to store and transmit. You can use any unzip facility in the
destination system to retrieve the raw contents.

4. Include the Manifest file. Details about Manifest files for debugging are here.
4. Choose Data Export Marketing Action. A more detailed discussion is there in DULE section.

5. Click on the Destination Flow created:

6. Choose RFM_MODEL_SEGMENT dataset to export

7. Configure the Batch Schedule

1. Frequency: Change it from the default Daily setting to Hourly.

2. Scheduled start time: It will automatically select the closest available time for you—please do not modify
it. Keep in mind, all times are in UTC.

3. Date: The current date will automatically be to today’s date.

4. Incremental Export: Keep in mind that the data export is processed incrementally, with the first batch job
uploading the complete file.

Adobe Experience Platform schedule times are always set in UTC (Coordinated Universal Time) which has
tradeoffs. UTC has these advantages:

Global Consistency: UTC provides a single, consistent reference point for time across all regions. This
eliminates confusion when dealing with your users operating in different time zones.

Simplified Scheduling: Having a unified time standard simplifies scheduling, particularly for global teams, as
you avoid needing to adjust for daylight saving time or other regional time changes.

Accurate Execution: Since UTC is not affected by time zone shifts, setting schedules in UTC ensures that
processes, like data ingestion or activation, run accurately and consistently.

Easier Debugging: Using a single time zone for all scheduled events makes tracking, logging, and debugging
system events much simpler, as all timestamps are directly comparable.

Disadvantages of using UTC include the need for time zone conversions, potential confusion for non-technical users,
manual adjustments for Daylight Saving Time, and a higher risk of human error in scheduling.

By executing the following command in Data Distiller, your local time will be converted to UTC, giving you a clear
idea of when the schedule will run:

SELECT from_unixtime(unix_timestamp()) AS utc_time;

The above query converts the current Unix timestamp to UTC, which is not affected by daylight saving time. UTC
remains constant throughout the year, so this query will always return the time in UTC regardless of local time zone
changes.

1. Click Finish to complete the setup.

Monitor the Destination Flow

1. Click on Destinations->Browse->DLZ_Data_Distiller flow

2. You should see the following status:

Download the Data from Azure Storage Explorer

1. If the data export confirmation in the Adobe Experience Platform (AEP) UI is successful but the data doesn’t
appear in Azure Storage Explorer, try refreshing your session first. If the data still isn’t visible, attempt to
reconnect to your Azure Storage account. If issues persist, simply relaunch the Azure Storage Explorer
application to resolve any session problems and display the newly exported data.

2. Your Storage Explorer UI should look like this:

3. Navigate down into the folders:

4. Download the files locally by selecting the files and clicking Download:

5. Manifest file looks like this:

1. To open the other file on a Mac, simply double-click it or unzip it if it’s compressed. This should result in a JSON
file:

The manifest file will look like the following:

{"flowRunId":"a28e30b1-07eb-4219-8d50-317ee82a5b38","scheduledTime":"2024-09-
22T21:00:00Z","exportResults":
[{"sinkPath":"/66f0631a95cb962aee9454aa/exportTime=20240922210000","name":"part-
00000-tid-2828907778374757437-317df0e1-c96d-4951-8b52-0ececf1ddafd-4508827-1-
c000.json.gz","size":21207}]}

The manifest file for the destination export provides key information about the data export process, which is highly
useful for the end user in several ways:

Audit & Monitoring

**flowRunId**: This is a unique internal identifier for the export process or flow within the Adobe
Experience Platform. It allows you to track and trace a specific export job. In case of issues or questions
dealing with Adobe Support, the **flowRunId** can be used by them to find logs, retry the export, or
analyze the performance of the flow.

**scheduledTime**: This field shows when the export was scheduled to occur (2023-08-
18T01:00:00Z in this case). This is useful for auditing purposes, verifying that the export happened at
the correct time, or ensuring the scheduling of exports aligns with your needs(e.g., daily or hourly exports).

Data Integrity & Validation

**exportResults**:

**sinkPath**: This is the destination path where the exported data has been stored. This helps the
user quickly locate the data for further processing or analysis.

**name**: This is the name of the exported file. It often contains details like the file name which
help the user identify the data contents and time of export.

**size**: This specifies the size of the exported file. Knowing the file size helps the user
understand the volume of data being exported and can be useful for managing storage costs, transfer
speeds, or estimating the data load. If a file size is unexpectedly small or large, the you might want to
investigate further to ensure no data was lost or duplicated.

DULE: Data Export Marketing Action

DULE (Data Usage Labeling and Enforcement) is a data governance system in Adobe Experience Platform (AEP)
that enables you to assign specific usage labels to datasets or individual fields within a dataset. You can create and
apply rules, known as policies, which link these usage labels to actions—usually allowing or disallowing certain uses.
Most prebuilt policies focus on controlling access to audiences or datasets, either restricting the entire audience or
dataset, or specific fields within them.

Let us explore the Data Export Marketing Action in depth

1. Browse to Privacy->Policies->Marketing Actions->Data Export

The Data Export action (authored by Adobe) involves exporting data to any location or destination outside of Adobe
products and services. Examples include downloading data to your local machine, copying data from the screen,
scheduling data delivery to an external location, Customer Journey Analytics scheduled projects, downloading reports,
using the Reporting API, and similar activities.

Whether this action will allow the data to be processed depends on its association with specific labels, as defined by a
governing policy that determines whether the action is approved or restricted. Therefore, the marketing action itself
has no inherent meaning other than enforcing a rule (in this case, an export) or not.

1. Browse to Privacy->Policies->Labels->Data Export

The C2 contract label specifies that any marketing action associated with this usage label in a policy will result in the
export of data being disallowed. This ensures that data governed by the C2 label cannot be exported for marketing
purposes under any circumstances.

The C2 label is more restrictive than the C1 label, which only permits export in aggregated, anonymous form. Browse
through these labels and they give you a sense of the kinds of governance policies you can impose on the data you
have.

1. Click on Privacy->Policies->Browse->3rd Party Export Restriction

It’s clear that the Data Export Marketing Action (under the Associated Marketing Action column) has been
preconfigured by Adobe to automatically prevent the export of any dataset containing fields marked with the C2
contract label (under associated labels). This ensures that fields tagged with the C2 label are restricted from being
exported to comply with contractual and governance rules. Thus, labels associate the color on the Marketing Action
and that helps define thee policy here.

1. Browse to Datasets->Browse->RFM_MODEL_SEGMENT dataset. Click on Manage Data & Access Labels

2. This bring us into the Data Governance tab. Click on the pencil icon

3. Choose C2 contract label.

4. You will notice that all the fields in this dataset are now associated with the C2 contract label, meaning none of
the fields can be exported to a third-party destination if the marketing action is enabled for that flow. However, a
different dataset with the same schema could still be exported, as the labeling is applied at the dataset level. This
allows for dataset-specific control, giving you the flexibility to manage export permissions on a per-dataset basis.

5. If you want to restrict the export of certain fields across all datasets that share the same schema, you can apply
data usage labels at the field level rather than the dataset level. By labeling specific fields (such as those
containing sensitive or personal information) with restrictive labels like the C2 contract label, these fields will
be blocked from being exported across any dataset using the same schema. Click to Schemas. Turn on Show
adhoc schemas. Search for rfm. Click on ad hoc schema
The Create Dataset from CSV workflow generates both a dataset and a schema dynamically. Since there is no
predefined knowledge about whether the schema represents an attribute schema (such as XDM Individual Profile), an
event schema (such as XDM Experience Event), or any other standard schema, the resulting schema is referred to as
an ad hoc schema. This is because the system does not automatically categorize the dataset under any specific schema
type that would typically allow it to be used in predefined workflows or processes.

1. Click on the schema and then click on Labels. You will see a screen that looks like Here you can choose an
individual field and apply a C2 contract label to it. Click on the pencil icon

2. Choose the C2 Contract Label and click Save

3. You will see the labels applied

In general, all datasets that use this schema will have this field blocked from being exported out.

A Note on Adobe Analytics Data Feeds

Please read this tutorial on extracting the fields:

Please read the following tutorial on extracting data from nested structures like arrays and maps such as the Identities
from an **identityMap:**

Hit-Level Data and Identification in Adobe Experience Platform

In Adobe Experience Platform, hit-level data, traditionally collected by Adobe Analytics, is stored as timestamped
event data. This section provides guidance on how to map specific Adobe Analytics Data Feed columns to XDM fields
in the Experience Platform. It also shows how hits, visits, and visitors are identified using XDM fields.

In the Experience Platform, each “hit” represents an event triggered by a user action, such as a page view or link click,
and is identified by a combination of **hitid_high** and **hitid_low**. These fields are essential for
tracking each unique event or interaction.

A unique identifier for each hit.

Used together with hitid_high for unique identification.

Used together with hitid_low for unique identification.

The timestamp of the hit, in UNIX® time.

This timestamp is used in timestamp-enabled datasets.

Visit and Visitor Identification

Visits and visitors are identified using various identity fields in the Experience Platform. The combination of
**visid_high** and **visid_low** forms a unique identifier for each visit. Additionally, customer-specific
visitor IDs (e.g., **cust_visid**) and geolocation data are stored in the **identityMap** structure.

A unique identifier for a visit.

endUserIDs._experience.aaid.id

A unique identifier for a visit.

endUserIDs._experience.aaid.primary
Used with visid_low to uniquely identify a visit.

endUserIDs._experience.aaid.namespace.code

Used with visid_low to identify a visit uniquely.

Used with visid_high to identify a visit.

endUserIDs._experience.aacustomid.id

endUserIDs._experience.aacustomid.primary

The customer visitor ID namespace code.

endUserIDs._experience.aacustomid.namespace.code

Used with visid_low to identify the customer visitor ID uniquely.

Geolocation data, such as country, region, or city.

Commerce and Event Tracking

Commerce events such as purchases, product views, and checkouts are critical for e-commerce use cases. These events
are represented in XDM fields such as commerce.purchases, and custom events can be tracked using
_experience.analytics fields.

commerce.purchases, commerce.productViews,

Standard commerce and custom events triggered on the hit.

The type of hit (e.g., standard hit, download link, exit link, or custom link clicked).

A variable used in link tracking image requests. Contains the URL of the clicked link.

A variable used in link tracking image requests. Lists the custom name of the link.

A flag that indicates whether the hit matches paid search detection.

A numeric ID representing the type of referral for the hit.

Important: Post-Processing Columns

Adobe Analytics uses columns prefixed with post_ to represent data after processing. However, in the Experience
Platform, there is no concept of post-processing fields for datasets collected through the Experience Platform Edge
Network (Web SDK, Mobile SDK, Server API). Consequently, both pre- and post-processed data feed columns map to
the same XDM field.

For example, both page_url and post_page_url map to web.webPageDetails.URL.

Performing transformations like sessionization, attribution, and deduplication in your queries requires leveraging Data
Distiller functions.

Sessionization is used to group individual hits into logical sessions based on user interactions within a given time
frame. The key ideas that we shall use are the following:
Sessionization: The **SESS_TIMEOUT** function groups user events into sessions based on a timeout period
(30 minutes in this case). A new session is started if no activity occurs within the timeout window.

Ingestion Time Tracking: The script tracks the start and end times of batch ingestion. It uses this information to
process only new data and update the checkpoint logs for future reference.

Checkpoint Logs: This process logs the status of each batch in checkpoint_log, making it easy to track the
state of data processing.

Sample code for Sessionization

-- Initialize variables
SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;
-- Get the last processed batch ingestion time
SET @from_batch_ingestion_time = SELECT coalesce(last_batch_ingestion_time,
'HEAD')
FROM checkpoint_log a
JOIN (
SELECT MAX(process_timestamp) AS process_timestamp
FROM checkpoint_log
WHERE process_name = 'data_feed'
AND process_status = 'SUCCESSFUL'
) b
ON a.process_timestamp = b.process_timestamp;

-- Get the last batch ingestion time

SET @to_batch_ingestion_time = SELECT MAX(_acp_system_metadata.ingestTime)
FROM events_dataset;

-- Sessionize the data and insert into data_feed.

INSERT INTO data_feed
SELECT *
FROM (
SELECT
userIdentity,
timestamp,
SESS_TIMEOUT(timestamp, 60 * 30) OVER (
PARTITION BY userIdentity
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_data,
page_name,
ingest_time
FROM (
SELECT
userIdentity,
timestamp,
web.webPageDetails.name AS page_name,
_acp_system_metadata.ingestTime AS ingest_time
FROM events_dataset
WHERE timestamp >= current_date - 90
) AS a
ORDER BY userIdentity, timestamp ASC
) AS b
WHERE b.ingest_time >= @from_batch_ingestion_time;

-- Update the checkpoint_log table

INSERT INTO checkpoint_log
SELECT
'data_feed' process_name,
'SUCCESSFUL' process_status,
cast(@to_batch_ingestion_time AS string) last_batch_ingestion_time,
cast(@last_updated_timestamp AS TIMESTAMP) process_timestamp
END
</annotation></semantics></math>BEGIN
−−Disabledroppingsystemcolumnssetdrops

ystemc

olumns=false;
−−InitializevariablesSET@last
u
pdated
t
imestamp=SELECTCURRENTT
IMESTAMP;−−
Getthelastprocessed
bat
chin
gestiontimeSET@fromb

atchi

ngestiont

ime=SELECTcoalesce(last
b
atch
i
ngestion
t
ime,′
HEA
D
′
)FROMcheckpointl
ogaJOIN(SELECTMAX(processt

imestamp)ASprocesst

imestampFROMcheckpoint
l



ogWHEREprocessn

ame=′
data
f



eed

′
ANDprocess
s
tatus=′
SUCCESSFUL′
)
bONa
.process
t
imestamp=b.processt

imestamp;
−−GetthelastbatchingestiontimeSET@t
o
b


atc
h
i


nges
ti
on
t
ime=

SE
LECTMAX(
a


cp


s


ystem


m


etadata.ingestTime)FROMevents
d
ataset;−
−Sessionizethedataandinsertintodata


f


eed.INSERTINTOdata


f


eedSELECT∗
FROM(SELECTuserIdentity,
timestamp,SESST
IMEOUT(timestamp,60∗
30
)OVER(PARTITIONBYuserIdentityORDERBYtimestampROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)ASsessiond

ata,
page
n
ame,
inges
t
t

imeFROM(SELECTuserIdentity,
timestamp,we
b.webPageDetails.nameASpage
n
ame,a

cps

ystemm

etadata.ingestTimeASingestt

imeFROMevents
d
atasetWHEREtimestamp>=currentd

ate−
90
)ASaORDERBYuserIdentity,
timestampASC)
ASbWHEREb.ingestt

ime>=@fromb

atchi

ngestiont

ime;−
−Updatethecheckpoint


l


ogt
able
INSERTINTOcheckpointl
ogSELECT
′
data


f


eed

′
processn

ame,
′
SUCCESSFUL′
processs

tatus,
cast(@
to
b
atch
i
ngestion
t
imeASstring)last
b
atch
i
ngestion
t
ime,
cast(@
lastu

pdatedt

imestampASTIMESTAMP)process
t
imestampEND
;

Let us dive into the query in detail:

1. Disable Dropping System Columns

set drop_system_columns=false;

This command disables the automatic removal of system columns. It ensures that columns like metadata
(**_acp_system_metadata.ingestTime**) are retained and can be referenced later in the query.

SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;

This statement initializes the variable **@last_updated_timestamp** with the current timestamp. This
timestamp is later used to record when the batch process is completed.

3. Get the Last Processed Batch Ingestion Time

SET @from_batch_ingestion_time = SELECT coalesce(last_batch_ingestion_time,
'HEAD')
FROM checkpoint_log a
JOIN (
SELECT MAX(process_timestamp) AS process_timestamp
FROM checkpoint_log
WHERE process_name = 'data_feed'
AND process_status = 'SUCCESSFUL'
) b
ON a.process_timestamp = b.process_timestamp;

This block determines the time of the last successful batch ingestion by:

Looking at the checkpoint_log table for entries where **process_name** is **data_feed** and
**process_status** is **SUCCESSFUL**.

The **coalesce** function ensures that if there’s no previous ingestion time (first run), it uses the default
value '``**HEAD**``'.

The MAX(process_timestamp) fetches the most recent batch ingestion time.

1. Get the Last Batch Ingestion Time

SET @to_batch_ingestion_time = SELECT MAX(_acp_system_metadata.ingestTime) FROM events_dataset;

This fetches the maximum ingestion time (_acp_system_metadata.ingestTime) from the

**events_dataset** to determine when the most recent batch of data was ingested.

1. Sessionize the Data and Insert it into data_feed

INSERT INTO data_feed SELECT * FROM ( SELECT userIdentity, timestamp, SESS_TIMEOUT(timestamp,

60 * 30) OVER ( PARTITION BY userIdentity ORDER BY timestamp ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW ) AS session_data, page_name, ingest_time FROM ( SELECT
userIdentity, timestamp, web.webPageDetails.name AS page_name, _acp_system_metadata.ingestTime AS
ingest_time FROM events_dataset WHERE timestamp >= current_date - 90 ) AS a ORDER BY userIdentity,
timestamp ASC ) AS b WHERE b.ingest_time >= @from_batch_ingestion_time;

This section performs sessionization and data insertion:

1. Inner Query (Alias: a):

1. Extracts relevant fields like userIdentity, timestamp, page_name, and

**ingest_time** from **events_dataset**.

2. Filters records to only include those within the past 90 days (**timestamp >= current_date -
90**).

2. Sessionization (**SESS_TIMEOUT**):

1. Uses the **SESS_TIMEOUT** function to create session boundaries with a 30-minute (1800 seconds)
timeout.

2. The **OVER** clause applies sessionization logic by partitioning the data by userIdentity and
ordering by timestamp.

3. Each row is assigned a session identifier based on user activity within a 30-minute window of inactivity.
3. Outer Query (Alias: b):

1. Selects and orders data based on userIdentity and timestamp.

2. Filters the result set to only include data that has been ingested since the last batch ingestion
(**b.ingest_time >= @from_batch_ingestion_time**).

4. Insert:

1. Inserts the sessionized data into the data_feed table.

5. Update the Checkpoint Log

INSERT INTO checkpoint_log SELECT ‘data_feed’ process_name, ‘SUCCESSFUL’ process_status,

cast(@to_batch_ingestion_time AS string) last_batch_ingestion_time, cast(@last_updated_timestamp AS
TIMESTAMP) process_timestamp;

This inserts a new entry into the checkpoint_log table with:

process_name: **'data_feed'**.

process_status: 'SUCCESSFUL', indicating that the batch was successfully processed.

last_batch_ingestion_time: The most recent ingestion time (from @to_batch_ingestion_time).

process_timestamp: The timestamp when the process was completed (from

**@last_updated_timestamp**).

Attribution functions can be used to assign credit to different touchpoints in a user’s journey based on predefined rules
(e.g., last touch attribution).

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics>
<mrow><mi>B</mi><mi>E</mi><mi>G</mi><mi>I</mi><mi>N</mi><mi>S</mi><mi>E</mi>
<mi>T</mi><mi>d</mi><mi>r</mi><mi>o</mi><msub><mi>p</mi><mi>s</mi></msub>
<mi>y</mi><mi>s</mi><mi>t</mi><mi>e</mi><msub><mi>m</mi><mi>c</mi></msub>
<mi>o</mi><mi>l</mi><mi>u</mi><mi>m</mi><mi>n</mi><mi>s</mi><mo>=</mo>
<mi>f</mi><mi>a</mi><mi>l</mi><mi>s</mi><mi>e</mi><mo separator="true">;</mo>
<mo>−</mo><mo>−</mo><mi>I</mi><mi>n</mi><mi>i</mi><mi>t</mi><mi>i</mi>
<mi>a</mi><mi>l</mi><mi>i</mi><mi>z</mi><mi>e</mi><mi>v</mi><mi>a</mi>
<mi>r</mi><mi>i</mi><mi>a</mi><mi>b</mi><mi>l</mi><mi>e</mi><mi>s</mi>
<mi>S</mi><mi>E</mi><mi>T</mi><mi mathvariant="normal">@</mi><mi>l</mi>
<mi>a</mi><mi>s</mi><msub><mi>t</mi><mi>u</mi></msub><mi>p</mi><mi>d</mi>
<mi>a</mi><mi>t</mi><mi>e</mi><msub><mi>d</mi><mi>t</mi></msub><mi>i</mi>
<mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo>=
</mo><mi>S</mi><mi>E</mi><mi>L</mi><mi>E</mi><mi>C</mi><mi>T</mi><mi>C</mi>
<mi>U</mi><mi>R</mi><mi>R</mi><mi>E</mi><mi>N</mi><msub><mi>T</mi><mi>T</mi>
</msub><mi>I</mi><mi>M</mi><mi>E</mi><mi>S</mi><mi>T</mi><mi>A</mi><mi>M</mi>
<mi>P</mi><mo separator="true">;</mo><mo>−</mo><mo>−</mo><mi>G</mi><mi>e</mi>
<mi>t</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>l</mi><mi>a</mi><mi>s</mi>
<mi>t</mi><mi>p</mi><mi>r</mi><mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi>
<mi>s</mi><mi>e</mi><mi>d</mi><mi>b</mi><mi>a</mi><mi>t</mi><mi>c</mi>
<mi>h</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>i</mi><mi>o</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi>
<mn>1718755872325</mn><mi>S</mi><mi>E</mi><mi>T</mi><mi
mathvariant="normal">@</mi><mi>f</mi><mi>r</mi><mi>o</mi><msub><mi>m</mi>
<mi>b</mi></msub><mi>a</mi><mi>t</mi><mi>c</mi><msub><mi>h</mi><mi>i</mi>
</msub><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>o</mi>
<msub><mi>n</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi><mo>=</mo>
<mi>S</mi><mi>E</mi><mi>L</mi><mi>E</mi><mi>C</mi><mi>T</mi><mi>c</mi>
<mi>o</mi><mi>a</mi><mi>l</mi><mi>e</mi><mi>s</mi><mi>c</mi><mi>e</mi><mo
stretchy="false">(</mo><mi>l</mi><mi>a</mi><mi>s</mi><msub><mi>t</mi><mi>s</mi>
</msub><mi>n</mi><mi>a</mi><mi>p</mi><mi>s</mi><mi>h</mi><mi>o</mi><msub>
<mi>t</mi><mi>i</mi></msub><mi>d</mi><msup><mo separator="true">,</mo><mo
mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>H</mi>
<mi>E</mi><mi>A</mi><msup><mi>D</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mo stretchy="false">)</mo><mi>F</mi><mi>R</mi>
<mi>O</mi><mi>M</mi><mi>c</mi><mi>h</mi><mi>e</mi><mi>c</mi><mi>k</mi>
<mi>p</mi><mi>o</mi><mi>i</mi><mi>n</mi><msub><mi>t</mi><mi>l</mi></msub>
<mi>o</mi><mi>g</mi><mi>a</mi><mi>J</mi><mi>O</mi><mi>I</mi><mi>N</mi><mo
stretchy="false">(</mo><mi>S</mi><mi>E</mi><mi>L</mi><mi>E</mi><mi>C</mi>
<mi>T</mi><mi>M</mi><mi>A</mi><mi>X</mi><mo stretchy="false">(</mo><mi>p</mi>
<mi>r</mi><mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>t</mi>
</msub><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><mo stretchy="false">)</mo><mi>A</mi><mi>S</mi><mi>p</mi><mi>r</mi>
<mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>t</mi></msub>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><mi>F</mi><mi>R</mi><mi>O</mi><mi>M</mi><mi>c</mi><mi>h</mi>
<mi>e</mi><mi>c</mi><mi>k</mi><mi>p</mi><mi>o</mi><mi>i</mi><mi>n</mi><msub>
<mi>t</mi><mi>l</mi></msub><mi>o</mi><mi>g</mi><mi>W</mi><mi>H</mi><mi>E</mi>
<mi>R</mi><mi>E</mi><mi>p</mi><mi>r</mi><mi>o</mi><mi>c</mi><mi>e</mi>
<mi>s</mi><msub><mi>s</mi><mi>n</mi></msub><mi>a</mi><mi>m</mi><mi>e</mi><msup>
<mo>=</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup>
<mi>d</mi><mi>a</mi><mi>t</mi><msub><mi>a</mi><mi>f</mi></msub><mi>e</mi>
<mi>e</mi><msup><mi>d</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mi>A</mi><mi>N</mi><mi>D</mi><mi>p</mi><mi>r</mi>
<mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>s</mi></msub>
<mi>t</mi><mi>a</mi><mi>t</mi><mi>u</mi><mi>s</mi><msup><mo>=</mo><mo
mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>S</mi>
<mi>U</mi><mi>C</mi><mi>C</mi><mi>E</mi><mi>S</mi><mi>S</mi><mi>F</mi>
<mi>U</mi><msup><mi>L</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mo stretchy="false">)</mo><mi>b</mi><mi>O</mi>
<mi>N</mi><mi>a</mi><mi mathvariant="normal">.</mi><mi>p</mi><mi>r</mi>
<mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>t</mi></msub>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><mo>=</mo><mi>b</mi><mi mathvariant="normal">.</mi><mi>p</mi>
<mi>r</mi><mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>t</mi>
</msub><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><mo separator="true">;</mo><mo>−</mo><mo>−</mo><mi>G</mi><mi>e</mi>
<mi>t</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>l</mi><mi>a</mi><mi>s</mi>
<mi>t</mi><mi>b</mi><mi>a</mi><mi>t</mi><mi>c</mi><mi>h</mi><mi>i</mi>
<mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>o</mi>
<mi>n</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mn>1718758687865</mn>
<mi>S</mi><mi>E</mi><mi>T</mi><mi mathvariant="normal">@</mi><mi>t</mi><msub>
<mi>o</mi><mi>b</mi></msub><mi>a</mi><mi>t</mi><mi>c</mi><msub><mi>h</mi>
<mi>i</mi></msub><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>i</mi>
<mi>o</mi><msub><mi>n</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi><mo>=
</mo><mi>S</mi><mi>E</mi><mi>L</mi><mi>E</mi><mi>C</mi><mi>T</mi><mi>M</mi>
<mi>A</mi><mi>X</mi><msub><mo stretchy="false">(</mo><mi>a</mi></msub>
<mi>c</mi><msub><mi>p</mi><mi>s</mi></msub><mi>y</mi><mi>s</mi><mi>t</mi>
<mi>e</mi><msub><mi>m</mi><mi>m</mi></msub><mi>e</mi><mi>t</mi><mi>a</mi>
<mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi><mi mathvariant="normal">.</mi>
<mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>T</mi>
<mi>i</mi><mi>m</mi><mi>e</mi><mo stretchy="false">)</mo><mi>F</mi><mi>R</mi>
<mi>O</mi><mi>M</mi><mi>d</mi><mi>e</mi><mi>m</mi><msub><mi>o</mi><mi>d</mi>
</msub><mi>a</mi><mi>t</mi><msub><mi>a</mi><mi>t</mi></msub><mi>r</mi>
<mi>e</mi><msub><mi>y</mi><mi>m</mi></msub><mi>c</mi><mi>i</mi><mi>n</mi>
<mi>t</mi><mi>y</mi><mi>r</mi><msub><mi>e</mi><mi>m</mi></msub><mi>i</mi>
<mi>d</mi><mi>v</mi><mi>a</mi><mi>l</mi><mi>u</mi><mi>e</mi><mi>s</mi><mo
separator="true">;</mo><mo>−</mo><mo>−</mo><mi>S</mi><mi>e</mi><mi>s</mi>
<mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mi>i</mi><mi>z</mi><mi>e</mi>
<mi>t</mi><mi>h</mi><mi>e</mi><mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi>
<mi>a</mi><mi>n</mi><mi>d</mi><mi>i</mi><mi>n</mi><mi>s</mi><mi>e</mi>
<mi>r</mi><mi>t</mi><mi>i</mi><mi>n</mi><mi>t</mi><mi>o</mi><mi>n</mi>
<mi>e</mi><msub><mi>w</mi><mi>s</mi></msub><mi>e</mi><mi>s</mi><mi>s</mi>
<mi>i</mi><mi>o</mi><mi>n</mi><mi>i</mi><mi>z</mi><mi>e</mi><msub><mi>d</mi>
<mi>d</mi></msub><mi>a</mi><mi>t</mi><mi>a</mi><mi>I</mi><mi>N</mi><mi>S</mi>
<mi>E</mi><mi>R</mi><mi>T</mi><mi>I</mi><mi>N</mi><mi>T</mi><mi>O</mi>
<mi>n</mi><mi>e</mi><msub><mi>w</mi><mi>s</mi></msub><mi>e</mi><mi>s</mi>
<mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mi>i</mi><mi>z</mi><mi>e</mi><msub>
<mi>d</mi><mi>d</mi></msub><mi>a</mi><mi>t</mi><mi>a</mi><mi>S</mi><mi>E</mi>
<mi>L</mi><mi>E</mi><mi>C</mi><mi>T</mi><mo>∗</mo><mi>F</mi><mi>R</mi>
<mi>O</mi><mi>M</mi><mo stretchy="false">(</mo><mi>S</mi><mi>E</mi><mi>L</mi>
<mi>E</mi><mi>C</mi><msub><mi>T</mi><mi>i</mi></msub><mi>d</mi><mo
separator="true">,</mo><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi>
<mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo separator="true">,</mo><mi>s</mi>
<mi>t</mi><mi>r</mi><mi>u</mi><mi>c</mi><mi>t</mi><mo stretchy="false">(</mo>
<mi>U</mi><mi>s</mi><mi>e</mi><msub><mi>r</mi><mi>I</mi></msub><mi>d</mi>
<mi>e</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>t</mi><mi>y</mi><mo
separator="true">,</mo><mi>c</mi><mi>a</mi><mi>s</mi><mi>t</mi><mo
stretchy="false">(</mo><mi>S</mi><mi>E</mi><mi>S</mi><msub><mi>S</mi><mi>T</mi>
</msub><mi>I</mi><mi>M</mi><mi>E</mi><mi>O</mi><mi>U</mi><mi>T</mi><mo
stretchy="false">(</mo><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi>
<mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo separator="true">,</mo><mn>60</mn>
<mo>∗</mo><mn>30</mn><mo stretchy="false">)</mo><mi>O</mi><mi>V</mi><mi>E</mi>
<mi>R</mi><mo stretchy="false">(</mo><mi>P</mi><mi>A</mi><mi>R</mi><mi>T</mi>
<mi>I</mi><mi>T</mi><mi>I</mi><mi>O</mi><mi>N</mi><mi>B</mi><mi>Y</mi>
<mi>U</mi><mi>s</mi><mi>e</mi><msub><mi>r</mi><mi>I</mi></msub><mi>d</mi>
<mi>e</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>t</mi><mi>y</mi><mi>O</mi>
<mi>R</mi><mi>D</mi><mi>E</mi><mi>R</mi><mi>B</mi><mi>Y</mi><mi>t</mi>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><mi>R</mi><mi>O</mi><mi>W</mi><mi>S</mi><mi>B</mi><mi>E</mi>
<mi>T</mi><mi>W</mi><mi>E</mi><mi>E</mi><mi>N</mi><mi>U</mi><mi>N</mi>
<mi>B</mi><mi>O</mi><mi>U</mi><mi>N</mi><mi>D</mi><mi>E</mi><mi>D</mi>
<mi>P</mi><mi>R</mi><mi>E</mi><mi>C</mi><mi>E</mi><mi>D</mi><mi>I</mi>
<mi>N</mi><mi>G</mi><mi>A</mi><mi>N</mi><mi>D</mi><mi>C</mi><mi>U</mi>
<mi>R</mi><mi>R</mi><mi>E</mi><mi>N</mi><mi>T</mi><mi>R</mi><mi>O</mi>
<mi>W</mi><mo stretchy="false">)</mo><mi>a</mi><mi>s</mi><mi>s</mi><mi>t</mi>
<mi>r</mi><mi>i</mi><mi>n</mi><mi>g</mi><mo stretchy="false">)</mo><mi>A</mi>
<mi>S</mi><mi>S</mi><mi>e</mi><mi>s</mi><mi>s</mi><mi>i</mi><mi>o</mi>
<mi>n</mi><mi>D</mi><mi>a</mi><mi>t</mi><mi>a</mi><mo separator="true">,</mo>
<mi>t</mi><msub><mi>o</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi>
<mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo stretchy="false">(</mo>
<mi>f</mi><mi>r</mi><mi>o</mi><msub><mi>m</mi><mi>u</mi></msub><mi>n</mi>
<mi>i</mi><mi>x</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mo
stretchy="false">(</mo><mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><msub>
<mi>t</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi><mi
mathvariant="normal">/</mi><mn>1000</mn><msup><mo separator="true">,</mo><mo
mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>y</mi>
<mi>y</mi><mi>y</mi><mi>y</mi><mo>−</mo><mi>M</mi><mi>M</mi><mo>−</mo>
<mi>d</mi><mi>d</mi><mi>H</mi><mi>H</mi><mo>:</mo><mi>m</mi><mi>m</mi><mo>:
</mo><mi>s</mi><msup><mi>s</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mo stretchy="false">)</mo><mo stretchy="false">)
</mo><mi>A</mi><mi>S</mi><mi>I</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi>
<mi>t</mi><mi>T</mi><mi>i</mi><mi>m</mi><mi>e</mi><mo separator="true">,</mo>
<mi>P</mi><mi>a</mi><mi>g</mi><mi>e</mi><mi>N</mi><mi>a</mi><mi>m</mi>
<mi>e</mi><mo separator="true">,</mo><mi>f</mi><mi>i</mi><mi>r</mi><mi>s</mi>
<msub><mi>t</mi><mi>u</mi></msub><mi>r</mi><mi>l</mi><mo separator="true">,
</mo><mi>f</mi><mi>i</mi><mi>r</mi><mi>s</mi><msub><mi>t</mi><mi>c</mi></msub>
<mi>h</mi><mi>a</mi><mi>n</mi><mi>n</mi><mi>e</mi><msub><mi>l</mi><mi>t</mi>
</msub><mi>y</mi><mi>p</mi><mi>e</mi><mo stretchy="false">)</mo><mi>a</mi>
<msub><mi>s</mi><mi>d</mi></msub><mi>e</mi><mi>m</mi><mi>o</mi><mi>s</mi>
<mi>y</mi><mi>s</mi><mi>t</mi><mi>e</mi><mi>m</mi><mn>5</mn><mi>F</mi>
<mi>R</mi><mi>O</mi><mi>M</mi><mo stretchy="false">(</mo><mi>S</mi><mi>E</mi>
<mi>L</mi><mi>E</mi><mi>C</mi><msub><mi>T</mi><mi>i</mi></msub><mi>d</mi><mo
separator="true">,</mo><mi>E</mi><mi>N</mi><mi>D</mi><mi>U</mi><mi>S</mi>
<mi>E</mi><mi>R</mi><mi>I</mi><mi>D</mi><mi>S</mi><msub><mi
mathvariant="normal">.</mi><mi>E</mi></msub><mi>X</mi><mi>P</mi><mi>E</mi>
<mi>R</mi><mi>I</mi><mi>E</mi><mi>N</mi><mi>C</mi><mi>E</mi><mi
mathvariant="normal">.</mi><mi>M</mi><mi>C</mi><mi>I</mi><mi>D</mi><mi
mathvariant="normal">.</mi><mi>I</mi><mi>D</mi><mi>a</mi><mi>s</mi><mi>U</mi>
<mi>s</mi><mi>e</mi><msub><mi>r</mi><mi>I</mi></msub><mi>d</mi><mi>e</mi>
<mi>n</mi><mi>t</mi><mi>i</mi><mi>t</mi><mi>y</mi><mo separator="true">,</mo>
<mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi>
<mi>m</mi><mi>p</mi><mo separator="true">,</mo><mi>w</mi><mi>e</mi><mi>b</mi>
<mi mathvariant="normal">.</mi><mi>w</mi><mi>e</mi><mi>b</mi><mi>P</mi>
<mi>a</mi><mi>g</mi><mi>e</mi><mi>D</mi><mi>e</mi><mi>t</mi><mi>a</mi>
<mi>i</mi><mi>l</mi><mi>s</mi><mi mathvariant="normal">.</mi><mi>n</mi>
<mi>a</mi><mi>m</mi><mi>e</mi><mi>A</mi><mi>S</mi><mi>P</mi><mi>a</mi>
<mi>g</mi><mi>e</mi><mi>N</mi><mi>a</mi><mi>m</mi><mi>e</mi><mo
separator="true">,</mo><mi>a</mi><mi>t</mi><mi>t</mi><mi>r</mi><mi>i</mi>
<mi>b</mi><mi>u</mi><mi>t</mi><mi>i</mi><mi>o</mi><msub><mi>n</mi><mi>f</mi>
</msub><mi>i</mi><mi>r</mi><mi>s</mi><msub><mi>t</mi><mi>t</mi></msub>
<mi>o</mi><mi>u</mi><mi>c</mi><mi>h</mi><mo stretchy="false">(</mo><mi>t</mi>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><msup><mo separator="true">,</mo><mrow><mo
mathvariant="normal">′</mo><mo mathvariant="normal">′</mo></mrow></msup><mo
separator="true">,</mo><mi>w</mi><mi>e</mi><mi>b</mi><mi mathvariant="normal">.
</mi><mi>w</mi><mi>e</mi><mi>b</mi><mi>R</mi><mi>e</mi><mi>f</mi><mi>e</mi>
<mi>r</mi><mi>r</mi><mi>e</mi><mi>r</mi><mi mathvariant="normal">.</mi>
<mi>u</mi><mi>r</mi><mi>l</mi><mo stretchy="false">)</mo><mi>O</mi><mi>V</mi>
<mi>E</mi><mi>R</mi><mo stretchy="false">(</mo><mi>P</mi><mi>A</mi><mi>R</mi>
<mi>T</mi><mi>I</mi><mi>T</mi><mi>I</mi><mi>O</mi><mi>N</mi><mi>B</mi>
<mi>Y</mi><mi>E</mi><mi>N</mi><mi>D</mi><mi>U</mi><mi>S</mi><mi>E</mi>
<mi>R</mi><mi>I</mi><mi>D</mi><mi>S</mi><msub><mi mathvariant="normal">.</mi>
<mi>E</mi></msub><mi>X</mi><mi>P</mi><mi>E</mi><mi>R</mi><mi>I</mi><mi>E</mi>
<mi>N</mi><mi>C</mi><mi>E</mi><mi mathvariant="normal">.</mi><mi>M</mi>
<mi>C</mi><mi>I</mi><mi>D</mi><mi mathvariant="normal">.</mi><mi>I</mi>
<mi>D</mi><mi>O</mi><mi>R</mi><mi>D</mi><mi>E</mi><mi>R</mi><mi>B</mi>
<mi>Y</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>a</mi><mi>m</mi><mi>p</mi><mi>A</mi><mi>S</mi><mi>C</mi><mi>R</mi>
<mi>O</mi><mi>W</mi><mi>S</mi><mi>B</mi><mi>E</mi><mi>T</mi><mi>W</mi>
<mi>E</mi><mi>E</mi><mi>N</mi><mi>U</mi><mi>N</mi><mi>B</mi><mi>O</mi>
<mi>U</mi><mi>N</mi><mi>D</mi><mi>E</mi><mi>D</mi><mi>P</mi><mi>R</mi>
<mi>E</mi><mi>C</mi><mi>E</mi><mi>D</mi><mi>I</mi><mi>N</mi><mi>G</mi>
<mi>A</mi><mi>N</mi><mi>D</mi><mi>U</mi><mi>N</mi><mi>B</mi><mi>O</mi>
<mi>U</mi><mi>N</mi><mi>D</mi><mi>E</mi><mi>D</mi><mi>F</mi><mi>O</mi>
<mi>L</mi><mi>L</mi><mi>O</mi><mi>W</mi><mi>I</mi><mi>N</mi><mi>G</mi><mo
stretchy="false">)</mo><mi mathvariant="normal">.</mi><mi>v</mi><mi>a</mi>
<mi>l</mi><mi>u</mi><mi>e</mi><mi>A</mi><mi>S</mi><mi>f</mi><mi>i</mi>
<mi>r</mi><mi>s</mi><msub><mi>t</mi><mi>u</mi></msub><mi>r</mi><mi>l</mi><mo
separator="true">,</mo><mi>a</mi><mi>t</mi><mi>t</mi><mi>r</mi><mi>i</mi>
<mi>b</mi><mi>u</mi><mi>t</mi><mi>i</mi><mi>o</mi><msub><mi>n</mi><mi>f</mi>
</msub><mi>i</mi><mi>r</mi><mi>s</mi><msub><mi>t</mi><mi>t</mi></msub>
<mi>o</mi><mi>u</mi><mi>c</mi><mi>h</mi><mo stretchy="false">(</mo><mi>t</mi>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi>
<mi>p</mi><msup><mo separator="true">,</mo><mrow><mo
mathvariant="normal">′</mo><mo mathvariant="normal">′</mo></mrow></msup><mo
separator="true">,</mo><mi>c</mi><mi>h</mi><mi>a</mi><mi>n</mi><mi>n</mi>
<mi>e</mi><mi>l</mi><mi mathvariant="normal">.</mi><mi>t</mi><mi>y</mi>
<mi>p</mi><mi>e</mi><mi>A</mi><mi>t</mi><mi>S</mi><mi>o</mi><mi>u</mi>
<mi>r</mi><mi>c</mi><mi>e</mi><mo stretchy="false">)</mo><mi>O</mi><mi>V</mi>
<mi>E</mi><mi>R</mi><mo stretchy="false">(</mo><mi>P</mi><mi>A</mi><mi>R</mi>
<mi>T</mi><mi>I</mi><mi>T</mi><mi>I</mi><mi>O</mi><mi>N</mi><mi>B</mi>
<mi>Y</mi><mi>E</mi><mi>N</mi><mi>D</mi><mi>U</mi><mi>S</mi><mi>E</mi>
<mi>R</mi><mi>I</mi><mi>D</mi><mi>S</mi><msub><mi mathvariant="normal">.</mi>
<mi>E</mi></msub><mi>X</mi><mi>P</mi><mi>E</mi><mi>R</mi><mi>I</mi><mi>E</mi>
<mi>N</mi><mi>C</mi><mi>E</mi><mi mathvariant="normal">.</mi><mi>M</mi>
<mi>C</mi><mi>I</mi><mi>D</mi><mi mathvariant="normal">.</mi><mi>I</mi>
<mi>D</mi><mi>O</mi><mi>R</mi><mi>D</mi><mi>E</mi><mi>R</mi><mi>B</mi>
<mi>Y</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>a</mi><mi>m</mi><mi>p</mi><mi>A</mi><mi>S</mi><mi>C</mi><mi>R</mi>
<mi>O</mi><mi>W</mi><mi>S</mi><mi>B</mi><mi>E</mi><mi>T</mi><mi>W</mi>
<mi>E</mi><mi>E</mi><mi>N</mi><mi>U</mi><mi>N</mi><mi>B</mi><mi>O</mi>
<mi>U</mi><mi>N</mi><mi>D</mi><mi>E</mi><mi>D</mi><mi>P</mi><mi>R</mi>
<mi>E</mi><mi>C</mi><mi>E</mi><mi>D</mi><mi>I</mi><mi>N</mi><mi>G</mi>
<mi>A</mi><mi>N</mi><mi>D</mi><mi>U</mi><mi>N</mi><mi>B</mi><mi>O</mi>
<mi>U</mi><mi>N</mi><mi>D</mi><mi>E</mi><mi>D</mi><mi>F</mi><mi>O</mi>
<mi>L</mi><mi>L</mi><mi>O</mi><mi>W</mi><mi>I</mi><mi>N</mi><mi>G</mi><mo
stretchy="false">)</mo><mi mathvariant="normal">.</mi><mi>v</mi><mi>a</mi>
<mi>l</mi><mi>u</mi><mi>e</mi><mi>A</mi><mi>S</mi><mi>f</mi><mi>i</mi>
<mi>r</mi><mi>s</mi><msub><mi>t</mi><mi>c</mi></msub><mi>h</mi><mi>a</mi>
<mi>n</mi><mi>n</mi><mi>e</mi><msub><mi>l</mi><mi>t</mi></msub><mi>y</mi>
<mi>p</mi><mi>e</mi><msub><mo separator="true">,</mo><mi>a</mi></msub>
<mi>c</mi><msub><mi>p</mi><mi>s</mi></msub><mi>y</mi><mi>s</mi><mi>t</mi>
<mi>e</mi><msub><mi>m</mi><mi>m</mi></msub><mi>e</mi><mi>t</mi><mi>a</mi>
<mi>d</mi><mi>a</mi><mi>t</mi><mi>a</mi><mi mathvariant="normal">.</mi>
<mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>T</mi>
<mi>i</mi><mi>m</mi><mi>e</mi><mi>A</mi><mi>S</mi><mi>i</mi><mi>n</mi>
<mi>g</mi><mi>e</mi><mi>s</mi><msub><mi>t</mi><mi>t</mi></msub><mi>i</mi>
<mi>m</mi><mi>e</mi><mi>F</mi><mi>R</mi><mi>O</mi><mi>M</mi><mi>d</mi>
<mi>e</mi><mi>m</mi><msub><mi>o</mi><mi>d</mi></msub><mi>a</mi><mi>t</mi><msub>
<mi>a</mi><mi>t</mi></msub><mi>r</mi><mi>e</mi><msub><mi>y</mi><mi>m</mi>
</msub><mi>c</mi><mi>i</mi><mi>n</mi><mi>t</mi><mi>y</mi><mi>r</mi><msub>
<mi>e</mi><mi>m</mi></msub><mi>i</mi><mi>d</mi><mi>v</mi><mi>a</mi><mi>l</mi>
<mi>u</mi><mi>e</mi><mi>s</mi><mi>W</mi><mi>H</mi><mi>E</mi><mi>R</mi>
<mi>E</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>a</mi><mi>m</mi><mi>p</mi><mo>></mo><mo>=</mo><mi>c</mi><mi>u</mi>
<mi>r</mi><mi>r</mi><mi>e</mi><mi>n</mi><msub><mi>t</mi><mi>d</mi></msub>
<mi>a</mi><mi>t</mi><mi>e</mi><mo>−</mo><mn>90</mn><mo stretchy="false">)</mo>
<mi>O</mi><mi>R</mi><mi>D</mi><mi>E</mi><mi>R</mi><mi>B</mi><mi>Y</mi>
<mi>U</mi><mi>s</mi><mi>e</mi><msub><mi>r</mi><mi>I</mi></msub><mi>d</mi>
<mi>e</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>t</mi><mi>y</mi><mo
separator="true">,</mo><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi>
<mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mi>A</mi><mi>S</mi><mi>C</mi><mo
stretchy="false">)</mo><mi>W</mi><mi>H</mi><mi>E</mi><mi>R</mi><msub><mi>E</mi>
<mi>d</mi></msub><mi>e</mi><mi>m</mi><mi>o</mi><mi>s</mi><mi>y</mi><mi>s</mi>
<mi>t</mi><mi>e</mi><mi>m</mi><mn>5.</mn><mi>I</mi><mi>n</mi><mi>g</mi>
<mi>e</mi><mi>s</mi><mi>t</mi><mi>T</mi><mi>i</mi><mi>m</mi><mi>e</mi><mo>>
</mo><mo>=</mo><mi>t</mi><msub><mi>o</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi>
<mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo
stretchy="false">(</mo><mi>f</mi><mi>r</mi><mi>o</mi><msub><mi>m</mi><mi>u</mi>
</msub><mi>n</mi><mi>i</mi><mi>x</mi><mi>t</mi><mi>i</mi><mi>m</mi><mi>e</mi>
<mo stretchy="false">(</mo><mi mathvariant="normal">@</mi><mi>f</mi><mi>r</mi>
<mi>o</mi><msub><mi>m</mi><mi>b</mi></msub><mi>a</mi><mi>t</mi><mi>c</mi><msub>
<mi>h</mi><mi>i</mi></msub><mi>n</mi><mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>i</mi><mi>o</mi><msub><mi>n</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi>
<mi>e</mi><mi mathvariant="normal">/</mi><mn>1000</mn><msup><mo
separator="true">,</mo><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mi>y</mi><mi>y</mi><mi>y</mi><mi>y</mi><mo>−</mo>
<mi>M</mi><mi>M</mi><mo>−</mo><mi>d</mi><mi>d</mi><mi>H</mi><mi>H</mi><mo>:
</mo><mi>m</mi><mi>m</mi><mo>:</mo><mi>s</mi><msup><mi>s</mi><mo
mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo
stretchy="false">)</mo><mo stretchy="false">)</mo><mo separator="true">;</mo>
<mo>−</mo><mo>−</mo><mi>U</mi><mi>p</mi><mi>d</mi><mi>a</mi><mi>t</mi>
<mi>e</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>c</mi><mi>h</mi><mi>e</mi>
<mi>c</mi><mi>k</mi><mi>p</mi><mi>o</mi><mi>i</mi><mi>n</mi><msub><mi>t</mi>
<mi>l</mi></msub><mi>o</mi><mi>g</mi><mi>t</mi><mi>a</mi><mi>b</mi><mi>l</mi>
<mi>e</mi><mi>I</mi><mi>N</mi><mi>S</mi><mi>E</mi><mi>R</mi><mi>T</mi>
<mi>I</mi><mi>N</mi><mi>T</mi><mi>O</mi><mi>c</mi><mi>h</mi><mi>e</mi>
<mi>c</mi><mi>k</mi><mi>p</mi><mi>o</mi><mi>i</mi><mi>n</mi><msub><mi>t</mi>
<mi>l</mi></msub><mi>o</mi><mi>g</mi><mi>S</mi><mi>E</mi><mi>L</mi><mi>E</mi>
<mi>C</mi><msup><mi>T</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mi>d</mi><mi>a</mi><mi>t</mi><msub><mi>a</mi>
<mi>f</mi></msub><mi>e</mi><mi>e</mi><msup><mi>d</mi><mo mathvariant="normal"
lspace="0em" rspace="0em">′</mo></msup><mi>a</mi><mi>s</mi><mi>p</mi><mi>r</mi>
<mi>o</mi><mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>n</mi></msub>
<mi>a</mi><mi>m</mi><mi>e</mi><msup><mo separator="true">,</mo><mo
mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>S</mi>
<mi>U</mi><mi>C</mi><mi>C</mi><mi>E</mi><mi>S</mi><mi>S</mi><mi>F</mi>
<mi>U</mi><msup><mi>L</mi><mo mathvariant="normal" lspace="0em"
rspace="0em">′</mo></msup><mi>a</mi><mi>s</mi><mi>p</mi><mi>r</mi><mi>o</mi>
<mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>s</mi></msub><mi>t</mi>
<mi>a</mi><mi>t</mi><mi>u</mi><mi>s</mi><mo separator="true">,</mo><mi>c</mi>
<mi>a</mi><mi>s</mi><mi>t</mi><mo stretchy="false">(</mo><mi
mathvariant="normal">@</mi><mi>t</mi><msub><mi>o</mi><mi>b</mi></msub>
<mi>a</mi><mi>t</mi><mi>c</mi><msub><mi>h</mi><mi>i</mi></msub><mi>n</mi>
<mi>g</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>o</mi><msub><mi>n</mi>
<mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi><mi>A</mi><mi>S</mi><mi>s</mi>
<mi>t</mi><mi>r</mi><mi>i</mi><mi>n</mi><mi>g</mi><mo stretchy="false">)</mo>
<mi>a</mi><mi>s</mi><mi>l</mi><mi>a</mi><mi>s</mi><msub><mi>t</mi><mi>s</mi>
</msub><mi>n</mi><mi>a</mi><mi>p</mi><mi>s</mi><mi>h</mi><mi>o</mi><msub>
<mi>t</mi><mi>i</mi></msub><mi>d</mi><mo separator="true">,</mo><mi>c</mi>
<mi>a</mi><mi>s</mi><mi>t</mi><mo stretchy="false">(</mo><mi
mathvariant="normal">@</mi><mi>l</mi><mi>a</mi><mi>s</mi><msub><mi>t</mi>
<mi>u</mi></msub><mi>p</mi><mi>d</mi><mi>a</mi><mi>t</mi><mi>e</mi><msub>
<mi>d</mi><mi>t</mi></msub><mi>i</mi><mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi>
<mi>a</mi><mi>m</mi><mi>p</mi><mi>A</mi><mi>S</mi><mi>t</mi><mi>i</mi>
<mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo
stretchy="false">)</mo><mi>a</mi><mi>s</mi><mi>p</mi><mi>r</mi><mi>o</mi>
<mi>c</mi><mi>e</mi><mi>s</mi><msub><mi>s</mi><mi>t</mi></msub><mi>i</mi>
<mi>m</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>a</mi><mi>m</mi><mi>p</mi><mo
separator="true">;</mo><mi>E</mi><mi>N</mi><mi>D</mi></mrow><annotation
encoding="application/x-tex"> BEGIN
SET drop_system_columns=false;

-- Initialize variables
SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;

-- Get the last processed batch ingestion time 1718755872325

SET @from_batch_ingestion_time =
SELECT coalesce(last_snapshot_id, 'HEAD')
FROM checkpoint_log a
JOIN (
SELECT MAX(process_timestamp) AS process_timestamp
FROM checkpoint_log
WHERE process_name = 'data_feed'
AND process_status = 'SUCCESSFUL'
) b
ON a.process_timestamp = b.process_timestamp;

-- Get the last batch ingestion time 1718758687865

SET @to_batch_ingestion_time =
SELECT MAX(_acp_system_metadata.ingestTime)
FROM demo_data_trey_mcintyre_midvalues;

-- Sessionize the data and insert into new_sessionized_data

INSERT INTO new_sessionized_data
SELECT *
FROM (
SELECT
_id,
timestamp,
struct(User_Identity,
cast(SESS_TIMEOUT(timestamp, 60 * 30) OVER (
PARTITION BY User_Identity
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as string) AS SessionData,
to_timestamp(from_unixtime(ingest_time/1000, 'yyyy-MM-dd
HH:mm:ss')) AS IngestTime,
PageName,
first_url,
first_channel_type
) as _demosystem5
FROM (
SELECT
_id,
ENDUSERIDS._EXPERIENCE.MCID.ID as User_Identity,
timestamp,
web.webPageDetails.name AS PageName,
attribution_first_touch(timestamp, '',
web.webReferrer.url) OVER (PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID ORDER BY
timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value
AS first_url,
attribution_first_touch(timestamp,
'',channel.typeAtSource) OVER (PARTITION BY
ENDUSERIDS._EXPERIENCE.MCID.ID ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING).value AS first_channel_type,
_acp_system_metadata.ingestTime AS ingest_time
FROM demo_data_trey_mcintyre_midvalues
WHERE timestamp >= current_date - 90
)
ORDER BY User_Identity, timestamp ASC
)
WHERE _demosystem5.IngestTime >=
to_timestamp(from_unixtime(@from_batch_ingestion_time/1000, 'yyyy-MM-dd
HH:mm:ss'));

-- Update the checkpoint_log table

INSERT INTO checkpoint_log
SELECT
'data_feed' as process_name,
'SUCCESSFUL' as process_status,
cast(@to_batch_ingestion_time AS string) as last_snapshot_id,
cast(@last_updated_timestamp AS timestamp) as process_timestamp;

END
</annotation></semantics></math>BEGINSETdrop
s
ystem
c
olumns=
false;−
−InitializevariablesSET@last
u
pdated
t
imestamp=SELECTCURRENTT
IMESTAMP;−−
Getthelastprocessed
bat
chin
gestiontime1718755872325SET@from


b


atch


i


ngestion
t
ime=
SELECTcoalesce(lasts

napshoti

d,′
HEA
D
′
)FROMcheckpointl
ogaJOIN(SELECTMAX(processt

imestamp)ASprocesst

imestampFROMcheckpoint
l



ogWHEREprocessn

ame=′
data
f



eed

′
ANDprocess
s
tatus=′
SUCCESSFUL′
)
bONa
.process
t
imestamp=b.processt

imestamp;
−−Getthelastbatchingestiontime1718758687865SET@to


b


atch


i


ngestion
t
ime=

SE
LECTMAX(
a


cp


s


ystem


m


etadata.ingestTime)FROMdemod

ata
t
reym

cintyrem

idvalues;
−−Sessionizethedataandinsertintonew
s


essionize
d
d


ataINSERTINTOne
w
s
essionized
d
ataSELECT∗
FROM(SELECTi

d,
timestamp,s
tru
ct
(User
I


dentity,
cast(SESST
IMEOUT(timestamp,60∗
30)OVER
(P
ARTITIONBYUse
r
I


dentityORDERBYtimestampROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)asstring)ASSessionData,t
o
t

imestamp(fromu

nixtime(ingest


t


ime/1000,
′
yyyy−
MM−
ddHH
:mm:
ss′
))
ASIn
gestTime,PageName,
f
irs
t
u

rl,fi
rst
c
hannel
t


y
pe
)asd

emosystem5FROM
(SE
LECTi

d,
EN
DUSERIDS.E
XPERIENCE.MCID.IDasUserI
dentity,
timestamp,we
b.webPageDetails.nameASPageName,attribution


f


irst


t


ouch
(timestamp,′′
,web.
w
eb
Referrer.url)OVER
(P
ARTITIONBYENDUSERIDS.

E



XPERIENCE.MCID.
IDORDERBYtimestampASCROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING).valueASfirst
u
rl,
attributionf
irstt

ouch(timestamp,′′
,channel.typeAtSource)OVER(PARTITIONBYENDUSERIDS.E
XPERIENCE.MCID.IDORDERBYtimestampASCROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING).valueASfirst
c
hannel
t


y
pe
,a

cps

ystemm

etadata.ingestTimeASingestt

imeFROMdemod

ata
t
reym

cintyrem

idvaluesWHEREtimestamp>=
current


d


ate
−
90
)ORDERBYUser
I


dentity,
timestampASC)WHERE
d
emosystem5.IngestTime>=to
t
imestamp(from


u


nixtime(@
from
b
atch
i
ngestion
t
ime/1000,
′
yyyy−
MM−
ddHH
:mm:
ss′
))
;−−

Updatethecheckpoint
l



ogtableINSERTINTOcheckpoint
l



ogSELECT′
data


f


eed

′
asprocessn

ame,
′
SUCCESSFUL′
asprocesss

tatus,
cast(@
to
b
atch
i
ngestion
t
imeASstring)aslast
s
napshot
i
d,cast
(@last
u
pdated
t
imestampAStimestamp)asprocess
t
imestamp;END
;

This attribution query adds attribution logic to capture the first touch point for URLs and channel types.

In the second SQL query, two attribution functions are used:

attribution_first_touch(timestamp, '', web.webReferrer.url) OVER (

PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID
ORDER BY timestamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
).value AS first_url,

attribution_first_touch(timestamp, '', channel.typeAtSource) OVER (

PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID
ORDER BY timestamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
).value AS first_channel_type,

These functions introduce first-touch attribution, which differs from the simple sessionization logic in the first query.
The above query uses the **attribution_first_touch()** Data Distiller function to identify the first event
(or touchpoint) for each session or user, capturing the first URL visited and the first channel type used during a session.

Last updated 5 months ago

Cloud storage destinations marked in green are supported.

Click on ADLS Gen2 container or directory

Copy the SAS URI and choose a name for the DLZ container. Note that the SAS URI is copied from the results of the
execution of the Python code above.

Connection is complete. You should see the files exported here,

RFM data with anonymized email

Navigate to Connections->Destinations->Catalog->Cloud Storage->Data Landing Zone. Click Activate.

Choose Datasets intead of audiences

Click on the Destination Account created.

Choose RFM_MODEL_SEGMENT dataset to export

Click Finish to complete the setup

Click on Destinations->Browse->DLZ_Data_Distiller flow

You should see the raw files.

Manifest files in TextEdit application on Mac

3rd Party Export Restriction

Click on the ellipsis to access the Data Labels

All fields are available as a flat list

Apply a label on an entire dataset

C2 contract labels are now applied to all of the fields

Locate the ad hoc schema to apply the DULE labels on

Apply the labels on a field

Choose the labels for the field

Label applied to an individual field

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-activation-and-data-export/act-200-dataset-activation-
anonymization-masking-and-differential-privacy-techniques * * *

1. UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT

ACT 200: Dataset Activation: Anonymization, Masking & Differential

Privacy Techniques
Explore advanced differential privacy techniques to securely activate data while balancing valuable insights and
individual privacy protection.”

Last updated 5 months ago

Download the file:

Ingest the data as healthcare_customers dataset using this:

The Privacy Dilemma in Personalization

One of the biggest challenges in personalization is determining how far a company should go in leveraging customer
data to create a highly tailored experience that wows the customer. The question arises: What machine learning
algorithm will deliver the perfect offer at the right time? However, pursuing this goal comes with significant risks.
Over-personalization can make customers uncomfortable, and even after removing personally identifiable information
(PII), those designing the algorithms or offers can still infer personal details, such as recognizing a neighbor who shops
with the company. This raises a crucial ethical dilemma—how far should we go to enhance the customer experience
while also safeguarding their privacy?

The solution lies in recognizing that there’s a trade-off. To respect customer privacy, companies must be willing to
sacrifice some degree of accuracy, and possibly some profit, to ensure customers feel secure when interacting with a
brand. By embracing differential privacy techniques, like adding noise to datasets, we can protect individual identities
while still gaining valuable insights. In doing so, companies demonstrate that they prioritize not only profits but also
the privacy and trust of their customers.

Data Distiller enables a wide variety of use cases, such as data activation for enterprise reporting, feature engineering
for machine learning, enriching the enterprise identity graph, and creating custom audiences in specialized formats.
However, dataset activation requires responsible consideration of the data being shared. While techniques like
stripping sensitive information, masking, and anonymization are all viable options, you still need enough behavioral
data for meaningful downstream analysis. The challenge is ensuring that the data you activate is not so raw or
transparent that someone within your company could reverse-engineer the identities of individuals. How do you
balance utility with privacy to protect individuals while maintaining valuable insights?

Here are a few use cases from the Capability Matrix that you might consider approaching differently when activating
datasets with Data Distiller:

1. Data Distiller Audiences with Privacy:: When activating audiences from Data Distiller, you can use noisy
datasets to segment customers based on behavior, demographics, or purchase history without exposing precise
individual data. This approach safeguards sensitive customer information while still enabling effective
segmentation for marketing campaigns.

2. A/B Testing with Privacy Enhancements:: Use noisy data to perform A/B testing on customer interactions with
different marketing strategies. Noise can help ensure that individual customers’ data points are less identifiable
while still allowing you to measure the success of each strategy effectively.

3. Predictive Modeling with Protected Data: Develop models to predict customer behavior (e.g., churn prediction,
purchase likelihood) where individual customer records are perturbed to protect privacy. You can still identify
trends and make predictions for your marketing efforts.

4. Lookalike Modeling for Ad Targeting: Create lookalike audiences by training models on noisy data, which can
help marketers find potential new customers who exhibit similar behaviors to existing high-value customers.
Adding noise preserves privacy while still providing valuable insights for targeting.

5. Personalized Recommendations with Privacy: Generate privacy-preserving personalized product or content

recommendations. By adding noise, you ensure that individual preferences are obscured, but trends can still drive
relevant recommendations.

6. Customer Lifetime Value (CLV) Estimation with Noise: Calculate customer lifetime value using noisy
datasets to avoid exposing sensitive financial or transactional details of individuals while still identifying trends
and high-value customer segments for personalized marketing.

7. Privacy-Protected Attribution Modeling: You can analyze marketing attribution (which channels lead to
conversions) using noisy data to protect user interactions while maintaining the overall effectiveness of
attribution models to optimize campaign spend.

8. Cross-Device Tracking without Exact Data Matching: In marketing campaigns that track user journeys across
devices, noise can help reduce the precision of cross-device matching, maintaining privacy while still enabling
marketers to understand multi-touch attribution paths.

What is Differential Privacy?

The key idea behind differential privacy is to ensure that the results of any analysis or query on a dataset remain
almost identical, whether or not an individual’s data is included. This means that no end user can “difference” two
snapshots of the dataset and deduce who the individuals are. By maintaining this consistency, differential privacy
prevents anyone from inferring significant details about any specific person, even if they are aware that person’s data is
part of the dataset.

Consider a database that tracks whether people have a particular medical condition. A simple query might ask, “How
many people in the dataset have the condition?” Suppose the true count is 100. Now, imagine that a new person with
the condition is added, increasing the count to 101. As the data scientist, you know that your neighbor has been very ill
and that there is only one medical care provider nearby. Without differential privacy, this information could allow you
to deduce that your neighbor is included in the dataset.
To prevent this, we can add a small amount of random noise before revealing the count. Instead of reporting exactly
100, we might reveal 102 or 99. If someone joins or leaves the dataset, the count could shift to 103 or 100, for
instance. This noise ensures that the presence or absence of any individual doesn’t significantly impact the result.

In this way, you, as the data scientist, cannot confidently determine whether a specific person is part of the dataset
based on the output. And that is a good thing - the individual’s privacy is protected, as their contribution is “hidden”
within the noise.

The Privacy vs. Utility Tradeoff Dilemma

The key idea in adding noise to ensure differential privacy is to balance two competing objectives:

1. Privacy: Protecting individuals’ data by making it difficult to infer whether a particular individual is in the
dataset.

2. Utility: Ensuring that the analysis results remain useful and accurate for personalization despite the noise.

The tradeoffs are:

High Privacy → Lower Utility: When you add a lot of noise to data to protect privacy, the accuracy and
reliability of the data and hence your personalization decrease

High Utility → Lower Privacy: On the other hand, if you reduce the noise to increase the accuracy (utility) of
the data i.e. the personalization, the dataset becomes more representative of the actual individuals, which
increases the risk of identifying someone.

Two Key Variables for Privacy: Sensitivity and Noise

In differential privacy, sensitivity (denoted as Δf) refers to how much the result of a query could change if a single
individual’s data is added or removed. It’s not about the variability of the data itself, but about the potential impact
any individual’s presence can have on the output. The higher the sensitivity, the greater the change an individual’s data
can introduce to the result.

Let’s revisit the example of the medical condition dataset. If the condition can only have one of two values (e.g., “has
the condition” or “does not”), it means the data has low sensitivity—since adding or removing one person will change
the count by at most 1. However, this low sensitivity makes it easier for someone, like a data scientist, to start guessing
which of their neighbors is in the dataset by correlating other fields, like treatments or appointment times.

Even though the sensitivity is low (since the result can only change by a small amount), the signal is strong because
there is limited variation in the data. This means the individual’s presence becomes easier to detect, which can
compromise privacy. To protect against this, we need to compensate by adding carefully calibrated noise. The
amount of noise depends on the sensitivity: low sensitivity may require less noise, but it’s still essential to add enough
to prevent any inference about specific individuals based on the dataset’s output. The amount of noise added is
determined by a key privacy parameter known as epsilon (𝜀).

This balance between sensitivity and noise ensures that the final result provides useful insights while protecting the
privacy of individuals.

In practice, you must choose an appropriate value for epsilon (𝜀) based on your specific needs and risk tolerance.
Higher epsilon values might be suitable when the accuracy of data is critical (e.g., scientific research use cases), while
lower epsilon values would be more appropriate in sensitive applications where privacy is the top priority (e.g., health
data).
How to Add Noise: The Laplace Mechanism

Laplacian noise refers to random noise drawn from a Laplace distribution, which looks like a pointy curve centered
at 0. This noise is used to obscure or mask the precise value of a result so that it’s difficult for an attacker to infer
whether a specific individual’s data is present or absent.

In most systems (like SQL or programming languages), random numbers are typically generated from a uniform
distribution, meaning the random values are equally likely to be anywhere within a certain range, such as between
-0.5 and 0.5. This uniform distribution is very different from the Laplace distribution, which is concentrated around
0. So, we need a way to convert uniform random numbers into Laplacian-distributed numbers. This conversion is done
using a transformation involving the logarithm function. The transformation converts the uniform random number
into a value that follows a Laplace distribution.

The Laplace noise generation requires converting uniformly distributed random numbers (generated using
RAND()) into Laplace-distributed values, and this conversion relies on the inverse of the cumulative distribution
function (CDF) of the Laplace distribution. This inverse transformation involves the logarithm function
(**LOG()**).

To generate Laplace noise for a random variable, we need to:

1. Generate a uniformly distributed random number U in the range [−0.5,0.5]

2. Apply the transformation:

L=−b⋅sign(U)⋅log⁡(1−2∣U∣)L = -b \cdot \text{sign}(U) \cdot \log(1 - 2 |U|) L=−b⋅sign(U)⋅log(1−2∣U∣)

This transformation is necessary to convert the uniform distribution to a Laplace distribution

Where:

b=sensitivityϵ(the scale parameter).b = \frac{\text{sensitivity}}{\epsilon} \quad \text{(the scale parameter).}

b=ϵsensitivity(the scale parameter).

U is a random value between -0.5 and 0.5.

sign(U) ensures the noise is symmetrically distributed around 0 (positive or negative)

The transformation is necessary because uniform random numbers are not naturally spread out like a Laplace
distribution. Most values from a Laplace distribution cluster around 0, and fewer values are far from 0. By using the
logarithm, we adjust the uniform distribution so that it has this same characteristic: most values are close to 0, but
there is still some chance of larger positive or negative values.

How to Decide Sensitivity

Deciding on the sensitivity of a set of Data Distiller Derived Attributes (combining numerical, Boolean, and
categorical attributes) when applying differential privacy requires understanding how much the output of your query
or function can change when a single individual’s data is modified. The sensitivity will depend on the type of Derived
Attribute and the function or model you are using.

The most common practice for finding the sensitivity of a derived attribute in a dataset is to examine the distribution of
values that the derived attribute can take. This involves identifying the maximum change that can occur in the output
of a query when a single individual’s data is added, removed, or modified. The sensitivity of the derived attribute is
essentially the largest possible difference in the query result due to the presence or absence of one individual.
Let’s say you have a dataset of customers and you’re calculating a derived attribute called “total purchase amount”
for each customer. This derived attribute is the sum of all purchases made by the customer over a specific period.

Step 1: Examine the distribution of the “purchase amount” attribute.

Suppose the purchase amounts range from $0 to $1,000.

Step 2: Determine the sensitivity by finding the maximum possible change in the derived attribute when one
customer’s data is added or removed.

In this case, if a customer’s purchases are removed, the maximum change in the “total purchase amount” is
$1,000 (if the customer made the maximum possible purchase of $1,000).

Thus, the sensitivity of the “total purchase amount” derived attribute is $1,000, because removing or adding a single
customer could change the sum by that amount.

How to Decide Epsilon (𝜀)

Once you’ve determined the sensitivity of your derived attributes, the next step in applying differential privacy is to
decide on the privacy parameter known as epsilon (𝜀). Epsilon controls the trade-off between privacy and utility: it
dictates how much noise needs to be added to your query results based on the sensitivity, ensuring that individual data
points are protected.

Let’s continue with the example from earlier where you are calculating the “total purchase amount” for each
customer. You’ve determined that the sensitivity of this derived attribute is $1,000, meaning that the maximum change
in the query result due to one individual’s data is $1,000.

If you choose 𝜀 = 0.1, the noise added to your total purchase amount query will be significant, ensuring strong
privacy. For instance, a query result of $10,000 might be distorted to something like $9,000 or $11,000 due to the
noise.

If you choose 𝜀 = 1.0, the noise added will be much smaller, possibly resulting in the total purchase amount being
reported as $9,950 or $10,050, providing more accuracy but slightly weaker privacy protection.

You can start with ε = 0.5 as a solid starting point because it provides a moderate balance between privacy and utility.
It introduces enough noise to protect privacy in many use cases without overly distorting the data. From there, you can
iterate by adjusting the value of epsilon, testing how it impacts both the privacy protection and the accuracy of your
use cases. By gradually increasing or decreasing ε, you can find the optimal balance between privacy needs and the
utility required for your specific analysis.

Case Study: Data Distiller Audience Activation

magine you’re analyzing healthcare customer data to segment patients based on age, total healthcare spending, and
subscription status to healthcare services. These attributes are essential for tailoring healthcare plans, optimizing
resource allocation, or delivering personalized healthcare recommendations. However, this data involves sensitive
personal and health-related information, which requires a robust privacy-preserving approach.

The columns we’ll include are:

1. PII Columns (to be dropped or anonymized):

customer_id: Unique identifier (anonymized).

**name**: Customer’s full name (dropped).

phone_number: Contact information (anonymized).

email: Email address (anonymized).

address: Physical address (dropped).

2. Non-PII Columns (used for marketing/healthcare segmentation):

age: Numerical value representing customer age.

total_spent: Total healthcare spending by the customer (numerical).

subscription_status: Whether the customer has a healthcare subscription plan (boolean).

gender: Categorical data.

country: Categorical data representing the customer’s location.

diagnosis_code: A code representing the medical condition (requires anonymization to protect

patient data).

prescription: The name of the prescription medicine (requires anonymization).

Let us execute the following query:

SELECT * FROM healthcare_customers;

The result will be:

The PII columns (Personally Identifiable Information) such as name and address are typically dropped in differential
privacy and anonymization processes for the following reasons.

Direct Identifiability:

Name: This is a direct identifier. Names can easily be linked back to specific individuals, making it impossible to
protect privacy if they are included. Simply adding noise or anonymizing other attributes would not protect a
person’s identity if their full name is still present.

Address: Similarly, addresses are highly specific to individuals and can be easily used to trace back to a person.
Even partial addresses or zip codes can be cross-referenced with public records or other data sources to identify
someone.

Many privacy laws and regulations, such as GDPR in Europe and HIPAA in the United States, require the removal of
identifiable data like names and addresses in datasets before sharing or using them for analytics. Keeping such
columns in the dataset would violate these privacy regulations.

Even when other information is anonymized, attackers can perform linkage attacks by combining multiple datasets.
For example, if an attacker knows a person’s address from another dataset, they could link that information with your
dataset if the address is still present.

SELECT
customer_id,
age,
total_spent,
subscription_status,
gender,
country,
diagnosis_code,
prescription
FROM healthcare_customers;

The result is:

Anonymize PII: Hashing & Masking Techniques

Here are some decisions we will make on the remaining columns:

1. Customer ID (Anonymized via Hashing): The customer_id is often used as a key for uniquely identifying
records, linking data across systems, or performing analysis without needing personal details like names. It is
important for analytics purposes to track individuals in a dataset, but it should be anonymized to protect their
identity.

2. Phone Number (Masked to Show Only the Last 4 Digits): The phone number can still provide some valuable
information, such as area code for regional analysis, or the last 4 digits for certain use cases (e.g., verification of
identity, identifying duplicate entries). Masking helps retain partial information for specific analyses.

3. Email (Anonymized via Hashing): Emails are often used for customer communication and identifying duplicates
or tracking interactions. However, email addresses are highly sensitive because they can be linked to an
individual, both within and outside the organization.

Hashing transforms the customer ID and Email into a unique but irreversible code, ensuring that the original ID/Email
cannot be retrieved or linked back to the person. This allows the dataset to retain its uniqueness and analytic power
while ensuring privacy.

By masking most of the digits and revealing only the last 4 digits, we ensure that the phone number is no longer
personally identifiable. The last 4 digits alone are not sufficient to identify someone but may be useful for business
logic purposes (e.g., verifying uniqueness).

Let us execute the following query:

SELECT
SHA2(CAST(customer_id AS STRING), 256) AS anonymized_customer_id,
CONCAT('XXX-XXX-', SUBSTRING(phone_number, -4)) AS masked_phone_number,
SHA2(email, 256) AS anonymized_email,
age,
total_spent,
subscription_status,
gender,
country,
diagnosis_code,
prescription
FROM healthcare_customers;

Observe the results very carefully:

SHA-256 (Secure Hash Algorithm 256-bit) is part of the SHA-2 family of cryptographic hash functions. It generates
a 256-bit (32-byte) hash value, typically represented as a 64-character hexadecimal number.

**SUBSTRING(phone_number, -4)** extracts the last 4 characters of the phone number. The -4 index
indicates that the function should start 4 characters from the end of the string.
We will leave some of these values untouched:

1. Diagnosis Code and Prescription: These columns are critical for certain types of healthcare segmentation (e.g.,
segmenting patients based on medical conditions or treatments).

2. Gender is often used for segmentation (e.g., marketing or healthcare demographic analysis).

3. Leave subscription_status unhashed because it is useful for segmentation and doesn’t reveal personal identity

4. country is typically used for geographic segmentation, which is important for understanding customer behavior
or demographics in different regions.

Data Distiller Statistics: Applying Differential Privacy

Let us now compute the sensitivity and the epsilon for these two variables Age and Total Spent.

Epsilon for **age**: The formula uses ϵ=0.5\epsilon = 0.5ϵ=0.5 for the age field.

Sensitivity for **age**: The sensitivity for age is assumed to be 1.0 as the maximum variation in age can be
1.0 across two snapshots of data.

Total Spent

psilon for **total_spent**: The same ϵ=0.5\epsilon = 0.5ϵ=0.5 is used for the total_spent field.

Sensitivity for **total_spent**: The sensitivity for total_spent is 500, reflecting the assumption that one
individual’s spending could change the total by as much as $500.

Let us execute the following to generate the unform random values for each column and each row:

SELECT
customer_id,age, total_spent, RAND() AS age_random, RAND() AS
total_spent_random
FROM healthcare_customers;

If you execute:

SELECT customer_id,
ROUND(age + (-LOG(1 - 2 * ABS(age_random - 0.5)) * SIGN(age_random-
0.5)) * (1.0 / 0.5), 0) AS noisy_age,
ROUND((-LOG(1 - 2 * ABS(age_random - 0.5)) * SIGN(age_random- 0.5)) *
(1.0 / 0.5), 0) AS age_diff,
ROUND(total_spent + (-LOG(1 - 2 * ABS(total_spent_random- 0.5)) *
SIGN(total_spent_random- 0.5)) * (500.0 / 0.5), 2) AS noisy_total_spent,
ROUND((-LOG(1 - 2 * ABS(RAND() - 0.5)) * SIGN(RAND() - 0.5)) * (500.0 /
0.5), 2) AS total_spent_diff
FROM
(SELECT
customer_id,age, total_spent, RAND() AS age_random, RAND() AS
total_spent_random
FROM healthcare_customers);

You will get:

Note on Categorical Variables

When dealing with categorical variables in the context of differential privacy, it’s important to consider both the
sensitivity and the cardinality (i.e., the number of unique categories) of the variable. For high-cardinality categorical
features, such as customer locations or product names, applying privacy techniques like one-hot encoding or feature
hashing is common in machine learning tasks. One-hot encoding transforms each category into a binary vector,
where each unique category becomes its own column, making the data more interpretable for machine learning
models. However, this approach can lead to a large number of columns if the cardinality is high, potentially affecting
performance and privacy.

In contrast, feature hashing (also known as the hashing trick) compresses high-cardinality categorical data by
mapping categories to a fixed number of buckets using a hash function. While this reduces the number of columns and
makes the dataset more manageable, it introduces collisions where different categories can be hashed into the same
bucket. When applying differential privacy to categorical variables, it’s important to consider the sensitivity, which
could be influenced by the number of possible categories. High-cardinality variables might require more noise to
ensure privacy, or you could aggregate categories to reduce the cardinality and thus the required sensitivity.

The best practice in hashing is that the number of buckets should be atleast equal to the number of cardinality values.

For categorical variables, it is generally safe to assume that the sensitivity is 1 in the context of differential privacy.
This assumption is commonly used when the query involves counting or querying the frequency of categories
because the sensitivity reflects the maximum possible change in the query result when a single individual’s data is
added or removed.

One-Hot Encoding Example for country column:

In one-hot encoding, each unique country will become its own binary column. For simplicity, let’s assume we have
three countries: USA, Canada, and Germany.

SELECT
customer_id,
age,
total_spent,
subscription_status,
-- One-hot encode the country column
CASE WHEN country = 'USA' THEN 1 ELSE 0 END AS country_usa,
CASE WHEN country = 'Canada' THEN 1 ELSE 0 END AS country_canada,
CASE WHEN country = 'Germany' THEN 1 ELSE 0 END AS country_germany
FROM healthcare_customers;

The result would be:

Feature Hashing Example for country column:

In feature hashing, we map the country values to a fixed number of hash buckets. Let’s assume we want to map the
country column to 3 hash buckets.

SELECT
customer_id,
MOD(ABS(HASH(country)), 3) AS hashed_country_bucket
FROM healthcare_customers;

The result would be:

The raw data that we have.

Drop the PII data by not choosing the columns in the SELECT query
See how the values have been hashed and masked

See how the age and total amount spent have added noise.

A low cardinality one hot encoding example

Hashing can be used for high cardinality situations

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-activation-and-data-export/act-300-functions-and-
techniques-for-handling-sensitive-data-with-data-distiller * * *

Download the file:

Ingest the data as healthcare_customers dataset using this:

In this tutorial, we’ll demonstrate how to handle sensitive healthcare data by applying various anonymization and
pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We’ll use SQL-based
techniques to randomize, mask, and inject noise into the data, using the following strategies:

The dataset contains sensitive columns:

customer_id: Unique identifier for each customer.

customer_name: Name of the customer.

phone_number: Customer’s phone number.

email: Customer’s email address.

address: Physical address.

total_spent: Total healthcare spending.

subscription_status: Whether the customer has a subscription plan.

gender: Gender of the customer.

country: Country of the customer.

diagnosis_code: Medical condition code (e.g., ICD-10).

prescription: Prescription given to the customer.

Randomization replaces sensitive data with random values. In this case, customer names will be randomized.

SELECT
customer_id,
CONCAT('User', CAST(FLOOR(RAND() * 10000) AS STRING)) AS randomized_name
FROM
healthcare_customers;

This query replaces customer names with random identifiers like User1234, ensuring names are obfuscated.

Partial Masking of Phone Numbers and Emails

Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-
identify individuals. This technique was used in the tutorial here.

SELECT
customer_id,
CONCAT('XXX-XXX-', SUBSTRING(phone_number, -4)) AS masked_phone,
CONCAT(SUBSTRING(email, 1, 4), '****@domain.com') AS masked_email
FROM
healthcare_customers;

This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address
while keeping the domain intact.

Pseudonymization (Hashing) of Email Addresses

Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial
identifiers or “pseudonyms,” ensuring that the data can no longer be attributed to a specific individual without
additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and
the original identity is severed unless the key to reverse the process (often called a “re-identification key”) is available.

Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it
irreversible.

SELECT
customer_id,
SHA2(email, 256) AS hashed_email
FROM
healthcare_customers;

When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters
using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s
virtually impossible to reverse the process and recover the original email. The output will always be 64 characters
long, no matter the size of the input. This is like turning the email into a “digital fingerprint”—each email will have a
distinct hash, but similar-looking emails (like [email protected] and [email protected]) will
have completely different outputs. Hashing is used to protect sensitive information because it hides the original data
while still allowing comparisons between hashed values.

SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses,
resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any
input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of
possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number
of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash),
making it highly reliable for even massive datasets like this,

Data Distiller also supports the MD5 function, which generates a 128-bit hash. For example:

**SELECT customer_id, email, MD5(email) AS hashed_email FROM

healthcare_customers;** This function is useful for use cases such as data integrity checks, quickly
comparing large datasets, anonymizing data for non-security purposes, and creating partitioning or bucketing keys for
efficient data distribution.

**MD5** offers faster performance compared to stronger hashing algorithms like **SHA-256,** making it
suitable for non-sensitive tasks where speed is a priority. However, **MD5** should not be used for cryptographic
purposes or to store sensitive data, as it is vulnerable to hash collisions and security attacks. For security-related
applications, stronger algorithms such as **SHA-256** are recommended.

In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or
bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a
combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should
have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity
means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the
attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.

The grouping dimensions require careful data exploration and experimentation to identify the right set of attributes that
meet such strict requirements. Once the minimum conditions are met, any new data added will only reinforce the
criteria. As a best practice, you should always double-check the conditions with every dataset activation using this
technique to ensure compliance.

Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:

WITH GeneralizedHealthcare AS (
SELECT
customer_id,
-- Generalize age into broader age ranges (20-year groups)
CASE
WHEN age BETWEEN 0 AND 19 THEN '0-19'
WHEN age BETWEEN 20 AND 39 THEN '20-39'
WHEN age BETWEEN 40 AND 59 THEN '40-59'
WHEN age BETWEEN 60 AND 79 THEN '60-79'
ELSE '80+'
END AS generalized_age,

-- Generalize country into fewer, broader regions

CASE
WHEN country IN ('Japan', 'China', 'Korea', 'India') THEN 'Asia'
WHEN country IN ('Australia', 'New Zealand') THEN 'Oceania'
WHEN country IN ('France', 'Germany', 'Italy', 'UK') THEN 'Europe'
ELSE 'Other'
END AS region,

diagnosis_code,

-- Generalize prescription into broader categories (example categories)

CASE
WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN
'Antibiotics'
WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood
Pressure Meds'
ELSE 'Other Medications'
END AS generalized_prescription
FROM healthcare_customers
)
SELECT
generalized_age,
region,
diagnosis_code,
generalized_prescription
FROM
GeneralizedHealthcare
GROUP BY
generalized_age, region, generalized_prescription, diagnosis_code
HAVING COUNT(*) == 1;;

The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First,
it generalizes the age into 20-year ranges (e.g., 0-19, 20-39), and the country is grouped into broad regions
(e.g., Asia, Europe). The prescription field is also generalized into broader categories like Painkillers,
Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset
is then grouped by these generalized dimensions, including diagnosis_code. Our hope is that the HAVING
COUNT(*) == 1 clause will return no results as no bucket of these grouping dimensions should have an individual
in it.

The execution will show the following:

The current generalization of the **prescription** dimension hasn’t provided sufficient anonymity..Since we
have already bucketed all the other dimensions, the **diagnosis_code** remains as the only ungrouped attribute
and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as **age,
country**) to better capture larger groups. This highlights an important tradeoff: you’ll need to determine which
dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while
still maintaining a balance between utility and privacy.

Let us try this:

WITH GeneralizedHealthcare AS (
SELECT
customer_id,
-- Generalize age into broader age ranges (three age buckets)
CASE
WHEN age BETWEEN 0 AND 29 THEN '0-29'
WHEN age BETWEEN 30 AND 59 THEN '30-59'
ELSE '60+'
END AS generalized_age,

-- Generalize country into larger, broader regions

CASE
WHEN country IN ('Japan', 'China', 'Korea', 'India', 'Australia',
'New Zealand') THEN 'Asia-Pacific'
WHEN country IN ('France', 'Germany', 'Italy', 'UK', 'Spain') THEN
'Europe'
WHEN country IN ('USA', 'Canada', 'Brazil') THEN 'Americas'
ELSE 'Other Regions'
END AS region,

diagnosis_code,

-- Generalize prescription into broader categories (example categories)

CASE
WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN
'Antibiotics'
WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood
Pressure Meds'
ELSE 'Other Medications'
END AS generalized_prescription
FROM healthcare_customers
)
SELECT
generalized_age,
region,
diagnosis_code,
generalized_prescription
FROM
GeneralizedHealthcare
GROUP BY
generalized_age, region, diagnosis_code, generalized_prescription
HAVING COUNT(*) == 1;

This returns:

Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often
achieve better results by combining multiple techniques.

Noise Injection for age and total_spent

Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please
explore the techniques for differential privacy in the tutorial here.

Substitution of Sensitive Data

Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.

WITH RandomizedDiagnosis AS (
SELECT
diagnosis_code,
CONCAT(
CHAR(FLOOR(RAND() * 26) + 65), -- First random letter (A-Z)
CHAR(FLOOR(RAND() * 26) + 65), -- Second random letter (A-Z)
CHAR(FLOOR(RAND() * 26) + 65) -- Third random letter (A-Z)
) AS random_code
FROM
(SELECT DISTINCT diagnosis_code FROM healthcare_customers) AS
distinct_codes
)
SELECT
hc.customer_id,
rd.random_code AS substituted_diagnosis_code
FROM
healthcare_customers hc
JOIN
RandomizedDiagnosis rd
ON
hc.diagnosis_code = rd.diagnosis_code;

The result will be the following:

This query consistently replaces each unique diagnosis_code in the healthcare_customers

table with a randomly generated three-letter code. It uses a Common Table Expression (CTE),
**RandomizedDiagnosis**, to generate a unique mapping for each distinct diagnosis code by creating a
random three-letter string (using ASCII values for letters A-Z). The **DISTINCT** clause ensures that each
diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the
diagnosis_code, ensuring that every instance of the same diagnosis code across the dataset is consistently
replaced with the same random string. This approach provides a secure and consistent substitution of sensitive
diagnosis codes, allowing for privacy while maintaining consistency for analysis.

Note that we did not use a subquery but instead used a CTE (Common Table Expression) Observe the key benefit of
using a CTE (Common Table Expression) over a subquery approach in readability and reusability. CTEs allow
you to define a temporary result set that can be referenced multiple times within the same query, making the SQL
easier to read and maintain, especially when dealing with complex queries.

For example, in the query provided, the CTE **RandomizedDiagnosis** allows the distinct diagnosis codes
and their randomized substitutions to be computed once and then reused in the main query. This makes the code
cleaner and separates the logic of generating random substitutions from the actual join operation. If you were to use a
subquery, you’d potentially have to repeat the subquery each time it’s needed, making the SQL harder to understand
and more error-prone if changes are required in multiple places.

Full Masking of Address and Prescription Data

In some cases, you may want to fully mask sensitive data fields to ensure privacy.

SELECT
customer_id,
REPEAT('*', LENGTH(address)) AS masked_address,
REPEAT('*', LENGTH(prescription)) AS masked_prescription
FROM
healthcare_customers;

This query masks the address and prescription fields entirely by replacing each character with an asterisk (*),
making the data unreadable.

Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall
data distribution.

WITH ShuffledData AS (
SELECT
customer_id,
total_spent,
ROW_NUMBER() OVER (ORDER BY customer_id) AS original_row, --orig row
order
ROW_NUMBER() OVER (ORDER BY RAND()) AS shuffled_row -- shuffled row
order
FROM
healthcare_customers
)
SELECT
original.customer_id,
original.total_spent AS original_total_spent,
shuffled.total_spent AS shuffled_total_spent
FROM
ShuffledData original
JOIN
ShuffledData shuffled
ON
original.shuffled_row = shuffled.original_row; -- Match original row to
shuffled row

This query performs shuffling of the **total_spent** values in the **healthcare_customers** dataset
while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE)
to assign two row numbers: one based on the original order of the customers (**original_row**) and another
based on a random order (**shuffled_row**).

By joining the CTE on these row numbers, the query reassigns the **total_spent** values according to the
shuffled row while preserving the original values.

This query may take some time to execute, so be prepared for a possible timeout in Ad Hoc Query Mode (Data
Distiller Exploration). It is recommended to use Batch Query Mode by employing the **CREATE TABLE** AS
command instead.

You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to
unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized
queries.

The aes**_encrypt** and aes**_descrypt** functions in Data Distiller are used for encryption and
decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in
obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly
useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII,
financial data, or medical records) needs to be encrypted when stored at rest or in transit.

Unlike hashing, which is a one-way process, AES encryption is reversible, allowing the data to be decrypted when
needed by authorized users or systems. This is essential in use cases where you need to retrieve the original data later.

For example, encrypted customer records can be decrypted by an authorized user or system to retrieve the original
information for processing.

You can read about this in the tutorial here.

There are 70 records that are single records in each of these 70 buckets that do not preserve 2-anonymity.

Our generalization has reduced the number from 70 to 1.

The codes are randomized but consistent

Full masking of some data fields.

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-activation-and-data-export/act-400-aes-data-encryption-
and-decryption-with-data-distiller * * *

1. UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT

ACT 400: AES Data Encryption & Decryption with Data Distiller
Secure your sensitive data with AES encryption - a robust, industry-standard way to protect customer information,
while easily decrypting it when needed.

Last updated 5 months ago

Download the file:

Ingest the data as **healthcare_customers** dataset using this:

Also recommended

Why Support AES (Advanced Encryption Standard)****?

AES (Advanced Encryption Standard) support in Data Distiller enhances data security and aligns with industry
standards. AES is the most popular symmetric encryption algorithm, widely trusted for its speed, efficiency, and strong
security across industries like finance, healthcare, and cloud services. Its ability to encrypt large volumes of data
efficiently makes it a superior choice over asymmetric algorithms like RSA, which, while highly secure, is slower and
typically used for specific tasks like key exchanges and digital signatures rather than large-scale encryption.

Data Distiller includes support for encryption modes like GCM (Galois/Counter Mode), which is the most favored
mode due to its dual ability to provide both encryption and data integrity. This makes it ideal for protecting sensitive
data in secure communications, cloud storage, and large-scale enterprise operations.

In comparison to asymmetric encryption like RSA, which requires different keys for encryption and decryption, AES
uses a single key, making it not only faster but also easier to manage in environments where large amounts of data
need to be securely processed and stored. While RSA is excellent for securing small, highly sensitive pieces of data
and key exchanges, AES is the gold standard for encrypting bulk data efficiently and securely.

AES support in Data Distiller ensures fast, scalable, secure, and robust data protection needed to meet regulatory
standards like GDPR and HIPAA, while also offering high performance for enterprise use cases.

AES and Its Encryption Modes in Data Distiller

AES (Advanced Encryption Standard) is one of the most widely used and trusted methods for encrypting data. It’s
employed globally to secure sensitive information, from financial transactions to personal communications. AES
works by converting plain text data into an unreadable format, known as ciphertext, using a secret key. Only someone
with the correct key can decrypt the data back into its original form.

AES in Data Distiller comes in 2 different key sizes: 128-bit and 256-bit, with the larger 256-bit key providing
stronger security. But AES-256 is the most widely used. It offers the highest level of security with a 256-bit key,
making it ideal for safeguarding sensitive data in industries like finance, healthcare, and government. AES-256 strikes
a balance between security and performance, making it the preferred choice for robust encryption needs, especially
where long-term data protection is critical.

However, AES doesn’t work alone—it uses different modes to encrypt and process data. These modes define how data
is broken down and transformed, offering varying levels of security and performance. The three most common modes
are GCM (Galois/Counter Mode) and ECB (Electronic Codebook Mode), each serving different purposes.

GCM (Galois/Counter Mode) is highly regarded for its speed and security. It not only encrypts data but also ensures
that it hasn’t been tampered with, making it ideal for secure communications. GCM is especially useful in scenarios
where both confidentiality and data integrity are important.

ECB (Electronic Codebook Mode) is the simplest and fastest mode, but also the least secure. In ECB, each block of
data is encrypted independently, meaning identical pieces of input will result in identical encrypted output. While this
makes ECB efficient, it can expose patterns in the data, making it less suitable for sensitive information.

Along with these modes, AES often relies on padding to ensure that data fits perfectly into the blocks required for
encryption. For example, PKCS padding is commonly used to fill gaps when data doesn’t perfectly match the block
size. In some modes, like GCM, padding isn’t required, making the encryption process more efficient.
The most popular mode of operation for AES encryption is GCM (Galois/Counter Mode). GCM is widely favored
because it provides both data confidentiality (encryption) and data integrity (authentication) in a highly efficient
manner. Its ability to ensure that data hasn’t been tampered with while being transmitted, combined with its speed and
performance, makes it ideal for modern applications, including secure communications, cloud services, and network
encryption. GCM’s versatility and security features have made it the go-to mode in many industry-standard
implementations.

Together, AES and its modes offer a versatile set of tools for protecting data in a wide range of scenarios, from high-
security communications to everyday data protection. Whether you need speed, security, or flexibility, AES provides
the foundation for keeping sensitive information safe.

CBC (Cipher Block Chaining) offers strong security by linking each block of data with the previous one. This
chaining makes it difficult for an attacker to spot patterns in the encrypted data, even if the input has repeated
elements. CBC is slower than GCM due to its sequential nature but is still widely used for its robustness. This feature
is yet to be released in Data Distiller.

Data Distiller does not currently support asymmetric encryption natively. Asymmetric encryption (which uses a pair
of keys: a public key for encryption and a private key for decryption) is not provided as part of the built-in functions in
Data Distiller.

Data Distiller primarily supports symmetric encryption functions with AES (Advanced Encryption Standard) for
data encryption and decryption.

If you need asymmetric encryption (e.g., RSA), you would typically need to implement this outside of Data Distiller
using external libraries in Python or Java, or through integration with a third-party encryption service.

Since Data Distiller supports AES for symmetric encryption, a single secret key is used for both encrypting and
decrypting data. This means that the same key must be securely shared between the parties involved in exchanging
information. The key is the critical element: anyone who has access to it can decrypt the encrypted data. Therefore,
protecting the key itself is essential to maintaining the security of the data.

Symmetric encryption, like AES, is typically faster than asymmetric encryption, making it ideal for efficiently securing
large volumes of data. However, this approach requires careful key management to ensure that unauthorized
individuals cannot access or compromise the key, as this would undermine the entire encryption process.

The generalized syntax is:

aes_encrypt(expr, key, mode [, padding])

expr: The data to be encrypted.

key: The binary key (use UNHEX() for hexadecimal key).

mode: Encryption mode (case-insensitive).

'ECB': Electronic CodeBook mode.

'GCM': Galois/Counter Mode (default mode).

padding (optional): Padding scheme (case-insensitive).

'NONE': No padding (for 'GCM' mode only).

'PKCS': Public Key Cryptography Standards padding (for 'ECB' mode).

'DEFAULT': Uses 'NONE' for 'GCM' and 'PKCS' for 'ECB'.

The generalized syntax is:

aes_decrypt(expr, key, mode [, padding])

**expr**: The binary data to be decrypted (typically stored as hex, so use UNHEX()).

key: The binary key (use UNHEX() for hexadecimal key).

mode: Decryption mode (must match the encryption mode).

'ECB': Electronic CodeBook mode.

'GCM': Galois/Counter Mode (default mode).

padding (optional): Padding scheme (must match the encryption padding).

'NONE': No padding (for 'GCM' mode only).

'PKCS': Public Key Cryptography Standards padding (for 'ECB' modes).

'DEFAULT': Uses 'NONE' for 'GCM' and 'PKCS' for 'ECB'.

Understanding GCM and ECB Modes

GCM and ECB are different methods (or modes) of encrypting data. GCM (Galois/Counter Mode) is like locking
your data with a secure padlock, but with an additional layer of protection to ensure that no one has tampered with it.
This mode not only encrypts the data but also verifies its integrity, making it highly secure and fast. It is often used for
secure communication, where speed and data integrity are critical.

ECB (Electronic Codebook Mode) treats each chunk of data the same way, without any chaining. It’s like putting
each letter of a message in the same type of envelope, without considering the surrounding letters. This makes ECB
fast but predictable, as identical chunks of data will produce identical encrypted output. Because of this, ECB is
considered less secure than GCM since it can reveal patterns in the data.

In encryption, padding refers to filling in extra spaces when the data doesn’t perfectly fit the required block size
(usually 16 bytes). Imagine you have a box that fits exactly 16 letters, but your message is only 13 letters long.
Padding is like adding extra filler to make the message fit perfectly.

PKCS (Public Key Cryptography Standards) is a widely used method for padding. It adds extra characters to fill the
gaps, making sure the data fits the block size. When the data is decrypted, the system knows how to remove the
padding. In contrast, NONE means no padding is added, which only works if the data already fits the block size
perfectly. This is commonly used in GCM mode, where padding isn’t required.

AAD (Additional Authenticated Data) is a feature in GCM mode that allows you to include extra information (such
as metadata) alongside your encrypted data. This extra information isn’t encrypted, but it is part of the secure process
and helps ensure that the message hasn’t been tampered with. Think of it as adding an extra label on a package,
indicating who sent it or when it was sent. While the label itself isn’t hidden, it’s essential to verify that the
information hasn’t been altered. AAD is useful in situations where the integrity of this additional information is
important for verifying the authenticity of the message.

This feature is yet to be released in Data Distiller.

AES is a type of symmetric encryption. In symmetric encryption, the same key is used for both encrypting and
decrypting data. This means that the person or system encrypting the data and the one decrypting it must both have
access to the same secret key. Since AES is symmetric, the security of the system depends on keeping the key
confidential. If someone gains access to the key, they can both encrypt and decrypt the data. Before using these
functions, you will need to generate a key, securely track it, and store it in a secure vault.

The key should be kept in a secure key management system (KMS) or a hardware security module (HSM). These
systems are designed to securely store, manage, and control access to encryption keys, preventing unauthorized access.
Popular cloud providers like AWS, Google Cloud, and Azure offer managed KMS services, which automate the secure
storage and handling of keys. By using a KMS or HSM, you can ensure that the key is protected, access is tightly
controlled, and audit logs are maintained for compliance with security standards.

-- Generate a random 16-byte key (32 hexadecimal characters)

SELECT
UPPER(SUBSTRING(SHA2(CAST(RAND() AS STRING), 256), 1, 32)) AS
generated_16_byte_key;

The query above generates hexadecimal characters, but the aes**_encrypt** and aes**_decrypt** functions
require binary values. Therefore, you need to use the unhex**(generated_16_byte_key)** function in Data
Distiller to convert the hexadecimal key into the required binary format

-- Generate a random 32-byte key (64 hexadecimal characters)

SELECT
UPPER(SHA2(CAST(RAND() AS STRING), 256)) AS generated_32_byte_key;

The query above generates hexadecimal characters, but the aes**_encrypt** and aes**_decrypt** functions
require binary values. Therefore, you need to use the unhex**(generated_24_byte_key)** function in Data
Distiller to convert the hexadecimal key into the required binary format

AES-256 Encryption & Decryption with GCM (Default Mode, No Padding)

Let us demonstrate how the encryption and decryption works. Note that we will be using the HEX function and CAST
functions for the purpose of displaying the results i.e. binary values cannot be displayed in the Data Distiller Query
Pro Mode Editor. You should remove them when using these to functions:

WITH EncryptedData AS (
-- Step 1: Encrypt the email and convert the encrypted binary data into a
readable hex string
SELECT
customer_id,
HEX(AES_ENCRYPT(email,
UNHEX('6BB8E32DB365D1953C95377C547330B52FAF9C35C9350A2BA1FC5CB4651D28E9'))) AS
encrypted_email_hex
FROM
healthcare_customers
)
-- Step 2: Decrypt the encrypted email and cast it back to STRING
SELECT
customer_id,
encrypted_email_hex, -- Display encrypted email as hex string
CAST(AES_DECRYPT(UNHEX(encrypted_email_hex),
UNHEX('6BB8E32DB365D1953C95377C547330B52FAF9C35C9350A2BA1FC5CB4651D28E9')) AS
STRING) AS decrypted_email
FROM
EncryptedData;

The result should be:

AES-256 Encryption & Decryption with ECB Mode and PKCS Padding

WITH EncryptedData AS (
-- Step 1: Encrypt email using AES-256 with ECB mode and PKCS padding
SELECT
customer_id,
HEX(AES_ENCRYPT(email,
UNHEX('6BB8E32DB365D1953C95377C547330B52FAF9C35C9350A2BA1FC5CB4651D28E9'),
'ECB', 'PKCS')) AS encrypted_email_hex
FROM
healthcare_customers
)
-- Step 2: Decrypt the encrypted email using the same key, mode, and padding
SELECT
customer_id,
encrypted_email_hex,
CAST(AES_DECRYPT(UNHEX(encrypted_email_hex),
UNHEX('6BB8E32DB365D1953C95377C547330B52FAF9C35C9350A2BA1FC5CB4651D28E9'),
'ECB', 'PKCS') AS STRING) AS decrypted_email
FROM
EncryptedData;

The Genius of Galois: His Math Powers Modern Encryption

GCM (Galois/Counter Mode) is a mode of operation for encryption that ties back to the innovative work of
mathematician Évariste Galois, whose contributions to abstract algebra, specifically Galois fields, play a pivotal role
in how GCM operates.

What makes GCM special—and really cool—is that it combines both encryption and authentication in a highly
efficient way, ensuring not only that data is protected, but also that it hasn’t been tampered with during transmission.
This dual capability is crucial for modern data security.

At the heart of GCM’s strength is its use of Galois fields, a concept developed by Galois in the 19th century, which
involves operations on finite sets of numbers. In GCM, these fields enable fast and secure mathematical operations that
verify data integrity while keeping the encryption itself highly efficient.

What’s particularly cool about this is that Galois, who tragically died young, couldn’t have foreseen how his abstract
work in algebra would one day become foundational in securing digital communications in the 21st century. By
leveraging the power of Galois fields, GCM mode manages to be both faster and more secure than many other
encryption modes, making it a go-to solution for protecting sensitive data, especially in high-performance
environments like cloud computing and secure messaging.

So, when using AES with GCM mode, you’re benefiting from the mathematical genius of Galois—applying 19th-
century mathematics to cutting-edge digital encryption!

Demonstration of AES encryption and decryption in Data Distiller

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions/func-300-privacy-functions-in-
data-distiller [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

1. UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS

FUNC 300: Privacy Functions in Data Distiller

Tutorials from other sections that cover this topic in detail

Last updated 5 months ago

ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques

ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller

ACT 400: AES Data Encryption & Decryption with Data Distiller

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions/func-400-statistics-functions-in-
data-distiller [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

1. UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS

FUNC 400: Statistics Functions in Data Distiller

Last updated 4 months ago

STATSML 400: Data Distiller Basic Statistics Functions

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions/func-600-advanced-statistics-and-
machine-learning-functions [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

1. UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS

FUNC 600: Advanced Statistics & Machine Learning Functions

Last updated 4 months ago

STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models

https://data-distiller.all-stuff-data.com/unit-1-getting-started [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)
*

https://data-distiller.all-stuff-data.com/about-the-author * * *

About the Author

Last updated 4 months ago

This site is maintained by Saurabh Mahapatra, who has experience in computer vision, robotics, virtual reality,
systems engineering, simulation, and big data analytics. He has also worked extensively in research, focusing on
dynamical systems, neural networks, neuroscience, and medical devices, including efforts to help improve tendon
surgeries for stroke-impaired patients. His research has also covered machine learning techniques for image
recognition and AI-driven vision interpretation in robotics.

Currently, Saurabh applies his knowledge to solve practical problems in HR and marketing. Previously, he worked at
MathWorks, where he was known as the “Simulink Dude” for his collection of Simulink examples.

This website is not affiliated with Adobe or any other company Saurabh has worked with. For feedback or suggestions,
feel free to reach out on LinkedIn.

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions/func-500-lambda-functions-in-
data-distiller-exploring-similarity-joins * * *

1. UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS

FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity

Joins
The goal of similarity join is to identify and retrieve similar or related records from one or more datasets based on a
similarity metric.

Here are some common use cases

1. Data Deduplication: In data cleansing tasks, similarity join can help identify and remove duplicate records from
a dataset.

2. Record Linkage: Similarity join is used in record linkage or identity resolution to identify and link records that
represent the same real-world identities across multiple datasets.

3. Recommendation Systems: In collaborative filtering-based recommendation systems, similarity join is used to

find users or items with similar preferences.

4. Information Retrieval: In information retrieval and text search, similarity join is used to retrieve documents,
articles, or web pages that are similar to a given query or document.

5. Text Analytics: In natural language processing (NLP) and text analysis, similarity join is used to compare and
group similar text documents, sentences, or phrases. It’s applied in document clustering and topic modeling.

What is a Similarity Join?

A similarity join is an operation that identifies and retrieves pairs of records from one or more tables based on a
measure of similarity between the records.
Key requirements for a similarity join:

1. Similarity Metric: A similarity join relies on a predefined similarity metric or measure, such as Jaccard
similarity, cosine similarity, edit distance, or others, depending on the nature of the data and the use case. This
metric quantifies how similar or dissimilar two records are.

2. Threshold: A similarity threshold is often defined to determine when two records are considered similar enough
to be included in the join result. Records with a similarity score above the threshold are considered matches.

Jaccard Similarity Measure

The Jaccard similarity measure is popular in many applications because of its simplicity, effectiveness, and
applicability to a wide range of problems. It determines the similarity between two sets by measuring the ratio of the
size of their intersection to the size of their union. It can be applied to a wide range of data types, including text data,
categorical data, and binary data. Calculating Jaccard similarity can be computationally efficient for large datasets,
making it suitable for real-time or batch processing.

The Jaccard similarity coefficient, often denoted as J(A, B), is defined as:

J(A,B)=∣A∪B∣∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣

Where:

∣A∩B∣∣A∩B∣∣A∩B∣ represents the size (number of elements) of the intersection of sets A and B.

∣A∪B∣∣A∪B∣∣A∪B∣ represents the size of the union of sets A and B.

The Jaccard similarity coefficient ranges from 0 to 1:

A Jaccard similarity of 0 indicates no similarity between the sets (completely dissimilar).

A Jaccard similarity of 1 indicates that the sets are identical (completely similar).

Here’s a simple example to illustrate Jaccard similarity:

Suppose we have two product sets, A and B, representing the words in two documents:

Product Set A: {iPhone, iPad, iWatch, iPad Mini}

Product Set B: {iPhone, iPad, Macbook Pro}

To calculate the Jaccard similarity between product sets A and B:

1. Find the intersection of product sets A and B (common elements): {iPhone, iPad}

2. Find the union of product sets A and B (all unique elements): {iPhone, iPad, iWatch, iPad Mini, Macbook Pro}

Now, use the Jaccard similarity formula:

J(A,B)=∣A∪B∣∣A∩B∣=2/5=0.4J(A,B)=∣A∪B∣∣A∩B∣=2/5=0.4J(A,B)=∣A∪B∣∣A∩B∣=2/5=0.4

So, the Jaccard similarity between product sets A and B is 0.4, indicating a moderate degree of similarity between the
words used in the two documents.
This is the similarity between the two sets that will become the columns in our join. But we need pairwise similarity
between each element in Set A with that in Set B.

Pairwise Jaccard Computation with String Similarity

We want to be able to compare a similarity match between the text strings of the products in Set A and Set B.

Let’s assume we’re using character bigrams (2-grams) for this calculation. A 2-gram, also known as a bigram, is a
consecutive sequence of two items or elements in a given sequence or text. And you can generalize this to n-grams.
Assume that the case does not matter and that spaces will not be accounted for. With these assumptions, we have:

Product Set A can be split into these ’2-grams”:

iPhone (5): “ip”, “ph”, “ho”, “on”, “ne”

iPad (3): “ip”, “pa”, “ad”

iWatch (5): “iw”, “wa”, “at”, “tc”, “ch”

iPad Mini (7): “ip”, “pa”, “ad”, “dm”, “mi”, “in”, “ni”

Product Set B:

iPhone (5): “ip”, “ph”, “ho”, “on”, “ne”

iPad (3): “ip”, “pa”, “ad”

Macbook Pro (9): “Ma”, “ac”, “cb”, “bo”, “oo”, “ok”, “kp”, “pr”, “ro”

Now, calculate the Jaccard similarity coefficient for each pair:

1. iPhone (Set A) with iPhone (Set B):

Jaccard Similarity Index: (Intersection: 5, Union: 5) = 5 / 5 = 1

2. iPhone (Set A) with iPad (Set B):

Jaccard Similarity Index: (Intersection: 1, Union: 7) = 1 / 7 ≈ 0.14

3. iPhone (Set A) with Macbook Pro (Set B):

Jaccard Similarity Index: (Intersection: 0, Union: 14) = 0 / 14 = 0

4. iPad (Set A) with iPhone (Set B):

Jaccard Similarity Index: (Intersection: 1, Union: 7) = 1 / 7 ≈ 0.14

5. iPad (Set A) with iPad (Set B):

Jaccard Similarity Index: (Intersection: 3, Union: 3) = 3 / 3 = 1

6. iPad (Set A) with Macbook Pro (Set B):

Jaccard Similarity Index: (Intersection: 0, Union: 12) = 0 / 12 = 0

7. iWatch (Set A) with iPhone (Set B):

Jaccard Similarity Index: (Intersection: 0, Union: 10) = 0 / 10 = 0

8. iWatch (Set A) with iPad (Set B):

Jaccard Similarity Index: (Intersection: 0, Union: 8) = 0 / 8 = 0

9. iWatch (Set A) with Macbook Pro (Set B):

Jaccard Similarity Index: (Intersection: 0, Union: 14) = 0 / 14 = 0

10. iPad Mini (Set A) with iPhone (Set B):

* Jaccard Similarity Index: (Intersection: 1, Union: 11) = 1 / 11 ≈ 0.09

11. iPad Mini (Set A) with iPad (Set B):

* Jaccard Similarity Index: (Intersection: 3, Union: 7) = 3 / 7 ≈ 0.43

12. iPad Mini (Set A) with Macbook Pro (Set B):

* Jaccard Similarity Index: (Intersection: 0, Union: 16) = 0 / 16 = 0

We just need a threshold to identify what are truly great matches which is dependent on the dataset itself.

Let us create a test table out of the example values above manually:

CREATE TABLE featurevector1 AS SELECT *

FROM (
SELECT 'iPad' AS ProductName
UNION ALL
SELECT 'iPhone'
UNION ALL
SELECT 'iWatch'
UNION ALL
SELECT 'iPad Mini'
);
SELECT * FROM featurevector1;

Just to make sure we understand the SQL code:

CREATE TEMP TABLE featurevector1 AS: This statement creates a temporary table named
featurevector1. Temporary tables are typically only accessible within the current session and are
automatically dropped at the end of the session.

SELECT * FROM (...): This part of the code is a subquery used to generate the data that will be inserted
into the featurevector1 table.

Inside the subquery, there are multiple SELECT statements combined using UNION ALL. Each SELECT
statement generates one row of data with the specified values for the ‘ProductName’ column.

SELECT 'iPad' AS ProductName: This generates a row with the value ‘iPad’ in the ‘ProductName’
column.

SELECT 'iPhone': This generates a row with the value ‘iPhone’ in the ‘ProductName’ column.
The result will be:

Similarly, we can also create the second feature vector that looks like the following:

CREATE TABLE featurevector2 AS SELECT *

FROM (
SELECT 'iPad' AS ProductName
UNION ALL
SELECT 'iPhone'
UNION ALL
SELECT 'Macbook Pro'
);
SELECT * FROM featurevector2;

Old Fashioned Tokenization

Tokenization or text splitting is the process of taking text (such as a sentence) and breaking it into individual terms
(usually words).

In our case, we need to do several things:

1. We will assume that whitespaces do not contribute to the similarity measure and we will get rid of them in our
feature vectors.

2. If there are duplicates present in the feature vector, they waste computation. We should get rid of them.

3. We will need to extract tokens of 2 characters, also called as a 2-gram or bigram. In our case, we will assume that
they are overlapping.

In each of the steps, we will keep adding the processed columns right next to the feature vector for illustration
purposes only.

We will use the DISTINCT clause to remove duplicates

SELECT DISTINCT(ProductName) AS featurevector1_distinct FROM featurevector1

SELECT DISTINCT(ProductName) AS featurevector2_distinct FROM featurevector2

In our example, this is trivial as there no duplicates.

To remove whitespaces that we have in our example, use the following:

SELECT DISTINCT(ProductName) AS featurevector1_distinct, replace(ProductName, '

', '') AS featurevector1_nospaces FROM featurevector1

replace(ProductName, ' ', '') AS featurevector1_nospaces: In this part of the query, it takes
the “ProductName” column from the “featurevector1″ table and uses the REPLACE function. The REPLACE function
replaces all occurrences of a space (′ ’) with an empty string (″). This effectively removes all spaces from the
“ProductName” values. The result is aliased as “featurevector1_nospaces.”

The results are:

SELECT DISTINCT(ProductName) AS featurevector2_distinct, replace(ProductName, '

', '') AS featurevector2_nospaces FROM featurevector2
Use the following code:

SELECT DISTINCT(ProductName) AS featurevector1_distinct,

lower(replace(ProductName, ' ', '')) AS featurevector1_transform FROM
featurevector1;

lower(...): The lower function is applied to the result of the REPLACE function. The lower function is used to
convert all characters in the modified “ProductName” values to lowercase. This ensures that the values are in
lowercase regardless of their original casing.

The result will be:

The same would go for the other feature vector:

SELECT DISTINCT(ProductName) AS featurevector2_distinct,

lower(replace(ProductName, ' ', '')) AS featurevector2_transform FROM
featurevector2

The result will be:

To create the tokens, we will use regexp_extract_all

SELECT DISTINCT(ProductName) AS featurevector1_distinct,

lower(replace(ProductName, ' ', '')) AS featurevector1_transform,
regexp_extract_all(lower(replace(ProductName, ' ', '')) , '.{2}', 0) AS tokens
FROM featurevector1;

Some code explanation:

1. regexp_extract_all(lower(replace(ProductName, ' ', '')), '.{2}', 0) AS

tokens: This part of the query further processes the modified “ProductName” values created in the previous
step. It uses the regexp_extract_all function to extract all non-overlapping substrings of 1 to 2 characters
from the modified and lowercase “ProductName” values. The '.{2}' regular expression pattern matches
substrings of 2 characters in length.

2. regexp_extract_all(..., '.{2}', 0): This function extracts all matching substrings from the input
text.

The results will be:

We have a problem - we need to create overlapping tokens. For example, the “iPad” string above is missing “pa”.

Let us fix that by shifting the lookahead operator (using substring) by one step and generating the bigrams:

SELECT DISTINCT(ProductName) AS featurevector1_distinct,

lower(replace(ProductName, ' ', '')) AS featurevector1_transform,
array_union(
regexp_extract_all(lower(replace(ProductName, ' ', '')), '.{2}', 0),
regexp_extract_all(lower(replace(substring(ProductName, 2), ' ',
'')), '.{2}', 0)
) AS tokens
FROM featurevector1;

1. regexp_extract_all(lower(replace(substring(ProductName, 2), ' ', '')), '.

{2}', 0): Similar to the previous step, this extracts two-character sequences from the modified product name,
but it starts from the second character (substring) to create overlapping tokens.
2. array_union(...) AS tokens: The array_union function combines the arrays of two-character
sequences obtained in the two regular expression extracts. This ensures that the result contains unique tokens
from both non-overlapping and overlapping sequences.

The results are:

But.

This does not cut it for us.

If we decide to use the substring approach, then for 3-grams, we will need to use two substrings i.e. essentially doing a
lookahead two times to get the shifts we need. For 10-grams, we will need 9 substring expressions. That will make our
code bloat and untenable.

Our approach of using plain old regular expressions is failing.

We need a new approach.

Exploring a Solution Using Data Distiller Lambda Functions

First, let us execute the following code

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 1),
i -> substring(lower(replace(ProductName, ' ', '')), i, 2)
) AS tokens
FROM
featurevector1;

The result will be:

What about 3-grams? Let us execute the following:

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 2),
i -> substring(lower(replace(ProductName, ' ', '')), i, 3)
) AS tokens
FROM
featurevector1

Observe the parameters in the length functions i.e. 2 and 3.

The results will be:

Well, what about 4-grams?

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 3),
i -> substring(lower(replace(ProductName, ' ', '')), i, 4)
) AS tokens
FROM
featurevector1;

The results are:

And what about 5-grams?

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 4),
i -> substring(lower(replace(ProductName, ' ', '')), i, 5)
) AS tokens
FROM
featurevector1;

The results are:

Since the 5-grams gives 4-grams as well, we try:

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 4),
i -> i + 4 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
5)) = 5
THEN substring(lower(replace(ProductName, ' ', '')), i, 5)
ELSE NULL
END
) AS tokens
FROM
featurevector1;

This gives:

Try:

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 5),
i -> i + 5 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
6)) = 6
THEN substring(lower(replace(ProductName, ' ', '')), i, 6)
ELSE NULL
END
) AS tokens
FROM
featurevector1;
The result is:

Try:

SELECT
DISTINCT(ProductName) AS featurevector1_distinct,
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 6),
i -> i + 6 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
7)) = 7
THEN substring(lower(replace(ProductName, ' ', '')), i, 7)
ELSE NULL
END
) AS tokens
FROM
featurevector1;

The result is:

Lambda functions, also known as anonymous functions or lambda expressions, are a concept commonly found in
functional programming languages. Lambda functions enable you to define small, inline, and anonymous functions
without explicitly naming them. They are typically used for short, simple operations and are often used in functional
programming constructs like mapping, filtering, and reducing data.

Here are some examples where they are used:

1. Functional Programming: In functional programming languages like Lisp, Haskell, Python (with libraries like
map, filter, and reduce), and JavaScript (with arrow functions), lambda functions play a significant role.
They are used to define functions on the fly and can be passed as arguments to other functions.

2. Higher-Order Functions: Lambda functions are often used with higher-order functions, which are functions that
can accept other functions as arguments or return functions as results. Higher-order functions are a fundamental
concept in functional programming.

3. Inline Function Definitions: Lambda functions are useful when you need a small, throwaway function that you
don’t want to define separately in your code. They can make code more concise and readable.

4. Data Transformation: Lambda functions are commonly used for data transformation tasks like mapping values
from one format to another or filtering data based on specific criteria.

Let us understand all the above points in the context of Data Distiller.

Data Distiller Lambda Functions

A lambda (higher-order) function in Data Distiller is an anonymous inline function that can be defined and used within
SQL statements. Think of them as programming constructs that you could use to iterate a function over multiple values
in an array. Philosophically, they are very similar to what you find in LISP or Lambda functions (such as
transform, filter, array_sort etc.) are defined using the lambda keyword followed by input parameters
and an expression. For example, transform is a lambda function that applies the function on all elements in an
array in expr using the function fun

The same goes for the following:

**filter:** Apply the filter on all array elements with the function func defined

**forall:** Apply the test condition defined by func on all elements in expr. Similar function is exists
that returns true or false

**reduce:** Aggregates elements in an array using a custom aggregator. See the example below to see how
you can simulate a for loop.

Let us look at an example where we want to create partial sums of all integers from 1 to 5 i.e. 1, 1+2, 1+2+3,
1+2+3+4, 1+2+3+4

SELECT transform(
sequence(1, 5),
x -> reduce(
sequence(1, x),
0, -- Initial accumulator value
(acc, y) -> acc + y -- Lambda function to add numbers
)
) AS sum_result;

Let us analyze the code above:

1. transform will apply the function x -> reduce on each element generated in sequence.

2. sequence creates 5 integers 1, 2, 3, 4, and 5. Each element of this is an x.

3. reduce itself is using a subset of integers from 1 to x.

4. The 0 denotes the accumulator value denoted by acc.

5. y is the element in sequence(1,x)

6. Accumulator acc stores the results and returns them.

The results will be:

What we are learning is that lambda functions are extremely powerful constructs when we want to implement
“programming” like syntax in Data Distiller.

Based on what we learned above, let us apply the same to our example. Let us take a slimmed-down version of 3-
grams and analyze the code:

SELECT
transform(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 2),
i -> substring(lower(replace(ProductName, ' ', '')), i, 3)
)
FROM
featurevector1

1. transform as mentioned earlier will apply a lambda function to each integer in the sequence.

2. sequence(1, length(lower(replace(ProductName, ' ', ''))) - 2): This part generates a

sequence of integers. Let’s break it down further:
length(lower(replace(ProductName, ' ', ''))): This calculates the length of the
ProductName after making it lowercase and removing spaces.

- 2: It subtracts 2 from the length to ensure that the sequence generates valid starting positions for 3-
character substrings. Subtracting 2 ensures that you have enough characters following each starting position
to extract a 3-character substring. Note that the substring function will operate like a lookahead
operator.

3. i -> substring(lower(replace(ProductName, ' ', '')), i, 3): This is a lambda

function that operates on each integer i in the sequence generated in step 1. Here’s what it does:

substring(...): It uses the substring function to extract a 3-character substring from the
ProductName column.

lower(replace(ProductName, ' ', '')): Before extracting the substring, it converts the
ProductName to lowercase and removes spaces to ensure consistency.

Let us understand the function of filter in:

SELECT
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 6),
i -> i + 6 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
7)) = 7
THEN substring(lower(replace(ProductName, ' ', '')), i, 7)
ELSE NULL
END
)
FROM
featurevector1;

filter takes this sequence and applies a condition to filter out only those starting positions that allow for
extracting a 7-character substring without going beyond the length of the modified ProductName. The
condition i -> i + 6 <= length(lower(replace(ProductName, ' ', ''))) ensures that the
starting position i plus 6 (the length of the desired 7-character substring minus one) does not exceed the length
of the modified ProductName.

The CASE statement is used to conditionally include or exclude substrings based on their length. Only 7-
character substrings are included; others are replaced with NULL.

Hint: When you build general-purpose utility functions such as the one we built for tokenizing strings, you can use
Data Distiller parameterized templates where the number of characters would be a parameter. The reuse and the
abstraction makes the feature extremely powerful.

Compute the Cross Join of Unique Elements Across the Two Feature Vectors

If we had to extract the elements in featurevector2 that are not in featurevector1.

SELECT lower(replace(ProductName, ' ', '')) FROM featurevector2

EXCEPT
SELECT lower(replace(ProductName, ' ', '')) FROM featurevector1;
Hint: Besides**EXCEPT**, you could also use **UNION** and **INTERSECT**. Experiment with **ALL**
or **DISTINCT** clauses.

The results will be:

Let us create the tokenized vector:

CREATE TABLE featurevector1tokenized AS SELECT

DISTINCT(ProductName) AS featurevector1_distinct,
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 1),
i -> i + 1 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
2)) = 2
THEN substring(lower(replace(ProductName, ' ', '')), i, 2)
ELSE NULL
END
) AS tokens
FROM
(SELECT lower(replace(ProductName, ' ', '')) AS ProductName FROM
featurevector1);
SELECT * FROM featurevector1tokenized;

Remember that if you are using DBvisualizer - once you create/delete a table, you have to refresh the database
connection so that the table’s metadata cache is refreshed. Data Distiller does not push out metadata updates.

The result will be:

Do the same for featurevector2:

CREATE TABLE featurevector2tokenized AS

SELECT
DISTINCT(ProductName) AS featurevector2_distinct,
transform(
filter(
sequence(1, length(lower(replace(ProductName, ' ', ''))) - 1),
i -> i + 1 <= length(lower(replace(ProductName, ' ', '')))
),
i -> CASE WHEN length(substring(lower(replace(ProductName, ' ', '')), i,
2)) = 2
THEN substring(lower(replace(ProductName, ' ', '')), i, 2)
ELSE NULL
END
) AS tokens
FROM
(SELECT lower(replace(ProductName, ' ', '')) AS ProductName FROM featurevector2
);
SELECT * FROM featurevector2tokenized;

The result will be:

Let us do the cross-join:

SELECT
A.featurevector1_distinct AS SetA_ProductNames,
B.featurevector2_distinct AS SetB_ProductNames,
A.tokens AS SetA_tokens1,
B.tokens AS SetB_tokens2
FROM
featurevector1tokenized A
CROSS JOIN
featurevector2tokenized B;

Let us recap the SQL:

1. A.featurevector1_distinct AS SetA_ProductNames: This part selects the

featurevector1_distinct column from the tableA and assigns it an alias SetA_ProductNames. The
result of this part will be a list of distinct product names from the first dataset.

2. A.tokens AS SetA_tokens1: This part selects the tokens column from the table or subquery A and
assigns it an alias SetA_tokens1. The result will be a list of tokenized values associated with the product
names from the first dataset.

3. The CROSS JOIN operation combines all possible combinations of rows from the two datasets. In other words,
it pairs each product name and its associated tokens from the first table (A) with each product name and its
associated tokens from the second table(B). This results in a Cartesian product of the two datasets, where each
row in the output represents a combination of a product name and its associated tokens from both datasets.

The results are:

Compute the Jaccard Similarity Measure

Computing the similarity measure should be very straightforward:

SELECT
SetA_ProductNames,
SetB_ProductNames,
SetA_tokens1,
SetB_tokens2,
size(array_intersect(SetA_tokens1, SetB_tokens2)) AS token_intersect_count,
size(array_union(SetA_tokens1, SetB_tokens2)) AS token_union_count,
ROUND(
CAST(size(array_intersect(SetA_tokens1, SetB_tokens2)) AS DOUBLE) /
size(array_union(SetA_tokens1, SetB_tokens2)), 2) AS jaccard_similarity
FROM
(SELECT
A.featurevector1_distinct AS SetA_ProductNames,
B.featurevector2_distinct AS SetB_ProductNames,
A.tokens AS SetA_tokens1,
B.tokens AS SetB_tokens2
FROM
featurevector1tokenized A
CROSS JOIN
featurevector2tokenized B
);

Let us understand the code:

1. size(array_intersect(SetA_tokens1, SetB_tokens2)) AS token_intersect_count:
This part calculates the number of tokens that are common to both SetA_tokens1 and SetB_tokens2. It
does so by computing the size of the intersection of the two arrays of tokens.

2. size(array_union(SetA_tokens1, SetB_tokens2)) AS token_union_count: This part

calculates the total number of unique tokens across both SetA_tokens1 and SetB_tokens2. It computes
the size of the union of the two arrays of tokens.

3. ROUND(CAST(size(array_intersect(SetA_tokens1, SetB_tokens2)) AS DOUBLE) /

size(array_union(SetA_tokens1, SetB_tokens2)), 2) AS jaccard_similarity: This
part calculates the Jaccard similarity between the token sets. It divides the size of the token intersection by the
size of the token union and rounds the result to two decimal places. The Jaccard similarity is a measure of how
similar two sets are, with a value between 0 and 1, where 1 indicates complete similarity.

The results are:

Thresholding on Jaccard Similarity Measure

Let us use a threshold of 0.4 to filter out the columns that made it to our similarity join:

SELECT
SetA_ProductNames,
SetB_ProductNames
FROM
(SELECT
SetA_ProductNames,
SetB_ProductNames,
SetA_tokens1,
SetB_tokens2,
size(array_intersect(SetA_tokens1, SetB_tokens2)) AS token_intersect_count,
size(array_union(SetA_tokens1, SetB_tokens2)) AS token_union_count,
ROUND(
CAST(size(array_intersect(SetA_tokens1, SetB_tokens2)) AS DOUBLE) /
size(array_union(SetA_tokens1, SetB_tokens2)),
2
) AS jaccard_similarity
FROM
(SELECT
A.featurevector1_distinct AS SetA_ProductNames,
B.featurevector2_distinct AS SetB_ProductNames,
A.tokens AS SetA_tokens1,
B.tokens AS SetB_tokens2
FROM
featurevector1tokenized A
CROSS JOIN
featurevector2tokenized B
)
)
WHERE jaccard_similarity>=0.4

This gives the columns for the similarity join:

Last updated 5 months ago

Jaccard Similarity Measure.

Manual test data creation using SELECTs and UNION ALL

iPadMini has whitespaces removed

MacbookPro has whitespaces removed

Convert all to lowercase.

Non-overlapping tokens are created.

Getting all possible bigram sequences with overlapping tokens.

Using lambda functions to extract overlapping bigrams.

Extracting an overlapping trigram.

Extracting overlapping 4-grams

5-grams gives us 4-grams as well

Summing partial sums in a loop.

The only unique element in featurevector2.

Mateialized view of featurevector1 after it has been tokenized.

Mateialized view of featurevector2 after it has been tokenized.

Cross join with the tokens

Jacard Similarity Measure across two feature vectors

Similarity join between featurevector1 and featurevector2.

https://data-distiller.all-stuff-data.com/unit-2-data-distiller-data-exploration [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-4-data-distiller-data-enrichment [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-3-data-distiller-etl-extract-transform-load [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)
*

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions/draft-func-100-date-and-time-
functions * * *

1. UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS

[DRAFT]FUNC 100: Date and Time Functions

The hour function is used when you want to extract the hour component from a **timestamp** or
**datetime** column. It’s particularly useful for time-based analysis, such as:

1. Aggregating Data by Hour: When you need to analyze events or actions (like clicks, sales, or logins) based on
the hour of the day. For example, identifying peak activity hours in a campaign.

2. Time-of-Day Patterns: When looking for trends or patterns in data based on the time of day. For instance,
understanding what hours are most effective for sending marketing emails.

3. Comparing Hourly Performance: When comparing the performance of different hours within a day across
multiple campaigns, as shown in the query.

SELECT campaign_id, hour(click_timestamp) AS hour_of_day, COUNT(*) AS total_clicks FROM

campaign_clicks GROUP BY campaign_id, hour_of_day;

The **date_trunc** function is used when you want to aggregate data by a specific time interval, such as day,
week, month, or year. In the provided query, **date_trunc('month', transaction_date)** is used to
round the **transaction_date** down to the first day of the month, allowing you to analyze data at the
monthly level. Here are some use cases for using the **date_trunc** function:

1. Aggregating by Time Intervals: When you need to summarize data over consistent time periods, such as
months, quarters, or years. This is useful for time series analysis, trend detection, or reporting.

2. Monthly or Periodic Reporting: When generating monthly reports to summarize key metrics (e.g., total
revenue, number of transactions) for each month.

3. Smoothing Time-Series Data: When you want to eliminate daily fluctuations by summarizing data into larger
time buckets, such as weeks or months, to better understand long-term trends.

4. Comparing Performance Across Periods: When comparing metrics across different time intervals, like
comparing revenue month-over-month.

The syntax for the date_trunc function is as follows:

**unit**: This specifies the level of truncation and can be values like 'year', 'quarter', 'month',
'week', 'day', 'hour', 'minute', or 'second'.

date: The date or timestamp expression that you want to truncate.

SELECT date_trunc(‘month’, transaction_date) AS month, SUM(revenue) AS total_revenue FROM transactions

GROUP BY date_trunc(‘month’, transaction_date) ORDER BY month;

The **year** function in this query extracts the year from the **signup_date** field, allowing you to group
and analyze data on an annual basis. Here are some situations where using the year function is beneficial:
1. Yearly Aggregation: Useful for grouping data by year to summarize activities or events that occurred within
each year. In the example below, it counts the number of customer signups per year.

2. Cohort Analysis: Helps in tracking groups of customers who signed up in the same year, providing insights into
customer behavior, growth trends, or retention over time.

3. Year-over-Year Comparisons: Facilitates comparisons across different years, such as assessing revenue growth,
user acquisition, or other key metrics.

4. Trend Analysis: Useful for identifying patterns or trends over multiple years, such as determining which years
had peak or low signup activity.

SELECT year(signup_date) AS signup_year, COUNT(customer_id) AS cohort_size FROM customers GROUP

BY year(signup_date) ORDER BY signup_year;

The dayofweek function is useful for:

1. Grouping Data by Day of the Week: It allows you to analyze trends or patterns based on the day, such as
identifying which days have higher sales or more website traffic.

2. Classifying Days as Weekend or Weekday: As shown in the example, you can use dayofweek to categorize
days into “Weekend” or “weekday” for analysis.

3. Scheduling and Planning: When analyzing tasks or events based on the day of the week, this function helps in
scheduling resources more efficiently.

SELECT CASE WHEN dayofweek(transaction_date) IN (1, 7) THEN ‘Weekend’ ELSE ‘Weekday’ END AS
day_type, SUM(revenue) AS total_revenue FROM transactions GROUP BY day_type;

The **datediff** function is used to calculate the difference between two dates, typically returning the result as
the number of days between them. In the context of the provided query, **datediff** is being used to determine
the number of days between consecutive purchase dates for each customer.

SELECT customer_id, avg(datediff(purchase_date, lag(purchase_date) OVER

(PARTITION BY customer_id ORDER BY purchase_date))) AS
avg_days_between_purchases
FROM purchases;

Here’s a breakdown of the query above and the use of datediff:

1. Calculating Differences Between Consecutive Dates: The **datediff** function computes the difference
in days between a **purchase_date** and the previous **purchase_date** for the same customer, as
determined by the **lag** function.

2. Using **lag** Function: The **lag(purchase_date)** function retrieves the previous purchase date
for each **customer_id**, allowing you to compare it with the current **purchase_date**.

3. Grouping by Customer: The **PARTITION BY customer_id** clause ensures that the calculations are
performed separately for each customer, allowing you to analyze individual purchasing patterns.

4. Averaging the Day Differences: The **avg** function calculates the average number of days between
purchases for each customer, providing insight into their purchase frequency.

Here’s a breakdown of the usage:

1. Filtering Data for Today’s Date: The query retrieves all customers who signed up on the current date by
comparing the **signup_date** to **current_date()**. This helps identify new signups that
occurred today.

2. Use Cases for current_date():

Daily Reports: Generating reports that focus on today’s activities, such as new signups, sales, or customer
interactions.

Real-Time Monitoring: Tracking metrics that need to be updated continuously, like daily active users or
same-day transactions.

Scheduled Queries: Running automated tasks or queries that process data based on the current date.

The **current_date()** function is used to get the current date (today’s date) in SQL. In the given query, it is
used to filter records where the **signup_date** matches today’s date.

SELECT customer_id, signup_date

FROM customer_activity_data
WHERE signup_date = current_date();

**current_timestamp** function

Here’s a breakdown of its use:

1. Capturing the Exact Interaction Time: By using current_timestamp(), you record the precise moment
when the interaction took place. This is useful for time-sensitive data tracking, such as logging user actions or
events.

2. Use Cases for current_timestamp():

Event Logging: Recording the exact time of events, such as user interactions, system events, or changes in
status.

Audit Trails: Keeping a detailed log of activities for compliance, debugging, or tracking user behavior over
time.

Real-Time Analytics: Analyzing data based on the exact time of occurrence, which is helpful for real-time
dashboards or time-series analysis.

The **current_timestamp()** function is used below to get the current date and time (timestamp) at the
moment the query is executed. In the given **INSERT** statement, it adds a record to the
campaign_interactions table with the exact time when the insertion occurs.

INSERT INTO campaign_interactions (customer_id, campaign_id, interaction_time)

VALUES (1234, 5678, current_timestamp());

**current_timezone** function

Here are the use cases:

Tracking Data Entry Timezone: This could be used to log the timezone in which the data entry occurred,
particularly useful in multi-regional systems where data might be inserted from various geographical locations.
Localization of Campaign Analytics: When analyzing campaign interactions, knowing the timezone helps
localize data for regional reports. It would enable the conversion of timestamps to the local time of the
interaction, giving a more accurate representation of when customers interacted with campaigns.

Timezone-Based Personalization: If the system’s timezone reflects the user’s local time, you could use this data
for personalized marketing. For example, sending notifications at specific times based on each user’s local
timezone.

Debugging and Audit Trails: In systems where data ingestion and interaction logs come from various regions,
capturing the current timezone during data entry could help troubleshoot issues, understand latency, or provide
insights into data processing across time zones.

Data Synchronization Across Regions: In distributed systems, knowing the current timezone for data entries
could aid in synchronizing data across servers or applications located in different time zones.

SELECT customer_id, current_timezone() AS customer_timezone FROM campaign_interactions;

SELECT customer_id, date(click_timestamp) AS click_date FROM customer_activity_data;

SELECT customer_id, last_interaction_date, date_add(last_interaction_date, 7) AS predicted_next_interaction

FROM customer_activity_data;

SELECT customer_id, date_diff(current_date(), last_purchase_date) AS inactivity_days FROM

customer_activity_data;

SELECT customer_id, date_format(transaction_date, ‘MMMM yyyy’) AS transaction_month FROM

customer_activity_data;

**date_from_unix_date** function

SELECT customer_id, date_from_unix_date(unix_timestamp) AS readable_date

FROM customer_activity_data;

SELECT customer_id, hour(click_timestamp) AS hour_of_day, COUNT(*) AS

total_clicks
FROM customer_activity_data
GROUP BY customer_id, hour_of_day;

SELECT customer_id, last_day(subscription_start_date) AS subscription_end_date

FROM customer_activity_data;

SELECT make_date(2024, 12, 25) AS campaign_start_date;

SELECT month(transaction_date) AS transaction_month, SUM(revenue) AS

total_revenue
FROM customer_activity_data
GROUP BY transaction_month;

SELECT customer_id, months_between(last_purchase_date, signup_date) AS

months_between_purchases
FROM customer_activity_data;

SELECT customer_id, next_day(last_interaction_date, 'Monday') AS follow_up_date

FROM customer_activity_data;
SELECT customer_id, minute(click_timestamp) AS minute_of_interaction, COUNT(*)
AS total_clicks
FROM customer_activity_data
GROUP BY customer_id, minute_of_interaction;

SELECT customer_id, second(click_timestamp) AS second_of_interaction

FROM customer_activity_data;

SELECT customer_id, timediff(last_interaction_date, first_interaction_date) AS

time_spent
FROM customer_activity_data;

SELECT timestamp('2024-12-31 23:59:59') AS campaign_end_timestamp;

**timestamp_micros** function

SELECT timestamp_micros(1696843573000000) AS event_timestamp;

**timestamp_millis** function

SELECT timestamp_millis(1696843573000) AS event_timestamp;

**timestamp_seconds** function

SELECT timestamp_seconds(1696843573) AS event_timestamp;

SELECT customer_id, timestampadd(MINUTE, 30, click_timestamp) AS

predicted_purchase_time
FROM customer_activity_data;

SELECT customer_id, timestampdiff(HOUR, first_interaction_date,

last_interaction_date) AS hours_between_interactions
FROM customer_activity_data;

SELECT customer_id, date_part('day', transaction_date) AS purchase_day

FROM customer_activity_data;

SELECT to_date('2024-12-31', 'yyyy-MM-dd') AS campaign_launch_date;

SELECT to_timestamp('2024-12-31 23:59:59', 'yyyy-MM-dd HH:mm:ss') AS

campaign_end_timestamp;

**to_unix_timestamp** function

SELECT to_unix_timestamp('2024-12-31 23:59:59', 'yyyy-MM-dd HH:mm:ss') AS

unix_timestamp;

**to_utc_timestamp** function

SELECT to_utc_timestamp(click_timestamp, 'America/Los_Angeles') AS

utc_click_time
FROM customer_activity_data;
SELECT year(transaction_date) AS transaction_year, SUM(revenue) AS
total_revenue
FROM customer_activity_data
GROUP BY transaction_year;

SELECT customer_id, date_sub(event_date, 7) AS reminder_date

FROM customer_activity_data;

SELECT date_trunc('month', transaction_date) AS transaction_month, SUM(revenue)

AS total_revenue
FROM customer_activity_data
GROUP BY transaction_month;

SELECT customer_id, dateadd(MONTH, 1, subscription_start_date) AS

next_billing_date
FROM customer_activity_data;

SELECT customer_id, datediff(current_date(), last_interaction_date) AS

inactivity_days
FROM customer_activity_data;

SELECT day(transaction_date) AS transaction_day, COUNT(*) AS total_transactions

FROM customer_activity_data
GROUP BY transaction_day;

SELECT dayofmonth(transaction_date) AS transaction_day_of_month, COUNT(*) AS

total_transactions
FROM customer_activity_data
GROUP BY transaction_day_of_month;

SELECT dayofweek(click_timestamp) AS engagement_day, COUNT(*) AS

total_engagements
FROM customer_activity_data
GROUP BY engagement_day;

SELECT dayofyear(transaction_date) AS transaction_day_of_year, COUNT(*) AS

total_transactions
FROM customer_activity_data
GROUP BY transaction_day_of_year;

Last updated 4 months ago

https://data-distiller.all-stuff-data.com/unit-5-data-distiller-identity-resolution [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-6-data-distiller-audiences [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)
*

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-business-intelligence [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-activation-and-data-export [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-6-data-distiller-audiences/draft-dda-202-data-distiller-audience-
orchestration * * *

You can skip the remaining prerequisites if you’ve already followed the steps in the tutorial below

If you have not done the above tutorial, you will need this to upload the test data:

We will be using the following data to create segments:

Retail Case Study: Optimizing Email Marketing Campaigns with Audience Segmentation and A/B Testing

In this use case, we aim to simulate and optimize an email marketing campaign by leveraging audience segmentation,
performance tracking, and A/B testing. The primary goal is to improve customer engagement, maximize
conversions, and refine campaign strategies based on real-time customer interactions.

Key Marketing Objectives:

1. Campaign Performance Tracking: Track and analyze key metrics such as email open rates, click-through rates,
and bounce rates to assess campaign success.

2. Customer Segmentation: Segment the customer base into various categories, such as highly engaged customers,
moderately engaged customers, and unengaged customers. This allows marketers to target their messaging more
effectively.

3. A/B Testing: Perform A/B tests by splitting the audience into two groups and testing different versions of the
email content (e.g., subject lines, calls to action). This helps identify which version performs better in terms of
engagement and conversion.

4. Improve Email Deliverability: Track failed email deliveries and understand bounce reasons (soft or hard
bounces) to optimize email lists and improve overall deliverability rates.
5. Personalized Marketing: Use engagement metrics (like open and click counts) to create personalized follow-up
campaigns, offering exclusive deals or reminders based on customer interaction behavior.

Specific Use Case:

A retail brand is running a series of email marketing campaigns for its Spring Sale, Holiday Offers, and New
Arrivals. The marketing team wants to:

1. Identify High-Value Customers: Focus on customers who have a high purchase frequency and loyalty score,
engaging them with personalized offers.

2. Segment the Audience Based on Engagement: Create tailored messaging for those who have opened emails but
haven’t clicked (i.e., warm leads) vs. those who haven’t engaged at all (cold leads).

3. A/B Test Subject Lines: Compare two email subject lines for the same campaign to see which one drives more
engagement (open and click rates).

4. Monitor and Reduce Email Bounces: Track and reduce email bounces by analyzing hard and soft bounces to
refine the email list and improve targeting.

Expected Outcome:

Higher Engagement: By tracking open and click rates, the marketing team can focus on the most effective
content, leading to higher engagement and ultimately increased sales.

Improved Targeting: Customer segmentation based on interaction helps in tailoring future messages, leading to
better personalization and increased likelihood of conversion.

Optimized Content: A/B testing results will provide insights into what content or subject lines resonate most
with the audience, enabling the brand to optimize its messaging.

Reduced Bounce Rates: Understanding bounce types (hard or soft) will allow the marketing team to clean up
the email list, ensuring better deliverability and engagement metrics.

We will focus on the thir objectives:

A/B Test Subject Lines: Compare two email subject lines for the same campaign to see which one drives more
engagement (open and click rates).

Opened but No Click Audience

This audience includes customers who have opened emails but did not click on any links.

CREATE AUDIENCE opened_no_click_audience

WITH (primary_identity=email, identity_namespace=Email)
AS SELECT
customer_id,
email,
campaign_name,
open_count,
click_count
FROM email_campaign_dataset_20241001_050033_012
WHERE open_count > 0 AND click_count = 0;

The result is:

This audience includes customers who neither opened nor clicked on the emails.

CREATE AUDIENCE no_engagement_audience

WITH (primary_identity=email, identity_namespace=Email)
AS SELECT
customer_id,
email,
campaign_name,
open_count,
click_count
FROM email_campaign_dataset_20241001_050033_012
WHERE open_count = 0;

Split Testing by Subject Line

Compare the engagement metrics between two different groups in an A/B test (using subject lines as the test variable).

WITH ab_testing_split AS (
SELECT
customer_id,
email,
email_subject,
campaign_name,
open_count,
click_count,
CASE
WHEN MOD(ROW_NUMBER() OVER (PARTITION BY campaign_name ORDER BY
customer_id), 2) = 0 THEN 'Group A'
ELSE 'Group B'
END AS test_group
FROM adobe_campaign_dataset
)
SELECT
test_group,
email_subject,
campaign_name,
COUNT(*) AS total_emails_sent,
SUM(open_count) AS total_opens,
SUM(click_count) AS total_clicks,
ROUND(SUM(open_count) / COUNT(*), 2) AS open_rate,
ROUND(SUM(click_count) / SUM(open_count), 2) AS click_through_rate
FROM ab_testing_split
GROUP BY test_group, email_subject, campaign_name
ORDER BY campaign_name, test_group;

This query allows you to compare the performance between Group A and Group B for an A/B test.

4. Email Delivery and Bounce Queries

a) Track Email Delivery Success

Track how well the emails are being delivered across campaigns by monitoring delivery status.

SELECT
campaign_name,
COUNT(*) AS total_emails_sent,
SUM(CASE WHEN delivery_status = 'Delivered' THEN 1 ELSE 0 END) AS
emails_delivered,
SUM(CASE WHEN delivery_status = 'Failed' THEN 1 ELSE 0 END) AS
emails_failed,
ROUND(SUM(CASE WHEN delivery_status = 'Delivered' THEN 1 ELSE 0 END) /
COUNT(*), 2) AS delivery_rate
FROM adobe_campaign_dataset
GROUP BY campaign_name
ORDER BY delivery_rate DESC;

This query helps you monitor the delivery success rate and identify potential issues in campaigns with high failure
rates.

b) Analyze Bounce Rates

Identify campaigns with high bounce rates and distinguish between hard and soft bounces.

SELECT
campaign_name,
COUNT(*) AS total_emails_sent,
SUM(CASE WHEN bounce_type = 'Hard Bounce' THEN 1 ELSE 0 END) AS
hard_bounces,
SUM(CASE WHEN bounce_type = 'Soft Bounce' THEN 1 ELSE 0 END) AS
soft_bounces,
ROUND(SUM(CASE WHEN bounce_type != 'None' THEN 1 ELSE 0 END) / COUNT(*), 2)
AS bounce_rate
FROM adobe_campaign_dataset
GROUP BY campaign_name
ORDER BY bounce_rate DESC;

This query will show you which campaigns have high bounce rates and whether those bounces are hard or soft,
helping you clean up email lists and improve deliverability.

5. General Engagement Trends

a) Engagement Over Time

Analyze how customer engagement changes over time by tracking the number of days since the customer’s last
purchase.

SELECT
last_purchase_days_ago,
AVG(open_count) AS avg_open_count,
AVG(click_count) AS avg_click_count
FROM adobe_campaign_dataset
GROUP BY last_purchase_days_ago
ORDER BY last_purchase_days_ago;

This query shows if there’s a correlation between how recently a customer made a purchase and their engagement with
email campaigns.

https://data-distiller.all-stuff-data.com/unit-7-data-distiller-statistics/stats-200-unlock-dataset-metadata-insights-via-
adobe-experience-platform-apis-and-python#create-a-developer-project [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/prereq-106-ingesting-json-test-data-into-adobe-experience-platform#setup-
azure-storage-explorer [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/prereq-105-ingesting-csv-data-into-adobe-experience-platform [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

https://data-distiller.all-stuff-data.com/unit-9-data-distiller-functions-and-extensions [

Adobe Data Distiller Guide

](https://data-distiller.all-stuff-data.com/)

Marketing Strategy of Nestle
82% (17)
Marketing Strategy of Nestle
60 pages
Chapter-1-5
50% (2)
Chapter-1-5
35 pages
Marketing Management - Chapter 1
No ratings yet
Marketing Management - Chapter 1
111 pages
Selling Vs Marketing
50% (2)
Selling Vs Marketing
9 pages
Armstrong Mai14 PPT 02
100% (1)
Armstrong Mai14 PPT 02
51 pages
Ghana Hotel Strategic Plan
100% (1)
Ghana Hotel Strategic Plan
71 pages
ANA Sales Enablement Plan Playbook
100% (1)
ANA Sales Enablement Plan Playbook
39 pages
DB Schenker PDF
0% (1)
DB Schenker PDF
54 pages
Beauty & Spa Marketing Plan
0% (1)
Beauty & Spa Marketing Plan
10 pages
PDF Strategic Marketing Amp Planning Pakistan State Oil DD
No ratings yet
PDF Strategic Marketing Amp Planning Pakistan State Oil DD
52 pages
Entrepreneurship All Ques. Collection (Helping Hand)
No ratings yet
Entrepreneurship All Ques. Collection (Helping Hand)
42 pages
New STPD
No ratings yet
New STPD
40 pages
Assignment Front Sheet: Qualification BTEC Level 4 HND Diploma in Business
No ratings yet
Assignment Front Sheet: Qualification BTEC Level 4 HND Diploma in Business
13 pages
Unit I 12: Bank Subject: Ba5207 Marketing Management Sem / Year: Ii/I
No ratings yet
Unit I 12: Bank Subject: Ba5207 Marketing Management Sem / Year: Ii/I
10 pages
CHAP123 - Revised
No ratings yet
CHAP123 - Revised
21 pages
Final Questionnaire For Printing PDF
No ratings yet
Final Questionnaire For Printing PDF
10 pages
Parents Perception On de La Salle Araneta University Among Basic Education of Academic Services: Basis For Strategic Marketing Plan
No ratings yet
Parents Perception On de La Salle Araneta University Among Basic Education of Academic Services: Basis For Strategic Marketing Plan
21 pages
Marketing Management M1
No ratings yet
Marketing Management M1
78 pages
Direct Sales Representative Interview Questionnaire
No ratings yet
Direct Sales Representative Interview Questionnaire
2 pages
BSBOPS601 Student Assessment Tasks
No ratings yet
BSBOPS601 Student Assessment Tasks
52 pages
What Is Keyword Research?
No ratings yet
What Is Keyword Research?
2 pages
Alcoholic Beverages / Marketing Strategies Michael Spandern
No ratings yet
Alcoholic Beverages / Marketing Strategies Michael Spandern
7 pages
Affiliate Marketing Strategies
No ratings yet
Affiliate Marketing Strategies
11 pages
DMA 2013 Integrated MKT Analytics
No ratings yet
DMA 2013 Integrated MKT Analytics
36 pages
Cardiff Met Dissertation Binding
100% (2)
Cardiff Met Dissertation Binding
7 pages
Case Study - Dominos (Marketing & Operations)
No ratings yet
Case Study - Dominos (Marketing & Operations)
7 pages
BUSINESS PLAN TEMPLATE - Aquatic Fish Farm
No ratings yet
BUSINESS PLAN TEMPLATE - Aquatic Fish Farm
47 pages
Apoorv Sharma Dissertation Synopsis (BM-022029)
No ratings yet
Apoorv Sharma Dissertation Synopsis (BM-022029)
6 pages
Theme of Otherness and Writing Back A Co
No ratings yet
Theme of Otherness and Writing Back A Co
34 pages
How Big Companies Manage Consistent Marketing? The Answer Lies in Big Data
No ratings yet
How Big Companies Manage Consistent Marketing? The Answer Lies in Big Data
9 pages
Analytica June 2024
No ratings yet
Analytica June 2024
11 pages
Segmenting, Targeting and Positiong
No ratings yet
Segmenting, Targeting and Positiong
23 pages
SU22 - Business Analyst - Set3
No ratings yet
SU22 - Business Analyst - Set3
66 pages
What Are The Content of Business Plan
No ratings yet
What Are The Content of Business Plan
2 pages
Clarehaven Estates - The Jack Pine (TS-04)
No ratings yet
Clarehaven Estates - The Jack Pine (TS-04)
18 pages
Experience LAbs AEP
No ratings yet
Experience LAbs AEP
38 pages
New pd2 Dump Latest 26-07-2019
No ratings yet
New pd2 Dump Latest 26-07-2019
33 pages
Drive To Instagram Agent
No ratings yet
Drive To Instagram Agent
1 page
AD0-E607 AJO %% Business Pract Exam Sylubus
No ratings yet
AD0-E607 AJO %% Business Pract Exam Sylubus
32 pages
Data Analytics for Beginners: Introduction to Data Analytics
From Everand
Data Analytics for Beginners: Introduction to Data Analytics
Anthony S. Williams
4/5 (19)
Data Analytics Essentials You Always Wanted To Know: Self Learning Management
From Everand
Data Analytics Essentials You Always Wanted To Know: Self Learning Management
Vibrant Publishers
4/5 (11)
Data farming Standard Requirements
From Everand
Data farming Standard Requirements
Gerardus Blokdyk
No ratings yet
Big Data Privacy Second Edition
From Everand
Big Data Privacy Second Edition
Gerardus Blokdyk
No ratings yet
Data Matrix Complete Self-Assessment Guide
From Everand
Data Matrix Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Data sharing A Clear and Concise Reference
From Everand
Data sharing A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Data corruption Second Edition
From Everand
Data corruption Second Edition
Gerardus Blokdyk
No ratings yet
Big Data Information Management for Government Complete Self-Assessment Guide
From Everand
Big Data Information Management for Government Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Data file Standard Requirements
From Everand
Data file Standard Requirements
Gerardus Blokdyk
No ratings yet
Source data Complete Self-Assessment Guide
From Everand
Source data Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Data model Second Edition
From Everand
Data model Second Edition
Gerardus Blokdyk
No ratings yet
Data fusion A Clear and Concise Reference
From Everand
Data fusion A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Data grid The Ultimate Step-By-Step Guide
From Everand
Data grid The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Data domain A Clear and Concise Reference
From Everand
Data domain A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
From Everand
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
Othman Khalifa
No ratings yet
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
From Everand
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
Sumitra Kumari
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
From Everand
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
Jamie Murphy
No ratings yet
Managing Data as a Product: Design and build data-product-centered socio-technical architectures
From Everand
Managing Data as a Product: Design and build data-product-centered socio-technical architectures
Andrea Gioia
No ratings yet
Business Analytics
From Everand
Business Analytics
Hiriyappa .B
4/5 (1)
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
From Everand
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
Mohammod Shaharuzzaman
No ratings yet
Business Intelligence and Data Mining Techniques
From Everand
Business Intelligence and Data Mining Techniques
Dwaipayan Sethi
No ratings yet
Business Analytics and Big Data
From Everand
Business Analytics and Big Data
Sachin Naha
No ratings yet
Data Analytics. Fast Overview.
From Everand
Data Analytics. Fast Overview.
George Letton
2.5/5 (19)
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Retail Data Analytics: Enhancing Customer Experience and Profitability
From Everand
Retail Data Analytics: Enhancing Customer Experience and Profitability
Christine Nyaga
No ratings yet
Excel Data Mastery for Beginners
From Everand
Excel Data Mastery for Beginners
Kevogo Musudia
No ratings yet
Building Winning Strategies With Analytics
From Everand
Building Winning Strategies With Analytics
Temitope Adeyeha
No ratings yet
Decision Making with Data
From Everand
Decision Making with Data
Ravi Deshpande
No ratings yet
Data Science for Business: Data Mining, Data Warehousing, Data Analytics, Data Visualization, Data Modelling, Regression Analysis, Big Data and Machine Learning
From Everand
Data Science for Business: Data Mining, Data Warehousing, Data Analytics, Data Visualization, Data Modelling, Regression Analysis, Big Data and Machine Learning
Travis Goleman
No ratings yet
Effective Analytics for Marketing
From Everand
Effective Analytics for Marketing
Sucheta Kakkar
No ratings yet
Business with AI
From Everand
Business with AI
Kendall Johnson
No ratings yet
Business Success with Business Intelligence
From Everand
Business Success with Business Intelligence
Ndane Eriyo
No ratings yet
How To Win Customers Every Day _ Volume 7: Data-Driven Selling: The Complete Guide to Success
From Everand
How To Win Customers Every Day _ Volume 7: Data-Driven Selling: The Complete Guide to Success
Max Editorial
No ratings yet
Crawl, Walk, Run: Advancing Analytics Maturity with Google Marketing Platform
From Everand
Crawl, Walk, Run: Advancing Analytics Maturity with Google Marketing Platform
Michael Loban
No ratings yet
What Is Data Analytics? A Complete Guide For Beginners
From Everand
What Is Data Analytics? A Complete Guide For Beginners
Piyush Kumar Jain
No ratings yet
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
From Everand
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Riley Adams
5/5 (1)
A Field Guide to Data-Driven Sales Enablement: A playbook featuring articles by 18 leading industry executives
From Everand
A Field Guide to Data-Driven Sales Enablement: A playbook featuring articles by 18 leading industry executives
Baker Communications, Inc.
No ratings yet
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Big Data: Understanding How Data Powers Big Business
From Everand
Big Data: Understanding How Data Powers Big Business
Bill Schmarzo
2/5 (1)
Data Analysis and Business Modeling with Excel 2013: Manage, analyze, and visualize data with Microsoft Excel 2013 to transform raw data into ready to use information
From Everand
Data Analysis and Business Modeling with Excel 2013: Manage, analyze, and visualize data with Microsoft Excel 2013 to transform raw data into ready to use information
David Rojas
1/5 (2)
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Business Analytics: Leveraging Data for Insights and Competitive Advantage
From Everand
Business Analytics: Leveraging Data for Insights and Competitive Advantage
Ronald BLaha
No ratings yet
Digital Skills for Agile Business Analysis
From Everand
Digital Skills for Agile Business Analysis
Tj. Blake Williams
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Analytics in a Business Context: Practical guidance on establishing a fact-based culture
From Everand
Analytics in a Business Context: Practical guidance on establishing a fact-based culture
Frank Vella
No ratings yet
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
From Everand
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
Corey Barak
No ratings yet
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
From Everand
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
Corey Barak
No ratings yet
Pragmalytics: Practical Approaches to Marketing Analytics in the Digital Age
From Everand
Pragmalytics: Practical Approaches to Marketing Analytics in the Digital Age
Cesar A. Brea
No ratings yet
Discovering Business Intelligence Using MicroStrategy 9
From Everand
Discovering Business Intelligence Using MicroStrategy 9
Nelson Enriquez
No ratings yet
Making Big Data Work for Your Business: A guide to effective Big Data analytics
From Everand
Making Big Data Work for Your Business: A guide to effective Big Data analytics
Sudhi Sinha
No ratings yet