0% found this document useful (0 votes)
85 views41 pages

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

This document summarizes an analytics workflow for analyzing customer sentiment from online conversations. It involves parsing thousands of documents into word frequency fingerprints to define document similarity and cluster documents. The data is accessed by building a file list and reading each file. HTML documents are parsed using an HTML parser to extract titles and body text, which are then tokenized, stemmed, and filtered to remove common terms. This processed text is used to define document similarity and cluster documents by topic.

Uploaded by

ramanavg
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views41 pages

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

This document summarizes an analytics workflow for analyzing customer sentiment from online conversations. It involves parsing thousands of documents into word frequency fingerprints to define document similarity and cluster documents. The data is accessed by building a file list and reading each file. HTML documents are parsed using an HTML parser to extract titles and body text, which are then tokenized, stemmed, and filtered to remove common terms. This processed text is used to define document similarity and cluster documents by topic.

Uploaded by

ramanavg
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Greenplum: MAD Analytics in Practice

MMDS June 16th, 2010

7/26/2010 Confidential

Warehousing Today

7/26/2010 Confidential

In the Days of Kings and Priests

Computers and Data: Crown Jewels Executives depend on computers


But cannot work with them directly

The DBA Priesthood


And their Acronymia
EDW, BI, OLAP, 3NF

Secret functions and techniques, expensive tools

7/26/2010 Confidential

The Architected EDW

Slowmoving models

Non-standard, in-memory analytics

Shadow systems

Slowmoving data

Shallow Business Intelligence

Departmental warehouses
7/26/2010 Confidential 4

Static schemas accrete over time

New Realities

Welcome to the Petabyte Age


TB disks < $100 Everything is data Rise of data-driven culture
Very publicly espoused by Google, Netflix, etc. Terraserver, USAspending.gov

The New Practitioners


Aggressively Datavorous Statistically savvy Diverse in training, tools
7/26/2010 Confidential 5

Greenplum Overview

7/26/2010 Confidential

Greenplum Database Architecture


PL/R PL/Perl PL/Python

MPP (Massively Parallel Processing) Shared-Nothing Architecture


Master Severs
Query planning & dispatch

SQL MapReduce

...

...

Network Interconnect

Segment Severs
Query processing & data storage

...

...

External Sources
Loading, streaming, etc.

7/26/2010 Confidential

Key Technical Innovations

Scatter-Gather Data Streaming


Industry leading data loading capabilities

Online Expansion
Dynamically provision new servers with no downtime

Map-Reduce Support
Parallel programming on data for advanced analytics

Polymorphic Storage
Support for both row and column-oriented storage
7/26/2010 Confidential 8

Benefits of the Greenplum Database Architecture


Simplicity Parallelism is automatic no manual partitioning required No complex tuning required just load and query Scalability Linear scalability to 1000s of cores, on commodity hardware Each node adds storage, query performance, loading performance Example: 6.5 petabytes on 96 nodes, with 17 trillion records Flexibility Fully parallelism for SQL92, SQL99, SQL2003 OLAP, MapReduce Any schema (star, snowflake, 3NF, hybrid, etc) Rich extensibility and language support (Perl, Python, R, C, etc)

7/26/2010 Confidential

Customer Example: eBay Petabyte-Scale


Business Problem
Fraud detection and click-stream analytics

Data Size
6.5 Petabytes of user data Loading 18 TB every day (130 Billion rows) Every click on eBays website, 365 days a year 20 Trillion row fact table

Hardware
96-node Sun Data Warehouse Appliance Expansion to go to 192 nodes

Benefit
Scalability and price/performance Cost effective complement to Teradata

7/26/2010 Confidential

10

MAD Analytics

7/26/2010 Confidential

11

Magnetic

Simple linear models Trend analysis

Analytics Data
EDC PLATFORM 7/26/2010 Confidential 12

Chorus

Agile

analyze and model in the cloud push results back into the cloud get data into the cloud
7/26/2010 Confidential 13

Deep

Future

What will happen?

How can we do better?

What happened where and when?

How and why did it happen?

Past

Facts
7/26/2010 Confidential 14

Interpretation

Dolans Vocabulary of Statistics

Data Mining focused on individuals


Statistical analysis needs more Focus on densitymethods

1. (Scalar) Arithmetic 2. Vector Arithmetic

I.e. Linear Algebra

Need to be able to utter statistical sentences

Functions
E.g. probability densities

And run massively parallel, on Big Data!

Functionals
i.e. functions on functions E.g., A/B testing: a functional over densities

7/26/2010 Confidential 15

Misc Statistical methods


E.g. resampling

MAD Skills Whitepaper

Paper includes parallelizable, stat-like SQL for


Linear algebra (vectors/matrices) Ordinary Least Squares (multiple linear regression) Conjugate Gradiant (iterative optimization, e.g. for SVM classifiers) Functionals including Mann-Whitney U test, Log-likelihood ratios Resampling techniques, e.g. bootstrapping

Encapsulated as stored procedures or UDFs

Significantly enhance the vocabulary of the DBMS! Related stuff in NIPS 06, using MapReduce syntax

These are examples.

Plenty of research to do here!!


7/26/2010 Confidential 16

MAD Analytics Examples

7/26/2010 Confidential

17

Example
Whats the right price for my products?

7/26/2010 Confidential

18

Whats the right price for my products?


Date BasePri Display ce Price 6.67 5.94 6.74 7.19 7.36 6.56 6.57 6.67 7.00 6.72 7.32 7.44 6.24 6.74 Feature Price 7.33 5.72 5.74 7.40 6.74 6.98 7.70 6.87 7.36 6.92 7.49 5.69 7.19 7.72 Feature/D isplay Price 7.33 7.00 7.26 7.40 7.75 7.43 7.70 6.87 7.36 6.92 7.49 7.69 7.19 7.72 TPR Volume

Get the raw data


DROP TABLE IF EXISTS misc.price_promo; CREATE TABLE misc.price_promo (

2009-0224 7.33 2009-0310 7.47 2009-0324 7.75 2009-0407 7.40 2009-0421 7.75 2009-0505 7.43 2009-0519 7.70 2009-0602 6.87 2009-0616 7.36 2009-0630 6.92 2009-0714 7.49 2009-0728 7.69 2009-0811 7.19 2009-087/26/2010 Confidential 25 7.72

7.20 5.72 5.82 7.23 6.22 5.70 6.23 6.64 7.44 6.73 7.58 5.70 6.72 5.72

20484.52 34313.94 25477.33 18772.57 20743.68 28244.82 20234.74 23262.60 19290.87 23617.61 18017.58 29193.44 23863.13 25138.34
19

dt date ,base_price numeric ,display_price numeric ,feature_price numeric ,feature_display_price numeric ,tpr numeric ,volume numeric ) DISTRIBUTED BY(dt); \copy misc.price_promo from data.csv with delimiter ','

Whats the right price for my products?


intercept_beta 72804.48332 base_price_beta 5049.03841 display_price_beta feature_display_price_beta -1388.842417 -6203.882026 tpr_beta r2

-4801.114351 0.883172235

Train the model


CREATE TABLE misc.price_promo_coefs AS SELECT coefs[1] AS intercept_beta ,coefs[3] AS display_price_beta ,coefs[5] AS tpr_beta FROM ( SELECT mregr_coef(volume, array[1::int, base_price_per_unit, display_price, temporary_price_reduction]) AS coefs feature_display_price, ,coefs[2] AS base_price_beta ,coefs[4] AS feature_display_price_beta ,r2

,mregr_r2(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS r2 FROM misc.price_promo ) AS a DISTRIBUTED RANDOMLY;
7/26/2010 Confidential 20

Whats the right price for my products?


volume 20484.52 34313.94 25477.33 18772.57 20743.68 28244.82 20234.74 23262.60 19290.87 23617.61 18017.58 29193.44 23863.13 25138.34 24307.88
7/26/2010 Confidential

volume_fitted 20507.88 31381.52 29591.06 19560.80 23769.63 27746.83 24876.55 23727.72 18862.64 23168.44 17595.93 26224.39 23571.29 27065.91 23945.45

ape 0.1140 8.5458 16.1466 4.1988 14.5873 1.7630 22.9398 1.9994 2.2198 1.9018 2.3402 10.1702 1.2229 7.6678 1.4909

Evaluate the model


CREATE OR REPLACE VIEW misc.v_price_promo_fitted AS SELECT volume ,volume_fitted ,100 * abs(volume - volume_fitted)::numeric / volume AS ape FROM ( SELECT p.volume ,c.intercept_beta + p.base_price * c.base_price_beta + p.display_price * c.display_price_beta + p.feature_display_price * c.feature_display_price_beta + p.tpr * c.tpr_beta AS volume_fitted FROM misc.price_promo_coefsc ,misc.price_promop ) AS a
21

Example
What are our customers saying about us?

7/26/2010 Confidential

22

What are our customers saying about us?


How do you discern trends and categories within thousands of on-line conversations?
Search for relevant blogs Construct a fingerprint for each document based on word frequencies Use this to define what it means for documents to be similar, or close Identify clusters of documents
7/26/2010 Confidential 23

Accessing the data


Build the directory list into a set of files that we will access:
-INPUT: NAME: filelist FILE: - maple:/Users/demo/blogsplog/filelist1 - maple:/Users/demo/blogsplog/filelist2 COLUMNS: - path text

For each record in the list "open()" the file and read it in its entirety
-MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read()

7/26/2010

id | path | body ------+---------------------------------------+-----------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ... Confidential
24

Parse the documents into word lists


Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal:
-MAP: NAME: extract_terms PARAMETERS: [id integer, body text] RETURNS: [id int, title text, doc _text] FUNCTION: | if 'parser' not in SD: import ... class MyHTMLParser(HTMLParser): ... SD['parser'] = MyHTMLParser() parser = SD['parser'] parser.reset() parser.feed(body) yield (id, parser.title, '{"' + '","'.join(parser.doc) + '"}')

7/26/2010 Confidential

25

Parse the documents into word lists


Use the HTMLParser library to parse the html documents and extract titles and body contents:
if 'parser' not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == 'title': self.title = data if self.inbody: ... parser = SD['parser'] parser.reset() ...

7/26/2010 Confidential

26

Parse the documents into word lists


Use nltk to tokenize, stem, and remove common terms:
if 'parser' not in SD:
from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambdax: (x, True), corpus.stopwords.words())) ... def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...
7/26/2010 Confidential 27

Parse the documents into word lists


Use nltk to tokenize, stem, and remove common terms:
if 'parser' not in SD:
from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): shell$ gpmapreduce -f blog-terms.yml def __init(self): ... mapreduce_75643_run_1 self.tokenizer = WordTokenizer() DONE self.stemmer = PorterStemmer() self.stopwords = 5; dict(map(lambdax: (x, True), corpus.stopwords.words())) sql# SELECT id, title, doc FROM blog_terms LIMIT def handle_data(self, data): ... id | title | doc if self.inbody: ------+------------------+----------------------------------------------------------------tokens = self.tokenizer.tokenize(data) 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... stems = map(self.stemmer.stem, tokens) 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... for x in stems: 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... if len(x) < 4: continue ... x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...

7/26/2010 Confidential

28

Create histograms of word frequencies


Extract a term-dictionary of terms that show up in at least ten blogs
sql#SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs ----------+------+----------sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17
7/26/2010 Confidential

29

Create histograms of word frequencies


Use the term frequencies to construct the term dictionary
sql# SELECT array(SELECT term FROM blog_term_freq) dictionary;

dictionary --------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary:
sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features;

id | term_count -----+---------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ...

7/26/2010 Confidential

30

Create histograms of word frequencies


Format of a sparse vector
id | term_count -----+---------------------------------------------------------------------... 10 | {3,1,40,...}:{0,2,0,...} ...

Dense representation of the vector


id | term_count -----+---------------------------------------------------------------------... 10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...} ... dictionary ---------------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,...}

Representing the document


{skin, skin, ...}

7/26/2010 Confidential

31

Transform the blog terms into statistically useful measures


Use the feature vectors to construct TFxIDF vectors: These are a measure of the importance of terms.

sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ...

7/26/2010 Confidential

32

Create document clusters around iteratively defined centroids


Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics. The current example is k-means clustering which requires two operations.

First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:
sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+-----------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145

7/26/2010 Confidential

33

Create document clusters around iteratively defined centroids


Next, use an averaging metric to re-center the mean of a cluster:
sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid -----+-----------------------------------------------------------------------3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}

Repeat the previous two operations until the centroids converge, and you have k-means clustering.

7/26/2010 Confidential

34

MAD Analytics in Practice

7/26/2010 Confidential

35

MAD Skills in practice

Extracted data from EDW and other source systems into new analytic sandbox Generated a social graph from call detail records and subscriber data Within 2 weeks uncovered behavior where connected subscribers were seven times more likely to churn than average user
7/26/2010 Confidential 36

Retention models

Customer Retention
Identify those at risk of abandoning their accounts. Use logistic regression models, or SAS scoring models.
Credit Card Number

7125 6289 5972 9510 3955 8125 1327 7120 2190 6379 9218 9290 2760 1924 2864 0950 4915 1908 8302 9940 3534 7203 6200 4010
7/26/2010 Confidential

Probability of Fraud SSN

Also used to predict


Probability of Churn

15%

22% 611-43-2435 7% 812-35-1035 47% 253-23-2943 12% 732-62-1435 3% 483-32-5298 821-90-8574

15% 22% 7% 47% 12% 3%


37

fraud in on-line and financial transactions hospital return visits etc.

Segmentation
Using segments
Create clusters of customers based on profiles, product usage, etc.

7/26/2010 Confidential

38

Association Rules
Using segments
For low or mediumvalue customers, compute possible new products using association rules
Product A Product B Product X Product Y Product Z

7/26/2010 Confidential

39

Segmentation and Association Rules


Using segments
Filter down to products associated with high-value customers in the same segment.
Product A Product B Product X Product Y Product Z

7/26/2010 Confidential

40

Questions

7/26/2010 Confidential

41

You might also like