Greenplum: MAD Analytics in Practice: Mmds June 16, 2010
Greenplum: MAD Analytics in Practice: Mmds June 16, 2010
7/26/2010 Confidential
Warehousing Today
7/26/2010 Confidential
7/26/2010 Confidential
Slowmoving models
Shadow systems
Slowmoving data
Departmental warehouses
7/26/2010 Confidential 4
New Realities
Greenplum Overview
7/26/2010 Confidential
SQL MapReduce
...
...
Network Interconnect
Segment Severs
Query processing & data storage
...
...
External Sources
Loading, streaming, etc.
7/26/2010 Confidential
Online Expansion
Dynamically provision new servers with no downtime
Map-Reduce Support
Parallel programming on data for advanced analytics
Polymorphic Storage
Support for both row and column-oriented storage
7/26/2010 Confidential 8
7/26/2010 Confidential
Data Size
6.5 Petabytes of user data Loading 18 TB every day (130 Billion rows) Every click on eBays website, 365 days a year 20 Trillion row fact table
Hardware
96-node Sun Data Warehouse Appliance Expansion to go to 192 nodes
Benefit
Scalability and price/performance Cost effective complement to Teradata
7/26/2010 Confidential
10
MAD Analytics
7/26/2010 Confidential
11
Magnetic
Analytics Data
EDC PLATFORM 7/26/2010 Confidential 12
Chorus
Agile
analyze and model in the cloud push results back into the cloud get data into the cloud
7/26/2010 Confidential 13
Deep
Future
Past
Facts
7/26/2010 Confidential 14
Interpretation
Functions
E.g. probability densities
Functionals
i.e. functions on functions E.g., A/B testing: a functional over densities
7/26/2010 Confidential 15
Significantly enhance the vocabulary of the DBMS! Related stuff in NIPS 06, using MapReduce syntax
7/26/2010 Confidential
17
Example
Whats the right price for my products?
7/26/2010 Confidential
18
2009-0224 7.33 2009-0310 7.47 2009-0324 7.75 2009-0407 7.40 2009-0421 7.75 2009-0505 7.43 2009-0519 7.70 2009-0602 6.87 2009-0616 7.36 2009-0630 6.92 2009-0714 7.49 2009-0728 7.69 2009-0811 7.19 2009-087/26/2010 Confidential 25 7.72
7.20 5.72 5.82 7.23 6.22 5.70 6.23 6.64 7.44 6.73 7.58 5.70 6.72 5.72
20484.52 34313.94 25477.33 18772.57 20743.68 28244.82 20234.74 23262.60 19290.87 23617.61 18017.58 29193.44 23863.13 25138.34
19
dt date ,base_price numeric ,display_price numeric ,feature_price numeric ,feature_display_price numeric ,tpr numeric ,volume numeric ) DISTRIBUTED BY(dt); \copy misc.price_promo from data.csv with delimiter ','
-4801.114351 0.883172235
,mregr_r2(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS r2 FROM misc.price_promo ) AS a DISTRIBUTED RANDOMLY;
7/26/2010 Confidential 20
volume_fitted 20507.88 31381.52 29591.06 19560.80 23769.63 27746.83 24876.55 23727.72 18862.64 23168.44 17595.93 26224.39 23571.29 27065.91 23945.45
ape 0.1140 8.5458 16.1466 4.1988 14.5873 1.7630 22.9398 1.9994 2.2198 1.9018 2.3402 10.1702 1.2229 7.6678 1.4909
Example
What are our customers saying about us?
7/26/2010 Confidential
22
For each record in the list "open()" the file and read it in its entirety
-MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read()
7/26/2010
id | path | body ------+---------------------------------------+-----------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ... Confidential
24
7/26/2010 Confidential
25
7/26/2010 Confidential
26
7/26/2010 Confidential
28
29
dictionary --------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...
then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary:
sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features;
7/26/2010 Confidential
30
7/26/2010 Confidential
31
sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ...
7/26/2010 Confidential
32
First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:
sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+-----------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145
7/26/2010 Confidential
33
Repeat the previous two operations until the centroids converge, and you have k-means clustering.
7/26/2010 Confidential
34
7/26/2010 Confidential
35
Extracted data from EDW and other source systems into new analytic sandbox Generated a social graph from call detail records and subscriber data Within 2 weeks uncovered behavior where connected subscribers were seven times more likely to churn than average user
7/26/2010 Confidential 36
Retention models
Customer Retention
Identify those at risk of abandoning their accounts. Use logistic regression models, or SAS scoring models.
Credit Card Number
7125 6289 5972 9510 3955 8125 1327 7120 2190 6379 9218 9290 2760 1924 2864 0950 4915 1908 8302 9940 3534 7203 6200 4010
7/26/2010 Confidential
15%
Segmentation
Using segments
Create clusters of customers based on profiles, product usage, etc.
7/26/2010 Confidential
38
Association Rules
Using segments
For low or mediumvalue customers, compute possible new products using association rules
Product A Product B Product X Product Y Product Z
7/26/2010 Confidential
39
7/26/2010 Confidential
40
Questions
7/26/2010 Confidential
41