0% found this document useful (0 votes)

85 views41 pages

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

This document summarizes an analytics workflow for analyzing customer sentiment from online conversations. It involves parsing thousands of documents into word frequency fingerprints to define document similarity and cluster documents. The data is accessed by building a file list and reading each file. HTML documents are parsed using an HTML parser to extract titles and body text, which are then tokenized, stemmed, and filtered to remove common terms. This processed text is used to define document similarity and cluster documents by topic.

Uploaded by

ramanavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views41 pages

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

Uploaded by

ramanavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Greenplum: MAD Analytics in Practice

MMDS June 16th, 2010

7/26/2010 Confidential

Warehousing Today

7/26/2010 Confidential

In the Days of Kings and Priests

Computers and Data: Crown Jewels Executives depend on computers

But cannot work with them directly

The DBA Priesthood

And their Acronymia
EDW, BI, OLAP, 3NF

Secret functions and techniques, expensive tools

7/26/2010 Confidential

The Architected EDW

Slowmoving models

Non-standard, in-memory analytics

Shadow systems

Slowmoving data

Shallow Business Intelligence

Departmental warehouses
7/26/2010 Confidential 4

Static schemas accrete over time

New Realities

Welcome to the Petabyte Age

TB disks < $100 Everything is data Rise of data-driven culture
Very publicly espoused by Google, Netflix, etc. Terraserver, USAspending.gov

The New Practitioners

Aggressively Datavorous Statistically savvy Diverse in training, tools
7/26/2010 Confidential 5

Greenplum Overview

7/26/2010 Confidential

Greenplum Database Architecture

PL/R PL/Perl PL/Python

MPP (Massively Parallel Processing) Shared-Nothing Architecture

Master Severs
Query planning & dispatch

SQL MapReduce

...

Network Interconnect

Segment Severs
Query processing & data storage

...

External Sources
Loading, streaming, etc.

7/26/2010 Confidential

Key Technical Innovations

Scatter-Gather Data Streaming

Industry leading data loading capabilities

Online Expansion
Dynamically provision new servers with no downtime

Map-Reduce Support
Parallel programming on data for advanced analytics

Polymorphic Storage
Support for both row and column-oriented storage
7/26/2010 Confidential 8

Benefits of the Greenplum Database Architecture

Simplicity Parallelism is automatic no manual partitioning required No complex tuning required just load and query Scalability Linear scalability to 1000s of cores, on commodity hardware Each node adds storage, query performance, loading performance Example: 6.5 petabytes on 96 nodes, with 17 trillion records Flexibility Fully parallelism for SQL92, SQL99, SQL2003 OLAP, MapReduce Any schema (star, snowflake, 3NF, hybrid, etc) Rich extensibility and language support (Perl, Python, R, C, etc)

7/26/2010 Confidential

Customer Example: eBay Petabyte-Scale

Business Problem
Fraud detection and click-stream analytics

Data Size
6.5 Petabytes of user data Loading 18 TB every day (130 Billion rows) Every click on eBays website, 365 days a year 20 Trillion row fact table

Hardware
96-node Sun Data Warehouse Appliance Expansion to go to 192 nodes

Benefit
Scalability and price/performance Cost effective complement to Teradata

7/26/2010 Confidential

MAD Analytics

7/26/2010 Confidential

Magnetic

Simple linear models Trend analysis

Analytics Data
EDC PLATFORM 7/26/2010 Confidential 12

Chorus

Agile

analyze and model in the cloud push results back into the cloud get data into the cloud
7/26/2010 Confidential 13

Deep

Future

What will happen?

How can we do better?

What happened where and when?

How and why did it happen?

Past

Facts
7/26/2010 Confidential 14

Interpretation

Dolans Vocabulary of Statistics

Data Mining focused on individuals

Statistical analysis needs more Focus on densitymethods

1. (Scalar) Arithmetic 2. Vector Arithmetic

I.e. Linear Algebra

Need to be able to utter statistical sentences

Functions
E.g. probability densities

And run massively parallel, on Big Data!

Functionals
i.e. functions on functions E.g., A/B testing: a functional over densities

7/26/2010 Confidential 15

Misc Statistical methods

E.g. resampling

MAD Skills Whitepaper

Paper includes parallelizable, stat-like SQL for

Linear algebra (vectors/matrices) Ordinary Least Squares (multiple linear regression) Conjugate Gradiant (iterative optimization, e.g. for SVM classifiers) Functionals including Mann-Whitney U test, Log-likelihood ratios Resampling techniques, e.g. bootstrapping

Encapsulated as stored procedures or UDFs

Significantly enhance the vocabulary of the DBMS! Related stuff in NIPS 06, using MapReduce syntax

These are examples.

Plenty of research to do here!!

7/26/2010 Confidential 16

MAD Analytics Examples

7/26/2010 Confidential

Example
Whats the right price for my products?

7/26/2010 Confidential

Whats the right price for my products?

Date BasePri Display ce Price 6.67 5.94 6.74 7.19 7.36 6.56 6.57 6.67 7.00 6.72 7.32 7.44 6.24 6.74 Feature Price 7.33 5.72 5.74 7.40 6.74 6.98 7.70 6.87 7.36 6.92 7.49 5.69 7.19 7.72 Feature/D isplay Price 7.33 7.00 7.26 7.40 7.75 7.43 7.70 6.87 7.36 6.92 7.49 7.69 7.19 7.72 TPR Volume

Get the raw data

DROP TABLE IF EXISTS misc.price_promo; CREATE TABLE misc.price_promo (

2009-0224 7.33 2009-0310 7.47 2009-0324 7.75 2009-0407 7.40 2009-0421 7.75 2009-0505 7.43 2009-0519 7.70 2009-0602 6.87 2009-0616 7.36 2009-0630 6.92 2009-0714 7.49 2009-0728 7.69 2009-0811 7.19 2009-087/26/2010 Confidential 25 7.72

7.20 5.72 5.82 7.23 6.22 5.70 6.23 6.64 7.44 6.73 7.58 5.70 6.72 5.72

20484.52 34313.94 25477.33 18772.57 20743.68 28244.82 20234.74 23262.60 19290.87 23617.61 18017.58 29193.44 23863.13 25138.34
19

dt date ,base_price numeric ,display_price numeric ,feature_price numeric ,feature_display_price numeric ,tpr numeric ,volume numeric ) DISTRIBUTED BY(dt); \copy misc.price_promo from data.csv with delimiter ','

Whats the right price for my products?

intercept_beta 72804.48332 base_price_beta 5049.03841 display_price_beta feature_display_price_beta -1388.842417 -6203.882026 tpr_beta r2

-4801.114351 0.883172235

Train the model

CREATE TABLE misc.price_promo_coefs AS SELECT coefs[1] AS intercept_beta ,coefs[3] AS display_price_beta ,coefs[5] AS tpr_beta FROM ( SELECT mregr_coef(volume, array[1::int, base_price_per_unit, display_price, temporary_price_reduction]) AS coefs feature_display_price, ,coefs[2] AS base_price_beta ,coefs[4] AS feature_display_price_beta ,r2

,mregr_r2(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS r2 FROM misc.price_promo ) AS a DISTRIBUTED RANDOMLY;
7/26/2010 Confidential 20

Whats the right price for my products?

volume 20484.52 34313.94 25477.33 18772.57 20743.68 28244.82 20234.74 23262.60 19290.87 23617.61 18017.58 29193.44 23863.13 25138.34 24307.88
7/26/2010 Confidential

volume_fitted 20507.88 31381.52 29591.06 19560.80 23769.63 27746.83 24876.55 23727.72 18862.64 23168.44 17595.93 26224.39 23571.29 27065.91 23945.45

ape 0.1140 8.5458 16.1466 4.1988 14.5873 1.7630 22.9398 1.9994 2.2198 1.9018 2.3402 10.1702 1.2229 7.6678 1.4909

Evaluate the model

CREATE OR REPLACE VIEW misc.v_price_promo_fitted AS SELECT volume ,volume_fitted ,100 * abs(volume - volume_fitted)::numeric / volume AS ape FROM ( SELECT p.volume ,c.intercept_beta + p.base_price * c.base_price_beta + p.display_price * c.display_price_beta + p.feature_display_price * c.feature_display_price_beta + p.tpr * c.tpr_beta AS volume_fitted FROM misc.price_promo_coefsc ,misc.price_promop ) AS a
21

Example
What are our customers saying about us?

7/26/2010 Confidential

What are our customers saying about us?

How do you discern trends and categories within thousands of on-line conversations?
Search for relevant blogs Construct a fingerprint for each document based on word frequencies Use this to define what it means for documents to be similar, or close Identify clusters of documents
7/26/2010 Confidential 23

Accessing the data

Build the directory list into a set of files that we will access:
-INPUT: NAME: filelist FILE: - maple:/Users/demo/blogsplog/filelist1 - maple:/Users/demo/blogsplog/filelist2 COLUMNS: - path text

For each record in the list "open()" the file and read it in its entirety
-MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read()

7/26/2010

Parse the documents into word lists

Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal:
-MAP: NAME: extract_terms PARAMETERS: [id integer, body text] RETURNS: [id int, title text, doc _text] FUNCTION: | if 'parser' not in SD: import ... class MyHTMLParser(HTMLParser): ... SD['parser'] = MyHTMLParser() parser = SD['parser'] parser.reset() parser.feed(body) yield (id, parser.title, '{"' + '","'.join(parser.doc) + '"}')

7/26/2010 Confidential

Parse the documents into word lists

Use the HTMLParser library to parse the html documents and extract titles and body contents:
if 'parser' not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == 'title': self.title = data if self.inbody: ... parser = SD['parser'] parser.reset() ...

7/26/2010 Confidential

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:
if 'parser' not in SD:
from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambdax: (x, True), corpus.stopwords.words())) ... def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...
7/26/2010 Confidential 27

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:
if 'parser' not in SD:
from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): shell$ gpmapreduce -f blog-terms.yml def __init(self): ... mapreduce_75643_run_1 self.tokenizer = WordTokenizer() DONE self.stemmer = PorterStemmer() self.stopwords = 5; dict(map(lambdax: (x, True), corpus.stopwords.words())) sql# SELECT id, title, doc FROM blog_terms LIMIT def handle_data(self, data): ... id | title | doc if self.inbody: ------+------------------+----------------------------------------------------------------tokens = self.tokenizer.tokenize(data) 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... stems = map(self.stemmer.stem, tokens) 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... for x in stems: 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... if len(x) < 4: continue ... x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...

7/26/2010 Confidential

Create histograms of word frequencies

Create histograms of word frequencies

Use the term frequencies to construct the term dictionary
sql# SELECT array(SELECT term FROM blog_term_freq) dictionary;

dictionary --------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary:
sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features;

id | term_count -----+---------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ...

7/26/2010 Confidential

Create histograms of word frequencies

Format of a sparse vector
id | term_count -----+---------------------------------------------------------------------... 10 | {3,1,40,...}:{0,2,0,...} ...

Dense representation of the vector

id | term_count -----+---------------------------------------------------------------------... 10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...} ... dictionary ---------------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,...}

Representing the document

{skin, skin, ...}

7/26/2010 Confidential

Transform the blog terms into statistically useful measures

Use the feature vectors to construct TFxIDF vectors: These are a measure of the importance of terms.

sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ...

7/26/2010 Confidential

Create document clusters around iteratively defined centroids

Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics. The current example is k-means clustering which requires two operations.

First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:
sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+-----------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145

7/26/2010 Confidential

Create document clusters around iteratively defined centroids

Next, use an averaging metric to re-center the mean of a cluster:
sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid -----+-----------------------------------------------------------------------3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}

Repeat the previous two operations until the centroids converge, and you have k-means clustering.

7/26/2010 Confidential

MAD Analytics in Practice

7/26/2010 Confidential

MAD Skills in practice

Extracted data from EDW and other source systems into new analytic sandbox Generated a social graph from call detail records and subscriber data Within 2 weeks uncovered behavior where connected subscribers were seven times more likely to churn than average user
7/26/2010 Confidential 36

Retention models

Customer Retention
Identify those at risk of abandoning their accounts. Use logistic regression models, or SAS scoring models.
Credit Card Number

7125 6289 5972 9510 3955 8125 1327 7120 2190 6379 9218 9290 2760 1924 2864 0950 4915 1908 8302 9940 3534 7203 6200 4010
7/26/2010 Confidential

Probability of Fraud SSN

Also used to predict

Probability of Churn

15%

22% 611-43-2435 7% 812-35-1035 47% 253-23-2943 12% 732-62-1435 3% 483-32-5298 821-90-8574

15% 22% 7% 47% 12% 3%

fraud in on-line and financial transactions hospital return visits etc.

Segmentation
Using segments
Create clusters of customers based on profiles, product usage, etc.

7/26/2010 Confidential

Association Rules
Using segments
For low or mediumvalue customers, compute possible new products using association rules
Product A Product B Product X Product Y Product Z

7/26/2010 Confidential

Segmentation and Association Rules

Using segments
Filter down to products associated with high-value customers in the same segment.
Product A Product B Product X Product Y Product Z

7/26/2010 Confidential

Questions

7/26/2010 Confidential

07 Data Pump
No ratings yet
07 Data Pump
10 pages
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
No ratings yet
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
55 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Vendor - Many To One Managing Multiple APEX Applications
100% (1)
Vendor - Many To One Managing Multiple APEX Applications
94 pages
Facebook, Twitter and Google Data Analysis Using Hadoop PDF
No ratings yet
Facebook, Twitter and Google Data Analysis Using Hadoop PDF
6 pages
Summary Chapter 3 and 4
No ratings yet
Summary Chapter 3 and 4
9 pages
Data Resource Management
50% (2)
Data Resource Management
64 pages
Big Data Engineer Resume
100% (2)
Big Data Engineer Resume
6 pages
Chapter 1. Oracle Database Architecture
No ratings yet
Chapter 1. Oracle Database Architecture
7 pages
Data Warehouse Modeling
100% (1)
Data Warehouse Modeling
87 pages
Green Plum Advanced Analytics
No ratings yet
Green Plum Advanced Analytics
65 pages
Data Warehouse Schemas For Decision Support
No ratings yet
Data Warehouse Schemas For Decision Support
13 pages
Python Geospatial Development - Third Edition
From Everand
Python Geospatial Development - Third Edition
Erik Westra
4/5 (1)
DW Concepts
100% (1)
DW Concepts
40 pages
Data Mining
No ratings yet
Data Mining
25 pages
DM Unit 2
No ratings yet
DM Unit 2
19 pages
IBM Streams Processing Language Introductory Tutorial
No ratings yet
IBM Streams Processing Language Introductory Tutorial
36 pages
Purging Workflow Tables
No ratings yet
Purging Workflow Tables
10 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Capstone Project
No ratings yet
Capstone Project
57 pages
Current Trends MIDTERM EXAM
No ratings yet
Current Trends MIDTERM EXAM
29 pages
Unit 1
No ratings yet
Unit 1
30 pages
Chapter 4 MIS
No ratings yet
Chapter 4 MIS
16 pages
DWDM Record
No ratings yet
DWDM Record
83 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Week 5 Database
No ratings yet
Week 5 Database
34 pages
Database Systems: Design, Implementation, and Management: Data Models
No ratings yet
Database Systems: Design, Implementation, and Management: Data Models
49 pages
Internship Project
No ratings yet
Internship Project
10 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
03) Introduction To Database Data Management Approches
No ratings yet
03) Introduction To Database Data Management Approches
21 pages
BusinessIntelligence 2023
No ratings yet
BusinessIntelligence 2023
36 pages
Databases and Information Management: Raw Facts Without Context or Intent
No ratings yet
Databases and Information Management: Raw Facts Without Context or Intent
14 pages
5 Data Warehouse
No ratings yet
5 Data Warehouse
17 pages
Module I
No ratings yet
Module I
46 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
SAP Business Objects SA
From Everand
SAP Business Objects SA
equitypress
5/5 (2)
Cca498 - Final - Review - Jiajia
No ratings yet
Cca498 - Final - Review - Jiajia
86 pages
Lecture 6
No ratings yet
Lecture 6
26 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
Master Manual
No ratings yet
Master Manual
100 pages
Lab Manual
No ratings yet
Lab Manual
100 pages
Foundations of Business Intelligence
No ratings yet
Foundations of Business Intelligence
38 pages
Managing Data Resources
No ratings yet
Managing Data Resources
12 pages
Lec2 (Analyse The Different Ways in Which Data Is Stored and Processed For Use in An Organisation.)
No ratings yet
Lec2 (Analyse The Different Ways in Which Data Is Stored and Processed For Use in An Organisation.)
33 pages
Access Reference: Working With Tables: Susan Harkins
No ratings yet
Access Reference: Working With Tables: Susan Harkins
17 pages
Module 1
No ratings yet
Module 1
78 pages
Lesson 4
No ratings yet
Lesson 4
41 pages
SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
Ess Base 11 Preview
No ratings yet
Ess Base 11 Preview
66 pages
Review
No ratings yet
Review
18 pages
HOT Inside: The Technical Architecture
No ratings yet
HOT Inside: The Technical Architecture
52 pages
CO3 Session 22
No ratings yet
CO3 Session 22
35 pages
By Bi Jay Mishra
No ratings yet
By Bi Jay Mishra
685 pages
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
No ratings yet
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
119 pages
Amendment To Omnibus Rules On Leave (CSC MC No 41, S. 1998, As Amended)
No ratings yet
Amendment To Omnibus Rules On Leave (CSC MC No 41, S. 1998, As Amended)
14 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
Unit - 1 Learning Notes
No ratings yet
Unit - 1 Learning Notes
11 pages
Waas
No ratings yet
Waas
9 pages
Data Mining 1
No ratings yet
Data Mining 1
13 pages
Elective-I Advanced Database Management Systems
No ratings yet
Elective-I Advanced Database Management Systems
67 pages
PL Standard Toolkit Reference
No ratings yet
PL Standard Toolkit Reference
78 pages
Lecture 8-Is Infrastructure DBMS
No ratings yet
Lecture 8-Is Infrastructure DBMS
34 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Data Warehousing: Data Models and OLAP Operations: Lecture-1
No ratings yet
Data Warehousing: Data Models and OLAP Operations: Lecture-1
47 pages
03 Data Warehousing Data Mining MIM
No ratings yet
03 Data Warehousing Data Mining MIM
48 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Hospital Management System Database Design Is Uploaded in This Page
No ratings yet
Hospital Management System Database Design Is Uploaded in This Page
4 pages
Dbms Unit-1 - Important Points
No ratings yet
Dbms Unit-1 - Important Points
58 pages
Oracle DB Backup Methods
No ratings yet
Oracle DB Backup Methods
19 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
41 pages
Sample Paper Two
No ratings yet
Sample Paper Two
18 pages
Data Warehousing: Data Models and OLAP Operations
No ratings yet
Data Warehousing: Data Models and OLAP Operations
41 pages
DBMS, Data Warehousing and Data Mining
No ratings yet
DBMS, Data Warehousing and Data Mining
31 pages
trắc nghiệm phân tích dữ liệu trong kế toán
No ratings yet
trắc nghiệm phân tích dữ liệu trong kế toán
24 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
An Introduction To The Jaspersoft BI Suite
No ratings yet
An Introduction To The Jaspersoft BI Suite
43 pages
Hull
No ratings yet
Hull
36 pages
DataWarehousing - Powerpoint Canadien Cs - Sfu.ca 2e Version
No ratings yet
DataWarehousing - Powerpoint Canadien Cs - Sfu.ca 2e Version
14 pages
Session 9
No ratings yet
Session 9
12 pages
Cloudera Academic Partnership 7
No ratings yet
Cloudera Academic Partnership 7
70 pages
Tree and Its Terminologies
No ratings yet
Tree and Its Terminologies
36 pages
Data Warehousing and Decision Support
No ratings yet
Data Warehousing and Decision Support
8 pages
Greenplum Text Analytics
No ratings yet
Greenplum Text Analytics
5 pages
AR - TemSe Overview
No ratings yet
AR - TemSe Overview
2 pages
Assume External Sandhi Unless Marked To Indicate Internal Sandhi
No ratings yet
Assume External Sandhi Unless Marked To Indicate Internal Sandhi
2 pages
Hadoop 5 Day Training
No ratings yet
Hadoop 5 Day Training
2 pages
Ymca Swimming Schedules
No ratings yet
Ymca Swimming Schedules
1 page
SQL Basics: 2 Sem, MSC
No ratings yet
SQL Basics: 2 Sem, MSC
95 pages
Datalogging SQL Server With CSharp WinForms
No ratings yet
Datalogging SQL Server With CSharp WinForms
31 pages
Tabla Empresa - Ellab
No ratings yet
Tabla Empresa - Ellab
2 pages
Aflevering 1 Databasesystemer
No ratings yet
Aflevering 1 Databasesystemer
4 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
ICDL Professional Modules - Computational - Using Databases
No ratings yet
ICDL Professional Modules - Computational - Using Databases
10 pages
Rechargesolar Co Uk-4th Dec 2023
No ratings yet
Rechargesolar Co Uk-4th Dec 2023
9 pages
Netezza Database Users Guide
No ratings yet
Netezza Database Users Guide
320 pages
Infosphere Streams
No ratings yet
Infosphere Streams
456 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
11 pages
Data Analytics Bonuses
No ratings yet
Data Analytics Bonuses
1 page

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

Uploaded by

Greenplum: MAD Analytics in Practice: Mmds June 16, 2010

Uploaded by

Greenplum: MAD Analytics in Practice

MMDS June 16th, 2010

In the Days of Kings and Priests

Computers and Data: Crown Jewels Executives depend on computers

The DBA Priesthood

Secret functions and techniques, expensive tools

The Architected EDW

Non-standard, in-memory analytics

Shallow Business Intelligence

Static schemas accrete over time

Welcome to the Petabyte Age

The New Practitioners

Greenplum Database Architecture

MPP (Massively Parallel Processing) Shared-Nothing Architecture

Key Technical Innovations

Scatter-Gather Data Streaming

Benefits of the Greenplum Database Architecture

Customer Example: eBay Petabyte-Scale

Simple linear models Trend analysis

What will happen?

How can we do better?

What happened where and when?

How and why did it happen?

Dolans Vocabulary of Statistics

Data Mining focused on individuals

Statistical analysis needs more Focus on densitymethods

1. (Scalar) Arithmetic 2. Vector Arithmetic

I.e. Linear Algebra

Need to be able to utter statistical sentences

And run massively parallel, on Big Data!

Misc Statistical methods

MAD Skills Whitepaper

Paper includes parallelizable, stat-like SQL for

Encapsulated as stored procedures or UDFs

These are examples.

Plenty of research to do here!!

MAD Analytics Examples

Whats the right price for my products?

Get the raw data

Whats the right price for my products?

Train the model

Whats the right price for my products?

Evaluate the model

What are our customers saying about us?

Accessing the data

Parse the documents into word lists

Parse the documents into word lists

Parse the documents into word lists

Parse the documents into word lists

Create histograms of word frequencies

Create histograms of word frequencies

id | term_count -----+---------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ...

Create histograms of word frequencies

Dense representation of the vector

Representing the document

Transform the blog terms into statistically useful measures

Create document clusters around iteratively defined centroids

Create document clusters around iteratively defined centroids

MAD Analytics in Practice

MAD Skills in practice

Probability of Fraud SSN

Also used to predict

22% 611-43-2435 7% 812-35-1035 47% 253-23-2943 12% 732-62-1435 3% 483-32-5298 821-90-8574

15% 22% 7% 47% 12% 3%

fraud in on-line and financial transactions hospital return visits etc.

Segmentation and Association Rules

You might also like