0% found this document useful (0 votes)
24 views29 pages

PPT6-W6-Big Data Integration

This document discusses big data integration and analytics techniques for integrating data from insurance companies and banks. It describes partial schemas for insurance policies, policy sales, claims, and bank accounts. It then shows how to identify similar attributes across the schemas to create a mediated schema for integrating the data. Dictionary encoding is also discussed to compress test codes and results. Finally, techniques like ontology queries and graph queries are mentioned for analyzing the integrated data.

Uploaded by

annisaaam72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

PPT6-W6-Big Data Integration

This document discusses big data integration and analytics techniques for integrating data from insurance companies and banks. It describes partial schemas for insurance policies, policy sales, claims, and bank accounts. It then shows how to identify similar attributes across the schemas to create a mediated schema for integrating the data. Dictionary encoding is also discussed to compress test codes and results. Finally, techniques like ontology queries and graph queries are mentioned for analyzing the integrated data.

Uploaded by

annisaaam72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Course : COMP8025 – Big Data Analytics

Big Data Integration


Session 06

Dr. Sani M. Isa


This presentation adopted from
Ilkay Altintas and Amarnath Gupta, UC San Diego




Bank’s Partial Schema
Insurance Company’s Partial Schema Accounts(AcctNumber, AcctType, MemberID,
Policies(PolicyKey, PolicyTypeKey, Agent, Conditions) MemberType, TypeID, StartDate, EndDate,
PolicySales(PolicyKey, PolicyholderKey, StartDate, InterestRate, CreditLimit)
TransactKey,Premium,CoveragePeriod, Individuals(MemberID, FName, MI, LName, SSN,
CoverageLimit) Nationality, DoB, LegalStatus,
Transactions(TransactKey, Date, Time, Amount, FullAddress, Phone, PhoneType, Email)
Balance) Corporations(MemberID, Name, RegisteredAddress,
Policyholders(PolicyHolderKey, Name, Address, CorporationType, Signatory1,
City, State, ZIP) Signatory2, DNBNumber, Phone, Email)
Claims(PolicyKey, ClaimKey, TransactKey, Transactions(TrID, AcctNum, Date, Time,
ClaimAmount) TransactionType,
ClaimDescription(ClaimKey, TypeKey, ClaimantKey, Description, TransactionAmount,
ProcCode, Description) Debit/Credit, Balance, Payoff)
Claimants(ClaimantKey, Name, Address, City, State, AccountType(TypeID, Name, Description)
ZIP) TransactionTypes(Ttype, Name, Description)
ClaimTypes(TypeKey, Description) Disputes(AccntNumber, DisputeID, TrID, Date,
PolicyTypes(PolicyTypeKey, Name, Description) DisputeAmt, Explanation, Valid, ValidatorID)

PolicySales(PolicyKey, PolicyholderKey,
StartDate, TransactKey, Premium,
CoveragePeriod, CoverageLimit)
Policyholders(PolicyHolderKey, Name,
Address, City, State, ZIP) discountCandidates(custID,
Accounts(AcctNumber, AcctType, MemberID, address, policyKey, AcctNumber)
MemberType, TypeID, StartDate,
EndDate, InterestRate, CreditLimit)
Individuals(MemberID, FName, MI, LName,
SSN, Nationality, DoB,
LegalStatus, FullAddress, Phone,
PhoneType, Email)
Policyholders(PolicyHolderKey, Name, Individuals(MemberID, FName, MI, LName, SSN,
Address, City, State, ZIP) Nationality, DoB, LegalStatus, FullAddress, Phone,
PhoneType, Email)

discountCandidates(custID, address, policyKey, AcctNumber)

Accounts(AcctNumber, AcctType, MemberID,


PolicySales(PolicyKey, PolicyholderKey, MemberType, TypeID, StartDate,
StartDate, TransactKey, Premium, EndDate, InterestRate, CreditLimit)
CoveragePeriod, CoverageLimit)
Z

4-
937528734’
X Y
Individuals(MemberID, FName, MI, LName, SSN,
Policyholders(PolicyHolderKey, Name, Nationality, DoB, LegalStatus, FullAddress, Phone,
Address, City, State, ZIP) PhoneType, Email)
Individuals(101, Stephen, C., Jones, 123-45-6789, US,
10/02/1983, citizen, “231 Cedar St. LA, CA 90005”, 661-266-9374,
landline, [email protected])
Individuals(102, Elizabeth, , McFarlane, 123-54-6789, US,
06/18/1978, citizen, “4157 Elm St. LA, CA 90005”, 213-266-9374,
mobile, [email protected])
Individuals(103, Liz, P., McFarlane-Gray, 123-92-2318, US,
06/18/1978, citizen, “231 Cedar St. LA, CA 90005”, 213-702-4343,
landline, [email protected])
Individuals(104, Lisa, M., Brady, 423-45-6209, US, 08/09/1975,
foreign-student, “231 Cedar St. LA, CA 90005”, 302-266-9374,
landline, [email protected])
Policyholders(3-764528104, Liz, P., McFarlane-Gray, 4157 Elm
St. LA, CA, 90005)













BankTransactions(TransactionID (TID), Compute pairwise attribute
TransactionBeginTime(TBT), TransactionEndTime(TET), similarity and using a threshold
TransactionAmount(TA), Credit-Debit(CD), plus/minus an error, put similar
TransactionParty(TP), Transaction Description(TD), Balance(B), attributes in the same cluster
Payoff(P))
InsuranceTransactions(TransactionID (TID), TransactionDateTime For every subset of uncertain
(TDT), TransactionType(TT), Amount(A), TransactionDetails(TDT)) pairs create a mediated
schema
Med1({TID}, {TBT, TET, TDT} {TA+CD, A}, {TP, TD, TDT}, {TT},
{B}, {P})
Med2({TID}, {TBT}, {TET}, {TDT} {TA+CD, A}, {TP}, {TD}, {TDT},
{TT}, {B}, {P})
Med3({TID}, {TBT, TDT}, {TET, TDT} {TA+CD, A}, {TP}, {TD, TDT},
{TT}, {B}, {P})

• Med3({TID}, {TBT, TDT}, {TET, TDT} {TA+CD, A}, {TP}, {TD, TDT},
{TT}, {B}, {P}) is better than
• Med1({TID}, {TBT, TET, TDT} {TA+CD, A}, {TP, TD, TDT}, {TT}, {B},
{P}) with respect to BankTransactions














• SELECT doctor, chronicDisease
FROM TreatsPatient T, HasChronicDisease H
WHERE T.Patient = H.Patient
S1.Treats(d, s)→TreatsPatient(d, p) AND HasChronicDisease(p,s)
S2.Discharges(d, p, c)→DischargesPatientFromClinic(d, p, c)
S3.Treats(d,s)→TreatsPatient(d,p) AND HasChronicDisease(p,s) AND
Doctors(d)
S4.Surgeons(d)→Surgeons(d)










Washington DC Disease Surveillance System (WADDS)

Reference Information Model


Health Level-7 or HL7 refers to a set of •
international standards for transfer of
clinical and administrative data between
software applications used by various
healthcare providers.


HL-7

Some Attribute Domains are hierarchical





Record Patient Date Test Test
# ID Code Result
1 100 1/1/2012 SE-AC 14.5
2 502 1/1/2012 BP-S 123
3 301 1/2/2012 HAC 5.8
4 502 1/1/2012 BP-D 91
… … … … …
… … … … …
10M 1274 7/20/2016 SE-AC 13.8


• Dictionary Data compression is an
Orig. Encoded
Test Test
important technology
Record Patient Date Test Test
# ID Code Result Code Code for big data.
1 100 1/1/2012 32 14.5 SE-AC 32
2 502 1/1/2012 125 123 BP-S 125
3 301 1/2/2012 174 5.8 HAC 174
4 502 1/1/2012 126 91 BP-D 126
… … … … … … …
… … … … … … …
10M 1274 7/20/2016 32 13.8 SE-AC 32
Ontology queries are graph queries





















value
















You might also like