0% found this document useful (0 votes)

15 views5 pages

Banking Problem Database

The document outlines a bank data analysis project involving two datasets: structured (Chase_Bank.csv) and semi-structured (Chase_Bank_1.json). Key activities include loading data into MySQL, ingesting it into HDFS using SQOOP, performing ETL operations with PIG, and analyzing the data in Hive. The document provides detailed instructions for each step, including database creation, data cleansing, and output file management.

Uploaded by

Avijit Manna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

Banking Problem Database

Uploaded by

Avijit Manna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Downloaded from: justpaste.

it/cdbm5

BANK DATA ANALYSIS

You are provided with two datasets(Structured & Semi-structured) containing information about bank deposit details.

The high level activities to be performed are :

1. Load all the structured data to MYSQL

2. Ingest structured data into the HDFS environment using SQOOP (Extract)
3. Ingest Semi-structured data into HDFS environment (Extract)
4. Perform ETL operation on JSON data using PIG scripts (Transform)
5. Perform ETL and Load sqoop output data into HIVE (Load)
6. Analyse data using HQL (Analyse)

INPUT FILES

The bank deposit input files are available in “

Desktop/Project/wingst2-banking-challenge/

” folder.

Structured Data :-

Chase_Bank.csv

Semi structured Data :- Chase_Bank_1.json

Important Instructions to be followed:-

HDFS Directories to store output files.

a. SQOOP output should be stored in hdfs:/user/labuser/

sqoop_bank

directory.
b. PIG output should to be stored in hdfs:/user/labuser/

bank1

directory.

Output files of your assessment (***.txt) should be present in local challenge folder ( /home/labuser/Desktop/Project/wings-xx-challenge).

Follow below steps to complete the assessment:-

Step 1:- Loading Data to MYSQL

Username: root
Password: labuserbdh

Create new Database and tables using MySQL commands to load structured data.
DB Name:- bank_db
Table:- bank .

Create Table and Load script is given below for you reference

create table bank (Id int , Institution_Name varchar(2000), Branch_Name varchar(2000), Branch_Number int, City varchar(2000),
County varchar(2000), State varchar(2000), Zipcode int, 2010_Deposits int,
2011_Deposits int, 2012_Deposits int, 2013_Deposits int, 2014_Deposits int,
2015_Deposits int, 2016_Deposits int);

Load

Chase_Bank.csv

data into the table

bank

load data local infile '/home/labuser/Desktop/Project/wingst2-banking-challenge/Chase_Bank.csv' into table bank fields terminated by ',' lines terminated by '\n' ignore 1 rows (Id,
Institution_Name, Branch_Name, Branch_Number, City, County, State, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits,
2016_Deposits);

Step 2:- Import Structured data to HDFS using SQOOP

Load data from MYSQL table-

bank

to HDFS (

/user/labuser/sqoop_bank

)using Sqoop based on the following conditions.

Columns to be imported :

Id, City, County, State, Zipcode,

2010_Deposits, 2011_Deposits, 2012_Deposits,

2013_Deposits, 2014_Deposits, 2015_Deposits,

2016_Deposits

Import only the records which are

NOT IN

below mentioned cities.

"
Rochester", "Austin", "Chicago", "Indianapolis"

Number of mappers should be 1

Run below command to copy sqoop output from HDFS to

sqoop_output.txt

file :

hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt

Note:- Make sure that your output files are available in challenge folder.

You will be loading and analysing Sqoop output data to Hive using HQL in further steps.

Step 3:- Cleansing Semi-structured (Json) data Using PIG

You will be cleansing Json data (

Chase_Bank_1.json

) which is available in challenge input folder.

Load this json data to PIG using PIG Latin Scripts

Note:- You can either load this data directly from challenge input folder , or use required commands to copy to hdfs and then to P

Use Pig Latin scripts to

find Minimum no of deposits in 2016 (ie, MIN(Deposits_2016)) for each county . Assign

minimum_dep”

as the column name

Sort output in descending order based on

minimum_dep”

And read only first 50 records.

Sample output format:-

{"group":"Gillespie","minimum_dep":212776}{"group":"Imperial","minimum_dep":148284}

STORE result in hdfs ( /user/labuser/pigoutput) directory

Use required commands to copy PIG output to challenge folder in file

pig_output.txt.

Step 4:- Loading and Analysing Data in Hive

Now, you have to load sqoop output to HIVE tables.

Create below database & Tables in Hive

Database

: hive_db
Partition Table

: bank_part.
Columns

Id, City, County, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits, 2016_Depos

Partition

should be based on column

State

Read records which satisfies the below condition & Load to bank_part table
City in Bronx, NewYorkCity, Dallas, Houston, Columbus
State in "NY","OH","TX"
Hint:- Create a temporary table to load Sqoop output and then load data to partitioned table with necessary filters.

Analysing Data using HQL

Use the following command and execute hive query to remove the WARNING messages from the HIVE output

export HIVE_SKIP_SPARK_ASSEMBLY=true

Write a HQL query to fetch the records which satisfies below criteria. 2014_Deposits is greater than 50000, 2015_Deposi
ts is greater than 60000, 2016_Deposits is greater than 70000, City in NewYorkCity, Dallas, Houston.

Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.

Save the output in file

hive_output.txt.

Note:- Given below the sample format to copy output to file from terminal.

hive S -e "use hive_db;select count(1) from

bank_part

;" >output.txt

VALIDATION :

Before closing the environment, ensure that all the output files are available in local directory

Desktop/Project/wingst2-banking-challenge/”

sqoop_output.txt

hive_output.txt

pig_output.txt

Click on

SUBMIT

button & validation will take place at backend.

Cursor Examples
No ratings yet
Cursor Examples
12 pages
SIC Big Data Chapter 3 Workbook
No ratings yet
SIC Big Data Chapter 3 Workbook
86 pages
HRIS Chap 2
67% (3)
HRIS Chap 2
33 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
Taxonomy Tools (SLA 2019)
No ratings yet
Taxonomy Tools (SLA 2019)
61 pages
Cloudera Msazure Hadoop Deployment Guide
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
39 pages
1.loading Data Into Mysql
No ratings yet
1.loading Data Into Mysql
21 pages
Solution Banking Challenge
No ratings yet
Solution Banking Challenge
2 pages
Data Ingestion From The RDS To HDFS Using Sqoop
No ratings yet
Data Ingestion From The RDS To HDFS Using Sqoop
5 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Maintenance Oracle Database PDF
No ratings yet
Maintenance Oracle Database PDF
34 pages
Unit 3 Topic 8 Flume and Scoop
No ratings yet
Unit 3 Topic 8 Flume and Scoop
35 pages
New 11g Features in Oracle Developer Tools For Visual Studio
No ratings yet
New 11g Features in Oracle Developer Tools For Visual Studio
13 pages
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
18 pages
Which of The Following Is The Foundation of Mapreduce Operations?
No ratings yet
Which of The Following Is The Foundation of Mapreduce Operations?
12 pages
Sqoop LAB
No ratings yet
Sqoop LAB
12 pages
BDA 02 - Sqoop Installation
No ratings yet
BDA 02 - Sqoop Installation
13 pages
5 - Big - Data Vivek
No ratings yet
5 - Big - Data Vivek
4 pages
Loadeer Lab
No ratings yet
Loadeer Lab
3 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
Practical 1-4
No ratings yet
Practical 1-4
14 pages
Creating A Table in RDBMS 3 2. Importing RDBMS Data Into H DFS 3 Exporting HDFS Data To RDBMS .. 6
No ratings yet
Creating A Table in RDBMS 3 2. Importing RDBMS Data Into H DFS 3 Exporting HDFS Data To RDBMS .. 6
5 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
Taxi Trip Analysis Using Hive
No ratings yet
Taxi Trip Analysis Using Hive
3 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
Wa0006.
No ratings yet
Wa0006.
14 pages
Big Data Record 2
No ratings yet
Big Data Record 2
117 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
Sqoop
No ratings yet
Sqoop
3 pages
Hadoop
No ratings yet
Hadoop
13 pages
HIVE Installation
No ratings yet
HIVE Installation
3 pages
Bda U3
No ratings yet
Bda U3
59 pages
Sqoop
No ratings yet
Sqoop
9 pages
Zep Sqoop Big Data Interview Questions
No ratings yet
Zep Sqoop Big Data Interview Questions
25 pages
BC Ca1,2
No ratings yet
BC Ca1,2
31 pages
BigData Module 2
No ratings yet
BigData Module 2
41 pages
Big Data With Hadoop & Spark - VII
No ratings yet
Big Data With Hadoop & Spark - VII
3 pages
Sqoop Additional Reading Pp-200913-222451-Unlocked
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
18 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Green and White Simple Illustrative Data Analytics Presentation
No ratings yet
Green and White Simple Illustrative Data Analytics Presentation
8 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
BDS Session 8
No ratings yet
BDS Session 8
49 pages
Pig
No ratings yet
Pig
6 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Problems On Relational Algebra
No ratings yet
Problems On Relational Algebra
12 pages
BDP Assignment 2
No ratings yet
BDP Assignment 2
12 pages
Mod 2
No ratings yet
Mod 2
70 pages
7 Ibiz Pig Workouts
No ratings yet
7 Ibiz Pig Workouts
7 pages
Resume of Satish 16122009
No ratings yet
Resume of Satish 16122009
4 pages
Data Analytics Chapter 5
No ratings yet
Data Analytics Chapter 5
14 pages
BIG DATA Module 2 FINAL SMI
No ratings yet
BIG DATA Module 2 FINAL SMI
44 pages
Unit 4 3 Lumify, Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify, Data Rapper and Sqooop
27 pages
7th Jan FSDSnov (Hive Fsds Nov)
No ratings yet
7th Jan FSDSnov (Hive Fsds Nov)
6 pages
Database Management System
No ratings yet
Database Management System
5 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Information Retrieval
No ratings yet
Information Retrieval
31 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Bigdata Question
No ratings yet
Bigdata Question
16 pages
IBM Big Data Engineer: IBM C2090-101 Version Demo
No ratings yet
IBM Big Data Engineer: IBM C2090-101 Version Demo
6 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Module-II: Database and Database Management System
No ratings yet
Module-II: Database and Database Management System
75 pages
Abinitio Interview Ques
No ratings yet
Abinitio Interview Ques
30 pages
DS - Practical - 1
No ratings yet
DS - Practical - 1
16 pages
Skip Lists
No ratings yet
Skip Lists
23 pages
Week 04 & 05
No ratings yet
Week 04 & 05
63 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
BW Inventory Controlling Extraction Steps: Log On To SAP-R/3 ECC OLTP System
No ratings yet
BW Inventory Controlling Extraction Steps: Log On To SAP-R/3 ECC OLTP System
8 pages
OLAP (Online Analytical Processing) BY SHSR
No ratings yet
OLAP (Online Analytical Processing) BY SHSR
12 pages
Spring Data Access: By, Srinivas Reddy.S
No ratings yet
Spring Data Access: By, Srinivas Reddy.S
21 pages
Coderbyte Report - Kamakshi Kaushik
No ratings yet
Coderbyte Report - Kamakshi Kaushik
5 pages
Unit 5
No ratings yet
Unit 5
36 pages
Relational Algebra Answers
No ratings yet
Relational Algebra Answers
3 pages
Mongodb Vs Mysql
No ratings yet
Mongodb Vs Mysql
10 pages
Oracle Database Administrator Resume
No ratings yet
Oracle Database Administrator Resume
5 pages
Graph Technology Buyers Guide EN A4
No ratings yet
Graph Technology Buyers Guide EN A4
34 pages
Answers-Assignment#2 CIS 203 Fall 15-16
No ratings yet
Answers-Assignment#2 CIS 203 Fall 15-16
2 pages
Move Tables, Indexes and Lobs To Another Tablespace
No ratings yet
Move Tables, Indexes and Lobs To Another Tablespace
3 pages
Information Retrieval System
No ratings yet
Information Retrieval System
10 pages
FRAMME FeaDefOver
No ratings yet
FRAMME FeaDefOver
11 pages
IDMS Migration Overview
No ratings yet
IDMS Migration Overview
34 pages
Torrent: High-Quality Exam Torrent & Valid Test Dumps & Reliable Guide Torrent
No ratings yet
Torrent: High-Quality Exam Torrent & Valid Test Dumps & Reliable Guide Torrent
5 pages

Banking Problem Database

Uploaded by

Banking Problem Database

Uploaded by

Downloaded from: justpaste.

BANK DATA ANALYSIS

The high level activities to be performed are :

1. Load all the structured data to MYSQL

The bank deposit input files are available in “

Semi structured Data :- Chase_Bank_1.json

HDFS Directories to store output files.

a. SQOOP output should be stored in hdfs:/user/labuser/

Follow below steps to complete the assessment:-

Step 1:- Loading Data to MYSQL

data into the table

Step 2:- Import Structured data to HDFS using SQOOP

Load data from MYSQL table-

)using Sqoop based on the following conditions.

Id, City, County, State, Zipcode,

2010_Deposits, 2011_Deposits, 2012_Deposits,

2013_Deposits, 2014_Deposits, 2015_Deposits,

Import only the records which are

below mentioned cities.

Number of mappers should be 1

Run below command to copy sqoop output from HDFS to

hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt

Step 3:- Cleansing Semi-structured (Json) data Using PIG

You will be cleansing Json data (

) which is available in challenge input folder.

Load this json data to PIG using PIG Latin Scripts

Use Pig Latin scripts to

as the column name

Sort output in descending order based on

And read only first 50 records.

STORE result in hdfs ( /user/labuser/pigoutput) directory

Use required commands to copy PIG output to challenge folder in file

Step 4:- Loading and Analysing Data in Hive

Now, you have to load sqoop output to HIVE tables.

Create below database & Tables in Hive

should be based on column

Analysing Data using HQL

Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.

hive S -e "use hive_db;select count(1) from

button & validation will take place at backend.

You might also like