0% found this document useful (0 votes)
15 views5 pages

Banking Problem Database

The document outlines a bank data analysis project involving two datasets: structured (Chase_Bank.csv) and semi-structured (Chase_Bank_1.json). Key activities include loading data into MySQL, ingesting it into HDFS using SQOOP, performing ETL operations with PIG, and analyzing the data in Hive. The document provides detailed instructions for each step, including database creation, data cleansing, and output file management.

Uploaded by

Avijit Manna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Banking Problem Database

The document outlines a bank data analysis project involving two datasets: structured (Chase_Bank.csv) and semi-structured (Chase_Bank_1.json). Key activities include loading data into MySQL, ingesting it into HDFS using SQOOP, performing ETL operations with PIG, and analyzing the data in Hive. The document provides detailed instructions for each step, including database creation, data cleansing, and output file management.

Uploaded by

Avijit Manna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Downloaded from: justpaste.

it/cdbm5

BANK DATA ANALYSIS

You are provided with two datasets(Structured & Semi-structured) containing information about bank deposit details.

The high level activities to be performed are :

1. Load all the structured data to MYSQL


2. Ingest structured data into the HDFS environment using SQOOP (Extract)
3. Ingest Semi-structured data into HDFS environment (Extract)
4. Perform ETL operation on JSON data using PIG scripts (Transform)
5. Perform ETL and Load sqoop output data into HIVE (Load)
6. Analyse data using HQL (Analyse)

INPUT FILES

The bank deposit input files are available in “

~/

Desktop/Project/wingst2-banking-challenge/

” folder.

Structured Data :-

Chase_Bank.csv

Semi structured Data :- Chase_Bank_1.json


Important Instructions to be followed:-

HDFS Directories to store output files.

a. SQOOP output should be stored in hdfs:/user/labuser/

sqoop_bank

directory.
b. PIG output should to be stored in hdfs:/user/labuser/

bank1

directory.

Output files of your assessment (***.txt) should be present in local challenge folder ( /home/labuser/Desktop/Project/wings-xx-challenge).

Follow below steps to complete the assessment:-

Step 1:- Loading Data to MYSQL

Login to MYSQL:

Username: root
Password: labuserbdh

Create new Database and tables using MySQL commands to load structured data.
DB Name:- bank_db
Table:- bank .

Create Table and Load script is given below for you reference

create table bank (Id int , Institution_Name varchar(2000), Branch_Name varchar(2000), Branch_Number int, City varchar(2000),
County varchar(2000), State varchar(2000), Zipcode int, 2010_Deposits int,
2011_Deposits int, 2012_Deposits int, 2013_Deposits int, 2014_Deposits int,
2015_Deposits int, 2016_Deposits int);

Load

Chase_Bank.csv

data into the table

bank

load data local infile '/home/labuser/Desktop/Project/wingst2-banking-challenge/Chase_Bank.csv' into table bank fields terminated by ',' lines terminated by '\n' ignore 1 rows (Id,
Institution_Name, Branch_Name, Branch_Number, City, County, State, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits,
2016_Deposits);

Step 2:- Import Structured data to HDFS using SQOOP

Load data from MYSQL table-

bank

to HDFS (

/user/labuser/sqoop_bank

)using Sqoop based on the following conditions.

Columns to be imported :

Id, City, County, State, Zipcode,

2010_Deposits, 2011_Deposits, 2012_Deposits,

2013_Deposits, 2014_Deposits, 2015_Deposits,

2016_Deposits

Import only the records which are

NOT IN

below mentioned cities.

"
Rochester", "Austin", "Chicago", "Indianapolis"

Number of mappers should be 1

Run below command to copy sqoop output from HDFS to

sqoop_output.txt

file :

hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt

Note:- Make sure that your output files are available in challenge folder.

You will be loading and analysing Sqoop output data to Hive using HQL in further steps.

Step 3:- Cleansing Semi-structured (Json) data Using PIG

You will be cleansing Json data (

Chase_Bank_1.json

) which is available in challenge input folder.

Load this json data to PIG using PIG Latin Scripts

Note:- You can either load this data directly from challenge input folder , or use required commands to copy to hdfs and then to P

Use Pig Latin scripts to

find Minimum no of deposits in 2016 (ie, MIN(Deposits_2016)) for each county . Assign

minimum_dep”

as the column name

Sort output in descending order based on

minimum_dep”

And read only first 50 records.


Sample output format:-

{"group":"Gillespie","minimum_dep":212776}{"group":"Imperial","minimum_dep":148284}

STORE result in hdfs ( /user/labuser/pigoutput) directory

Use required commands to copy PIG output to challenge folder in file

pig_output.txt.

Step 4:- Loading and Analysing Data in Hive

Now, you have to load sqoop output to HIVE tables.

Create below database & Tables in Hive

Database

: hive_db
Partition Table

: bank_part.
Columns

Id, City, County, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits, 2016_Depos

Partition

should be based on column

State

Read records which satisfies the below condition & Load to bank_part table
City in Bronx, NewYorkCity, Dallas, Houston, Columbus
State in "NY","OH","TX"
Hint:- Create a temporary table to load Sqoop output and then load data to partitioned table with necessary filters.

Analysing Data using HQL

Use the following command and execute hive query to remove the WARNING messages from the HIVE output

export HIVE_SKIP_SPARK_ASSEMBLY=true

Write a HQL query to fetch the records which satisfies below criteria. 2014_Deposits is greater than 50000, 2015_Deposi
ts is greater than 60000, 2016_Deposits is greater than 70000, City in NewYorkCity, Dallas, Houston.

Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.


Save the output in file

hive_output.txt.

Note:- Given below the sample format to copy output to file from terminal.

hive S -e "use hive_db;select count(1) from

bank_part

;" >output.txt

VALIDATION :

Before closing the environment, ensure that all the output files are available in local directory

Desktop/Project/wingst2-banking-challenge/”

sqoop_output.txt

hive_output.txt

pig_output.txt

Click on

SUBMIT

button & validation will take place at backend.

You might also like