0% found this document useful (0 votes)
9 views4 pages

Session9 DataIngestion SQOOP

SQOOP DATA Ingestion

Uploaded by

parthu12347299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Session9 DataIngestion SQOOP

SQOOP DATA Ingestion

Uploaded by

parthu12347299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

PHASE-1

DATALAKE --> ALL THE KINDS OF DATA --> SD , SSD , USD


HDFS (ONPREM), S3 , ADLS

HADOOP CLUSTER --> HADOOP ADMINS


DEVELOPERS --> EDGE NODE --> LINUX terminal

LINUX --> 20
HDFS --> 15 COMMANDS

===========================================
ETL or ELT ( BD)

E -> EXTRACT --> VERY SIMPLE --> SQL LOGICS


L -> LOAD --> VERY SIMPLE --> LOAD
T --> TRANSFORM --> STRUGGLE --> PYSPARK , HIVE

ETL or ELT

============================================
EXTRACT TERMINOLOGY -->

BANKING --> SOURCES

RDBMS --> MYSQL , ORACLE , POSTGRE


SOCIAL MEDIA --> LINKEDLN , FB , WATSAPP
MAINFRAMES
SENSOR
REST API

INJECTION PHASE --> TAKING THE DATA FROM SOURCE SYSTEM and keeping in DATA LAKE

COMMON TOOLS AND FRAMEWORKS --> INJECTION PHASE

TALEND
INFORMATICA
SQOOP --> THIS ONE --> SIMPLE --> BD projects --> DATA INJECTION TOOL ( CLOUDERA ,
CLOUDXLAB)
SPARK SQL PULL INJECTION
KAFKA (BD and RT)
SSIS
AWS GLUE
AZURE ADF
AZURE SYANPSE ANALYTICS
APACHE FLINK

============================

SQOOP --> SQL + HADOOP --> TOOL which takes the dta from RDBMS to HDFS

1) WRITE A QUERY ON A SQL RDBMS


2) TAKE THE DATA FROM EACH AND EVERY SYSTEM , DUMP it in a different , then do the
anal;sys

SQOOP RDBMS and keep it in HDFS

a) EDGE NODE
b) HDFS
c) BOTH
d) NONE

SQOOP to RDBMS TO S3 ? YES ...

MYSQL Server --> Retail_db --> customer table --> HDFS


1) HOST ID
2) username
3) password
4) Databasenamne
5) table name
6) Target name --> target hdfs dir name

MYSQL --. S3 -- SQOOP

1) HOST ID
2) USERNAME
3) PASSWORD
4) DATABASE
5) TABLE NAME
6) ACCESS KEY
7) S3 bucket name

CUSTOMER TABLE --> 12435 --> HDFS ..

1) HOST ID
2) username
3) password
4) retail_db
5) customers
6) HDFS directory

SQOOP -->
1) IMPORT --> RDBMS to HDFS
2) EXPORT --> HDFS TO RDBMS

CLOUDERA -->

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --


password cloudera --table customers -m 1 --target-dir
/user/cloudera/Sqoop_B17_SAMPLE

================= CLOUDERA ===============================

1) Open your cloudera in Putty


2) Check whether mysql is present in your cloudera
mysql -u root -pcloudera
3) show databases ; ( retail_db)
4) use retail_db;
5) check customers
6) select count(*) from customers;
============================================
open a new terminal in putty

HIT THE BELOW COMMAND :

sqoop import --connect jdbc:mysql://localhost/retail_db --username root


--password cloudera --table customers -m 1 --target-dir
/user/cloudera/Sqoop_B17_SAMPLE

hdfs dfs -ls /user/cloudera/Sqoop_B17_SAMPLE


You need to see the part file ...

============================================

CLOUDXLAB -->
1) open mysql
mysql command -->
mysql -h cxln2.c.thelab-240901.internal -u sqoopuser -pNHkkP876rp
2) Go inside retail_db;
use retail_db;

3) select count(*) from customers;


128

4)
sqoop import --connect jdbc:mysql://cxln2.c.thelab-240901.internal/retail_db -
username sqoopuser --password NHkkP876rp -table customers -m 1 --target-dir
/user/gadirajumidhun2082/Sqoop_B17_MIDHUN

==============================================

NUMBER OF MAPPERS --> 4


split-by

customers ( PK ) --> HDFS --> no mappers --> 4


customer_Midhun (batch12) --> HDFS --> No mappers --> error

==============================================

1) SQOOP COMMAND
2) SQOOP IMPORT ARCGH

=============================================

1) SQOOP --> RDBMS to HDFS customer


RDBMS --> SQL
HDFS --> JAVA

PROGRAM --> RECORD CONTAINER CLASS --> DATA TYPE MATCHING

2) BOUNDARY QUERY -->


100000 --> customer --> select min(custid),max(custid) from tbl/4
m 4

m1 --> 1 to 25000 --> partm-00000


m2 --> 25001 to 50000 --> partm-00001
m3 --. 50001 to 75000 --> partm-00002
m4 -- 75001 to 10000 --> partm-00003

3) DATA IMPORT -->


m1
m2
m3
m4 to HDFS

============================================

You might also like