0% found this document useful (0 votes)

26 views

Data Science

The document discusses data science and AI topics like data engineering, data analysis, machine learning, Hadoop, HDFS, and SQL. It provides information on data types, MapReduce, YARN, and Hive and includes example queries and commands.

Uploaded by

ZAHID MOHD

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Data Science

Uploaded by

ZAHID MOHD

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data science & AI

Data Engineer:-
design
build, arrange data
programming
System
Data scientist :-
Test
Analyse, Translate data
analytics
both:-- Curious,judgemental,debate WALSN
statistics: whole inference from single sample
data today
structured :-enterprises office data number date string
unstructured videos audio emails very large storage high low data wil make data
collection
HADOOP AND OOP BASIC
ENORMOUS DATA SET (100110000101010)slices unique individual data set
given individual computer s
single program -mapper system-> primary results-> results are sort-> secondary process-
reduce process ( map-reduce )
SCRAPING
for data no
method
python with beautiful soap

Hadoop core
HDFS:
1.Storage unit ->HDFS
space storage
Block a. Block b block c block d block E
128 128 128 128 88
Total 600MB
Copy will store in multiple system like blocks
Data is not lost at any cost
Data replication method
One of the node willl be lost that can be recoved from the any block
Even if one data block crashes making HDFS fualt-tolerant
2.Map reduce
Traditional method to processing method
Advatges
-load balancing
-time reduce
3.yarn( resources manger)
meta data :- data about data
HDFS-fuetures
-destributed
-Scalabe
Scaling
horizontel scaling
Vertical scaling
-Cost effective
-fault-toleranet
-High through put
Latency:time to get first recored
Throughput:number of record s processed unit of time

HDFS MASTER/slave topology

master node :name node
|
|
V

slave node :data node

Data node send sgl to. Data node to ensure its alive
Name all report data on ,replication factor3x

Secondry Name node buffer hadoop1.x ,64mb

Stanby hadoop2.x 128, ,128MB
Edit log fsimage used to re construct data

replication factor
each block in deposit

Block(160 )
64
64
33
map reduces execution program

interview questions
What are the main features of HDFS?
Fault tolerance, high throughput, suitability for handling large data sets etc

What is meant by Data node?

Actual storage locations and serves read and writer requests for clients.

What is daemon?
Daemon is the process that runs in background in the environment.

What is meant by heartbeat in HDFS?

Data nodes send heartbeat signals to Name node respectively to inform that they are
alive.

What is meant by 'block' in HDFS?

Block in HDFS refers to minimum quantum of data for reading orwriting

Default block size?

Default block size is 64 MB

What type of data is processed by Hadoop? digital data

What is a rack in HDFS?

Rack is the storage location where all the data nodes are put together

Why is HDFS fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes

Explain the architecture of HDFS?

Name node, data node, secondary name node

What is checkpointing in Hadoop?

Checkpointing is the process of combining the Edit Logs with the FsImage (File system
Image). It is performed by the Secondary NameNode.

What is a NameNode in Hadoop?

The NameNode is the master node that manages all the DataNodes (slave nodes). It
records the metadata information regarding all the files stored in the cluster (on the
DataNodes), e.g. The location of blocks stored, the size of the files, permissions,
hierarchy, etc.

Difference between traditional RDBMS and Hadoop?

Data Types
Processing
Schema on Read Vs. Write
Read/Write Speed
Cost
Best Fit Use Case
YARN

Components
-multi-tendency
-cluster utilisation
-compactbility
-scalblity
Master demon -resource manager - maximise the cluster utilisation
Slave demon - node manager
application MANGER :- give available containers to process ask RM
Shaduler :-hear scheduler
- capacity scheduler
knowledge
Is YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology
that supports MapReduce and other distributed computing frameworks and is also
referred to as MapReduce version 2
Why YARN?
With older versions of Hadoop, you were limited to executing MapReduce jobs only
What are the YARN responsibilities? create a container
Monitoring containers that are running
etc
What are the key components of YARN?
ResourceManager. NodeManager, ApplicationMaster, Container
What is ResourceManager in YARN?
Scheduler - The scheduler is responsible for allocating resources.
ApplicationManager - Accepting job-submissions, executing the application etc
What is ApplicationMasterin YARN?
Performany application-specificwork,

Map reduce
Map : you count shelf #1 I count up shelf #2 more peoplev too fast
Key value pire -intermediate map
key value

Reducer : we all get together and add up our individual count s

shuffle and sort
splitting :128 block splitting
mapping: just collecting
shuffling&key a :group by key
Mapperstage2reducersage -> key value pier

knowledge test
What is Hadoop Map Reduce?
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce
framework is used

How Hadoop MapReduce works?

Explain Map and Reduce Phases

Explain what is shuffling in MapReduce?

transfers the map outputs to the reduceras inputs is known as the shuffle

Distributed Cache in MapReduce Framework?

When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache
is used. The files could be an executable jar files or simple properties file.
Explain what is Speculative Execution?
If a particular drive is taking a long time to complete a task, Hadoop will create a
duplicate task on anotherdisk.

Input Split and HDFS Block?

The logical division of data is known as Split while a physical division of data is known
as HDFS Block

What do you mean by data locality?

Moving computation unit to data rather data to the computation unit
hadoop start
format namenode hdfs namenode -format
start hadoop start-all.sh
stop hadoop stop-all.sh
open hadoop web ui localhost:9001 ,9870 ,8088 (cluster)
load sample data hdfs hadoop -put dir to dir( hdfs dfs -put dir to dir)
exit safe mode hadoop dfsadmin -safemode leave
delete data from hdfs hadoop dfs -rm -r file
Make dir Hadoop
Operation
hadoop dfs -put /Users/mohammedzahidk/desktop/test.csv /user/zahi
Hfs dfs -mkdir

captsl
zahid

MAPREDUCE

SQL (query)
Standerd language for communication with database
Create data base
Use

Data type
Numeric data type
 INT
 FLOAT
 DECIMAL
NON-NUMERIC DATA TYPE
 CHAR
 VAARCHAR
 ENUM
 BOOLEAN

DAATE AND TIME

 DATE(YYYY-MM-DD)
 DATETIME(YYYY-MM-DD HH-MM-SS)
 TIME(HH-MM-SS)
 YEAR(YYYY)
FOREIGN KEY
It used to link two table together
Primary key called reference or parent
Foreign key called child

Eg
CREATE TABLE order(
Id AUTO_INCREMENT PREMIRY KEY,
Product_id INT,
Customer_id INT,
Ordertime DATETIME,
FOREIGN KEY product_id REFERENCEs product(id),
FOREIGN KEY customer_id REFERNCEs Customer(id));

QUERY
1. SELECT
2. SHOW
3. DROP TABLE DELETE
4. ALTER. – ADD,DROP,CHANGE,MODIFY
5. DESCRIBE
6. INSERT INTO table name (clom,) VALUES( “”)
7. WHERE
8. AND/OR
9. BETWEEN
10. IS NULL
Add foreign key
ALTER TABLE name
ADD CONSTRAINT FRKY_id
FOREIGN KEY FRKY_id REFERENCEs name(id),

SQL JOINS
 Innerjoin
 Leftjoin
 Rightjoin
 Fullouterjoin
Inner join
SELECT customers customer_id, orders order_id, orders. order_date
FROM customers
INNER JOIN orders
ON customers customer_id = orders customer_id
ORDER BY customers customer_id;]

APACHE HIVE

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6387)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (634)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1160)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (983)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8302)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (633)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1254)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (933)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (10337)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (887)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1007)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (3237)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5058)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4346)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (458)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2091)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1993)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1077)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2780)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2838)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4086)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (76)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (906)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (813)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (277)

Data Science

Uploaded by

Data Science

Uploaded by

Data science & AI

HDFS MASTER/slave topology

slave node :data node

Secondry Name node buffer hadoop1.x ,64mb

What is meant by Data node?

What is meant by heartbeat in HDFS?

What is meant by 'block' in HDFS?

Default block size?

What type of data is processed by Hadoop? digital data

What is a rack in HDFS?

Why is HDFS fault-tolerant?

Explain the architecture of HDFS?

What is checkpointing in Hadoop?

What is a NameNode in Hadoop?

Difference between traditional RDBMS and Hadoop?

Reducer : we all get together and add up our individual count s

How Hadoop MapReduce works?

Explain what is shuffling in MapReduce?

Distributed Cache in MapReduce Framework?

Input Split and HDFS Block?

What do you mean by data locality?

DAATE AND TIME

You might also like