0% found this document useful (0 votes)
19 views41 pages

02 HadoopIntroEcosystem

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views41 pages

02 HadoopIntroEcosystem

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduc)on

 to  Hadoop  and  the  


Hadoop  Ecosystem  
Chapter  2  

201509  
Course  Chapters  

1   Introduc)on   Course  Introduc)on  


2   Introduc,on  to  Hadoop  and  the  Hadoop  Ecosystem  
Introduc,on  to  Hadoop  
3   Hadoop  Architecture  and  HDFS  
4   Impor)ng  Rela)onal  Data  with  Apache  Sqoop  
5   Introduc)on  to  Impala  and  Hive  
Impor)ng  and  Modeling  Structured  
6   Modeling  and  Managing  Data  with  Impala  and  Hive  
Data  
7   Data  Formats  
8   Data  File  Par))oning  
9   Capturing  Data  with  Apache  Flume     Inges)ng  Streaming  Data  

10   Spark  Basics  
11   Working  with  RDDs  in  Spark  
12   Aggrega)ng  Data  with  Pair  RDDs  
13   Wri)ng  and  Deploying  Spark  Applica)ons   Distributed  Data  Processing  with  
14   Parallel  Processing  in  Spark   Spark  
15   Spark  RDD  Persistence    
16   Common  PaEerns  in  Spark  Data  Processing  
17   Spark  SQL  and  DataFrames  

18   Conclusion   Course  Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐2  
Introduc)on  to  Hadoop  and  the  Hadoop  Ecosystem  

In  this  chapter  you  will  learn  


§ What  Hadoop  is  and  how  it  addresses  big  data  challenges  
§ The  guiding  principles  behind  Hadoop  
§ The  major  components  of  the  Hadoop  Ecosystem    
§ The  tools  you  will  be  using  in  the  Homework  Labs  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐3  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi,onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing  
§ Data  Analysis  and  Explora)on    
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐4  
Tradi)onal  Large-­‐Scale  Computa)on  

§ Tradi,onally,  computa,on  has  been    


processor-­‐bound  
– Rela)vely  small  amounts  of  data  
– Lots  of  complex  processing  

§ The  early  solu,on:  bigger  computers  


– Faster  processor,  more  memory  
– But  even  this  couldn’t  keep  up    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐5  
Distributed  Systems  

§ The  beJer  solu,on:  more  computers  


– Distributed  systems  –  use  mul)ple  machines  
for  a  single  job  

“In  pioneer  days  they  used  oxen  for  heavy  


pulling,  and  when  one  ox  couldn’t  budge  a  log,  
we  didn’t  try  to  grow  a  larger  ox.  We  shouldn’t  
be  trying  for  bigger  computers,  but  for  more  
systems  of  computers.”  
           –  Grace  Hopper  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.  
Database Hadoop Cluster  2-­‐6  
Challenges  with  Distributed  Systems  

§ Challenges  with  distributed  systems  


– Programming  complexity  
– Keeping  data  and  processes  in  sync  
– Finite  bandwidth    
– Par)al  failures  

§ The  solu,on?  
– Hadoop!  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐7  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing  
§ Data  Analysis  and  Explora)on    
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  
 

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐8  
What  is  Apache  Hadoop?  

§ Scalable  and  economical  data  storage,  processing  and  analysis  


– Distributed  and  fault-­‐tolerant    
– Harnesses  the  power  of  industry  standard  hardware  
§ Heavily  inspired  by  technical  documents  published  by  Google  

Batch   Search   Analy)c   Machine   Stream   Other  


Processing   Engine   SQL   Learning   Processing   Applica)ons  

Workload  Management  

Data  Storage  

Filesystem   Online  NoSQL  

Data  Integra)on  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐9  
Common  Hadoop  Use  Cases  

§ Extract/Transform/Load  (ETL)   § Collabora,ve  filtering  


§ Text  mining   § Predic,on  models  
§ Index  building   § Sen,ment  analysis  
§ Graph  crea,on  and  analysis   § Risk  assessment  
§ PaJern  recogni,on    

§ What  do  these  workloads  have  in  common?    Nature  of  the  data…  
– Volume  
– Velocity  
– Variety  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐10  
Distributed  Systems:  The  Data  BoEleneck  (1)  

§ Tradi,onally,  data  is  stored  in  a  central  loca,on  


§ Data  is  copied  to  processors  at  run,me  
§ Fine  for  limited  amounts  of  data  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐11  
Distributed  Systems:  The  Data  BoEleneck  (2)  

§ Modern  systems  have  much  more  data  


– terabytes+  a  day  
– petabytes+  total  
§ We  need  a  new  approach…  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐12  
Big  Data  Processing  with  Hadoop  

§ Hadoop  introduced  a  radical  new  approach:  


– Bring  the  program  to  the  data  rather  than  the  data  to  the  program    
§ Based  on  two  key  concepts  
– Distribute  data  when  the  data  is  stored  
A  Hadoop  Cluster  
– Run  computa)on  where  the  data  resides  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐13  
Core  Hadoop  

Processing  
A  Hadoop  Cluster  
• Spark  
• MapReduce  

Resource  Management   Storage  

• YARN   • HDFS  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐14  
Big  Data  Processing  

1.  Ingest   2.  Process   3.  Analyze   4.  Access  

Data  Analysis    
Data  Sources   Data  Storage   Data  Processing   and  Explora)on  
Hadoop   Spark   Impala   Search  
Distributed  
File  System  
(HDFS)   Hadoop  
MapReduce  

Hive  
HBase  
Pig  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐15  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing    
§ Data  Analysis  and  Explora)on  
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐16  
Data  Ingest  and  Storage  

§ Hadoop  typically  ingests  data  from  many  


sources  and  in  many  formats  
– Tradi)onal  data  management  systems,  e.g.  
1.  Ingest  
databases  
– Logs  and  other  machine  generated  data  
(event  data)   Data  Sources   Data  Storage  
– Imported  files   Hadoop  
Distributed  
File  System  
(HDFS)  

HBase  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐17  
Data  Storage  

§ Hadoop  Distributed  File  System  (HDFS)  


– HDFS  is  the  storage  layer  for  Hadoop  
– Provides  inexpensive  reliable  storage  for  massive    
amounts  of  data  on  industry-­‐standard  hardware  
– Data  is  distributed  when  stored  
– Covered  later  in  this  course  
§ Apache  HBase:  The  Hadoop  Database  
HDFS  
– A  NoSQL  distributed  database  built  on  HDFS  
– Scales  to  support  very  large  amounts  of  data    
and  high  throughput  
– A  table  can  have  thousands  of  columns  
– Covered  in  depth  in  Cloudera  Training  for  Apache  HBase  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐18  
Data  Ingest  Tools  (1)  

§ HDFS  
– Direct  file  transfer  
§ Apache  Sqoop  
– High  speed  import  to  HDFS  from  Rela)onship  
Database  (and  vice  versa)  
– Supports  many  data  storage  systems  
– e.g.  Netezza,  Mongo,  MySQL,  Teradata,  Oracle   HDFS  
– Covered  later  in  this  course  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐19  
Data  Ingest  Tools  (2)  

§ Apache  Flume  
– Distributed  service  for  inges)ng  streaming  data  
– Ideally  suited  for  event  data  from  mul)ple  systems  
– For  example,  log  files  
– Covered  later  in  this  course  

§ Kaca    
HDFS  
– A  high  throughput,  scalable  messaging  system  
– Distributed,  reliable  publish-­‐subscribe  system  
– Integrates  with  Flume  and  Spark  Streaming  

Apache    
Kana  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐20  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing    
§ Data  Analysis  and  Explora)on  
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐21  
Apache  Spark:  An  Engine  For  Large-­‐scale  Data  Processing  

§ Spark  is  large-­‐scale  data  processing  engine  


– General  purpose  
– Runs  on  Hadoop  clusters  and  data  in  HDFS  
§ Supports  a  wide  range  of  workloads  
– Machine  learning  
– Business  intelligence  
– Streaming  
– Batch  Processing  
§ This  course  uses  Spark  for  data  processing  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐22  
Hadoop  MapReduce:  The  Original  Hadoop  Processing  Engine  

§ Hadoop  MapReduce  is  the  original  Hadoop    


framework  
– Primarily  Java  based  
§ Based  on  the  MapReduce  programming  model  
§ The  core  Hadoop  processing  engine  before  Spark  was  introduced  
§ S,ll  the  dominant  technology    
– But  losing  ground  to  Spark  fast  
§ Many  exis,ng  tools  are  s,ll  built  using  MapReduce  code  
§ Has  extensive  and  mature  fault  tolerance  built  into  the  framework  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐23  
Apache  Pig:  Scrip)ng  for  MapReduce  

§ Apache  Pig  builds  on  Hadoop  to  offer  high-­‐level  data  processing  
– This  is  an  alterna)ve  to  wri)ng  low-­‐level  MapReduce  code  
– Pig  is  especially  good  at  joining  and  transforming  data  
§ The  Pig  interpreter  runs  on  the  client  machine  
– Turns  Pig  La)n  scripts  into  MapReduce  or  Spark  jobs  
– Submits  those  jobs  to  a  Hadoop  cluster  
– Covered  in  Cloudera  Data  Analyst  Training  
 
people = LOAD '/user/training/customers' AS (cust_id, name);
orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐24  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing    
§ Data  Analysis  and  Explora,on  
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐25  
Cloudera  Impala:  High  Performance  SQL  

§ Impala  is  a  high-­‐performance  SQL  engine    


– Runs  on  Hadoop  clusters  
– Data  stored  in  HDFS  files  
– Inspired  by  Google’s  Dremel  project  
– Very  low  latency  –  measured  in  milliseconds  
– Ideal  for  interac)ve  analysis  
§ Impala  supports  a  dialect  of  SQL  (Impala  SQL)  
– Data  in  HDFS  modeled  as  database  tables  
§ Impala  was  developed  by  Cloudera  
– 100%  open  source,  released  under  the  Apache  soqware  
license  
§ Impala  is  used  for  data  analysis  in  this  course  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐26  
Apache  Hive:  SQL  on  MapReduce  

§ Hive  is  an  abstrac,on  layer  on  top  of  Hadoop  


– Hive  uses  a  SQL-­‐like  language  called  HiveQL  
– Similar  to  Impala  SQL  
– Useful  for  data  processing  and  ETL  
– Impala  is  preferred  for  ad  hoc  analy)cs  
§ Hive  executes  queries  using  MapReduce    
– Hive  on  Spark  is  available  for  early  adopters;  not  yet  recommended  for  
produc)on  
§ Hive  can  op,onally  be  used  for  data  analysis  in  this  course  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐27  
Cloudera  Search:  A  Plasorm  For  Data  Explora)on  

§ Interac,ve  full-­‐text  search  for  data  in  a  Hadoop  cluster  


§ Allows  non-­‐technical  users  to  access  your  data  
– Nearly  everyone  can  use  a  search  engine  
§ Cloudera  Search  enhances  Apache  Solr  
– Integrates  Solr  with  HDFS,  MapReduce,  HBase,    
and  Flume  
– Supports  file  formats  widely  used  with  Hadoop  
– Dynamic  Web-­‐based  dashboard  interface  with  Hue  
– Apache  Sentry  based  security  
§ Cloudera  Search  is  100%  open  source  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐28  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing  
§ Data  Analysis  and  Explora)on    
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐29  
Hue:  The  UI  for  Hadoop  

§ Hue  =  Hadoop  User  Experience  


§ Hue  provides  a  Web  front-­‐end  to  a  Hadoop  
– Upload  and  browse  data  
– Query  tables  in  Impala  and  Hive  
– Run  Spark  and  Pig  jobs  and  workflows  
– Search  
– And  much  more  
§ Makes  Hadoop  easier  to  use  
§ Hue  is  100%  open-­‐source  
§ Created  by  Cloudera    
– Open  source,  released  under  Apache  license  
§ Hue  is  used  throughout  this  course  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐30  
Apache  Oozie:  Workflow  Management  

§ Oozie    
– Workflow  engine  for  Hadoop  jobs  
– Defines  dependencies  between  jobs  
§ The  Oozie  server  submits  the  jobs  to  the  server  in  the  correct  sequence  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐31  
Apache  Sentry:  Hadoop  Security  

§ Sentry  provides  fine-­‐grained  access  control  


(authoriza,on)  to  various  Hadoop  ecosystem  
components  
– Impala  
– Hive  
– Cloudera  Search  
– HDFS  
§ In  conjunc,on  with  Kerberos  authen,ca,on,  Sentry  
authoriza,on  provides  a  complete  cluster  security  
solu,on  
§ Created  by  Cloudera  
– Now  an  open-­‐source  Apache  project    

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐32  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing  
§ Data  Analysis  and  Explora)on    
§ Other  Ecosystem  Tools  
§ Introduc,on  to  the  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐33  
Introduc)on  to  the  Homework  Labs  

§ The  best  way  to  learn  is  to  do!  


§ Most  topics  in  this  course  have  a  corresponding  lab  for  prac,cing  the  skills  
you  have  learned  in  lecture  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐34  
Scenario  Explana)on  (1)  

§ The  Homework  Labs  are  based  on  a  hypothe,cal  scenario  


– However,  the  concepts  apply  to  nearly  any  organiza)on  
§ Loudacre  Mobile  is  a  (fic,onal)  fast-­‐growing  wireless  carrier  
– Provides  mobile  service  to  customers  throughout  western  USA  

L udacre mobile
o

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐35  
Scenario  Explana)on  (2)  

§ Loudacre  needs  to  migrate  their  exis,ng  infrastructure  to  Hadoop  


– The  size  and  velocity  and  their  data  has  exceeded  their  ability  to  
processing  and  analyze  their  data  
§ Loudacre  data  sources  
– MySQL  database  –  customer  account  data  (name,  address,  phone  
numbers,  devices)  
– Apache  web  server  logs  from  Customer  Service  site  
– HTML  files  –  Knowledge  base  ar)cles  
– XML  files  –  Device  ac)va)on  records  
– Real-­‐)me  device  status  logs  
– Base  sta)ons  –  cell  tower  loca)ons  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐36  
Introduc)on  to  Homework  Labs:  Gevng  Started  

§ Instruc,ons  are  in  the  Homework  Labs  


§ Start  with    
– General  Notes  
– Sevng  Up  
– Run  setup  script  for  the  course  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐37  
Introduc)on  to  Homework  Labs:  Classroom  Virtual  Machine  

§ Your  virtual  machine  


– Log  in  as  user  training  (password  training)  
– Pre-­‐installed  and  configured  with  
– Spark  and  CDH  (Cloudera’s  Distribu)on,  including  Apache  Hadoop)  
– Various  tools  including  Firefox,  gedit,  Emacs,  Eclipse,  and  Maven  
§ Training  materials:  ~/training_materials/dev1  folder  on  the  VM  
– exercises  –  one  folder  per  homework  
– scripts  –  course  setup  scripts  
§ Course  data:  ~/training_materials/data  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐38  
Chapter  Topics  

Introduc,on  to  Hadoop  and  the  


Introduc,on  to  Hadoop  
Hadoop  Ecosystem  

§ Problems  with  Tradi)onal  Large-­‐scale  Systems  


§ Hadoop!  
§ Data  Storage  and  Ingest  
§ Data  Processing  
§ Data  Analysis  and  Explora)on    
§ Other  Ecosystem  Tools  
§ Introduc)on  to  Homework  Labs  
§ Conclusion  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐39  
Essen)al  Points  

§ Hadoop  is  a  framework  for  distributed  storage  and  processing    


§ Core  Hadoop  includes  HDFS  for  storage  and  YARN  for  cluster  resource  
management  
§ The  Hadoop  ecosystem  includes  many  components  for  
– Inges)ng  data  (Flume,  Sqoop,  Kana)  
– Storing  data  (HDFS,  HBase)  
– Processing  data  (Spark,  Hadoop  MapReduce,  Pig)  
– Modeling  data  as  tables  for  SQL  access  (Impala,  Hive)  
– Exploring  data  (Hue,  Search)  
– Protec)ng  Data  (Sentry)  
§ This  course  introduces  most  of  the  key  Hadoop  infrastructure  
§ Homework  Labs  let  you  prac,ce  and  refine  your  Hadoop  skills!  

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐40  
Bibliography  

The  following  offer  more  informa,on  on  topics  discussed  in  this  chapter  
§ Hadoop:  The  Defini0ve  Guide  (published  by  O’Reilly)  
– http://tiny.cloudera.com/hadooptdg
§ Cloudera  Essen0als  for  Apache  Hadoop  –  free  online  training    
– http://tiny.cloudera.com/esscourse

©  Copyright  2010-­‐2015  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  or  shared  without  prior  wriEen  consent  from  Cloudera.    2-­‐41  

You might also like