00 HadoopWelcome Transcript

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop is commonly used for applications involving large data sets such as web search indexing, data mining, log analysis and bioinformatics.

Uploaded by

ramanavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views4 pages

00 HadoopWelcome Transcript

Uploaded by

ramanavg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Transcript name: What is Hadoop?

English

Hello everyone and welcome! My name is Akmal Chaudhri. In this video we will explain what Hadoop and Big Data are. Imagine this scenario: You have 1GB of data that you need to process. The data are stored in a relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB. And then 100GB. And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB. And you are fast approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your management wants to derive information from both the relational data and the unstructured data, and wants this information as soon as possible. What should you do? Hadoop may be the answer! Hadoop is an open source project of the Apache Foundation. It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Googles MapReduce and Google File System technologies as its foundation. It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers. This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate. As of Hadoop version 0.20.2, updates are not possible, but appends will be possible starting in version 0.21. Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers. Hadoop is not suitable for OnLine Transaction Processing workloads where data are randomly

accessed on structured data like a relational database. Hadoop is not suitable for OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Processing. It is NOT a replacement for a relational database system. So, what is Big Data? With all the devices available today to collect data, such as RFID readers, microphones, cameras, sensors, and so on, we are seeing an explosion in data being collected worldwide. Big Data is a term used to describe large collections of data (also known as datasets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistics tools. Other interesting statistics providing examples of this data explosion are: There are more than 2 billion internet users in the world today, and 4.6 billion mobile phones in 2011, and 7TB of data are processed by Twitter every day, and 10TB of data are processed by Facebook every day. Interestingly, approximately 80% of these data are unstructured. With this massive quantity of data, businesses need fast, reliable, deeper data insight. Therefore, Big Data solutions based on Hadoop and other analytics software are becoming more and more relevant. This is a list of other open source projects related to Hadoop: Eclipse is a popular IDE donated by IBM to the open source community. Lucene is a text search engine library written in Java. Hbase is the Hadoop database. Hive provides data warehousing tools to extract, transform and load data, and query this data stored in Hadoop files. Pig is a platform for analyzing large data sets. It is a high level language for expressing data analysis. Jaql, or jackal, is a query language for JavaScript open notation.

Zoo Keeper is a centralized configuration service and naming registry for large distributed systems. Avro is a data serialization system. UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data. Lets now talk about examples of Hadoop in action. Early in 2011, Watson, a super computer developed by IBM competed in the popular Question and Answer show Jeopardy!. Watson was successful in beating the two most popular players in that game. It was input approximately 200 million pages of text using Hadoop to distribute the workload for loading this information into memory. Once the information was loaded, Watson used other technologies for advanced search and analysis. In the telecommunications industry we have China Mobile, a company that built a Hadoop cluster to perform data mining on Call Data Records. China Mobile was producing 5-8TB of these records daily. By using a Hadoop-based system they were able to process 10 times as much data as when using their old system, and at one fifth of the cost. In the media we have the New York Times which wanted to host on their website all public domain articles from 1851 to 1922. They converted articles from 11 million image files to 1.5TB of PDF documents. This was implemented by one employee who ran a job in 24 hours on a 100-instance Amazon EC2 Hadoop cluster at a very low cost. In the technology field we again have IBM with IBM ES2, an enterprise search technology based on Hadoop, Lucene and Jaql. ES2 is designed to address unique challenges of enterprise search such as the use of an enterprisespecific vocabulary, abbreviations and acronyms. ES2 can perform mining tasks to build acronym libraries, regular expression patterns, and geoclassification rules. There are also many internet or social network companies using Hadoop such as Yahoo, Facebook, Amazon, eBay, Twitter, StumbleUpon, Rackspace, Ning, AOL, and so on. Yahoo is, of course, the largest production user with an application running a Hadoop cluster

consisting of approximately 10,000 Linux machines. Yahoo is also the largest contributor to the Hadoop open source project. Now, Hadoop is not a magic bullet that solves all kinds of problems. Hadoop is not good to process transactions because it is random access. It is not good when the work cannot be parallelized. It is not good for low latency data access. Not good for processing lots of small files. And not good for intensive calculations with little data. Now lets move on, and talk about Big Data solutions. Big Data solutions are more than just Hadoop. They can integrate analytic solutions to the mix to derive valuable information that can combine structured legacy data with new unstructured data. Big data solutions may also be used to derive information from data in motion. For example, IBM has a product called InfoSphere Streams that can be used to quickly determine customer sentiment for a new product based on Facebook or Twitter comments. Finally, lets end this presentation with one final thought: Cloud computing has gained a tremendous track in the past few years, and it is a perfect fit for Big Data solutions. Using the cloud, a Hadoop cluster can be setup in minutes, on demand, and it can run for as long as is needed without having to pay for more than what is used. This is the end of this video. Thank you for watching. To learn more, visit BigDataUniversity.com.

Netezza Database Users Guide
No ratings yet
Netezza Database Users Guide
320 pages
Infosphere Streams
No ratings yet
Infosphere Streams
456 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
11 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Module-2
No ratings yet
Module-2
34 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
ICAI_2023_paper_3719
No ratings yet
ICAI_2023_paper_3719
6 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
CLOUD_COMPUTING
No ratings yet
CLOUD_COMPUTING
21 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
unit 2
No ratings yet
unit 2
9 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
BDA
No ratings yet
BDA
8 pages
Ymca Swimming Schedules
No ratings yet
Ymca Swimming Schedules
1 page
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Assume External Sandhi Unless Marked To Indicate Internal Sandhi
No ratings yet
Assume External Sandhi Unless Marked To Indicate Internal Sandhi
2 pages
Hadoop 5 Day Training
No ratings yet
Hadoop 5 Day Training
2 pages
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
No ratings yet
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
6 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Greenplum Text Analytics
No ratings yet
Greenplum Text Analytics
5 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Waas
No ratings yet
Waas
9 pages
Unit 2
No ratings yet
Unit 2
10 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data
No ratings yet
Big Data
28 pages
IBM Streams Processing Language Introductory Tutorial
No ratings yet
IBM Streams Processing Language Introductory Tutorial
36 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
HADOOP
No ratings yet
HADOOP
55 pages
PL Standard Toolkit Reference
No ratings yet
PL Standard Toolkit Reference
78 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Greenplum: MAD Analytics in Practice: Mmds June 16, 2010
No ratings yet
Greenplum: MAD Analytics in Practice: Mmds June 16, 2010
41 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

00 HadoopWelcome Transcript

Uploaded by

00 HadoopWelcome Transcript

Uploaded by

Transcript name: What is Hadoop?

You might also like