0% found this document useful (0 votes)

152 views16 pages

Introduction To Big Data PDF

The document discusses the concepts of big data, data analytics, data science, and their applications. It defines big data as very large amounts of data that are analyzed quickly to discover new patterns. Data analytics uses data analysis techniques to draw conclusions from data to make informed business decisions. Data science comprises data cleansing, preparation, and analysis. MapReduce is a programming model that allows distributed processing of large datasets in parallel across clusters. Spark and Hadoop are two common big data frameworks, with Spark being faster for iterative tasks while Hadoop can handle larger datasets. Machine learning and NoSQL databases are also discussed in relation to big data. Potential applications of big data include customer segmentation, fraud detection, and industrial equipment monitoring.

Uploaded by

Aurelle KT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views16 pages

Introduction To Big Data PDF

Uploaded by

Aurelle KT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction of Data Science, Big Data, Data Analytics and their Applications

I. Big Data Concept & Definition

The quantitative explosion of digital data has forced researchers to find new ways of seeing and
analyzing the world. It's about discovering new orders of greatness for capturing, searching,
sharing, storing, analyzing and presenting data. So was born the Big Data. Big data is shaking up
our way of doing business. The concept, as currently defined, encompasses a set of technologies
and practices designed to store very large amounts of data and analyze them very quickly. The
concept of big data is a concept popularized since 2012 to reflect the fact that companies are
confronted with data volumes (data) to be processed more and more considerable and presenting
strong commercial and marketing issues.

www.inchtechs.com 1 Dr-Eng. Aurelle Tchagna

II. Big Data, Data Science and Data Analytics
Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial industries
to enable organizations to make more-informed business decisions and by scientists and
researchers to verify or disprove scientific models, theories and hypotheses. Data Science is the
field that comprises of everything that related to data cleansing, preparation and analysis. Big Data
involves automating insights into a certain dataset as well as supposes the usage of queries and
data aggregation procedures.

Skills required
to become Data
Scientist, Big
Data Specialist
and Data
Analyst.

www.inchtechs.com 2 Dr-Eng. Aurelle Tchagna

III. MapReduce Paradigm
MapReduce is a programming paradigm that was designed to allow parallel distributed processing
of large sets of data, converting them to sets of tuples, and then combining and reducing those
tuples into smaller sets of tuples. In layman’s terms, MapReduce was designed to take big data
and use parallel distributed computing to turn big data into little- or regular-sized data. Parallel
distributed processing refers to a powerful framework where mass volumes of data are processed
very quickly by distributing processing tasks across clusters of commodity servers.

www.inchtechs.com 3 Dr-Eng. Aurelle Tchagna

IV. Big Data Framework
No framework is ubiquitous, but there are a few standouts. Spark is the best big data framework
according to the techrepublic.com website. Hadoop is one of the first framework that used to work
with big data. In this book, we are going to work with Hadoop and Spark Framework. With
multiple big data frameworks available on the market, choosing the right one is a challenge. A
classic approach of comparing the pros and cons of each platform is unlikely to help, as businesses
should consider each framework from the perspective of their particular needs. Facing multiple
Hadoop MapReduce vs. Apache Spark requests, our big data consulting practitioners compare two
leading frameworks to answer a burning question: which option to choose – Hadoop MapReduce
or Spark.
Both Hadoop and Spark are open source projects by Apache Software Foundation and both are the
flagship products in big data analytics. Hadoop has been leading the big data market for more than
5 years. According to our recent market research, Hadoop’s installed base amounts to 50,000+
customers, while Spark boasts 10,000+ installations only. However, Spark’s popularity
skyrocketed in 2013 to overcome Hadoop in only a year. A new installation growth rate
(2016/2017) shows that the trend is still ongoing. Spark is outperforming Hadoop with 47% vs.
14% correspondingly. In fact, the key difference between Hadoop MapReduce and Spark lies in
the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read
from and write to a disk. As a result, the speed of processing differs significantly – Spark may be

www.inchtechs.com 4 Dr-Eng. Aurelle Tchagna

up to 100 times faster. However, the volume of data processed also differs: Hadoop MapReduce
is able to work with far larger data sets than Spark.
Tasks Hadoop MapReduce is good for:
• Linear processing of huge data sets. Hadoop MapReduce allows parallel processing of
huge amounts of data. It breaks a large chunk into smaller ones to be processed separately
on different data nodes and automatically gathers the results across the multiple nodes to
return a single result. In case the resulting dataset is larger than available RAM, Hadoop
MapReduce may outperform Spark.
• Economical solution, if no immediate results are expected. Our Hadoop team considers
MapReduce a good solution if the speed of processing is not critical. For instance, if data
processing can be done during night hours, it makes sense to consider using Hadoop
MapReduce.
Tasks Spark is good for:

• Fast data processing. In-memory processing makes Spark faster than Hadoop MapReduce
– up to 100 times for data in RAM and up to 10 times for data in storage.
• Iterative processing. If the task is to process data again and again – Spark defeats Hadoop
MapReduce. Spark’s Resilient Distributed Datasets (RDDs) enable multiple map
operations in memory, while Hadoop MapReduce has to write interim results to a disk.
• Near real-time processing. If a business needs immediate insights, then they should opt
for Spark and its in-memory processing.
• Graph processing. Spark’s computational model is good for iterative computations that
are typical in graph processing. And Apache Spark has GraphX – an API for graph
computation.
• Machine learning. Spark has MLlib – a built-in machine learning library, while Hadoop
needs a third-party to provide it. MLlib has out-of-the-box algorithms that also run in
memory. But if required, our Spark specialists will tune and adjust them to tailor to your
needs.
• Joining datasets. Due to its speed, Spark can create all combinations faster, though
Hadoop may be better if joining of very large data sets that requires a lot of shuffling and
sorting is needed.

www.inchtechs.com 5 Dr-Eng. Aurelle Tchagna

V. Machine Learning (ML)

VI. NoSQL Data Base

www.inchtechs.com 6 Dr-Eng. Aurelle Tchagna

VII. Big Data Applications

• Customer segmentation. Analyzing customer behavior and identifying segments of

customers that demonstrate similar behavior patterns will help businesses to understand
customer preferences and create a unique customer experience.
• Risk management. Forecasting different possible scenarios can help managers to make
right decisions by choosing non-risky options.
• Real-time fraud detection. After the system is trained on historical data with the help of
machine-learning algorithms, it can use these findings to identify or predict an anomaly in
real time that may signal of a possible fraud.

www.inchtechs.com 7 Dr-Eng. Aurelle Tchagna

• Industrial big data analysis. It’s also about detecting and predicting anomalies, but in this
case, these anomalies are related to machinery breakdowns. A properly configured system
collects the data from sensors to detect pre-failure conditions.

www.inchtechs.com 8 Dr-Eng. Aurelle Tchagna

VIII. Exercise Note
1. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?

A. Big data management and data mining B. Data warehousing and business intelligence C.
Management of Hadoop clusters D. Collecting and storing unstructured data

2. All of the following accurately describe Hadoop, EXCEPT:

A. Open source B. Real-time C. Java-based D. Distributed computing approach

3. __________ has the world’s largest Hadoop cluster.

A. Apple B. Datamatics C. Facebook D. None of the mentioned

www.inchtechs.com 9 Dr-Eng. Aurelle Tchagna

4. What are the five V’s of Big Data?

A. Volume B. Velocity C. Variety D. All the above

5. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.

A. Scalding B. Cascalog C. Hcatalog D. Hcalding

6. What are the main components of Big Data?

A. MapReduce B. HDFS C. YARN D. All of these

7. What are the different features of Big Data Analytics?

A. Open-Source B. Scalability C. Data Recovery D. All the above

8. Define the Port Numbers for NameNode, Task Tracker and Job Tracker.

A. NameNode B. Task Tracker C. Job Tracker D. All of the above

9. Facebook Tackles Big Data With _______ based on Hadoop

A. Project Prism B. Prism C. Project Data D. Project Bid

10. What is a unit of data that flows through a Flume agent?

A. Record B. Event C. Row D. Log

11. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:

a) Improved data storage and information retrieval

b) Improved extract, transform and load features for data integration

c) Improved data warehousing functionality

d) Improved security, workload management and SQL support

12. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?

www.inchtechs.com 10 Dr-Eng. Aurelle Tchagna

a) Big data management and data mining

b) Data warehousing and business intelligence

c) Management of Hadoop clusters

d) Collecting and storing unstructured data

15. Which hdfs command is used to check for various inconsistencies ?

a) fsk

b) fsck

c) fetchdt

d) none of the mentioned

16. Point out the correct statement :

a) All hadoop commands are invoked by the bin/hadoop script

b) Hadoop has an option parsing framework that employs only parsing generic options

c) Archive command creates a hadoop archive

d) All of the mentioned

17. HDFS supports the ____________ command to fetch Delegation Token and store it in a file
on the local system.

a) fetdt

b) fetchdt

c) fsk

d) rec

18. In ___________ mode, the NameNode will interactively prompt you at the command line
about possible courses of action you can take to recover your data.

a) full

www.inchtechs.com 11 Dr-Eng. Aurelle Tchagna

b) partial

c) recovery

d) commit

19. Point out the wrong statement :

a) classNAME displays the class name needed to get the Hadoop jar

b) Balancer Runs a cluster balancing utility

c) An administrator can simply press Ctrl-C to stop the rebalancing process

d) None of the mentioned

20. _________ command is used to copy file or directories recursively.

a) dtcp

b) distcp

c) dcp

d) distc

21. __________ mode is a Namenode state in which it does not accept changes to the name space.

a) Recover

b) Safe

c) Rollback

d) None of the mentioned

22. __________ command is used to interact and view Job Queue information in HDFS.

a) queue

b) priority

c) dist

www.inchtechs.com 12 Dr-Eng. Aurelle Tchagna

d) all of the mentioned

23. Which of the following command runs the HDFS secondary namenode ?

a) secondary namenode

b) secondarynamenode

c) secondary_namenode

d) none of the mentioned

25. Point out the correct statement :

a) MapReduce tries to place the data and the compute as close as possible

b) Map Task in MapReduce is performed using the Mapper() function

c) Reduce Task in MapReduce is performed using the Map() function

d) All of the mentioned

26. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.

a) Maptask

b) Mapper

c) Task execution

d) All of the mentioned

27. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.

a) Reduce

b) Map

c) Reducer

d) All of the mentioned

www.inchtechs.com 13 Dr-Eng. Aurelle Tchagna

28. Point out the wrong statement :

a) A MapReduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner

b) The MapReduce framework operates exclusively on pairs

c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods

d) None of the mentioned

30. ________ is a utility which allows users to create and run jobs with any executable as the
mapper and/or the reducer.

a) Hadoop Strdata

b) Hadoop Streaming

c) Hadoop Stream

d) None of the mentioned

31. __________ maps input key/value pairs to a set of intermediate key/value pairs.

a) Mapper

b) Reducer

c) Both Mapper and Reducer

d) None of the mentioned

32. The number of maps is usually driven by the total size of :

a) inputs

b) outputs

c) tasks

d) none of the mentioned

www.inchtechs.com 14 Dr-Eng. Aurelle Tchagna

33. Running a ___________ program involves running mapping tasks on many or all of the nodes
in our cluster.

a) MapReduce

b) Map

c) Reducer

d) All of the mentioned

34. ___________ is the world’s most complete, tested, and popular distribution of Apache Hadoop
and related projects.

a) MDH

b) CDH

c) ADH

d) BDH

35. Point out the correct statement :

a) Cloudera is also a sponsor of the Apache Software Foundation

b) CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified
batch processing, interactive SQL, and interactive search, and role-based access controls

c) More enterprises have downloaded CDH than all other such distributions combined

d) All of the mentioned

36. Cloudera ___________ includes CDH and an annual subscription license (per node) to
Cloudera Manager and technical support.

a) Enterprise

b) Express

c) Standard

d) All of the mentioned

www.inchtechs.com 15 Dr-Eng. Aurelle Tchagna

37. Cloudera Express includes CDH and a version of Cloudera ___________ lacking enterprise
features such as rolling upgrades and backup/disaster recovery.

a) Enterprise

b) Express

c) Standard

d) Manager

38. Point out the wrong statement :

a) CDH contains the main, core elements of Hadoop

b) In October 2012, Cloudera announced the Cloudera Impala project

c) CDH may be downloaded from Cloudera’s website at no charge

d) None of the mentioned

39. __________ is a online NoSQL developed by Cloudera.

a) HCatalog

b) Hbase

c) Imphala

d) Oozie

40. CDH process and control sensitive data and facilitate :

a) multi-tenancy

b) flexibilty

c) scalability

d) all of the mentioned

www.inchtechs.com 16 Dr-Eng. Aurelle Tchagna

PPT12-W12-Big Data Visualization
100% (1)
PPT12-W12-Big Data Visualization
29 pages
BASE24-eps - UIS Developers Guide
No ratings yet
BASE24-eps - UIS Developers Guide
130 pages
02 - AWS Restart Training Modules and Topics
100% (1)
02 - AWS Restart Training Modules and Topics
4 pages
Big Data - Introduction: Ravichandran
100% (1)
Big Data - Introduction: Ravichandran
44 pages
Data Dictionary
No ratings yet
Data Dictionary
4 pages
Data Mining Information
100% (1)
Data Mining Information
15 pages
Power BI - Exam Prep - 29 - 3
No ratings yet
Power BI - Exam Prep - 29 - 3
40 pages
Digital Library Proposal
100% (5)
Digital Library Proposal
7 pages
Big Data Hadoop Certification Training Course
No ratings yet
Big Data Hadoop Certification Training Course
12 pages
1 - Overview of Business Analysis
No ratings yet
1 - Overview of Business Analysis
19 pages
Microsoft Excel - Introduction To Data Sceince
No ratings yet
Microsoft Excel - Introduction To Data Sceince
22 pages
Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services
No ratings yet
Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services
49 pages
Organizational Readiness to E-Transformation
From Everand
Organizational Readiness to E-Transformation
Aqel M. Aqel
No ratings yet
Patterns of Big Data Forrester
No ratings yet
Patterns of Big Data Forrester
74 pages
Unit - 8 Database and Database Management System
No ratings yet
Unit - 8 Database and Database Management System
36 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Structure Introduction
No ratings yet
Data Structure Introduction
58 pages
Babok (Business Analysis Body of Knowledge)
No ratings yet
Babok (Business Analysis Body of Knowledge)
10 pages
The Open Group Certification For People: Togaf 9 Training Course Materials Checklist
No ratings yet
The Open Group Certification For People: Togaf 9 Training Course Materials Checklist
26 pages
SAP Query
100% (1)
SAP Query
38 pages
Monitoring and Evaluation Toolkit November 2017
No ratings yet
Monitoring and Evaluation Toolkit November 2017
83 pages
BSC Ii
No ratings yet
BSC Ii
25 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Transient Provider
No ratings yet
Transient Provider
34 pages
Data Science Architect Master's Course Brochure
No ratings yet
Data Science Architect Master's Course Brochure
23 pages
Benchmarking Presentation Michelle
No ratings yet
Benchmarking Presentation Michelle
43 pages
Role of Big Data Analytics in Banking
No ratings yet
Role of Big Data Analytics in Banking
6 pages
Session 4 - Risk Governance & Control Environment v2 - Chris Razook
No ratings yet
Session 4 - Risk Governance & Control Environment v2 - Chris Razook
26 pages
Big Data Metods
No ratings yet
Big Data Metods
23 pages
Big Educational Data & Analytics Survey
No ratings yet
Big Educational Data & Analytics Survey
23 pages
Big Data: by It Faculty Alttc Ghaziabad
No ratings yet
Big Data: by It Faculty Alttc Ghaziabad
26 pages
Reverse-Engineering Techniques
No ratings yet
Reverse-Engineering Techniques
18 pages
SQL Interview Questions
100% (1)
SQL Interview Questions
9 pages
IBM - Architecting A Big Data Platform For - White Paper - IML14333USEN PDF
No ratings yet
IBM - Architecting A Big Data Platform For - White Paper - IML14333USEN PDF
36 pages
Reporting and Data Analysis With HP Openview Network Node Manager
No ratings yet
Reporting and Data Analysis With HP Openview Network Node Manager
244 pages
EG-C-401E Concrete Rev 2
No ratings yet
EG-C-401E Concrete Rev 2
19 pages
Project Decision Analysis
No ratings yet
Project Decision Analysis
30 pages
Big Data My Studies
No ratings yet
Big Data My Studies
28 pages
Top 10 Big Data Trends
No ratings yet
Top 10 Big Data Trends
13 pages
SAP Business Objects BI 4 Upgrade
No ratings yet
SAP Business Objects BI 4 Upgrade
40 pages
IIBA Global Thought
No ratings yet
IIBA Global Thought
28 pages
SAP Analytics Business Intelligence Statement of Direction
No ratings yet
SAP Analytics Business Intelligence Statement of Direction
6 pages
Rapid Fire BI: A New Approach To Business Intelligence Tableau
No ratings yet
Rapid Fire BI: A New Approach To Business Intelligence Tableau
16 pages
The Data Science of The Quantified Self
No ratings yet
The Data Science of The Quantified Self
46 pages
Unit 5: Calculations: Week 1: Data Modeling
No ratings yet
Unit 5: Calculations: Week 1: Data Modeling
8 pages
Data Science Training Institute in Hyderabad
No ratings yet
Data Science Training Institute in Hyderabad
14 pages
ETL Development Standards
No ratings yet
ETL Development Standards
8 pages
Certified Hadoop and Spark Course Curriculum
No ratings yet
Certified Hadoop and Spark Course Curriculum
9 pages
Big Data Architectures
No ratings yet
Big Data Architectures
4 pages
MRA - Big Data Analytics - Its Impact On Changing Trends in Retail Industry
No ratings yet
MRA - Big Data Analytics - Its Impact On Changing Trends in Retail Industry
4 pages
BABOK Guide Appendix Glossary
No ratings yet
BABOK Guide Appendix Glossary
14 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Evaluation of BIRCH Clustering Algorithm For Big Data
No ratings yet
Evaluation of BIRCH Clustering Algorithm For Big Data
5 pages
Hadoop (Big Data) : Skills Gained
No ratings yet
Hadoop (Big Data) : Skills Gained
8 pages
CBAP Business Analysis Training - Karachi
No ratings yet
CBAP Business Analysis Training - Karachi
3 pages
Edw Bi
No ratings yet
Edw Bi
22 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Better Data Science - Generate PDF Reports With Python
No ratings yet
Better Data Science - Generate PDF Reports With Python
5 pages
Expenditure Matrix
No ratings yet
Expenditure Matrix
10 pages
Data Quality For Data Science in SCM
No ratings yet
Data Quality For Data Science in SCM
9 pages
Técnica Cirúrgica em Grandes Animais-Turner-1 - 240128 - 133347
No ratings yet
Técnica Cirúrgica em Grandes Animais-Turner-1 - 240128 - 133347
331 pages
Big Data
No ratings yet
Big Data
6 pages
Ultimate Guide To Harvard Referencing - Cite This For Me
No ratings yet
Ultimate Guide To Harvard Referencing - Cite This For Me
47 pages
1 Month Big Data Boot Camp
No ratings yet
1 Month Big Data Boot Camp
6 pages
Next DLP Data Loss Prevention Checklist
100% (2)
Next DLP Data Loss Prevention Checklist
3 pages
Cbap Ccba Roadmap
No ratings yet
Cbap Ccba Roadmap
6 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
IIBA® Business Analysis Foundations Business Process Modeling - Gaurav Kumar Wankar
No ratings yet
IIBA® Business Analysis Foundations Business Process Modeling - Gaurav Kumar Wankar
1 page
Oracle Wait Events and Solution Part I
No ratings yet
Oracle Wait Events and Solution Part I
11 pages
Gartner Data Governnace Report 2025
No ratings yet
Gartner Data Governnace Report 2025
34 pages
Chapter - 11 Queries in A Database Class X It
No ratings yet
Chapter - 11 Queries in A Database Class X It
3 pages
Microsoft Access 2010 Handout
No ratings yet
Microsoft Access 2010 Handout
11 pages
Dbms Unit - 1
No ratings yet
Dbms Unit - 1
18 pages
Bring Your Own Device Case Study
67% (3)
Bring Your Own Device Case Study
3 pages
How JP Morgan Chase Uses Data Mesh To Optimize Operations at Scale
No ratings yet
How JP Morgan Chase Uses Data Mesh To Optimize Operations at Scale
5 pages
Laurel de Apolo (Lope de Vega)
No ratings yet
Laurel de Apolo (Lope de Vega)
273 pages
2022-07-16
No ratings yet
2022-07-16
8 pages
Introduction To Data Lake
No ratings yet
Introduction To Data Lake
1 page
Student ID Return System
No ratings yet
Student ID Return System
12 pages
DMR Sia
No ratings yet
DMR Sia
538 pages
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
100% (1)
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
65 pages
Unit 4 Data Modeling: Structure
No ratings yet
Unit 4 Data Modeling: Structure
27 pages
Information Management
No ratings yet
Information Management
16 pages
Enhanced Conversion Program For Creation and Maintenance of HR Mini Master
No ratings yet
Enhanced Conversion Program For Creation and Maintenance of HR Mini Master
27 pages
Wa0154.
No ratings yet
Wa0154.
12 pages
A Peek Behind Colossus, Google's File System Google Cloud Blog
No ratings yet
A Peek Behind Colossus, Google's File System Google Cloud Blog
5 pages
Databases Exercise Sheet +
No ratings yet
Databases Exercise Sheet +
5 pages
The Internet A Nursing Research
No ratings yet
The Internet A Nursing Research
4 pages
A Critical Look at The Researchgate Score As A Measure of Scientific Reputation
No ratings yet
A Critical Look at The Researchgate Score As A Measure of Scientific Reputation
3 pages