0% found this document useful (0 votes)
152 views16 pages

Introduction To Big Data PDF

The document discusses the concepts of big data, data analytics, data science, and their applications. It defines big data as very large amounts of data that are analyzed quickly to discover new patterns. Data analytics uses data analysis techniques to draw conclusions from data to make informed business decisions. Data science comprises data cleansing, preparation, and analysis. MapReduce is a programming model that allows distributed processing of large datasets in parallel across clusters. Spark and Hadoop are two common big data frameworks, with Spark being faster for iterative tasks while Hadoop can handle larger datasets. Machine learning and NoSQL databases are also discussed in relation to big data. Potential applications of big data include customer segmentation, fraud detection, and industrial equipment monitoring.

Uploaded by

Aurelle KT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views16 pages

Introduction To Big Data PDF

The document discusses the concepts of big data, data analytics, data science, and their applications. It defines big data as very large amounts of data that are analyzed quickly to discover new patterns. Data analytics uses data analysis techniques to draw conclusions from data to make informed business decisions. Data science comprises data cleansing, preparation, and analysis. MapReduce is a programming model that allows distributed processing of large datasets in parallel across clusters. Spark and Hadoop are two common big data frameworks, with Spark being faster for iterative tasks while Hadoop can handle larger datasets. Machine learning and NoSQL databases are also discussed in relation to big data. Potential applications of big data include customer segmentation, fraud detection, and industrial equipment monitoring.

Uploaded by

Aurelle KT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction of Data Science, Big Data, Data Analytics and their Applications

I. Big Data Concept & Definition


The quantitative explosion of digital data has forced researchers to find new ways of seeing and
analyzing the world. It's about discovering new orders of greatness for capturing, searching,
sharing, storing, analyzing and presenting data. So was born the Big Data. Big data is shaking up
our way of doing business. The concept, as currently defined, encompasses a set of technologies
and practices designed to store very large amounts of data and analyze them very quickly. The
concept of big data is a concept popularized since 2012 to reflect the fact that companies are
confronted with data volumes (data) to be processed more and more considerable and presenting
strong commercial and marketing issues.

www.inchtechs.com 1 Dr-Eng. Aurelle Tchagna


II. Big Data, Data Science and Data Analytics
Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial industries
to enable organizations to make more-informed business decisions and by scientists and
researchers to verify or disprove scientific models, theories and hypotheses. Data Science is the
field that comprises of everything that related to data cleansing, preparation and analysis. Big Data
involves automating insights into a certain dataset as well as supposes the usage of queries and
data aggregation procedures.

Skills required
to become Data
Scientist, Big
Data Specialist
and Data
Analyst.

www.inchtechs.com 2 Dr-Eng. Aurelle Tchagna


III. MapReduce Paradigm
MapReduce is a programming paradigm that was designed to allow parallel distributed processing
of large sets of data, converting them to sets of tuples, and then combining and reducing those
tuples into smaller sets of tuples. In layman’s terms, MapReduce was designed to take big data
and use parallel distributed computing to turn big data into little- or regular-sized data. Parallel
distributed processing refers to a powerful framework where mass volumes of data are processed
very quickly by distributing processing tasks across clusters of commodity servers.

www.inchtechs.com 3 Dr-Eng. Aurelle Tchagna


IV. Big Data Framework
No framework is ubiquitous, but there are a few standouts. Spark is the best big data framework
according to the techrepublic.com website. Hadoop is one of the first framework that used to work
with big data. In this book, we are going to work with Hadoop and Spark Framework. With
multiple big data frameworks available on the market, choosing the right one is a challenge. A
classic approach of comparing the pros and cons of each platform is unlikely to help, as businesses
should consider each framework from the perspective of their particular needs. Facing multiple
Hadoop MapReduce vs. Apache Spark requests, our big data consulting practitioners compare two
leading frameworks to answer a burning question: which option to choose – Hadoop MapReduce
or Spark.
Both Hadoop and Spark are open source projects by Apache Software Foundation and both are the
flagship products in big data analytics. Hadoop has been leading the big data market for more than
5 years. According to our recent market research, Hadoop’s installed base amounts to 50,000+
customers, while Spark boasts 10,000+ installations only. However, Spark’s popularity
skyrocketed in 2013 to overcome Hadoop in only a year. A new installation growth rate
(2016/2017) shows that the trend is still ongoing. Spark is outperforming Hadoop with 47% vs.
14% correspondingly. In fact, the key difference between Hadoop MapReduce and Spark lies in
the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read
from and write to a disk. As a result, the speed of processing differs significantly – Spark may be

www.inchtechs.com 4 Dr-Eng. Aurelle Tchagna


up to 100 times faster. However, the volume of data processed also differs: Hadoop MapReduce
is able to work with far larger data sets than Spark.
Tasks Hadoop MapReduce is good for:
• Linear processing of huge data sets. Hadoop MapReduce allows parallel processing of
huge amounts of data. It breaks a large chunk into smaller ones to be processed separately
on different data nodes and automatically gathers the results across the multiple nodes to
return a single result. In case the resulting dataset is larger than available RAM, Hadoop
MapReduce may outperform Spark.
• Economical solution, if no immediate results are expected. Our Hadoop team considers
MapReduce a good solution if the speed of processing is not critical. For instance, if data
processing can be done during night hours, it makes sense to consider using Hadoop
MapReduce.
Tasks Spark is good for:

• Fast data processing. In-memory processing makes Spark faster than Hadoop MapReduce
– up to 100 times for data in RAM and up to 10 times for data in storage.
• Iterative processing. If the task is to process data again and again – Spark defeats Hadoop
MapReduce. Spark’s Resilient Distributed Datasets (RDDs) enable multiple map
operations in memory, while Hadoop MapReduce has to write interim results to a disk.
• Near real-time processing. If a business needs immediate insights, then they should opt
for Spark and its in-memory processing.
• Graph processing. Spark’s computational model is good for iterative computations that
are typical in graph processing. And Apache Spark has GraphX – an API for graph
computation.
• Machine learning. Spark has MLlib – a built-in machine learning library, while Hadoop
needs a third-party to provide it. MLlib has out-of-the-box algorithms that also run in
memory. But if required, our Spark specialists will tune and adjust them to tailor to your
needs.
• Joining datasets. Due to its speed, Spark can create all combinations faster, though
Hadoop may be better if joining of very large data sets that requires a lot of shuffling and
sorting is needed.

www.inchtechs.com 5 Dr-Eng. Aurelle Tchagna


V. Machine Learning (ML)

VI. NoSQL Data Base

www.inchtechs.com 6 Dr-Eng. Aurelle Tchagna


VII. Big Data Applications

• Customer segmentation. Analyzing customer behavior and identifying segments of


customers that demonstrate similar behavior patterns will help businesses to understand
customer preferences and create a unique customer experience.
• Risk management. Forecasting different possible scenarios can help managers to make
right decisions by choosing non-risky options.
• Real-time fraud detection. After the system is trained on historical data with the help of
machine-learning algorithms, it can use these findings to identify or predict an anomaly in
real time that may signal of a possible fraud.

www.inchtechs.com 7 Dr-Eng. Aurelle Tchagna


• Industrial big data analysis. It’s also about detecting and predicting anomalies, but in this
case, these anomalies are related to machinery breakdowns. A properly configured system
collects the data from sensors to detect pre-failure conditions.

www.inchtechs.com 8 Dr-Eng. Aurelle Tchagna


VIII. Exercise Note
1. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?

A. Big data management and data mining B. Data warehousing and business intelligence C.
Management of Hadoop clusters D. Collecting and storing unstructured data

2. All of the following accurately describe Hadoop, EXCEPT:

A. Open source B. Real-time C. Java-based D. Distributed computing approach

3. __________ has the world’s largest Hadoop cluster.

A. Apple B. Datamatics C. Facebook D. None of the mentioned

www.inchtechs.com 9 Dr-Eng. Aurelle Tchagna


4. What are the five V’s of Big Data?

A. Volume B. Velocity C. Variety D. All the above

5. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.

A. Scalding B. Cascalog C. Hcatalog D. Hcalding

6. What are the main components of Big Data?

A. MapReduce B. HDFS C. YARN D. All of these

7. What are the different features of Big Data Analytics?

A. Open-Source B. Scalability C. Data Recovery D. All the above

8. Define the Port Numbers for NameNode, Task Tracker and Job Tracker.

A. NameNode B. Task Tracker C. Job Tracker D. All of the above

9. Facebook Tackles Big Data With _______ based on Hadoop

A. Project Prism B. Prism C. Project Data D. Project Bid

10. What is a unit of data that flows through a Flume agent?

A. Record B. Event C. Row D. Log

11. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:

a) Improved data storage and information retrieval

b) Improved extract, transform and load features for data integration

c) Improved data warehousing functionality

d) Improved security, workload management and SQL support

12. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?

www.inchtechs.com 10 Dr-Eng. Aurelle Tchagna


a) Big data management and data mining

b) Data warehousing and business intelligence

c) Management of Hadoop clusters

d) Collecting and storing unstructured data

15. Which hdfs command is used to check for various inconsistencies ?

a) fsk

b) fsck

c) fetchdt

d) none of the mentioned

16. Point out the correct statement :

a) All hadoop commands are invoked by the bin/hadoop script

b) Hadoop has an option parsing framework that employs only parsing generic options

c) Archive command creates a hadoop archive

d) All of the mentioned

17. HDFS supports the ____________ command to fetch Delegation Token and store it in a file
on the local system.

a) fetdt

b) fetchdt

c) fsk

d) rec

18. In ___________ mode, the NameNode will interactively prompt you at the command line
about possible courses of action you can take to recover your data.

a) full

www.inchtechs.com 11 Dr-Eng. Aurelle Tchagna


b) partial

c) recovery

d) commit

19. Point out the wrong statement :

a) classNAME displays the class name needed to get the Hadoop jar

b) Balancer Runs a cluster balancing utility

c) An administrator can simply press Ctrl-C to stop the rebalancing process

d) None of the mentioned

20. _________ command is used to copy file or directories recursively.

a) dtcp

b) distcp

c) dcp

d) distc

21. __________ mode is a Namenode state in which it does not accept changes to the name space.

a) Recover

b) Safe

c) Rollback

d) None of the mentioned

22. __________ command is used to interact and view Job Queue information in HDFS.

a) queue

b) priority

c) dist

www.inchtechs.com 12 Dr-Eng. Aurelle Tchagna


d) all of the mentioned

23. Which of the following command runs the HDFS secondary namenode ?

a) secondary namenode

b) secondarynamenode

c) secondary_namenode

d) none of the mentioned

25. Point out the correct statement :

a) MapReduce tries to place the data and the compute as close as possible

b) Map Task in MapReduce is performed using the Mapper() function

c) Reduce Task in MapReduce is performed using the Map() function

d) All of the mentioned

26. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.

a) Maptask

b) Mapper

c) Task execution

d) All of the mentioned

27. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.

a) Reduce

b) Map

c) Reducer

d) All of the mentioned

www.inchtechs.com 13 Dr-Eng. Aurelle Tchagna


28. Point out the wrong statement :

a) A MapReduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner

b) The MapReduce framework operates exclusively on pairs

c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods

d) None of the mentioned

30. ________ is a utility which allows users to create and run jobs with any executable as the
mapper and/or the reducer.

a) Hadoop Strdata

b) Hadoop Streaming

c) Hadoop Stream

d) None of the mentioned

31. __________ maps input key/value pairs to a set of intermediate key/value pairs.

a) Mapper

b) Reducer

c) Both Mapper and Reducer

d) None of the mentioned

32. The number of maps is usually driven by the total size of :

a) inputs

b) outputs

c) tasks

d) none of the mentioned

www.inchtechs.com 14 Dr-Eng. Aurelle Tchagna


33. Running a ___________ program involves running mapping tasks on many or all of the nodes
in our cluster.

a) MapReduce

b) Map

c) Reducer

d) All of the mentioned

34. ___________ is the world’s most complete, tested, and popular distribution of Apache Hadoop
and related projects.

a) MDH

b) CDH

c) ADH

d) BDH

35. Point out the correct statement :

a) Cloudera is also a sponsor of the Apache Software Foundation

b) CDH is 100% Apache-licensed open source and is the only Hadoop solution to offer unified
batch processing, interactive SQL, and interactive search, and role-based access controls

c) More enterprises have downloaded CDH than all other such distributions combined

d) All of the mentioned

36. Cloudera ___________ includes CDH and an annual subscription license (per node) to
Cloudera Manager and technical support.

a) Enterprise

b) Express

c) Standard

d) All of the mentioned

www.inchtechs.com 15 Dr-Eng. Aurelle Tchagna


37. Cloudera Express includes CDH and a version of Cloudera ___________ lacking enterprise
features such as rolling upgrades and backup/disaster recovery.

a) Enterprise

b) Express

c) Standard

d) Manager

38. Point out the wrong statement :

a) CDH contains the main, core elements of Hadoop

b) In October 2012, Cloudera announced the Cloudera Impala project

c) CDH may be downloaded from Cloudera’s website at no charge

d) None of the mentioned

39. __________ is a online NoSQL developed by Cloudera.

a) HCatalog

b) Hbase

c) Imphala

d) Oozie

40. CDH process and control sensitive data and facilitate :

a) multi-tenancy

b) flexibilty

c) scalability

d) all of the mentioned

www.inchtechs.com 16 Dr-Eng. Aurelle Tchagna

You might also like