Bda 20cs41001 Course File
Bda 20cs41001 Course File
(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301
DEPARTMENT OF
COMPUTER SCIENCE & ENGINEERING
(2023-2024)
1
Contents
2
Geethanjali College of Engineering and Technology
(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Name of the Course: BIG DATA ANALYTICS
Subject code: 20CS41001 Programme: UG
Branch: CSE Version No: 1
Year: IV Document Number: GCET/CSE/BDA/01
Semester: I No. of Pages:143
Section: IV- CSE - A, B, C, D, E
Classification status (Unrestricted/Restricted): Restricted
Distribution List: Department, Library
3
2. Vision of the Institute
Geethanjali visualizes dissemination of knowledge and skills to students, who would eventually
contribute to wellbeing of the people of the nation and global community.
3. Mission of the Institute
1.To impart adequate fundamental knowledge in all basic sciences and engineering, technical
and Inter-personal skills to students.
2.To bring out creativity in students that would promote innovation, research, and
entrepreneurship.
3.To Preserve and promote cultural heritage, humanistic and spiritual values promoting peace
and harmony in society.
4. Vision of the Department
To produce globally competent and socially responsible computer science engineers contributing to
the advancement of engineering and technology which involves creativity and innovation by
providing excellent learning environment with world class facilities.
5. Mission of the Department
1.To be a centre of excellence in instruction, innovation in research and scholarship, and service
to the stake holders, the profession, and the public.
2.To prepare graduates to enter a rapidly changing field as a competent computer science
engineer.
3.To prepare graduate capable in all phases of software development, possess a firm
understanding of hardware technologies, have the strong mathematical background
necessary for scientific computing, and be sufficiently well versed in general theory to allow
growth within the discipline as it advances.
4.To prepare graduates to assume leadership roles by possessing good communication skills, the
ability to work effectively as team members, and an appreciation for their social and ethical
responsibility in a global setting.
4
6. Program Educational Objectives (PEOs)
1. To provide graduates with a good foundation in mathematics, sciences and engineering
fundamentals required to solve engineering problems that will facilitate them to find
employment in industry and / or to pursue postgraduate studies with an appreciation for lifelong
learning.
2. To provide graduates with analytical and problem-solving skills to design algorithms, other
hardware / software systems, and inculcate professional ethics, inter-personal skills to work in a
multi-cultural team.
3. To facilitate graduates to get familiarized with the art software / hardware tools, imbibing
creativity and innovation that would enable them to develop cutting-edge technologies of multi-
disciplinary nature for societal development.
Program Outcomes (CSE)
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyse complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues, and the consequent responsibilities
relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
5
solutions in societal and environmental contexts, and demonstrate the knowledge of,
and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and teamwork: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
PSO (Program Specific Outcome):
PSO 1: Demonstrate competency in Programming and problem-solving skills and apply these
skills in solving real world problems.
PSO 2: Select appropriate programming languages, Data structures and algorithms in
combination with modern technologies and tools, apply them in developing creative and
innovative solutions.
PSO 3: Demonstrate adequate knowledge in emerging technologies.
6
7. Syllabus
7
8
8.Course Objectives and Outcomes
Course Objectives:
Develop ability to
1. Know the basic elements of Big Data and Data science to handle huge amounts of data.
2. Gain knowledge about various technologies for handling Big Data.
9
9. Brief notes on the importance of the course and how it fits into the curriculum
10
f. When students complete this course, what do they need know or be able to do?
Big data analytics is an emerging field that has many opportunities that could impact industries such as
finance, government, and healthcare. Data analysts can help change lives by providing valuable insights
and real time solutions .This course helps students build vital expertise in Various technologies in Big
data and its working, an exposure towards analytical and visualization tools.
g. Is there specific knowledge that the students will need to know in the future?
Analytical Thinking
SQL Database
Decision Analysis
Mathematical and Statistical Skills
Software Analytics
Programming Skills
Functions and Formulas
Data Cleaning and Preparation
Quantitative Skills
Data Visualization Skills
Query Languages
Problem Solving
Domain Knowledge
h. Five years from now, what do you hope students will remember from this course?
Graduates have an opportunity to take specialization in the specified course and an opt for
business analyst or can be involved in industry for a data scientist role.
i. What is it about this course that makes it unique or special?
Data analytics is important because it helps businesses optimize their performances. A company
can also use data analytics to make better business decisions and help analyze customer trends
and satisfaction, which can lead to new—and better—products and services.
j. Why does the program offer this course?
It helps us to understand importance of data and how it can be turned into business revenue if
we can handle, process, and analyse in an appropriate way.
l. Why can’t this course be “covered” as a sub-section of another course?
Course itself is a specialization in masters.
11
m. What unique contributions to students’ learning experience does this course make?
It helps students to understand the value of data and how the data has become the new oil in
the industry.
n. What is the value of taking this course? How exactly does it enrich the program?
Big Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of data.
p. What are the major career options that require this course?
Machine learning engineer.
Data architect.
Statistician.
Data analyst.
Chief technology officer (CTO) .
Chief data officer (CDO) .
Application architect.
Project manager.
4 IV Big Data analytics Apply each learning model for different datasets.
Analyse needs, challenges, and techniques for big
5 V Big Data Visualization data visualization.
12
12. Course Mapping with POs
Program Outcomes
Course Name
BIG DATA ANALYTICS PSO
PSO PSO
1 2 3 4 5 6 7 8 9 10 11 12 2
1 3
CO1: Describe Big Data 2 1 1 1 1 1
elements and Architectures.
13
Timetable
Year/Sem/Sec: IV-B.Tech I-Semester B-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: P Lalitha Room No : 220
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday BDA BDA & CC LAB WIPRO TRAINING
Tuesday MAD STM BDA ML WIPRO TRAINING
LUNCH
Wednesday CC ML LAB WIPRO TRAINING
Thursday STM CC ML STM WIPRO TRAINING
Friday ML BDA STM CC MAD PROJECT SEMINAR
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester C-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: S Radha Room No : 221
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday MAD BDA ML CC WIPRO TRAINING
Tuesday CC ML MAD STM WIPRO TRAINING
LUNCH
Wednesday STM BDA MAD CC WIPRO TRAINING
Thursday BDA BDA & CC LAB WIPRO TRAINING
Friday ML ML LAB STM PROJECT SEMINAR
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester D-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: T Neelima Room No : 325
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday STM CC ML BDA WIPRO TRAINING
Tuesday ML ML LAB WIPRO TRAINING
LUNCH
Year/Sem/Sec: IV-B.Tech I-Semester E-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: P Deeplaxmi Room No : 326
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday CC MAD BDA STM WIPRO TRAINING
Tuesday BDA BDA & CC LAB WIPRO TRAINING
LUNCH
14
Individual Time Table
Year/Sem/Sec: IV-B.Tech I-Semester A-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: Y Siva Room No : 213
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday
LUNCH
Tuesday
Wednesday BDA BDA & CC LAB
Thursday BDA
Friday BDA
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester B-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: P Lalitha Room No : 220
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday BDA BDA & CC LAB
Tuesday BDA
LUNCH
Wednesday
Thursday
Friday BDA
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester C-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: S Radha Room No : 221
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday BDA
Tuesday
LUNCH
Wednesday BDA
Thursday BDA BDA & CC LAB
Friday
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester D-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: T Neelima Room No : 325
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday BDA
Tuesday
LUNCH
Wednesday BDA
Thursday
Friday BDA & CC LAB BDA
Saturday
Year/Sem/Sec: IV-B.Tech I-Semester E-Section A.Y : 2023 -24 W.E.F. 18-07-2023 Version : 01
Class Teacher: P Deeplaxmi Room No : 326
Time 09.00-09.50 09.50-10.40 10.40-11.30 11.30-12.20 12.20-1.00 01.00-01.50 01.50-02.40 02.40-03.30
Period 1 2 3 4 5 6 7
Monday BDA
Tuesday BDA BDA & CC LAB
LUNCH
Wednesday
Thursday
Friday BDA
Saturday
15
13. Lecture plan with methodology being used/adopted
Lesson Schedule
S.No Date No of Topics to be covered Regular / Teaching aids
perio Additional used.
ds LCD/OHP/BB
Unit - I
1. Day 1 1 Course objectives and Course Outcomes, Regular LCD
Introduction to Big Data
16
5. Day 21,22 2 Name Node, Secondary Name Node, Data Regular LCD
Node
6. Day 23 1 Hadoop MapReduce paradigm Regular LCD
7. Day 24 1 Map Reduce tasks, jobs Regular LCD
8. Day 25 1 Job tracker & Task trackers Regular LCD
9. Day 26,27 2 Introduction to NOSQL Regular LCD
10. Day 28 1 Textual ETL processing Regular LCD
UNIT-IV
1. Day 29.30 2 Data analytics life cycle, Data cleaning, Regular LCD
Data transformation
UNIT-V
1. Day 38,39 2 Introduction to Data visualization, Regular LCD
Challenges to Big data visualization
2. Day 3 Types of data visualization, Visualizing Big Regular LCD
40,41,42 Data, Tools used in data visualization
3. Day 3 Proprietary Data Visualization tools, Open- Regular LCD
43,44,45 source data visualization tools
4. Day 3 Data visualization with Tableau. Regular LCD
46,47,48
5. Day 49 1 Introduction to Spark Additional LCD
17
14. Assignment Questions
Unit 1
1. Flipkart announces its big billion-day sale for this festival week. What are the possible
challenges that could be faced by the retailer and how it could be handled?
2. Data is growing at an exponential acceleration since the advent of computers and internet,
categorize various kinds of data contributing towards its growth with an example?
3. Elucidate various characteristics of big data with an example.
4. A bookstore owner wants his business to be spread out at 5 locations in India and 5 locations
abroad. He wants all information to be available online. Answer the Following:
i. What are the top challenges he would face initially?
ii. Is there any way of putting all details on web portal? Justify.
What are the three top reasons he should consider leveraging big data.
5. Illustrate with relevant examples and diagrams wherever applicable to discuss the classification
of analytics.
Unit 2
1. Identify the difference between parallel processing system and distributed system. Discuss how
they help to store and process big data.
2.List and Discuss various features of cloud Computing.
3.What assumptions were made in the design of HDFS? Do these assumptions make sense? Discuss
why?
4. Explicate various Cloud Deployment models.
5.Explain Hadoop ecosystem.
Unit 3
1. What are the two main phases of the Map Reduce programming? Explain through an example
2. Illustrate the functions of Map Reduce component of Hadoop with explanation to daemons
3. How is a file stored in HDFS to support parallel processing as well as support fault tolerance?
4.What is the role of name node, secondary name node and data node in Hadoop.
5.What are the tasks perform by map reduce in Hadoop?
Unit 4 & 5 planning for case study.
15. Tutorial problems : NA
18
16. Question Bank
UNIT-1
1. List and discuss the four elements of big data.
2.As an HR manager of a company providing Big Data solutions clients, What characteristics
would you look for while recruiting a potential candidate for the position of data analyst.
3.You are planning the market strategy for a new product in your company. Identify and list some
limitations of structured data related to this work.
4.Discuss in details different types of analytics.
UNIT-2
5.What is distributed computing? Explain the working of a distributed computing environment.
6.List the difference between parallel and distributed systems.
7.Discuss the techniques of parallel computing.
8.List the important features of Hadoop.
9.Discuss the features of cloud computing that can be used to handle big data.
10.Discuss the role cloud services play in handling big data.
11.Write a short note on In-Memory Computing technology.
12.Write a short note on the Hadoop ecosystem.
13.How does HDFS ensure data integrity in a Hadoop cluster.
UNIT-3
14.Discuss the working of MapReduce.
15.What is role of Name Node in an HDFS cluster.
16.List the main features of MapReduce.
17.Descirbe the working of the MapReduce algorithm.
18.Discuss some techniques to optimize MapReduce Jobs.
19.Discuss the point you need to consider while designing a file system in MapReduce.
20.Write a short note on Input Split.
21.Explain the types of MapReduce applications.
22.Write a short note on FileInputFormat class.
23.Compare relational database with NoSQL database.
24.What is the purpose of sharding? What is difference between replication and sharding?
25.Explain the ways in which data can be distributed.
19
UNIT-4
26.Explain reporting
27.Explain the following
a. Basic Analytics
b. Advanced Analytics
c. Operational Analytics
d. Monetized Analytics.
28. Explain the working of the c () command in R.
29.Explain the working of the scan () command in R.
30.Compare the write. Table () and write.csv () commands.
31.List some types of data structures available in R.
32.Name some operators used to form data subsets in R.
33. List the functions provided by R to combine different sets of data.
34.List some common characteristics of the R language.
35.Write a script to generate the following list of numerical values with spaces.
36.Write a script to concatenate text and numerical values in R
37. Define text data analysis.
38.What are analytics point solutions.
39.Explain analytical process
40.What are the two main advantages of using functions over scripts?
41.Discuss the syntax of defining a function in R.
42.What do you understand by plotting in R.
43.List the techniques used to draw graphs in R.
44.Write a short note on time series plots.
UNIT-5
45.Discuss some applications of data visualization.
46.Name a few data visualization tools.
47.List and discuss various types of data visualizations.
48.Write a note on tableau products.
49.List and discuss the icons presents on the tableau toolbar.
50.What is Tableau server.
20
17. Quiz questions
UNIT I
Getting an Overview of Big Data: What is Big Data?, History of Data Management –
Evolution of Big Data, Structuring Big Data, Elements of Big Data, Big Data Analytics,
Careers in Big Data, Future of Big Data.
21
tapes, CDs’, and various digital formats), video (’DVDs and other digital formats), and
merchandise across multiple channels.
When reporting sales revenue, population demographics, sentiments, reviews, and feedback
were not often reported or at least were not considered as a visible part of decision making in a
traditional computing environment. The reasons for this included rigidity of traditional computing
architectures and associated models to integrate unstructured, semi-structured, or other forms of
data, while these artifacts were used in analysis and internal organizational reporting for revenue
activities from a movie. Looking at these examples in medicine and entertainment business
management, we realize that decision support has always been an aid to the decision-making
process and not the end state itself, as is often confused.
If one were to consider all the data, the associated processes, and the metrics used in any
decision making situation within any organization, we realize that we have used information
(volumes of data) in a variety of formats and varying degrees of complexity and derived decisions
with the data in non-traditional software processes. Before we get to Big Data, let us look at a few
important events in computing history. In the late 1980s, we were introduced to the concept of
decision support and data warehousing. This wave of being able to create trends, perform historical
analysis, and provide predictive analytics and highly scalable metrics created a series of solutions,
companies, and an industry in itself.
All these entities have contributed to the consumerization of data, from data creation,
acquisition, and consumption perspectives. The business models and opportunities that came with
the large-scale growth of data drove the need to create powerful metrics to tap from the knowledge
of the crowd that was driving them, and in return offer personalized services to address the need of
the moment.
Here are some examples:
● Weather data—there is a lot of weather data reported by governmental agencies around the
world, scientific organizations, and consumers like farmers. What we hear on television or radio is
an analytic key performance indicator (KPI) of temperature and forecasted conditions based on
several factors.
● Contract data—there are many types of contracts that an organization executes every year, and
there are multiple liabilities associated with each of them.
● Labor data—elastic labor brings a set of problems that organizations need to solve.
22
● Maintenance data—records from maintenance of facilities, machines, non-computer-related
systems, and more.
● Financial reporting data—corporate performance reports and annual filing to Wall Street.
● Compliance data—financial, healthcare, life sciences, hospitals, and many other agencies that file
compliance data for their corporations.
● Clinical trials data—pharmaceutical companies have wanted to minimize the life cycle of
processing for clinical trials data and manage the same with rules-based processing; this is an
opportunity for Big Data.
● Processing doctors’ notes on diagnosis and treatments—another key area of hidden insights and
value for disease state management and proactive diagnosis; a key machine learning opportunity.
● Contracts—every organization writes many types of contracts every year, and must process and
mine the content in the contracts along with metrics to measure the risks and penalties.
Define Big Data
The term “big data” refers to data that is so large, fast or complex that it's difficult or
impossible to process using traditional methods. The act of accessing and storing large amounts of
information for analytics has been around a long time.
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large data sets go back to
the 1960s and '70s when the world of data was just getting started with the first data centers and
the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
essential for the growth of big data because they make big data easier to work with and cheaper to
store. In the years since then, the volume of big data has skyrocketed. Users are still generating
huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence of
machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has expanded
23
big data possibilities even further. The cloud offers truly elastic scalability, where developers can
simply spin up ad hoc clusters to test a subset of data.
24
media platforms, business processes, machines, networks, human interactions, etc. Such a large
amount of data is stored in data warehouses. Thus, comes to the end of characteristics of big data.
Classification of analytics:
Descriptive analytics: Descriptive analytics is a statistical method that is used to search and
summarize historical data to identify patterns or meaning.
Predictive analytics: Predictive Analytics is a statistical method that utilizes algorithms and
machine learning to identify trends in data and predict future behaviours.
Prescriptive analytics: Prescriptive analytics is a statistical method used to generate
recommendations and make decisions based on the computational findings of algorithmic models.
Advantages of Big Data
1. Cost optimization
One of the most major advantages of Big Data technologies is that they reduce the cost of storing,
processing, and analysing enormous volumes of data for enterprises. Not only that, but Big Data
technologies may help find cost-effective and efficient company practices. The logistics business
serves as a good illustration of Big Data's cost-cutting potential. In most cases, the cost of goods
returned is 1.5 times the cost of delivery.
2. Helps in Understanding the Market Conditions
By examining Big data, it is possible to have a better knowledge of current market situations. Let's
take an example: a corporation can determine the most popular goods by studying a customer's
purchase behaviour. It aids in the analysis of trends and client desires. A company can use this to
get an advantage over its competition.
3. Improve efficiency
Big Data techniques can dramatically enhance operational efficiency. Big Data technologies may
collect vast volumes of usable customer data by connecting with customers/clients and getting their
important input.
4. Better Decision Making
The main benefit of using Big Data Analytics is that it has boosted the decision-making process to a
great extent. Rather than anonymously making decisions, companies are considering Big Data
Analytics before concluding any decision.
25
5. Improving Customer Service and Customer Experience
Big data, machine learning (ML), and artificial intelligence (AI)-powered technical support and
helpline services may considerably increase the quality of response and follow-up that firms can
provide to their customers.
6. Fraud and Anomaly Detection
It's just as vital to know what's going wrong in businesses like financial services or healthcare as it
is to know what's going right. With big data, AI and machine learning algorithms can quickly
discover erroneous transactions, fraudulent activity signs, and abnormalities in data sets that might
indicate a variety of current or prospective problems.
7. Focused And Targeted Campaigns
Big data may be used by businesses to give customized products to their target market. Don't waste
money on ineffective advertising strategies. Big data enables businesses to do in-depth analyses of
customer behavior. Monitoring online purchases and watching point-of-sale transactions are
common parts of this investigation.
8. Innovative Products
Big data continues to assist businesses in both updating existing goods and developing new ones.
Companies can discern what matches their consumer base by gathering enormous volumes of
data.
9. Agile supply chain management
Surprising, because we usually don't notice our supply networks until they've been severely
disrupted. Big data, which includes predictive analytics and is typically done in near real-time, aids
in keeping our worldwide network of demand, production, and distribution running smoothly.
10. Improved operations
Big data analytics may be used to enhance a variety of business activities, but one of the most
exciting and gratifying has been using big data analytics to improve physical operations. For
example, using big data and data science to create predictive maintenance plans might help
important systems avoid costly repairs and downtime. Start by looking at the age, condition,
location, warranty, and servicing information.
Big data analytics applications don’t restrict to one field only .Following are the application of big
data analytics
1. Banking and Securities
2. Communications, Media and Entertainment
26
3. Healthcare Providers
4. Education
5. Manufacturing and Natural Resources
6. Government
7. Insurance
8. Retail and Wholesale trade
9. Transportation
10. Energy and Utilities
11. Big Data & Auto Driving Car
12. Big Data in IoT
13. Big Data in Marketing
14. Big Data in Business Insights
Career in big data analytics:
1. Soaring Demand for Analytics Professionals
2. Huge Job Opportunities & Meeting the Skill Gap
3. Salary Aspects
4. Big Data Analytics: A Top Priority in a lot of Organizations
5. Adoption of Big Data Analytics is Growing
6. Analytics: A Key Factor in Decision Making
7. The Rise of Unstructured and Semi structured Data Analytics
8. Big Data Analytics is Used Everywhere
9. Surpassing Market Forecast / Predictions for Big Data Analytics
Future of Big Data :
As the amount of big data continues to grow, so does the need for trained data analysts to dive in
and extract insights. Big data analytics provides exciting opportunities for creating change in
industries such as finance, government, and healthcare. Data analysts can help change lives by
preventing fraud, allocating resources in a natural disaster, or improving the quality of healthcare.
Big Data examples
1. Fraud detection
For businesses whose operations involve any type of claims or transaction processing, fraud
detection is one of the most compelling Big Data application examples. Historically, fraud
27
detection on the fly has proven an elusive goal. In most cases, fraud is discovered long after the
fact, at which point the damage has been done and all that's left is to minimize the harm and adjust
policies to prevent it from happening again. Big Data platforms that can analyze claims and
transactions in real time, identifying large-scale patterns across many transactions or detecting
anomalous behavior from an individual user, can change the fraud detection game.
2. IT log analytics
IT solutions and IT departments generate an enormous quantity of logs and trace data. In the
absence of a Big Data solution, much of this data must go unexamined: organizations simply don't
have the manpower or resource to churn through all that information by hand, let alone in real time.
With a Big Data solution in place, however, those logs and trace data can be put to good use.
Within this list of Big Data application examples, IT log analytics is the most broadly applicable.
Any organization with a large IT department will benefit from the ability to quickly identify large-
scale patterns to help in diagnosing and preventing problems. Similarly, any organization with a
large IT department will appreciate the ability to identify incremental performance optimization
opportunities.
Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What's going on in a customer's call center is often a great
barometer and influencer of market sentiment, but without a Big Data solution, much of the insight
that a call center can provide will be overlooked or discovered too late. Big Data solutions can help
identify recurring problems or customer and staff behavior patterns on the fly not only by making
sense of time/quality resolution metrics, but also by capturing and processing call content itself.
A Big Data solution built to harvest and analyze social media activity, like IBM's Cognos
Consumer Insights, a point solution running on IBM's BigInsights Big Data platform, can make
sense of the chatter. Social media can provide real-time insights into how the market is responding
to products and campaigns. With those insights, companies can adjust their pricing, promotion, and
campaign placement on the fly for optimal results.
28
UNIT II
Introducing Technologies for Handling Big Data: Distributed and Parallel Computing for Big
Data, Introducing Hadoop, Cloud Computing and Big Data, In‐Memory Computing
Technology for Big Data, Understanding Hadoop Ecosystem.
29
Parallel Computing: Also improves the processing capability of a computer system by adding
additional computational resources to it. Divide complex computations into subtasks, handled
individually by processing units, running in parallel. Concept – processing capability will increase
with the increase in the level of parallelism.
30
PARALLEL COMPUTING TECHNIQUES:
Cluster or Grid Computing
• primarily used in Hadoop.
• based on a connection of multiple servers in a network (clusters)
• servers share the workload among them.
• overall cost may be very high.
Massively Parallel Processing (MPP)
• used in data warehouses. Single machine working as a grid is used in the MPP platform.
• Capable of handling storage, memory and computing activities.
• Software written specifically for MPP platform is used for optimization.
• MPP platforms, EMC Greenplum, Paracel , suited for high-value use cases.
3) High Performance Computing (HPC) : Offer high performance and scalability by using IMC.
Suitable for processing floating point data at high speeds.
Used in research and business organization where the result is more valuable than the cost or where
strategic importance of project is of high priority.
HADOOP
Hadoop is a distributed system like distributed database.
Hadoop is a ‘software library’ that allows its users to process large datasets across distributed
clusters of computers, thereby enabling them to gather, store and analyze huge sets of data.
31
It provides various tools and technologies, collectively termed as the Hadoop Ecosystem.
Hadoop cluster consist of single Master Node and a multiple Worker Nodes.
FEATURES OF CLOUD COMPUTING:
• Scalability – addition of new resources to an existing infrastructure. - increase in the amount of
data, requires organization to improve h/w components. - The new h/w may not provide complete
support to the s/w, that used to run properly on the earlier set of h/w. - Solution to this problem is
using cloud services - that employ the distributed computing technique to provide scalability.
• Elasticity – Hiring certain resources, as and when required, and paying for those resources. - no
extra payment is required for acquiring specific cloud services. - A cloud does not require
customers to declare their resource requirements in advance.
• Resource Pooling - multiple organizations, which use similar kinds of resources to carry out
computing practices, have no need to individually hire all the resources.
• Self Service – cloud computing involves a simple user interface that helps customers to directly
access the cloud services they want.
• Low Cost – cloud offers customized solutions, especially to organizations that cannot afford too
much initial investment. - cloud provides pay-us-you-use option, in which organizations need to
sign for those resources only that are essential.
• Fault Tolerance – offering uninterrupted services to customers.
CLOUD DEPLOYMENT MODELS:
Depending upon the architecture used in forming the n/w, services and applications used,
and the target consumers, cloud services form various deployment models. They are,
Public Cloud
Private Cloud
Community Cloud
Hybrid Cloud
• Public Cloud (End-User Level Cloud)
- Owned and managed by a company than the one using it.
- Third party administrator.
- Eg : Verizon, Amazon Web Services, and Rackspace.
- The workload is categorized because of service category, h/w customization is possible to provide
optimized performance.
- The process of computing becomes very flexible and scalable through customized h/w resources.
32
- The primary concern with a public cloud include security and latency.
• Private Cloud (Enterprise Level Cloud)
- Remains entirely in the ownership of the organization using it.
- Infrastructure is solely designed for a single organization.
- Can automate several processes and operations that require manual handling in a public cloud.
- Can also provide firewall protection to the cloud, solving latency and security concerns.
- A private cloud can be either on-premises or hosted externally. on premises: service is
exclusively used and hosted by a single organization. hosted externally: used by a single
organization and are not shared with other organizations.
• Community Cloud
-Type of cloud that is shared among various organizations with a common tie.
-Managed by third party cloud services.
-Available on or off premises.
Eg. In any state, the community cloud can provided so that almost all govt. organizations of that
state can share the resources available on the cloud. Because of the sharing of resources on
community cloud, the
data of all citizens of that state can be easily managed by the govt. organizations.
• Hybrid Cloud
-various internal or external service providers offer services to many organizations.
-In hybrid clouds, an organization can use both types of cloud, i.e. public and private together –
situations such as cloud bursting.
• Organization uses its own computing infrastructure, high load requirement, access clouds.
The organization using the hybrid cloud can manage an internal private cloud for general use and
migrate the entire or part of an application to the public cloud during the peak periods.
CLOUD SERVICES FOR BIG DATA
In big data Iaas, Paas and Saas clouds are used in following manner.
• Iaas:- Huge storage and computational power requirement for big data are fulfilled by limitless
storage space and computing ability obtained by iaas cloud.
• Paas:- offerings of various vendors have started adding various popular big data platforms that
include MapReduce, Hadoop. These offerings save organizations from a lot of hassles which occur
in managing individual hardware components and software applications.
33
• Saas:- Various organizations require identifying and analyzing the voice of customers particularly
on social media. Social media data and platform are provided by SAAS vendors. In addition,
private cloud facilitates access to enterprise data which enables these analyses.
IN MEMORY COMPUTING TECHNOLOGY
Another way to improve speed and processing power of data.
• IMC is used to facilitate high speed data processing e.g. IMC can help in tracking and monitoring
the consumers’ activities and behaviors which allow organizations to take timely actions for
improving customer services and hence customer satisfaction.
• Data stored on external devices known as secondary storage space. This data had to be accessed
from an external source.
• In the IMC technology the RAM or Primary storage space is used for analyzing data. Ram helps
to increase computing speed.
• Also, reduction in cost of primary memory has helped to store data in primary memory.
Hadoop HDFS - 2007 - A distributed file system for reliably storing huge amounts of unstructured,
semi-structured and structured data in the form of files.
34
Hadoop MapReduce - 2007 - A distributed algorithm framework for the parallel processing of large
datasets on HDFS filesystem. It runs on Hadoop cluster but also supports other database formats
like Cassandra and HBase.
Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and
asynchronous masterless replication.
HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with
master-slave replication. It uses HDFS as underlying storage.
Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on
Paxos algorithm variant called Zab.
Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting
interface over native Java MapReduce programming.
Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL
interface over native Java MapReduce programming.
Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for
finding meaningful patterns in HDFS datasets.
Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export
back.
YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the
cluster resources like memory and CPU.
Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into
HDFS.
Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message
semantics.
Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It provides
libraries for Machine Learning, SQL interface and near real-time Stream Processing.
Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
Solr Cloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It
uses Lucene library for data indexing.
35
Hadoop Environment Big Data Analytics Hadoop is changing the perception of handling Big Data especially
the unstructured data. Let’s know how Apache Hadoop software library, which is a framework, plays a vital
role in handling Big Data. Apache Hadoop enables surplus data to be streamlined for any distributed
processing system across clusters of computers using simple programming models. It truly is made to scale
up from single servers to many machines, each and every offering local computation, and storage space.
Instead of depending on hardware to provide high availability, the library itself is built to detect and handle
breakdowns at the application layer, so providing an extremely available service along with a cluster of
computers, as both versions might be vulnerable to failures.
Hadoop Community Package Consists of
• File system and OS level abstractions
• A MapReduce engine (either MapReduce or YARN)
• The Hadoop Distributed File System (HDFS)
• Java ARchive (JAR) files
• Scripts needed to start Hadoop
• Source code, documentation and a contribution section
36
UNIT III
Big Data processing:
Big Data technologies, Introduction to Google file system, Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, NameNode, Secondary NameNode, and DataNode,
Hadoop MapReduce paradigm, Map Reduce tasks, Job, Task trackers, Introduction to NOSQL,
Textual ETL processing.
37
The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with
robustness, scalability, and reliability. GFS can be implemented in commodity servers to support
large-scale file applications with high performance and high reliability.
Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and
developed to accommodate Google's expanding data processing requirements. GFS provides fault
tolerance, reliability, scalability, availability, and performance to large networks and connected
nodes.
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that must be retained; Google File System grew out
of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days
of Google, while it was still located in Stanford. Files are divided into fixed-size chunks of
64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely
overwritten, or shrunk; files are usually appended to or read.
A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node
and multiple Chunk servers. Each file is divided into fixed-size chunks. Chunk servers store these
chunks. Each chunk is assigned a globally unique 64-bit label by the master node at the time of
creation, and logical mappings of files to constituent chunks are maintained. Each chunk is
replicated several times throughout the network. At default, it is replicated three times, but this is
configurable. Files which are in high demand may have a higher replication factor, while files for
which the application client uses strict storage optimizations may be replicated less than three times
- in order to cope with quick garbage cleaning policies.
38
HDFS architecture
HDFS can be presented as the master/slave architecture. HDFS master is named as NameNode
whereas slave as DataNode. NameNode is a sever that manages the filesystem namespace and
adjusts the access (open, close, rename, and more) to files by the client. It divides the input data
into blocks and announces which data block will be store in which DataNode. DataNode is a slave
machine that stores the replicas of the partitioned dataset and serves the data as the request comes.
It also performs block creation and deletion.
The internal mechanism of HDFS divides the file into one or more blocks; these blocks are stored
in a set of data nodes. Under normal circumstances of the replication factor three, the HDFS
strategy is to place the first copy on the local node, second copy on the local rack with a different
node, and a third copy into different racks with different nodes. As HDFS is designed to support
large files, the HDFS block size is defined as 64 MB. If required, this can be increased.
Understanding HDFS components HDFS is managed with the master-slave architecture included
with the following components:
• Name Node: This is the master of the HDFS system. It maintains the directories, files, and
manages the blocks that are present on the Data Nodes.
• Data Node: These are slaves that are deployed on each machine and provide actual storage. They
are responsible for serving read-and-write data requests for the clients.
• Secondary Name Node: This is responsible for performing periodic checkpoints. So, if the Name
Node fails at any time, it can be replaced with a snapshot image stored by the secondary Name
Node checkpoints.
MapReduce architecture
MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job
submission, job initialization, task assignment, task execution, progress and status update, and job
completion-related activities, which are mainly managed by the Job Tracker node and executed by
Task Tracker. Client application submits a job to the Job Tracker. Then input is divided across the
cluster. The Job Tracker then calculates the number of map and reducer to be processed. It
commands the Task Tracker to start executing the job. Now, the Task Tracker copies the resources
to a local machine and launches JVM to map and reduce program over the data.
39
Along with this, the Task Tracker periodically sends update to the Job Tracker, which can be
considered as the heartbeat that helps to update JobID, job status, and usage of resources.
Hadoop is a top-level Apache project and is a very complicated Java framework. To avoid technical
complications, the Hadoop community has developed a number of Java frameworks that has added
an extra value to Hadoop features. They are considered as Hadoop subprojects. Here, we are
departing to discuss several Hadoop components that can be considered as an abstraction of HDFS
or MapReduce.
40
Hadoop MapReduce fundamentals
Basically, the MapReduce model can be implemented in several languages, but apart from
that, Hadoop MapReduce is a popular Java framework for easily written applications. It processes
vast amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable and fault tolerant manner. This MapReduce paradigm is divided
into two phases, Map and Reduce, that mainly deal with key-value pairs of data. The Map and
Reduce tasks run sequentially in a cluster, and the output of the Map phase becomes the input of
the Reduce phase.
All data input elements in MapReduce cannot be updated. If the input (key,value) pairs for
mapping tasks are changed, it will not be reflected in the input files. The Mapper output will be
piped to the appropriate Reducer grouped with the key attribute as input. This sequential data
process will be carried away in a parallel manner with the help of Hadoop MapReduce algorithms
as well as Hadoop clusters.
MapReduce programs transform the input dataset present in the list format into output data
that will also be in the list format. This logical list translation process is mostly repeated twice in
the Map and Reduce phases. We can also handle these repetitions by fixing the number of Mappers
and Reducers.
The following are the components of Hadoop that are responsible for performing analytics over Big
Data:
• Client: This initializes the job
• JobTracker: This monitors the job
• TaskTracker: This executes the job
• HDFS: This stores the input and output data
The four main stages of Hadoop MapReduce data processing are as follows:
• The loading of data into HDFS
• The execution of the Map phase
• Shuffling and sorting
• The execution of the Reduce phase
41
Loading data into HDFS
The input dataset needs to be uploaded to the Hadoop directory so it can be used by MapReduce
nodes. Then, Hadoop Distributed File System (HDFS) will divide the input dataset into data splits
and store them to Data Nodes in a cluster by taking care of the replication factor for fault tolerance.
All the data splits will be processed by Task Tracker for the Map and Reduce tasks in a parallel
manner.
The list of (key, value) pairs is generated such that the key attribute will be repeated many times.
So, its key attribute will be re-used in the Reducer for aggregating values in MapReduce. As far as
format is concerned, Mapper output format values and Reducer input values must be the same.
After the completion of this Map operation, the Task Tracker will keep the result in its buffer
storage and local disk space (if the output data size is more than the threshold). For example,
suppose we have a Map function that converts the input text into lowercase. This will convert the
list of input strings into a list of lowercase strings. Reducing phase execution As soon as the
Mapper output is available, Task Tracker in the Reducer node will retrieve the available partitioned
Map's output data, and they will be grouped together and merged into one large file, which will
42
then be assigned to a process with a Reducer method. Finally, this will be sorted out before data is
provided to the Reducer method. The Reducer method receives a list of input values from an input
(key, list (value)) and aggregates them based on custom logic, and produces the output (key, value)
pairs.
The output of the Reducer method of the Reduce phase will directly be written into HDFS as per
the format specified by the MapReduce job configuration class. The Hadoop MapReduce
fundamentals
• Understand MapReduce objects
• Learn how to decide the number of Maps in MapReduce
• Learn how to decide the number of Reduces in MapReduce
• Understand MapReduce dataflow
• Take a closer look at Hadoop MapReduce terminologies
MapReduce objects
MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer, and
Driver.
• Mapper:
This is designed for the Map phase of MapReduce, which starts MapReduce operations by
carrying input files and splitting them into several pieces. For each piece, it will emit a key-value
data pair as the output value.
43
• Reducer:
This is designed for the Reduce phase of a MapReduce job; it accepts key-based grouped
data from the Mapper output, reduces it by aggregation logic, and emits the (key, value) pair for the
group of values.
• Driver:
This is the main file that drives the MapReduce process. It starts the execution of
MapReduce tasks after getting a request from the client application with parameters. The Driver file
is responsible for building the configuration of a job and submitting it to the Hadoop cluster. The
Driver code will contain the main() method that accepts arguments from the command line. The
input and output directory of the Hadoop MapReduce job will be accepted by this program. Driver
is the main file for defining job configuration details, such as the job name, job input format, job
output format, and the Mapper, Combiner, Partitioner, and Reducer classes. MapReduce is
initialized by calling this main() function of the Driver class.
44
MapReduce dataflow
Now that we have seen the components that make a basic MapReduce job possible, we will
distinguish how everything works together at a higher level. From the following diagram, we will
understand MapReduce dataflow with multiple nodes in a Hadoop cluster:
45
• Data transformation
• Data integration
Common Hadoop Shell commands
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
bin/hdfs dfs -ls <path>
mkdir:
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
46
moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Name Node
NameNode:
This is the master of the HDFS system. It maintains the directories, files, and manages the blocks
that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and provide actual storage. They
are responsible for serving read-and-write data requests for the clients.
• Secondary NameNode: This is responsible for performing periodic checkpoints. So, if the
NameNode fails at any time, it can be replaced with a snapshot image stored by the secondary
NameNode checkpoints.
47
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and creates several small chunks of
data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces a
new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Input Output
Terminology
PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
49
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic to
produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale industries of a
particular state, since its formation.
When we write applications to process such bulk data,they will take a lot of time to execute.
There will be a heavy network traffic when we move data from source to network server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txt and given as input. The input file looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits {
50
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()) {
lasttoken = s.nextToken(); }
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
} }
//Reducer class
public static class E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable,
Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
} }} }
//Main function
public static void main(String args[])throws Exception {
51
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
Save the above program as ProcessUnits.java. The compilation and execution of the program is
explained below.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Visit the following link mvnrepository.com to download the jar. Let us assume the downloaded
folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and creating a jar
for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
52
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input directory of
HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files from
the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below, the output will contain
the number of input splits, the number of Map tasks, the number of reducer tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read = 61
FILE: Number of bytes written = 279400
FILE: Number of read operations = 0
FILE: Number of large read operations = 0
FILE: Number of write operations = 0
HDFS: Number of bytes read = 546
HDFS: Number of bytes written = 40
HDFS: Number of read operations = 9
HDFS: Number of large read operations = 0
HDFS: Number of write operations = 2 Job Counters
Launched map tasks = 2
Launched reduce tasks = 1
Data-local map tasks = 2
Total time spent by all maps in occupied slots (ms) = 146137
53
Total time spent by all reduces in occupied slots (ms) = 441
Total time spent by all map tasks (ms) = 14613
Total time spent by all reduce tasks (ms) = 44120
Total vcore-seconds taken by all map tasks = 146137
Total vcore-seconds taken by all reduce tasks = 44120
Total megabyte-seconds taken by all map tasks = 149644288
Total megabyte-seconds taken by all reduce tasks = 45178880
Map-Reduce Framework
Map input records = 5
Map output records = 5
Map output bytes = 45
Map output materialized bytes = 67
Input split bytes = 208
Combine input records = 5
Combine output records = 5
Reduce input groups = 5
Reduce shuffle bytes = 6
Reduce input records = 5
Reduce output records = 5
Spilled Records = 10
Shuffled Maps = 2
Failed Shuffles = 0
Merged Map outputs = 2
GC time elapsed (ms) = 948
CPU time spent (ms) = 5160
Physical memory (bytes) snapshot = 47749120
Virtual memory (bytes) snapshot = 2899349504
Total committed heap usage (bytes) = 277684224
54
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9
The following command is used to see the output in Part-00000 file. This file is generated by
HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the local file system for
analyzing.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir
/home/hadoop
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running
the Hadoop script without any arguments prints the description for all commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
1 namenode -format
Formats the DFS filesystem.
2 secondarynamenode
Runs the DFS secondary namenode.
3 namenode
Runs the DFS namenode.
4 datanode
55
Runs a DFS datanode.
5 dfsadmin
Runs a DFS admin client.
6 mradmin
Runs a Map-Reduce admin client.
7 fsck
Runs a DFS filesystem checking utility.
8 fs
Runs a generic filesystem user client.
9 balancer
Runs a cluster balancing utility.
10 oiv
Applies the offline fsimage viewer to an fsimage.
11 fetchdt
Fetches a delegation token from the NameNode.
12 jobtracker
Runs the MapReduce job Tracker node.
13 pipes
Runs a Pipes job.
14 tasktracker
Runs a MapReduce task Tracker node.
15 historyserver
Runs job history servers as a standalone daemon.
16 job
56
Manipulates the MapReduce jobs.
17 queue
Gets information regarding JobQueues.
18 version
Prints the version.
19 jar <jar>
Runs a jar file.
23 classpath
Prints the class path needed to get the Hadoop jar and the required libraries.
24 daemonlog
Get/Set the log level for each daemon
1 -submit <job-file>
Submits the job.
57
2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.
4 -kill <job-id>
Kills the job.
7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
Introduction to NoSQL
A NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the tabular
relations used in relational databases. Such databases came into existence in the late 1960s, but did
not obtain the NoSQL moniker until a surge of popularity in the early twenty-first century.
The suitability of a given NoSQL database depends on the problem it should solve. Data
structures used by NoSQL databases are sometimes also viewed as more flexible than relational
database tables. Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data immediately
or might result in reading data that is not accurate which is a problem known as stale reads.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as MongoDB and Cassandra.
The main advantages are high scalability and high availability.
High scalability –
NoSQL database use sharing for horizontal scaling. Partitioning of data and placing it on
multiple machines in such a way that the order of the data is preserved is sharing. Vertical scaling
means adding more resources to the existing machine whereas horizontal scaling means adding
more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra etc. NoSQL can handle huge
amount of data because of scalability, as the data grows NoSQL scale itself to handle that data in
efficient manner.
59
High availability –
Auto replication feature in NoSQL databases makes it highly available because in case of
any failure data replicates itself to the previous consistent state.
Disadvantages of NoSQL:
NoSQL has the following disadvantages.
Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed for storage but it provides very
little functionality. Relational databases are a better choice in the field of Transaction Management
than NoSQL.
Open-source –
NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words two
database systems are likely to be unequal.
Management challenge –
The purpose of big data tools is to make management of a large amount of data as simple as
possible. But it is not so easy. Data management in NoSQL is much more complex than a relational
database. NoSQL, in particular, has a reputation for being challenging to install and even more
hectic to manage on a daily basis.
GUI is not available –
GUI mode tools to access the database is not flexibly available in the market.
Backup –
Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no
approach for the backup of data in a consistent manner.
Large document size –
Some database systems like MongoDB and CouchDB store data in JSON format. Which means
that documents are quite large (BigData, network bandwidth, speed), and having descriptive key
names actually hurts, since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in that category are:
MongoDB falls in the category of NoSQL document based database.
60
Key value store: Memcached, Redis, Coherence
Tabular: Hbase, Big Table, Accumulo
Document based: MongoDB, CouchDB, Cloudant
When should NoSQL be used:
When huge amount of data need to be stored and retrieved .The relationship between the data you
store is not that important. The data changing over time and is not structured. Support of
Constraints and Joins is not required at database level. The data is growing continuously and you
need to scale the database regular to handle the data.
NoSQL Data Architecture Patterns
Architecture Pattern is a logical way of categorising data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it in
a valid format. It is widely used because of its flexibilty and wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-related to
the key. The key-value pair storage databases generally store data as a hash table where each key is
unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type
of pattern is usually used in shopping websites or e-commerce applications.
Advantages:
Can handle large amounts of data and heavy load.
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay performance.
Data can be involving many-to-many relationships which may collide.
61
Examples:
DynamoDB
Berkeley DB
62
3. Document Database:
The document database fetches and accumulates data in forms of key-value pairs but here,
the values are called as Documents. Document can be stated as a complex data structure. Document
here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested
documents is also very common. It is very affective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
63
Fastest traversal because of connections.
Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
Neo4J
FlockDB( Used by Twitter)
64
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States,
and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed data
is being loaded into the data warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:
65
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL,
and MarkLogic.
Textual ETL is the process of reading text and producing a relational data base suitable for
analytical processing. The text can come from any electronic source and the results can go into
any relational data base. ... Email typically contains spam and personal email which does not
belong in a corporate data base.
66
UNIT IV
Big Data analytics:
Data analytics life cycle, Data cleaning , Data transformation, Comparing reporting and analysis,
Types of analysis, Analytical approaches, Data analytics using R, Exploring basic features of R,
Exploring R GUI, Reading data sets, Manipulating and processing data in R, Functions and
packages in R, Performing graphical analysis.
UNIT 4
Big Data Analytics
Types of analysis
Analysis of data is a vital part of running a successful business. There are four types of data
analysis that are in use across all industries. While we separate these into categories, they are all
linked together and build upon each other. As you begin moving from the simplest type of analytics
to more complexes, the degree of difficulty and resources required increases. At the same time, the
level of added insight and value also increases.
Descriptive Analysis
The first type of data analysis is descriptive analysis. It is at the foundation of all data insight. It is
the simplest and most common use of data in business today. Descriptive analysis answers the
“what happened” by summarizing past data, usually in the form of dashboards.
The biggest use of descriptive analysis in business is to track Key Performance Indicators (KPIs).
KPIs describe how a business is performing based on chosen benchmarks.
• KPI dashboards
• Monthly revenue reports
• Sales leads overview
Diagnostic Analysis
Diagnostic analysis takes the insights found from descriptive analytics and drills down to
find the causes of those outcomes. Organizations make use of this type of analytics as it creates
more connections between data and identifies patterns of behavior.
A critical aspect of diagnostic analysis is creating detailed information. When new problems
arise, it is possible you have already collected certain data pertaining to the issue. By already
67
having the data at your disposal, it ends having to repeat work and makes all problems
interconnected.
Predictive Analysis
This type of analysis is another step up from the descriptive and diagnostic analyses. Predictive
analysis uses the data we have summarized to make logical predictions of the outcomes of events.
This analysis relies on statistical modeling, which requires added technology and manpower to
forecast. It is also important to understand that forecasting is only an estimate; the accuracy of
predictions relies on quality and detailed data.
• Risk Assessment
• Sales Forecasting
• Using customer segmentation to determine which leads have the best chance of converting
• Predictive analytics in customer success teams
There’s no defined structure of the phases in the life cycle of Data Analytics; thus, there may not be
uniformity in these steps. There can be some data professionals that follow additional steps, while
there may be some who skip some stages altogether or work on different phases simultaneously.
Phase 1: Data Discovery and Formation
Phase 2: Data Preparation and Processing
Phase 2: Data Preparation and Processing
Phase 4: Model Building
Phase 5: Result Communication and Publication
Phase 6: Measuring of Effectiveness
Consider an example of a retail store chain that wants to optimize its products’ prices to boost its
revenue. The store chain has thousands of products over hundreds of outlets, making it a highly
complex scenario. Once you identify the store chain’s objective, you find the data you need,
prepare it, and go through the Data Analytics lifecycle process. Observe different types of
customers, such as ordinary customers and customers like contractors who buy in bulk. According
to you, treating various types of customers differently can give us the solution.
68
In this case, we need to get the definition, find data, and conduct hypothesis testing to check
whether various customer types impact the model results and get the right output. Once we are
convinced with the model results, we can deploy the model, and integrate it into the business, and
we are all set to deploy the prices you think are the most optimal across the outlets of the store.
Data analytics using R
R is a programming language that provides more flexibility than Excel. R is not bound by
a spreadsheet, where the data need to be entered in cells. For more complex analyses, Excel’s
spreadsheet format is too restrictive. And R is freely available on-line with new content being
uploaded regularly.
RStudio
RStudio is an interface that provides you with a greater ability to conduct your analyses in R.
You can think of RStudio as a overlay on the software R to allow you to visually group together
in one interface the input window, the output window, the objects in your workspace, and plots.
Benefits of R Analytics
Business analytics in R allows users to analyze business data more efficiently. The following are
some of the main benefits realized by companies employing R in their analytics programs:
Democratizing Analytics across the Organization: R can help democratize analytics by enabling
business users with interactive data visualization and reporting tools. R can be used for data science
by non data scientists so that business users and citizen data scientists can make better business
decisions. R analytics can also reduce time spent on data preparation and data wrangling,
allowing data scientists to focus on more complex data science initiatives.
Providing Deeper, More Accurate Insights: R can help create powerful models to analyze large
amounts of data. With more precise data collection and storage through R analytics, companies can
deliver more valuable insights to users. Analytics and statistical engines using R provide deeper,
more accurate insights for the business. R can be used to develop very specific, in-depth analyses.
69
Leveraging Big Data: R can handle big datasets and is arguably as easy if not easier for most
analysts to use as any of the other analytics tools available today.
Creating Interactive Data Visualizations: R is also helpful for data visualization and data
exploration because it supports the creation of graphs and diagrams. It includes the ability to create
interactive visualizations and 3D charts and graphs that are helpful for communicating with
business users.
While R programming was originally designed for statisticians, it can be implemented for a variety
of uses including predictive analytics, data modeling, and data mining. Businesses can implement R
to create custom models for data collection, clustering, and analytics. R analytics can provide a
valuable way to quickly develop models targeted at understanding specific areas of the business and
delivering tailored insights on day-to-day needs.
• Statistical testing
• Prescriptive analytics
• Predictive analytics
• Time-series analysis
• What-if analysis
• Regression models
• Data exploration
• Forecasting
• Text mining
• Data mining
• Visual analytics
• Web analytics
• Social media analytics
• Sentiment analysis
70
Exploring basic Features of R:
The R programming language is versatile and can be used for a software development environment
for statistical analysis or graphics representation and reporting purposes.The below mentioned are
the significant features of the R language:
• R is a simple and effective programming language that has been well-developed, as well as
R is data analysis software.
• R has a large, consistent, and incorporated set of tools used for data analysis.
• R contains a suite of operators for different types of calculations on arrays, lists, and
vectors.
• R graphical techniques for data analysis output either directly display to the computer, or
can be print on paper.
• R language can also be used with several other scripting languages such as python, perl,
ruby, F#, and Julia.
Exploring R GUI
R GUI is the standard GUI platform for working in R. The R Console Window forms an essential
part of the R GUI. In this window, we input various instructions, scripts and several other important
operations. This console window has several tools embedded in it to facilitate ease of operations.
This console appears whenever we access the R GUI.
In the main panel of R GUI, go to the ‘File‘ menu and select the ‘New Script‘ option. This will
create a new script in R. In order to quit the active R session, you can type the following code after
the R prompt ‘>’ as follows:
>q()
71
Manipulating and processing data in R
Data structures provide the way to represent data in data analytics. We can manipulate data
in R for analysis and visualization. One of the most important aspects of computing with data in R
is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few
basic data structures in R:
a. Vectors in R
These are ordered container of primitive elements and are used for 1-dimensional data.
b. Matrices in R
These are Rectangular collections of elements and are useful when all data is of a single class that is
numeric or characters.
c. Lists in R
These are ordered container for arbitrary elements and are used for higher dimension data, like
customer data information of an organization. When data cannot be represented as an array or a
data frame, list is the best choice. This is so because lists can contain all kinds of other objects,
including other lists or data frames, and in that sense, they are very flexible.
d. Data frames
These are two-dimensional containers for records and variables and are used for representing data
from spreadsheets etc. It is similar to a single table in the database.
a. $
The dollar sign operator selects a single element of data. When you use this operator with a data
frame, the result is always a vector.
72
b. [[
Similar to $ in R, the double square brackets operator in R also returns a single element, but it
offers the flexibility of referring to the elements by position rather than by name. It can be used for
data frames and lists.
c. [
The single square bracket operator in R returns multiple elements of data. The index within the
square brackets can be a numeric vector, a logical vector, or a character vector.
For example: To retrieve 5 rows and all columns of already built in data set iris, below command is
used:
1> iris[1:5, ]
Creating subsets in vectors
To create subsets of data stored in a vector, the subset( ) function or the [ ] brackets is used.
Example: using the subset( ) function and the [ ] brackets.
# A sample vector
V <- c(1,5,6,3,2,4,2)
Subset(v,v,4) # using subset function
V[v,4] #using square brackets
# another vector
T ,- c(“one”, “one”, “two”, “three”, “four”, “two”)
#remove “one” entries
Subset(t, t!=”one)
T[t!=”one”]
Creating subsets in Data Frames
Example 1:
# A sample data frame
Data <- read.table(header=T, text=’
Subject class marks
1 1 99
2 2 84
3 1 89
73
4 2 79
‘)
a. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
Parameters:
x1, x2: vector, matrix, data frames
deparse.level: This value determines how the column names generated. The default value of
deparse.level is 1.
Example 1:
x <- 2:7
y <- c(2, 5)
cbind(x, y)
Output:
xy
[1, ] 2 2
[2, ] 3 5
[3, ] 4 2
[4, ] 5 5
75
[5, ] 6 2
[6, ] 7 5
rbind(): The rbind or the row bind function is used to bind or combine the multiple group of rows
together.
rbind(my_data, new_row)
The easiest way of using rbind in R is the combination of a vector and a data frame. First, let’s
create some example data frame…
Output:
## x1 x2 x3
## 1 7 5 1
76
## 2 4 2 2
## 3 4 8 3
## 4 9 9 4
## 5 9 8 7
Sorting data:
Example: vec1,-c(23,45,10,34,89,20,67,99)
# sorting of a vector
Sort(vec1)
#reverse sorting
Sort(vec1, descreasing=TRUE)
Ordering data:
Example:
77
sampleDataFrame[ order(sampleDataFrame$weight), ]
Transposing Data
t( ) function is used to transpose a matrix or a data frame. This function transposes rows into
columns and columns into rows.
Example:
sampleDataFrame
t(sampleDataFrame)
R provides the following functions of the reshape package to convert data into wide or long
formats:
o use the melt ( ) function to convert wide data into the long format
o use the dcast ( ) function to convert long data into the wide format
melt( )
Syntax:
melt(data, na.rm = FALSE, value.name = “value”)
78
Cast ( )
Syntax:
cast(data, formula, fun.aggregate)
Managing Data in R using Matrices
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. In R,
the elements that make up a matrix must be of a consistant mode (i.e. all elements must be numeric,
or character, etc.). Therefore, a matrix can be thought of as an atomic vector with a dimension
attribute. Furthermore, all rows of a matrix must be of same length.
Creating Matrices
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left”
corner and running down the columns. We can create a matrix using the matrix() function and
specifying the values to fill in the matrix and the number of rows and columns to make the matrix.
# numeric matrix
m1 <- matrix(1:6, nrow = 2, ncol = 3)
m1
output :
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
The underlying structure of this matrix is simply an integer vector with an added 2x3 dimension
attribute.
Matrices can also contain character values. Whether a matrix contains data that are of numeric or
character type, all the elements must be of the same class.
# a character matrix
m2 <- matrix(letters[1:6], nrow = 2, ncol = 3)
m2
## [,1] [,2] [,3]
79
## [1,] "a" "c" "e"
## [2,] "b" "d" "f"
Matrices can also be created using the column-bind cbind() and row-bind rbind() functions.
However, keep in mind that the vectors that are being binded must be of equal length and mode.
v1 <- 1:4
v2 <- 5:8
cbind(v1, v2)
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
rbind(v1, v2)
## [,1] [,2] [,3] [,4]
## v1 1 2 3 4
## v2 5 6 7 8
80
Adding on to Matrices
We can leverage the cbind() and rbind() functions for adding onto matrices as well. Again, its
important to keep in mind that the vectors that are being binded must be of equal length and mode
to the pre-existing matrix.
81
Subsetting Matrices
To subset matrices we use the [ operator; however, since matrices have 2 dimensions we need to
incorporate subsetting arguments for both row and column dimensions. A generic form of matrix
subsetting looks like: matrix[rows, columns]. We can illustrate with matrix m2:
m2
## col_1 col_2 col_3
## row_1 1 5 9
## row_2 2 6 10
## row_3 3 7 11
## row_4 4 8 12
By using different values in the rows and columns argument of m2[rows, columns], we can
subset m2 in multiple ways.
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
83
Following are the characteristics of a data frame.
• The data stored in a data frame can be of numeric, factor or character type.
3 occupation = c("engineer","designer","architect","engineer"))
Note that you simply use the data.frame function, provide names for your columns, and populate
the contents of the columns (using the c() vector functionality).
Output:
Note that a data frame can hold a number of data types (i.e., the columns can be characters,
integers, dates, factors, etc). Above you'll notice that the data types are displayed below the column
name; in this case, our two columns were coded as factors.
We not only see the values of each row in the second column printed but also the
corresponding levels. See here for more on what levels are. The syntax is the same when selecting a
row from a Tibble, except the levels aren't included because columns with characters aren't
automatically coded as factors and only factors have levels (don't get hung-up if you don't
understand levels for now). Note that the tibble column prints a little nicer (and is
a chr or character, but not in a jokey way).
1df_tidy[,2]
Note that in R, when locating a cell, [1,2] refers to the first row and second column, so
that [,2] grabs the entire second column.
To actually do something more interesting with this, and count the number of unique jobs, you use
the same syntax inside a function:
1unique(df_tidy[,2])
85
Selecting a Column by Name
Note that we keep the , syntax since we want the entire column, and select the column by name in
quotes.
1df_base[,"occupation"]
This is the same for a tibble (this is the last time we'll make the comparison). Again, notice the
levels are gone because the tidyverse defaults to characters instead of factors.
1df_tidy[,"occupation"]
86
1df_tidy[1,]
And here's the syntax to grab multiple rows. Note that you span inclusively from the first to last row
of interest.
1df_tidy[1:2,]
Packages in R
87
search()
When the search() command is executed, you can overview the packages that are loaded and are
ready for use. You will see the graphics package that carries out routines to create graphs.
There are many packages that are being installed but not loaded by themselves.
For example – Splines package, that contains routines for smoothing curves, is being installed. But
this splines package is not loaded by itself.
To see what packages are available, you need to type the following command:
installed.packages()
Installing R Packages for Windows
In Windows, you get the package menu and install option which is very easy.
After selecting a local mirror site, a list of available binary packages is being shown. You can
choose the ones you need. Once you have selected the packages you need, you need to click
the OK button to download and install them into R.
If you download the package files from the internet(as .zip), you need to use the install package(s)
in the packages menu. It allows you to select the files you need and again packages are unzipped
and installed into R.
To install R packages on the Linux system, you need to perform the below steps:
• Download the required packages as compressed files from the link: Available packages by
name
• Run the following command to install packages:
R CMD INSTALL [options] [l-lib] pkgs
• Use the following command to load the installed package:
library(package)
Installing by the Name of Package
In Linux, you can install the package if you know the name of a package.
88
install.packages(‘ade4’)
R Packages List
The below table specifies the best packages in R with their usage:
ANN Used for building and analyzing Artificial Neural Networks (ANN)
lattice Used for creating lattice graphics for panel plots or trellis graphs
We need to load the package in R after installing them to make them usable.
To load the R language Package, you can use the library() command, as follows:
library(package)
In R, you can unload a package by using detach() command, as follows:
detach(package:name)
Working with Functions
Moving from Scripts to R Function:
• Functions can work with any input. You can provide diverse input data to the functions.
• The output of the function is an object that allows you to work with the result.
89
How to Create a Script in R?
R supports several editors. So the script can be created in any of the editors like Notepad, MS Word
or Word Pad and can be saved with R extension in the current working directory.
For example, if we want to read the sample. R script in R, we need to provide below command:
source("sample.R")
In order to create a script, first, open a script file in the editor mode and type the required code.
We will create a script that takes in input in the form of fractions and converts it into a percentage
by further rounding it to one decimal digit.
Save the above script as script file with any name for example pastePercent.R.
Now you can call this script on the console with the help of source command which we have
already seen.
source(‘pastePercent.R’)
90
Output:
Define a function with a name so that it becomes easier to call an R function and pass arguments to
it as input.
The R function should be followed by parentheses that act as a front gate for your function and
between the parentheses, arguments for the function are provided.
Use the return() statement that acts as a back gate of your function.
The return() statement provides the final result of the function that is returned to your workspace.
Let us now see how we can convert the script that we had written earlier to convert values into
percentage and round off into an R function.
91
Output:
The keyword function defines the starting of function. The parentheses after the function
form the front gate, or argument list of the function. Between the parentheses are the arguments to
the function. In this case, there is only one argument.
The return statement defines the end of the function and returns the result. The object put
between the parentheses is returned from inside the function to the workspace. Only one object can
be placed between the parentheses.
The braces, {} are the walls of the function. Everything between the braces is part of the assembly
line or the body of the function. This is how functions are created in R.
Using R Function
After transforming the script into an R function, you need to save it and you can use the function in
R again if required.
As R does not let you know by itself that it loaded the function but it is present in the workspace, if
you want you can check it by using ls() command.
Now as we know what all functions in R are present in the memory and we can use it when
required.
For example, if you want to create percentage from values again, you can use add percent function
for the same as below:
92
#Author DataFlair
ls()
new.vector <- c(0.8223, 0.02487, 1.62, 0.4)
Percent_add(new.vector)
Output:
In R, a function is also an object and you can manipulate it as you do for other objects.
You can assign a function to the new object using below command:
93
The output of the above code is as follows:
Till now, in all the above code, we have written return() function to return output. But in R, this can
be skipped as by default, R returns the value of the last line of code in the R function body.
Now, the above code will be:
#Author DataFlair
Percent_add <- function(fac){
percentage <- round(fac * 100, digits = 1)
94
paste(percentage, "%", sep = "")}
Output:
You need return if you want to exit the function before the end of the code in the body.
For example, you could add a line to the Percent_add function that checks whether fac is numeric,
and if not, returns NULL, as shown in the following table:
#Author DataFlair
Percent_add <- function(frac){
if( !is.numeric(frac) ) return(NULL)
percentage <- round(frac * 100, digits = 1)
paste(percentage, "%", sep = "")}
Output:
95
2. Dropping the {}
You can drop braces in some cases though they form a proverbial wall around the function.
If a function consists of only one line of code, you can just add that line after the argument list
without enclosing it in braces. R will see the code after the argument list as the body of the
function.
Suppose, you want to calculate the odds from a proportion. You can write a function without using
braces, as shown below:
Scope of R Function
Every object you create ends up in this environment, which is also called the global environment.
The workspace or global environment is the universe of the R user where everything happens.
1. External R Function
If you use an R function, the function first creates a temporary local environment. This local
environment is nested within the global environment, which means that, from that local
environment, you also can access any object from the global environment. As soon as the function
ends, the local environment is destroyed along with all the objects in it.
If R sees any object name, it first searches the local environment. If it finds the object there, it uses
that one else it searches in the global environment for that object.
2. Internal R Function
Using global variables in an R function is not considered a good practice. Writing your functions in
such a way that they need objects in the global environment is not efficient because you use
functions to avoid dependency on objects in the global environment in the first place.
The whole concept behind R strongly opposes using global variables used in different functions. As
a functional programming language, one of the main ideas of R is that the outcome of a function
96
should not be dependent on anything but the values for the arguments of that function. If you give
the arguments for the same values, you will always get the same results.
#Author DataFlair
calculate_func <- function(data1, data2, data3){
base_min <- function(z) z - mean(data3)
base_min(data1) / base_min(data2)
}
Output:
97
A closer look at the R function definition of base_min() shows that it uses an object control but
does not have an argument with that name.
It is easy to find out the function you used in R. You can just look at the function code of print() by
typing its name at the command line.
In order to display the info code of the print() function, we proceed as follows:
Output:
The UseMethod() function contains central function in the main generic function system.
The UseMethod() function moves along and looks for a function that can deal with the type of
object that is given as the argument x.
Suppose you have a data frame that you want to print. The object that was passed as an argument
will be printed using the print.data.frame() function that will be first looked up by R.
The other functions are searched through a thorough search of another function. The procedure is
then started with a print which is then followed by the object type and the dot.
98
Below is the example for the same:
R provides the feature to create an object with the names that are already used by R. It is possible
with the use of default keyword.
R will ignore the type of the object in that case and just look for a default method if you use the
default keyword with the name of an object.
99
Performing Graphical Analysis in R
Graphs are useful for non-numerical data, such as colours, flavours, brand names, and more. When
numerical measures are difficult or impossible to compute, graphs play an important role.
• Plots with Single Variable – You can plot a graph for a single variable.
• Plots with Two Variables – You can plot a graph with two variables.
• Plots with Multiple Variables – You can plot a graph with multiple variables.
• Special Plots – R has low and high-level graphics facilities.
You may need to plot for a single variable in graphical data analysis with R programming.
For example – A plot showing daily sales values of a particular product over a period of time. You
can also plot the time series for month by month sales.
The choice of plots is more restricted when you have just one variable to the plot. There are various
plotting functions for single variables in R:
• Histograms – Used to display the mode, spread, and symmetry of a set of data.
• Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful
for error checking.
• Time Series Plots – When a period of time is complete, the time series plot can be used to
join the dots in an ordered set of y values.
100
• Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.
Histograms have the response variable on the x-axis, and the y-axis shows the frequency of
different values of the response. In contrast, a bar chart has the response variable on the y-axis and
a categorical explanatory variable on the x-axis.
1.Histograms
Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is
used to plot histograms.
The x-axis is divided into which the values of the response variable are distributed and then
counted. This is called bins. Histograms are tricky because it depends on the subjective judgments
of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce
one picture, narrow bins produce a different picture, and unequal bins produce confusion.
Small bins produce multimodality (a combination of audio, textual, and visual modes), whereas
broad bins produce unimodality (contains a single-mode). When there are different bin widths, the
default in R is for this to convert the counts into densities.
The convention adopted in R for showing bin boundaries is to employ square and round brackets,
so that:
• [a,b) means ‘greater than or equal to a but less than b’ [square than round).
• (a,b] means ‘greater than a but less than or equal to b’ (round than square].
You need to take care that the bins can accommodate both your minimum and maximum values.
The cut() function takes a continuous vector and cuts it up into bins that can then be used for
counting.
The hist() function in R does not take your advice about the number of bars or the width of bars. It
helps simultaneous viewing of multiple histograms with similar range. For small integer data, you
can have one bin for each value.
In R, the parameter k of the negative binomial distribution is known as size and the mean is known
as mu.
Drawing histograms of continuous variables is a more challenging task than explanatory variables.
This problem depends on the density estimation that is an important issue for statisticians. To deal
with this problem, you can approximately transform continuous model to a discrete model using a
linear approximation to evaluate the density at the specified points.
101
The choice of bandwidth is a compromise made between removing insignificant bumps and real
peaks.
2 Index Plots
For plotting single samples, index plots can be used. The plot function takes a single
argument. This is a continuous variable and plots values on the y-axis, with the x coordinate
determined by the position of the number in the vector. Index plots are especially useful for error
checking.
3 Time Series Plot
The time series plot can be used to join the dots in an ordered set of y values when a period of time
is complete. The issues arise when there are missing values in the time series (e.g., if sales values
for two months are missing during the last five years), particularly groups of missing values (e.g., if
sales values for two quarters are missing during the last five years) and during that period we
typically know nothing about the behaviour of the time series.
ts.plot and plot.ts are the two functions for plotting time series data in R.
4 Pie Chart
You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the
function pie takes a vector of numbers and turns them into proportions. It then divides the circle on
the basis of those proportions.
To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of
character strings, here called data$names.
If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to
enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv”
extension, and input the data to R using read.csv in place of read.table.
#Author DataFlair
data <- read.csv("/home/dataflair/data/piedata.csv")
data
102
Output:
The two types of variables used in the graphical data analysis with R:
• Response variable
• Explanatory variable
The response variable is represented on the y-axis and the explanatory variable is represented on
the x-axis.
When an explanatory variable is categorical, like genotype or colour or gender, the appropriate plot
is either a box-and-whisker plot or a barplot.
Scatterplots
Scatterplots shows a graphical representation of the relationship between two numbered sets. The
plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to
an existing plot by using the functions, point, and lines.
The points and line functions can be specified in the following two ways:
• Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-
dimensional plane with the help of two perpendicular vectors that are known as an axis. The
origin of the Cartesian coordinate system is the point where two axes cut each other and the
location of this point is the (0,0).
• Formula plot (y, x) – The formula based plot refers to representing the relationship
between variables in the graphical form. For example – The equation, y=mx+c, shows a
straight line in the Cartesian coordinate system.
The advantage of the formula-based plot is that the plot function and the model fit look and feel the
same. The Cartesian plots build plots using “x then y” while the model fit uses “y then x”.
The plot function uses the following arguments:
The best way to identify multiple individuals in scatterplots is to use a combination of colours and
symbols. A useful tip is to use as.numeric to convert a grouping factor into colour and/or a symbol.
105
Stepped Lines
Stepped lines can be plotted as graphical representation displays in R. These plots, plot data
distinctly and also provide a clear view of the differences in the figures.
While plotting square edges between two points, you need to decide whether to go across and then
up, or up and then across. Let’s assume that we have two vectors from 0 to 10. We plot these points
as follows:
x = 0:10
y = 0:10
plot(x,y)
Output:
Also, generate a line by using the upper case “S” as shown below:
> lines(x,y,col="green",type='S')
106
Output:
A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles. It
is based on the minimum and maximum values, and upper and lower quartiles. Boxplots
summarizes the information available. The vertical dash lines are called the ‘whiskers’. Boxplots
are also excellent for spotting errors in data. The extreme outliers represent these errors.
Barplot
Barplot is an alternative to boxplot to show the heights of the mean values from the different
treatments. Function tapply computes the height of the bars. Thus it works out the mean values for
each level of the categorical explanatory variable.
Let us create a toy dataset of temperatures in a week. Then, we will plot a barplot that will have
labels.
temperature <- c(28, 35, 31, 40, 29, 41, 42)
days <- c("Sun", "Mon", "Tues", "Wed",
"Thurs", "Fri", "Sat")
barplot(temperature, main = "Maximum Temperatures
in a Week",
xlab = "Days",
ylab = "Degree in Celcius",
names.arg= days,
col = "darkred")
107
Output:
Initial data inspection using plots is even more important when there are many variables, any one of
which might have mistakes or omissions. The principal plot functions that represent multiple
variables are:
• The Pairs Function – For a matrix of scatterplots of every variable against every other.
• The Coplot Function – For conditioning plots where y is plotted against x for different
values of z.
It is better to use more specialized commands when dealing with the rows and columns of data
frames.
For two or more continuous explanatory variables, it is valuable to check for subtle
dependencies between the explanatory variables. Rows represent the response variables and
columns represent the explanatory variables.
Every variable in the data frame is on the y-axis against every other variable on the x-axis using the
pairs function plots. The pairs function needs only the name of the whole data frame as its first
argument.
108
The Coplot Function
The relationship between the two variables may be obscured by the effects of other processes in
multivariate data. When you draw a two-dimensional plot of y against x, then all the effects of other
explanatory variables are shown onto the plane of the paper. In the simplest case, we have one
response variable and just two explanatory variables.
The coplot panels are ordered from lower left to upper right, associated with the values of the
conditioning variable in the upper panel from left to right.
R has extensive facilities for producing graphs. It also has low and high-level graphics
facilities as per the requirement.
The low-level graphics are the basic building blocks that can build up graphs step by step, while a
high-level facility provides the variety of pre-assembled graphical display.
Apart from the various kinds of graphical plots discussed, R supports the following special plots:
• Design Plots – Effective sizes in designed experiments can be visualized using design
plots. One can plot the design plots using the plot.design function –
plot.design(Growth.rate~Water*Detergent*Daphnia)
• Bubble Plots – Useful for illustrating the variation in the third variable across different
locations in the x–y.
• Plots with many Identical Values – Sometimes, two or more points with count data fall in
exactly the same location in a scatterplot. As a result, the repeated values of y are hidden, one
beneath the other.
Using the following functions, we can add the extra graphical objects in plots:
You are likely to want to save each of your plots as a PDF or PostScript file for publication-quality
graphics. This is done by specifying the ‘device’ before plotting, then turning the device off once
finished.
The computer screen is the default device, where we can obtain a rough copy of the graph, using
the following command:
110
Selecting an Appropriate Graph in R
• Line Graph – It displays over the time period. It generally keeps the track of records for
both, long-time period and short-time period according to requirements. In the case of small
change, the line graph is more common than the bar graph. In some cases, the line graphs also
compare the changes among different groups in the same time period.
• Pie Chart – It displays comparison within a group.
For example – You can compare students in a college on the basis of their streams, such as
arts, science, and commerce using a pie chart. One cannot use a pie chart to show changes over
the time period.
• Bar Graph – Similar to a line graph, the bar graph generally compares different groups or
tracking changes over a defined period of time. Thus the difference between the two graphs is
that the line graph tracks small changes while a bar graph tracks large changes.
• Area Graph – The area graph tracks the changes over the specific time period for one or
more groups related to a similar category.
• X-Y Plot – The X-Y plot displays a certain relationship between two variables. In this type
of variable, the X-axis measures one variable and Y-axis measures another variable. On the
one hand, if the values of both variables increase at the same time, a positive relationship exists
between variables. On the other hand, if the value of one variable decreases at the time of the
increasing value of another variable, a negative relationship exists between variables. It could
be also possible that the two variables don’t have any relationship. In this case, plotting graph
has no meaning.
111
UNIT – V
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open-
source data visualization tools, Data visualization with Tableau.
Data visualization
Data visualization is a graphical representation of any data or information. Visual
elements such as charts, graphs, and maps are the few data visualization tools that provide the
viewers with an easy and accessible way of understanding the represented information.
Data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.
112
Profit and loss – Business companies often resort to pie charts or bar graphs showing their
annual profit or loss margin.
Big data often has unstructured formats. Due to bandwidth limitations and power requirements,
visualization should move closer to the data to extract meaningful information efficiently.
Effective data visualization is a key part of the discovery process in the era of big data. For the
challenges of high complexity and high dimensionality in big data, there are different
dimensionality reduction methods.
There are also following problems for big data visualization:
• Visual noise: Most of the objects in dataset are too relative to each other. Users cannot
divide them as separate objects on the screen.
• Information loss: Reduction of visible data sets can be used, but leads to information loss.
• Large image perception: Data visualization methods are not only limited by aspect ratio
and resolution of device, but also by physical perception limits.
• High rate of image change: Users observe data and cannot react to the number of data
change or its intensity on display.
• High performance requirements: It can be hardly noticed in static visualization because of
lower visualization speed requirements--high performance requirement.
113
In Big Data applications, it is difficult to conduct data visualization because of the large size
and high dimension of big data. Most of current Big Data visualization tools have poor
performances in scalability, functionalities, and response time. Uncertainty can result in a
great challenge to effective uncertainty-aware visualization and arise during a visual analytics
process.
Potential solutions to some challenges or problems about visualization and big data were
presented:
• Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.
• Understanding the data: One solution is to have the proper domain expertise in place.
• Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.
• Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.
• Dealing with outliers: Possible solutions are to remove the outliers from the data or create
a separate chart for the outliers.
Data can be visualized in many ways, such as in the form of 1D,2D, or 3D structures. The below
table briefly describes the different types of data visualization:
114
2D/Planar For example, choropleth, GeoCommons, Google
cartogram, dot distribution Fusion Tables, Google Maps,
map, and proportional symbol API, Polymaps, Many Eyes,
map Google Charts, and Tableau
Public
As shown in the table, the simplest type of data visualization is 1D representation and the most
complex data visualization is the network representation. The following is a brief description of
each of these data visualizations:
115
▪ 1D (Linear) Data Visualization – In the linear data visualization, data is presented in the
form of lists. Hence, we cannot term it as visualization. It is rather a data organization
technique. Therefore, no tool is required to visualize data in a linear manner.
▪ 2D(planar) Data visualization –This technique presents data in the form of images,
diagrams or charts on aplane surface.
▪ 3D (volumetric) Data visualization –In this method, data presentation involves exactly
three dimensions to show simulations, surface and volume rendering, etc. generally, it is
used in scientific studies. Today, many organizations use 3D computer modeling and
volume rendering in advertisements to provide users a better feel of their products.
▪ Network data visualization – It is used to represent data relations that are too complex to
be represented in the form of hierarchies.
Data visualization is a great way to reduce the turn-around time consumed in interpreting big data.
Traditional analytical techniques are not enough to capture or interpret the information that big data
possesses. Traditional tools are developed by using relational models that works best on static
interaction. Big data is highly dynamic in function and therefore, most traditional tools are not able
to generate quality results. The response time of traditional tool is quite high, making it unfit for
quality interaction.
Now a days, IT companies that are using Big Data faces the following challenges:
By considering the above factors, IT companies are focusing more on research and development of
robust algorithm, software, and tools to analyze the data that is scattered in the internet space.
Visualization of data produces cluttered images that are filtered with the help of clutter- reduction
techniques. Uniform sampling and dimension reduction are two commonly used clutter- reduction
techniques.
Visual data reduction process involves automated data analysis to measure density, outliers, and
their differences. These measures are then used as quality metrics to evaluate data – reduction
activity.
➢ Size metrics
117
➢ Interactive to connect with different sources of data
A part from representing data, a visualization tool must be able to establish links between different
data values, restore the missing data, and polish data for further analysis.
EXCEL – It is a new tool that is used for data analytics. It helps you to track and visualize data for
deriving better insights. This tool provides various ways to share data and analytical conclusions
within and across organizations.
118
LastForward – It is open – source software provided by last.fm for analyzing and visualizing
social music network.
Pics – This tool is used to track the activity of images on the website.
D3 – D3 allows you to bind arbitrary data to a document object model (DOM) and then applies
data- driven transformations to the document. For example, you can use D3 to generate an
HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar
chart with smooth transitions and interactions.
Rootzmap Mapping the internet – It is a tool to generate a series of maps on the basis of the
datasets provided by the National Aeronautics and Space Administration (NASA).
119
Open source data visualization tools
• They deliver performance and are complaint with the web as well as
mobile web security.
Analytical techniques are used to analyze complex relationships among variables. The following
are some commonly used analytical techniques for big data solutions:
Regression analysis – it is a statistical tool used for prediction. Regression analysis is used to
predict continuous dependent variables from independent variables.
120
Grouping methods – the technique of categorizing observation into significant or
purposeful blocks is called grouping. The recognition of features to create a distinction
between groups is called discriminant analysis.
o Path analysis
Introduction to Tableau:
Tableau is a Data visualization software that allows developers to build interactive dashboards
that are easily updated with new data and can be shared with a wider audience. There are various
types of Tableau products available in the market. Some of the commonly known products include
Tableau Desktop, Tableau Server, Tableau Online, Tableau Reader, and Tableau Public.
The important features of Tableau Software include the following:
➢ Single- click data analytics in visual form
➢ In- depth statistical analysis
➢ Management of metadata
➢ In built, top-class data analytic practices
➢ In built data engine
➢ Big data analytics
➢ Quick and accurate data discovery
➢ Business dashboards creation
➢ Various types of data visualization
➢ Social media analytics, including Facebook and Twitter
➢ Easy and quick integration of R
➢ Business intelligence through mobile
➢ Analysis of time series data
➢ Analysis of data from surveys
Click the continue button. The open page of tableau tool appears.
Toolbar Icons:
Tableau is a GUI- oriented drag and drop tool. The following figure shows the icons present on the
tableau toolbar.
122
o Undo/ Redo – scrolls backward or forward on the screen. You can
retrieve any step by clicking the undo/redo button
o File save – saves your work. You need to click this button frequently
as tableau does not have the automated save function.
o Group – allows you to group data by selecting more than one headers
in a table or values in a legend.
123
o Fit Menu – allows different views of the tableau screen. You can fit the
screen either horizontally or vertically.
o Fit Axis - Fixes the axis of view. You can zoom in/out charts with this
button
Main menu
File – contains general functions, such as open, save, and save as. Other functions are print to pdf
and Repository location function to review and change the default location of the saved file.
Data – helps to analyze the tabular data on the tableau website. The edit relationships option is used
to blend data when the field names in two data sources are not identical
Worksheet – provides option such as export option, excel crosstab, and duplicate as crosstab
Dashboard – Provides the actions menu, which is the most important option on the dashboard menu
because all the actions related to tableau worksheets and dashboards are defined within the actions
menu. The actions menu is present under the worksheet menu as well.
Story – provides the new story option that is used for explaining the relationship among facts,
providing context to certain events, showing the dependency between decisions and outcomes, etc.
Analysis – provides the aggregate measures and stack mark options. To create new measures or
dimensions, use create calculated field or edit calculated field.
Map – provides options to change the color scheme and replace the default maps
Window – provides the bookmark menu, which is used to create .tbm files that can be shared with
different users
124
Help – provides options to access tableau’s online manual, training videos, and sample workbooks
Tableau Server
In Tableau Server Users can interact with the dashboards on the server without any installation on
their machines. Tableau Online is Tableau Server hosted by Tableau on a cloud platform.
Tableau server also provides robust security to the dashboards. Tableau Server web-edit feature
allows authorized users to download and edit the dashboards. Tableau server allows users to
publish and share their data sources as live connections or extracts. Tableau Server is highly
secured for visualizing data. It leverages fast databases through live connections
Depending on their utility and the amount of information they contain, tableau saves and shares
files as:
• Tableau workbook – it is the default save type when you save your
work on the desktop. The extension of such files will be .twb. The files with extension
.twbx can be shared with people not having tableau desktop license or those who cannot
access the data source.
• Tableau bookmark - if you want to share any specific file with others,
use tableau bookmark.
Tableau charts
Tableau can create different types of univariate, bivariate, and multivariate charts.
125
The following are some of the common chart types that tableau can create:
• Trend lines – trend lines are used to analyze the relationship between
variables as well as predict the future outcome
• Bullet graph – bullet graph is just like a bar graph and is generally used
in qualitative analysis.
• Box plot – box plot represents distribution of data and is used in the
comparison of multiple sets of data. It can effectively compute:
o Median
o Word cloud – similar to bubble charts, the words in a word cloud are
sized according to the frequency at which they appear in the content.
126
18. Additional Topics
On the left we have a Driver Program that runs on the master node. The master node is the one that
is responsible for the entire flow of data and transformations across the multiple worker nodes.
Usually, when we write our Spark code, the machine to which we deploy acts as the master node.
After the Driver Program, the very first thing that we need to do is to initiate a SparkContext.
The SparkContext can be considered as a session using which you can use all the features available
in Spark. For example, you can consider the SparkContext as a database connection within your
application. Using that database connection you can interact with the database, similarly, using the
SparkContext you can interact with the other functionalities of Spark.
There is also a Cluster Manager installed that is used to control multiple worker nodes.
The SparkContext that we have generated in the previous step works in conjunction with
the Cluster Manager to manage and control various jobs across all the worker nodes. Whenever a
job has to be executed by the Cluster Manager, it splits up the entire job into multiple individual
tasks and then these tasks are distributed over the worker nodes. This is taken care of by the Driver
Program and the SparkContext. As soon as an RDD is created, it is distributed by the Cluster
Manager across the multiple worker nodes and cached there.
On the right-hand side of the architecture diagram, you can see that we have two worker nodes. In
practice, this can range from two to multiple worker nodes depending on the workload of the
system. The worker nodes actually act as the slave nodes that execute the tasks distributed to them
by the Cluster Manager. These worker nodes return the execution result of the tasks to
the SparkContext. A key point to mention here is that you can increase your worker nodes such that
all the jobs are distributed to each of the worker nodes and as such the tasks can be performed
parallelly. This will increase the speed of data processing to a large extent.
127
19. Known Gaps : Nil
Apache Pig is a platform for analysing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:
• Optimization opportunities. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
efficiency.
128
21. UNIVERSITY QUESTION PAPERS OF PREVIOUS YEAR
129
22. References, Journals, websites, and E-links
Websites
1. https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-big-data-
analytics
2. https://www.tibco.com/reference-center/what-is-big-data-analytics
3. https://www.databricks.com/glossary/big-data-analytics
REFERENCES
Textbook(s)
1. Big Data, Black Book, DT Editorial Services, ISBN: 9789351197577, 2016 Edition
Reference Book(s)
1. The Journal of Big Data publishes open-access original research on data science and data
analytics.
2. Frontiers in big data is an innovative journal focuses on the power of big data - its role in
machine learning, AI, and data mining, and its practical application from cybersecurity to
climate science and public health.
3. Big data in research journal aims to promote and communicate advances in big data
research by providing a fast and high-quality forum for researchers, practitioners, and policy
makers.
130
4. Big Data and Cognitive Computing is an international, scientific, peer-reviewed, open
access journal of big data and cognitive computing published quarterly online by MDPI.
5. Journal on Big Data is launched in a new area when the engineering features of big data are
setting off upsurges of explorations in algorithms, raising challenges on big data, and
industrial development integration.
6. Big Data Mining and Analytics (Published by Tsinghua University Press) discovers hidden
patterns, correlations, insights and knowledge through mining and analysing large amounts
of data obtained from various applications.
131
Class & Branch : B.Tech (CSE) IV Year I Sem Section: B Batch : 2020
1 20R11A0549 BHASHETTY SAKETH 28 20R11A0576 Miss G PHALGUNI
2 20R11A0550 Miss SAMALA VENNELA 29 20R11A0577 G TEJAS
3 20R11A0551 SANDILA MALLESH 30 20R11A0578 Miss GANGIREDDY SAI ASRITHA
4 20R11A0552 Mr. SHRIKERAA BERAAR 31 20R11A0579 Mr. GANGU DATTHA KRISHNA
5 20R11A0553 Mr. T KRANTHI KIRAN 32 20R11A0580 Miss GOPISETTY BHANU SRI
6 20R11A0554 T S SAI TEJA 33 20R11A0581 GUDUR LIKITHA
7 20R11A0555 TALLAPRAGADA ROHITH 34 20R11A0582 GUTHIKONDA DHRUTI
8 20R11A0556 TANNERU YAKSHITHA 35 20R11A0583 HEMA HARSHITHA JAYANTHI
9 20R11A0557 Mr. THADURU UDAY KIRAN 36 20R11A0584 HRISHITHA ANUMANDLA
10 20R11A0558 THATIKONDA NIKHIL 37 20R11A0585 Miss ILAPANDA RENUKA
11 20R11A0559 Miss VARALA SINDHU 38 20R11A0586 JEEDIPELLI SANJAY KUMAR REDDY
12 20R11A0560 VEPADINNA VAISHNAVI REDDY 39 20R11A0587 Miss K GREESHMA
13 20R11A0561 Mr. A K SRI HARSHA 40 20R11A0588 Miss K ROHITHA
14 20R11A0562 Mr. ADIMULAM VAMSHI KRISHNA 41 20R11A0589 KANDULA SAI PRANAV RAJ
15 20R11A0563 Miss ADURTHI NAGA HARSHINI 42 20R11A0590 Miss KOLETI PAVANI PRIYA
16 20R11A0564 Mr. ANNADANAM ABHIRAM 43 20R11A0591 KOMPELLA SYAMALA PAVANI
17 20R11A0565 Mr. ANUMULA RAGHAVENDRA 44 20R11A0592 Miss KOSGI AAKANKSHA REDDY
18 20R11A0566 APOORVA GITTI 45 20R11A0593 Miss KHUSHI JHA
19 20R11A0567 Mr. B SWABHAN RAO 46 20R11A0594 LANKA BHANU PRASAD
20 20R11A0568 Mr. BARI SHARATH KUMAR 47 20R11A0595 M GIRIDHAR SAI KRISHNA
21 20R11A0569 Mr. BHARTIPUDI SAKETH RAM 48 20R11A0596 Miss MADHI SUKRITA
22 20R11A0570 Miss BITIKONDA BHAGYA SREE 49 21R15A0506 Miss SANIKE SREE VARSHA
23 20R11A0571 CHERUKU SHYLI 50 21R15A0507 Mr. BOSU KARTHIK
24 20R11A0572 Miss CHITTURI ARUNA 51 21R15A0508 Mr. GUDE NITHIN KUMAR
25 20R11A0573 Miss D SREE SAI SOUJANYA 52 21R15A0509 Miss ORUGANTI RAJINI
26 20R11A0574 Mr. DAIDA MANIDEEP 53 21R15A0510 Miss NAGARAJU SAI NAGA BHARGAVI
27 20R11A0575 Miss ENUGULA POOJITHA
Class & Branch : B.Tech (CSE) IV Year I Sem Section: C Batch : 2020
Sl.No. Admn No. Student Name Sl.No. Admn No. Student Name
1 19R11A05M5 Miss KATEPALLY SATHVIKA 27 20R11A05C2 Miss AEMIREDDY CHANDANA
2 20R11A0597 Mr. MADIRAJU SRIRAM 28 20R11A05C3 VELATI AKHIL
3 20R11A0598 Mr. MARAMAMULA SAI SRI VISHNU 29 20R11A05C4 Miss AKULA PRATHUSHA
4 20R11A0599 Miss MENDU PRIYA SPANDANA 30 20R11A05C5 Mr. ANAMALLA AKSHITH
5 20R11A05A0 Mr. N AKASH REDDY 31 20R11A05C6 Mr. ANIKETH KAMLEKAR
6 20R11A05A1 Mr. PASUNURU NITHIN SAI 32 20R11A05C7 Miss ARKATALA KEERTHANA
7 20R11A05A2 Mr. R JAYANTH 33 20R11A05C8 Mr. AVULA AJAY
8 20R11A05A3 Mr. RAMA RAHUL 34 20R11A05D0 Miss C S DHRITHI RAO
9 20R11A05A4 Miss RAMADUGU GAYATHRI SUSMITHA 35 20R11A05D1 CHINTHAMANI SRUJAN
10 20R11A05A5 Miss RAMINENI BHAVYA 36 20R11A05D2 Miss CHITAKULA VIVEKA
11 20R11A05A6 Miss S JYOTHI 37 20R11A05D4 ELEMELA MADHUSUDHAN YADAV
12 20R11A05A7 SAI ASHRITH SARRAF 38 20R11A05D5 Miss ERUKULLA ANISHA
13 20R11A05A8 Mr. SAMUDRALA AKSHAY 39 20R11A05D6 Mr. GARJALA MEGHANSH RAO
14 20R11A05A9 SAMULA RAJASHEKAR REDDY 40 20R11A05D7 Mr. GEEDI SANJAY KUMAR
15 20R11A05B0 Miss SANKULA SREEYA 41 20R11A05D8 Mr. GIRIMAPURAM CHANDRALOKESH CHARY
16 20R11A05B1 Miss SIGHAKOLLI GAYATHRI SRI LASYA 42 20R11A05D9 Miss GOUDA SURYANI
17 20R11A05B2 Miss SRINIVASULA NEHA NIKHITHA 43 20R11A05E0 Mr. GUDIPELLY MUDITH REDDY
18 20R11A05B3 T ABHINAV MITHRA 44 20R11A05E1 Miss GUMPULA BHUMIKA
19 20R11A05B4 Mr. TATIPARTHI SUNNY 45 20R11A05E2 Mr. GUNJA VIJAY KUMAR
20 20R11A05B5 TEJASWI GUDAPATI 46 20R11A05E3 Mr. JOSHUA DANIEL PASUMARTHI
21 20R11A05B6 V LAXMANA VYAAS 47 20R11A05E4 Mr. KANAKAPUDI SATISH
22 20R11A05B7 VADLAPATLA BHAVANA 48 21R15A0511 Miss NERAVATLA LIKHITHA
23 20R11A05B8 Miss VEMULAPALLI PRAGNA SRI 49 21R15A0512 Miss DANTHURI NAVYASREE GOUD
24 20R11A05B9 YATA MAHESH 50 21R15A0513 Miss MANDA SOWMYA
25 20R11A05C0 YELLAM NITHIN ADITYA 51 21R15A0514 Mr. MD FARAZUDDIN
26 20R11A05C1 Mr. A SRI KARANCHANDRA 52 21R15A0515 Miss PUTTA PAVANI
Class & Branch : B.Tech (CSE) IV Year I Sem Section: D Batch : 2020
Sl.No. Admn No. Student Name Sl.No. Admn No. Student Name
1 20R11A05E5 Mr. KAPAKA SRI KRISHNA KOUSHIK 26 20R11A05H1 Mr. SUNKARI SUMANTH
2 20R11A05E6 Miss KASTURI ANUSHA 27 20R11A05H2 SURYA SAMEER
3 20R11A05E7 KOMATIREDDY SATHWIK PRAYANI 28 20R11A05H3 Miss TEAK SHREYA
4 20R11A05E8 Mr. KOPPULA SHARATH 29 20R11A05H4 THOGARU SAI KARTHIK
5 20R11A05F0 KOSUNURU SURYA RAJ 30 20R11A05H5 Mr. THOTA SAI KUMAR
6 20R11A05F1 LAMBU MEDHA REDDY 31 20R11A05H6 Miss USHASWI SAMAYAM
7 20R11A05F2 MARKA SAI TEJA GOUD 32 20R11A05H7 Mr. VARAGANTI NITHIN
8 20R11A05F3 Mr. MASKAM MANOJ KUMAR 33 20R11A05H8 Miss VEMULA MANASA
9 20R11A05F4 PEDAPROLU RITHIKA 34 20R11A05H9 VUDHANTHI VENKATA SESHA SAI ESHWAR
10 20R11A05F5 Mr. NAGAPURI ROHITH 35 20R11A05J0 YARLAGADDA VISHNU VARDHAN BABU
11 20R11A05F6 Mr. OMPRAKASH THADURU 36 20R11A05J2 Mr. AEMUNURI PAVAN KUMAR
12 20R11A05F7 PABBA SHREYA 37 20R11A05J3 Miss ANKAM SAAHITHI
13 20R11A05F8 PAKANATI SRINDHITHA 38 20R11A05J4 Mr. ASHRITH GADEELA
14 20R11A05F9 PANAGATLA AKHIL GOUD 39 20R11A05J5 BARLA SUMA REDDY
15 20R11A05G0 Mr. PERUMANDLA NITHIN 40 20R11A05J6 BAMANDLA NIKHILESHWAR
16 20R11A05G1 PONNADA NIKHILESH KUMAR 41 20R11A05J7 Miss BATHULA KHUSHI
17 20R11A05G2 POTTABATHINI OMKAR 42 20R11A05J8 BOBBA SRESHTA
18 20R11A05G3 PULA TARUN KUMAR 43 20R11A05J9 Mr. C MONISH
19 20R11A05G4 PULIPAKA SAI KARTHIKEYA 44 20R11A05K1 Miss DEVALLA GOWTHAMI
20 20R11A05G5 RAMINI POOJITHA 45 20R11A05K2 Mr. DHARAVATH ROUNITH
21 20R11A05G6 Miss S MADHURI 46 21R15A0516 Mr. MEKALA KIRAN KUMAR
22 20R11A05G7 SURAMPALLY UPENDRANATH YUKTHA YUKTHA 47 21R15A0517 Miss RAMATENKI HIMABINDU
23 20R11A05G8 SAIKETHAN ADHIKAM 48 21R15A0518 Miss VANKUDOTH NIKITHA
24 20R11A05G9 Miss SHAGUFTHA 49 21R15A0519 Miss INDU SRAVANI
25 20R11A05H0 Miss SUFIA 50 21R15A0520 Mr. DHARAVATH THARUN
132
Class & Branch : B.Tech (CSE) IV Year I Sem Section: E Batch : 2020
Sl.No. Admn No. Student Name Sl.No. Admn No. Student Name
1 20R11A05K3 Miss EPPALAPALLI HIRANMAYA SHARVANI 26 20R11A05M8 Mr. NANGUNURI SANDEEP KUMAR
2 20R11A05K4 Miss GANJIMALA DEVASENA 27 20R11A05M9 Miss P SHARANI
3 20R11A05K5 Miss GAVVALA ANUSHA 28 20R11A05N0 Miss P SUBHA SREE
4 20R11A05K6 Mr. GOGINENI KARTHIK 29 20R11A05N1 Miss PAGADALA DEEKSHITHA
5 20R11A05K7 Miss GURRAM SRI VAISHNAVI 30 20R11A05N2 Mr. PALLE VIKAS GOUD
6 20R11A05K8 Miss HANIFA TABASSUM 31 20R11A05N3 Miss PATNAM SANJANA
7 20R11A05K9 HARI SARADA 32 20R11A05N4 Miss POORVI JATOTH
8 20R11A05L0 HARIPRIYA GURLA 33 20R11A05N5 Mr. PULIKASHI VISHORE
9 20R11A05L1 Mr. K M VARSHEETH 34 20R11A05N7 Mr. SAI ADARSH MOKSHAGUNDAM
10 20R11A05L2 Miss KANDIPATI BHANUSRI 35 20R11A05N9 Mr. SATHYAVADA PAVAN KIRITY
11 20R11A05L3 KANNEGANTI SRI DIVYA 36 20R11A05P0 Mr. SINGAM SAI KIRAN
12 20R11A05L4 Mr. KOTA RUTHVIK RAJ 37 20R11A05P1 Miss SRUTHI SINGH
13 20R11A05L5 KUSUNAM MOHIT REDDY 38 20R11A05P2 Mr. SULAKE VISHAL KUMAR
14 20R11A05L6 LAXMISETTI KEERTHAN 39 20R11A05P3 Mr. SUNDAR SAI SRINIVAS A
15 20R11A05L7 Mr. M VANSH 40 20R11A05P4 Mr. VARIKOLU MADHAVAN
16 20R11A05L8 Miss MALLAIAHGARI AKSHITHA 41 20R11A05P6 Mr. VEESAM GANAPATHI
17 20R11A05L9 Miss MAMIDYALA DEEPIKA 42 20R11A05P7 Mr. VENKATA SURYAKARTIK EDARA
18 20R11A05M0 Miss MOGILI SATHVIKA REDDY 43 20R11A05P8 NISHIKANTH VIDEM
19 20R11A05M1 Mr. MOHAMMAD SHOAIB 44 20R11A05P9 Miss VUTHUNOORI LAHARI
20 20R11A05M2 Mr. MOHAMMED HAJI ALI KHAN 45 20R11A05Q0 Mr. YELLAGATTU SHIVA KUMAR
21 20R11A05M3 POLU NITHISH REDDY 46 21R15A0521 Miss THANNERU DEEPIKA
22 20R11A05M4 Miss MUKKAMALA LAXMI RIDDHI APARAJITA 47 21R15A0522 Miss ALAKANTI VICTORIA
23 20R11A05M5 Mr. NAINAVARAPU MURALI KRISHNA 48 21R15A0523 Miss VUTLAPALLY LAHARI
24 20R11A05M6 Mr. NALLA UDAY KUMAR 49 21R15A0524 Mr. YEGAMATI BIXAM REDDY
25 20R11A05M7 Miss NAMITHA JOANNE NETHALA
133