0% found this document useful (0 votes)
104 views15 pages

Ethiopin Tecica University Departement of Ict Cours Title: Big Data

The document discusses big data architecture and characteristics including volume, velocity, variety, veracity, validity, and volatility. It also discusses some common big data technologies like Hadoop, artificial intelligence, NoSQL databases, R programming, and data lakes.

Uploaded by

gudissagabissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views15 pages

Ethiopin Tecica University Departement of Ict Cours Title: Big Data

The document discusses big data architecture and characteristics including volume, velocity, variety, veracity, validity, and volatility. It also discusses some common big data technologies like Hadoop, artificial intelligence, NoSQL databases, R programming, and data lakes.

Uploaded by

gudissagabissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

ETHIOPIN TECICA UNIVERSITY

DEPARTEMENT OF ICT
COURS TITLE: BIG DATA ANALYSIS (IT 562)

PREPARED BY: KidestTechane

Mtr 619/13

SUMBMITED TO: Dr. vasu

SUMBMISSION DATE: 20/6/2021


1. List the main characteristics of big data architecture with neat schematic diagram.

The term “big data” refers to data that is so large, fast or complex that it's difficult or
impossible to process using traditional methods. The act of accessing and storing large
amounts of information for analytics has been around a long time. Big data is a great
quantity of diverse information that arrives in increasing volumes and with ever-higher
velocity. Big data can be structured (often numeric, easily formatted and stored) or
unstructured (more free-form, less quantifiable).Nearly every department in a company can
utilize findings from big data analysis, but handling its clutter and noise can pose problems.
Big data can be collected from publicly shared comments on social networks and websites,
voluntarily gathered from personal electronics and apps, through questionnaires, product
purchases, and electronic check-ins. Big data is most often stored in computer databases and
is analyzed using software specifically designed to handle large, complex data sets.

 Volume: Volume refers to the sheer size of the ever-exploding data of the computing
world. It raises the question about the quantity of data.
 Velocity: Velocity refers to the processing speed. It raises the question of at what
speed the data is processed.
 Variety: Variety refers to the types of data. It raises the question of how disparate
the data formats are.
 Veracity by Veracity, we mean the truthfulness of data. In other words, how certain
we are about this data? Or how much data is the kind, it claims to be of that kind?
We are most likely talking about meaningfulness of results from data for given
problem space, we are working on or exploring.
 Validity: Validity of data may sound similar to veracity of data. However, they are
not the same concept but similar. By validity, we mean the correctness and accuracy
of data with regard to the intended usage. In other words, data may not have any
veracity issues but may not be valid if not properly understood. Critically speaking,
same set of data may be valid for one application or usage and then invalid for
another application or usage. Even though, we are dealing with data where
relationship may not be defined easily or in initial stages, but it is very important to
verify relationship to some extent between elements of data, which we are dealing
with to validate it against intended consumption, as possible.
 Volatility: speaking of volatility of big data, we can easily recall the retention policy
of structured data that we implement every day in our businesses. Once retention
period expires, we can easily destroy it. As an example: an online ecommerce
company may not want to keep a 1 year customer purchase history. Because after
one year and default warranty on their product expires so there is no possibility of
such data restore ever. Big data is no exception to this rule and policy in real world
data storage. Such issue is very much magnified in big data world and not as easy as
we have dealt with it in traditional data world. Big data retention period may exceed
and storage and security may become expensive to implement. Actually, Volatility
becomes significant due to Volume, Variety and Velocity of data.
 Value: Unlike other V’s of big data that we talked in earlier sections, this V is the
desired outcome of big data processing. We are always interested to extract
maximum value from any big data set we are given to work with. Here we must look
for true value of data we are given to work with. In other words, data value must
exceed its cost or ownership or management. One must pay attention to the
investment of storage for data. Storage may be cost effective and relatively cheaper
at time of purchase but such underinvestment may hurt highly valuable data, for
example storing clinical trial data for new drug on cheap and unreliable storage may
save money today but can put data on risk tomorrow.

Big Data Characteristics Elucidation Description

 Volume Size of Data Quantity of collected and stored data. Data size is in TB
 Velocity Speed of Data The transfer rate of data between source and destination
 Value Importance of Data It represents the business value to be derived from big data
 Variety Type of Data Different type of data like pictures, videos and audio arrives at the
receiving end
 Veracity Data Quality Accurate analysis of captured data is virtually worthless if it’s not
accurate
 Validity Data Authenticity Correctness or accuracy of data used to extract result in the form
of information
Big Data technologies, such as Apache Spark and Cassandra are in high demand. Companies are
looking for professionals who are skilled in using them to make the most out of the data
generated within the organization.

These data tools help in handling huge data sets and identifying patterns and trends within them.
So, if you are planning to get into the Big Data industry, you have to equip yourself with these
tools. 

The Big Data technologies discussed here will help any company to increase its profits,
understand its customers better and develop quality solutions. And the best part is, you can start
learning these technologies from the tutorials and resources available on the Internet.

1. B) How would you show your understanding of the tools, trends and technology in big
data?

Big Data Technologies are broadly classified into two categories.

1. Operational Big Data Technologies


Operational Big Data Technologies indicates the volume of data generated every day, such as
online transactions, social media or any information from a particular company used for analysis
by software based on big data technology. It acts as raw data to supply big data analysis
technology. Few cases of Operational Big Data Technologies include information on MNC
management, Amazon, Flipkart, Walmart, online ticketing for movies, flights, railways and
more.

2. Analytical Big Data Technologies

Analytical Big Data Technologies concerns the advanced adjustment of Big Data Technologies,
which is rather complicated than Operational Big Data. This category includes the real analysis
of Big Data, which is essential to business decisions. Some examples in this area include stock
marketing, weather forecasting, time series and medical records analysis.

Let’s take a look at the top 5 Big Data technologies being used in IT Industries.

3) Top 5 Big Data technologies

1. Hadoop Ecosystem

Hadoop Framework was developed to store and process data with a simple programming model
in a distributed data processing environment. The data present on different high-speed and low-
expense machines can be stored and analyzed. Enterprises have widely adopted Hadoop as Big
Data Technologies for their data warehouse needs in the past year. The trend seems to continue
and grow in the coming year as well. Companies that have not explored Hadoop so far will most
likely see its advantages and applications. 

2. Artificial Intelligence

Artificial Intelligence is a broad bandwidth of computer technology that deals with the
development of intelligent machines capable of carrying out different tasks typically requiring
human intelligence. AI is developing fast from Apple’s Siri to self-driving cars. As an
interdisciplinary branch of science, it takes into account a number of approaches such as
increased Machine Learning and Deep Learning to make a remarkable shift in most tech
industries. AI is revolutionizing the existing Big Data Technologies.

3. NoSQL Database

NoSQL includes a wide variety of different Big Data Technologies in the database, which are
developed to design modern applications. It shows a non-SQL or non-relational database
providing a method for data acquisition and recovery. They are used in Web and Big Data
Analytics in real-time. It stores unstructured data and offers faster performance and flexibility
while addressing various data types—for example, MongoDB, Redis and Cassandra. It provides
design integrity, easier horizontal scaling and control over opportunities in a range of devices. It
uses data structures that are different from those concerning databases by default, which speeds
up NoSQL calculations. Facebook, Google, Twitter, and similar companies store user data
terabytes daily.

4. R Programming

R is one of the open-source Big Data Technologies and programming languages. The free
software is widely used for statistical computing, visualization, unified development
environments such as Eclipse and Visual Studio assistance communication. According to
experts, it has been the world’s leading language. The system is also widely used by data miners
and statisticians to develop statistical software and mainly data analysis. 

5. Data Lakes

Data Lakes means a consolidated repository for storage of all data formats at all levels in terms
of structural and unstructured data.

Data can be saved during Data accumulation as is without being transformed into structured data.
It enables performing numerous types of Data analysis from dashboards and Data visualization
to Big Data transformation in real-time for better business interference.
Businesses that use Data Lakes stay ahead in the game from their competitors and carry out new
analytics, such as Machine Learning, through new log file sources, data from social media and
click-streaming.

This Big Data technology helps enterprises respond to better business growth opportunities by
understanding and engaging clients, sustaining productivity, active device maintenance, and
familiar decision-making to better business growth opportunities.

What are the best Big Data Tools?

Here is the list of top 10 big data tools –

 Apache Hadoop
 Apache Spark
 Flink
 Apache Storm
 Apache Cassandra
 MongoDB
 Kafka
 Tableau
 RapidMiner
 R Programming

Trends that are set to be the norm of Big Data Analytics

1. Inclusion of all aspects of Artificial Intelligence

As stated earlier, Big Data is a concept that comes under the purview of the much broader AI
spectrum. But, they actually work simultaneously to grow smart systems. Data herein, makes the
concept of Big data more effective with better predictions, enhancing AI and AI in return helps
data to transform into ways that it can be more effective (actionable data). They together form a
vicious circle of interdependence. Since, Smarter machines are the need of the hour for all
businesses, we ought to see a lot of work and applicability of both these technologies in the
coming times.

2. More utilization of Data as a Service

Have you witnessed the embedded data of covid-19 death and patient count in the post-pandemic
era on the various websites?

This is an example wherein, businesses tend to develop and offer data as a service to other
businesses to embed and utilize within their work fabric. Though some users may term it as
being detrimental to user privacy and security concerns; the technology development has been
helping businesses to move data easily from one platform to the other in a no-nuisance kind of
presentation, without any vendor lock-ins or data accessibility, administration, and collaboration
issues. DaaS is thus touted to see its own share of glory in the times to come.

3. Things tend to get faster with Quantum Computing

Technologies are evolving by the day and most of them require data as food. But, it is mainly the
speed at which they can ingest and digest this ‘food’ that separates them in functionalities and
efficiencies.

Do you know that Google has already developed a Quantum computing (wherein, decisions of
yes or no are not taken by binary digits 1 and 0, rather with much faster qubits or quantum bits)
based processor named Sycamore, that claims to have solved a problem in 200 seconds, that
another state-of-the-art supercomputer would take more than 10,000 years to resolve. Machine
Learning algorithms as of now, have been limited by the slow computational speeds and the
prowess of classical computers. Quantum computing is a novice development trend that herein
tends to administer large data sets at much faster speeds to analyze data at a faster and efficient
pace to identify patterns and anomalies in real-time, making it much more effective for
businesses worldwide. Quantum computing can easily integrate data by running comparisons to
quickly analyze and understand the relationship between two or more predictive models or the
effectiveness of algorithms.
4. Edge Computing for better problem solving

There are more than 30 billion connected devices out there with the numbers soon rallying to
touch the 50 billion mark. These IoT devices are the new normal of the world, and businesses are
thus, on the lookout for ways to better utilize the enormous data that they tend to generate all the
time. Edge Computing is a new development framework in this regard, wherein processors are
located closer to the source or destination for data, rather than directly going to the clouds. As
businesses become more possessive of the data they generate and the value it holds, the trend is
sure to see much wider use and scope in the future.

5. Hybrid Clouds in Big Data Analytics

With the rise of cyber-attacks and privacy and security issues of data within clouds, businesses
are opting for the usage of hybrid clouds. This one-infrastructure cloud model enables the
utilization of one or more public clouds to work in synchronization with one or more private
clouds, leading to a more comprehensive environment with mobile app security a major concern.
For this cloud topology development, an organization must have a private cloud to gain
adaptability with the aspired public cloud.

2. What are the best practices in Big Data analytics? Explain the techniques used in Big
Data Analytics.
Now, with the knowledge of what is big data and what it offers, organizations must know how
analytics must be practiced to make the most of their data. The list below shows five of the best
practices for big data:  
1. UNDERSTAND THE BUSINESS REQUIREMENTS
Analyzing and understanding the business requirements and organizational goals is the first and
the foremost step that must be carried out even before leveraging big data analytics into your
projects. The business users must understand which projects in their company must use big data
analytics to make maximum profit.
2. DETERMINE THE COLLECTED DIGITAL ASSETS
The second best big data practice is to identify the type of data pouring into the organization, as
well as, the data generated in-house. Usually, the data collected is disorganized and in varying
formats. Moreover, some data is never even exploited (read dark data), and it is essential that
organizations identify this data too.
3. IDENTIFY WHAT IS MISSING
The third practice is analyzing and understanding what is missing. Once you have collected the
data needed for a project, identify the additional information that might be required for that
particular project and where can it come from. For instance, if you want to leverage big data
analytics in your organization to understand your employee's well-being, then along with
information such as login logout time, medical reports, and email reports, we need to have some
additional information about the employee’s, let’s say, stress levels. This information can be
provided by co-workers or leaders.
4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED
After analyzing and collecting data from different sources, it's time for the organization to
understand which big data technologies, such as predictive analytics, stream analytics, data
preparation, fraud detection, sentiment analysis, and so on can be best used for the current
business requirements. For instance, big data analytics helps the HR team in companies for the
recruitment process to identify the right talent faster by collaborating the social media and job
portals using predictive and sentiment analysis.
5. ANALYZE DATA CONTINUOUSLY
This is the final best practice that an organization must follow when it comes to big data. You
must always be aware of what data is lying with your organization and what is being done with
it. Check the health of your data periodically to never miss out on any important but hidden
signals in the data. Before implementing any new technology in your organization, it is vital to
have a strategy to help you get the most out of it. With adequate and accurate data at their
disposal, companies must also follow the above mentioned big data practices to extract value
from this data.

2. B) How can you identify the companies that are using Big Data Analytics in
Ethiopia?

Some of the applications of BDA include market segmentation, sales forecasting, weather
forecasting, payment fraud detection, crop diseases detection, e-commerce analysis and users
purchasing recommendation and others. The application of BDA is not only left for
economically developed regions. It is also important for resource-constrained environments
challenges of identifying and utilizing big data analytics in the resource-constrained environment
in the case of Ethiopia have been explored using some case. To identify the companies that are
using big data analysis in Ethiopia first we have to focus potential industries that can generate
big data in Ethiopia, Like Ethiopian Telecommunication Corporation, Agricultural
Transformation Agency, Payment systems like Hello Cash and Ethiopian Educational Networks.
Then we have to collect data using a semi-structured interview approach. There may be
challenges in identifying of BDA in a resource-constrained environment. Some of these areas:
lack of BDA awareness, data integration challenge, lack of skilled experts in the area, lack of
data correctness and completeness, lack of standardized data registry, lack of leadership and
management skill, issue of data privacy and infrastructure challenges including a huge volume of
storage device constraint.

3. A) what is the difference between Hadoop and Traditional RDBMS?

Hadoop is fundamentally an open-source infrastructure software framework that allows


distributed storage and processing a huge amount of data i.e. Big Data. It’s a cluster system
which works as a Master-Slave Architecture. Hence, with such architecture, large data can be
stored and processed in parallel. Different types of data can be analyzed, structured (tables),
unstructured (logs, email body, blog text) and semi-structured (media file metadata, XML,
HTML).

RDBMS stands for the relational database management system. It is a database system based on
the relational model specified by Edgar F. Codd in 1970. The database management software
like Oracle server, My SQL, and IBM DB2 are based on the relational database management
system.

The data represented in the RDBMS is in the form of the rows or the tuples. This table is
basically a collection of related data objects and it consists of columns and rows. Normalization
plays a crucial role in RDBMS. It contains the group of the tables, each table contains the
primary key.

The key difference between RDBMS and Hadoop is that the RDBMS stores structured data
while the Hadoop stores structured, semi-structured, and unstructured data. The RDBMS is a
database management system based on the relational model. The Hadoop is software for storing
data and running applications on clusters of commodity hardware.

Hadoop RDBMS
 Store data in native format  Structural format
 Scale horizontal and vertical  Scale up vertically
 Key value pairs  Relational table
 Functional programming  Declarative queries
 Offline batch processing  Online transaction processing
 Dynamic schema  Predefined schema
 Model less approach  Not preferred for large data base
 Open sources  Not best fit for hierarchical data
 Proffered for large data base  Relational model
 Open sources as well as closed

B) Highlight the features of Hadoop and explain the functionalities of Hadoop cluster?
Open Source
 Hadoop is an open source framework which means it is available free of cost.
 Also, the users are allowed to change the source code as per their requirements.
Distributed Processing
 Hadoop supports distributed processing of data i.e. faster processing.
 The data in Hadoop HDFS is stored in a distributed manner and Map Reduce is
responsible for the parallel processing of data.
Fault Tolerance
 Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes,
by default.
 This number can be changed according to the requirement. So, we can recover the data
from another node if one node fails.
 The detection of node failure and recovery of data is done automatically.
Reliability
 Hadoop stores data on the cluster in a reliable manner that is independent of machine.
 So, the data stored in Hadoop environment is not affected by the failure of the machine.

Scalability

 Another important feature of Hadoop is the scalability. It is compatible with the other
hardware and we can easily ass the new hardware to the nodes.
 Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are

C. The best configuration for executing Hadoop jobs is dual core machines or dual processors


with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC
memory though it is not low - end. ECC memory is recommended for running Hadoop because
most of the Hadoop users have experienced various checksum errors by using non ECC
memory.

Software Requirement: The software requirement for Hadoop is the Java software since


the Hadoop framework is mostly written in Java programming language. The minimum version
for Java is the Java. Java™ must be installed. Recommended Java versions are described at Hadoop
Java Versions. must be installed and must be running to use the Hadoop scripts that manage remote
Hadoop daemons if the optional start and stop scripts are to be used
Prerequisite: To install Hadoop, you should have Java version 1.8 in your system. Check your java
version through this command on command prompt if java is not installed in our system.

4. A) Explain the significances Hadoop distributed file systems and its application.

Why is Hadoop important?


1. Ability to store and process huge amounts of any kind of data, quickly: With data volumes and
varieties constantly increasing, especially from social media and the Internet of Things (IoT),
that's a key consideration.

2. Computing power: Hadoop's distributed computing model processes big data fast. The more
computing nodes you use the more processing power you have.

3. Fault tolerance. Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. Multiple copies of all data are stored automatically.

4. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before
storing it. We can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.

5. Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.

6. Scalability. We can easily grow your system to handle more data simply by adding nodes.
Little administration is required.

Hadoop Applications:

 Making Hadoop Applications More Widely Accessible.


 A Graphical Abstraction Layer on Top of Hadoop Applications.
 With its Eclipse-based graphical workspace, Talend Open Studio for Big Data enables
the developer and data scientist to leverage Hadoop loading and processing technologies
like HDFS, H Base, Hive, and Pig without having to write Hadoop application code. By
simply selecting graphical components from a palette, arranging and configuring them,
you can create Hadoop jobs that, for example:
 Load data into HDFS (Hadoop Distributed File System)
 Use Hadoop Pig to transform data in HDFS.

B. Difference between Name Node, Checkpoint Name Node and Backup Node

Name Node is the core of HDFS that manages the metadata – the information of what file
maps to what block locations and what blocks are stored on what data node. In simple terms,
it’s the data about the data being stored. Name Node supports a directory tree-like structure
consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for name
space’s image file. It keeps track of the latest checkpoint of the namespace.
Edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Name Node has the same directory structure as Name Node, and creates
checkpoints for namespace at regular intervals by downloading the fs image and edits file and
margining them within the local directory. The new image after merging is then uploaded to
Name Node.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not
support the ‘upload to Name Node’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization with
Name Node. It maintains an up-to-date in-memory copy of file system namespace and doesn’t
require getting hold of changes after regular intervals. The backup node needs to save the
current state in-memory to an image file to create a new checkpoint.
5. A) what is Commodity hardware? Commodity hardware sometimes known as off-the-
shelf hardware is a computer device or IT component that is relatively inexpensive, widely
available and basically interchangeable with other hardware of its type. Unlike purpose-
built hardware designed for a specific IT function, commodity hardware can perform many
different functions.

5. B) How big data analysis helps businesses increase their revenue? Give example
and Name some companies that use Hadoop.

 Better targeted customer marketing


 Improved product analytics
 Improved business planning
 Improved supply chain management
 Improved analysis for fraud, waste, and abuse
 Business intelligence, querying, reporting, searching, including many implementation of
searching, filtering, indexing, speeding up aggregation for reporting and for report
generation, trend analysis, search optimization, and general information retrieval.
 Improved performance for common data management operations, with the majority
focusing on log storage, data storage and archiving, followed by sorting, running joins
extraction/transformation / loading (ETL) processing, other types of data conversions, as
well as duplicate analysis and elimination.
 Non-database applications, such as image processing, text processing in preparation for
publishing, genome sequencing, protein sequencing and structure prediction, web
crawling, and monitoring workflow processes.
 Data mining and analytical applications, including social network analysis, facial
recognition, profile matching, other types of text analytics, web mining, machine
learning, information extraction, personalization and recommendation analysis, ad
optimization, and behavior analysis.

You might also like