Ethiopin Tecica University Departement of Ict Cours Title: Big Data
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
DEPARTEMENT OF ICT
COURS TITLE: BIG DATA ANALYSIS (IT 562)
Mtr 619/13
The term “big data” refers to data that is so large, fast or complex that it's difficult or
impossible to process using traditional methods. The act of accessing and storing large
amounts of information for analytics has been around a long time. Big data is a great
quantity of diverse information that arrives in increasing volumes and with ever-higher
velocity. Big data can be structured (often numeric, easily formatted and stored) or
unstructured (more free-form, less quantifiable).Nearly every department in a company can
utilize findings from big data analysis, but handling its clutter and noise can pose problems.
Big data can be collected from publicly shared comments on social networks and websites,
voluntarily gathered from personal electronics and apps, through questionnaires, product
purchases, and electronic check-ins. Big data is most often stored in computer databases and
is analyzed using software specifically designed to handle large, complex data sets.
Volume: Volume refers to the sheer size of the ever-exploding data of the computing
world. It raises the question about the quantity of data.
Velocity: Velocity refers to the processing speed. It raises the question of at what
speed the data is processed.
Variety: Variety refers to the types of data. It raises the question of how disparate
the data formats are.
Veracity by Veracity, we mean the truthfulness of data. In other words, how certain
we are about this data? Or how much data is the kind, it claims to be of that kind?
We are most likely talking about meaningfulness of results from data for given
problem space, we are working on or exploring.
Validity: Validity of data may sound similar to veracity of data. However, they are
not the same concept but similar. By validity, we mean the correctness and accuracy
of data with regard to the intended usage. In other words, data may not have any
veracity issues but may not be valid if not properly understood. Critically speaking,
same set of data may be valid for one application or usage and then invalid for
another application or usage. Even though, we are dealing with data where
relationship may not be defined easily or in initial stages, but it is very important to
verify relationship to some extent between elements of data, which we are dealing
with to validate it against intended consumption, as possible.
Volatility: speaking of volatility of big data, we can easily recall the retention policy
of structured data that we implement every day in our businesses. Once retention
period expires, we can easily destroy it. As an example: an online ecommerce
company may not want to keep a 1 year customer purchase history. Because after
one year and default warranty on their product expires so there is no possibility of
such data restore ever. Big data is no exception to this rule and policy in real world
data storage. Such issue is very much magnified in big data world and not as easy as
we have dealt with it in traditional data world. Big data retention period may exceed
and storage and security may become expensive to implement. Actually, Volatility
becomes significant due to Volume, Variety and Velocity of data.
Value: Unlike other V’s of big data that we talked in earlier sections, this V is the
desired outcome of big data processing. We are always interested to extract
maximum value from any big data set we are given to work with. Here we must look
for true value of data we are given to work with. In other words, data value must
exceed its cost or ownership or management. One must pay attention to the
investment of storage for data. Storage may be cost effective and relatively cheaper
at time of purchase but such underinvestment may hurt highly valuable data, for
example storing clinical trial data for new drug on cheap and unreliable storage may
save money today but can put data on risk tomorrow.
Volume Size of Data Quantity of collected and stored data. Data size is in TB
Velocity Speed of Data The transfer rate of data between source and destination
Value Importance of Data It represents the business value to be derived from big data
Variety Type of Data Different type of data like pictures, videos and audio arrives at the
receiving end
Veracity Data Quality Accurate analysis of captured data is virtually worthless if it’s not
accurate
Validity Data Authenticity Correctness or accuracy of data used to extract result in the form
of information
Big Data technologies, such as Apache Spark and Cassandra are in high demand. Companies are
looking for professionals who are skilled in using them to make the most out of the data
generated within the organization.
These data tools help in handling huge data sets and identifying patterns and trends within them.
So, if you are planning to get into the Big Data industry, you have to equip yourself with these
tools.
The Big Data technologies discussed here will help any company to increase its profits,
understand its customers better and develop quality solutions. And the best part is, you can start
learning these technologies from the tutorials and resources available on the Internet.
1. B) How would you show your understanding of the tools, trends and technology in big
data?
Analytical Big Data Technologies concerns the advanced adjustment of Big Data Technologies,
which is rather complicated than Operational Big Data. This category includes the real analysis
of Big Data, which is essential to business decisions. Some examples in this area include stock
marketing, weather forecasting, time series and medical records analysis.
Let’s take a look at the top 5 Big Data technologies being used in IT Industries.
1. Hadoop Ecosystem
Hadoop Framework was developed to store and process data with a simple programming model
in a distributed data processing environment. The data present on different high-speed and low-
expense machines can be stored and analyzed. Enterprises have widely adopted Hadoop as Big
Data Technologies for their data warehouse needs in the past year. The trend seems to continue
and grow in the coming year as well. Companies that have not explored Hadoop so far will most
likely see its advantages and applications.
2. Artificial Intelligence
Artificial Intelligence is a broad bandwidth of computer technology that deals with the
development of intelligent machines capable of carrying out different tasks typically requiring
human intelligence. AI is developing fast from Apple’s Siri to self-driving cars. As an
interdisciplinary branch of science, it takes into account a number of approaches such as
increased Machine Learning and Deep Learning to make a remarkable shift in most tech
industries. AI is revolutionizing the existing Big Data Technologies.
3. NoSQL Database
NoSQL includes a wide variety of different Big Data Technologies in the database, which are
developed to design modern applications. It shows a non-SQL or non-relational database
providing a method for data acquisition and recovery. They are used in Web and Big Data
Analytics in real-time. It stores unstructured data and offers faster performance and flexibility
while addressing various data types—for example, MongoDB, Redis and Cassandra. It provides
design integrity, easier horizontal scaling and control over opportunities in a range of devices. It
uses data structures that are different from those concerning databases by default, which speeds
up NoSQL calculations. Facebook, Google, Twitter, and similar companies store user data
terabytes daily.
4. R Programming
R is one of the open-source Big Data Technologies and programming languages. The free
software is widely used for statistical computing, visualization, unified development
environments such as Eclipse and Visual Studio assistance communication. According to
experts, it has been the world’s leading language. The system is also widely used by data miners
and statisticians to develop statistical software and mainly data analysis.
5. Data Lakes
Data Lakes means a consolidated repository for storage of all data formats at all levels in terms
of structural and unstructured data.
Data can be saved during Data accumulation as is without being transformed into structured data.
It enables performing numerous types of Data analysis from dashboards and Data visualization
to Big Data transformation in real-time for better business interference.
Businesses that use Data Lakes stay ahead in the game from their competitors and carry out new
analytics, such as Machine Learning, through new log file sources, data from social media and
click-streaming.
This Big Data technology helps enterprises respond to better business growth opportunities by
understanding and engaging clients, sustaining productivity, active device maintenance, and
familiar decision-making to better business growth opportunities.
Apache Hadoop
Apache Spark
Flink
Apache Storm
Apache Cassandra
MongoDB
Kafka
Tableau
RapidMiner
R Programming
As stated earlier, Big Data is a concept that comes under the purview of the much broader AI
spectrum. But, they actually work simultaneously to grow smart systems. Data herein, makes the
concept of Big data more effective with better predictions, enhancing AI and AI in return helps
data to transform into ways that it can be more effective (actionable data). They together form a
vicious circle of interdependence. Since, Smarter machines are the need of the hour for all
businesses, we ought to see a lot of work and applicability of both these technologies in the
coming times.
Have you witnessed the embedded data of covid-19 death and patient count in the post-pandemic
era on the various websites?
This is an example wherein, businesses tend to develop and offer data as a service to other
businesses to embed and utilize within their work fabric. Though some users may term it as
being detrimental to user privacy and security concerns; the technology development has been
helping businesses to move data easily from one platform to the other in a no-nuisance kind of
presentation, without any vendor lock-ins or data accessibility, administration, and collaboration
issues. DaaS is thus touted to see its own share of glory in the times to come.
Technologies are evolving by the day and most of them require data as food. But, it is mainly the
speed at which they can ingest and digest this ‘food’ that separates them in functionalities and
efficiencies.
Do you know that Google has already developed a Quantum computing (wherein, decisions of
yes or no are not taken by binary digits 1 and 0, rather with much faster qubits or quantum bits)
based processor named Sycamore, that claims to have solved a problem in 200 seconds, that
another state-of-the-art supercomputer would take more than 10,000 years to resolve. Machine
Learning algorithms as of now, have been limited by the slow computational speeds and the
prowess of classical computers. Quantum computing is a novice development trend that herein
tends to administer large data sets at much faster speeds to analyze data at a faster and efficient
pace to identify patterns and anomalies in real-time, making it much more effective for
businesses worldwide. Quantum computing can easily integrate data by running comparisons to
quickly analyze and understand the relationship between two or more predictive models or the
effectiveness of algorithms.
4. Edge Computing for better problem solving
There are more than 30 billion connected devices out there with the numbers soon rallying to
touch the 50 billion mark. These IoT devices are the new normal of the world, and businesses are
thus, on the lookout for ways to better utilize the enormous data that they tend to generate all the
time. Edge Computing is a new development framework in this regard, wherein processors are
located closer to the source or destination for data, rather than directly going to the clouds. As
businesses become more possessive of the data they generate and the value it holds, the trend is
sure to see much wider use and scope in the future.
With the rise of cyber-attacks and privacy and security issues of data within clouds, businesses
are opting for the usage of hybrid clouds. This one-infrastructure cloud model enables the
utilization of one or more public clouds to work in synchronization with one or more private
clouds, leading to a more comprehensive environment with mobile app security a major concern.
For this cloud topology development, an organization must have a private cloud to gain
adaptability with the aspired public cloud.
2. What are the best practices in Big Data analytics? Explain the techniques used in Big
Data Analytics.
Now, with the knowledge of what is big data and what it offers, organizations must know how
analytics must be practiced to make the most of their data. The list below shows five of the best
practices for big data:
1. UNDERSTAND THE BUSINESS REQUIREMENTS
Analyzing and understanding the business requirements and organizational goals is the first and
the foremost step that must be carried out even before leveraging big data analytics into your
projects. The business users must understand which projects in their company must use big data
analytics to make maximum profit.
2. DETERMINE THE COLLECTED DIGITAL ASSETS
The second best big data practice is to identify the type of data pouring into the organization, as
well as, the data generated in-house. Usually, the data collected is disorganized and in varying
formats. Moreover, some data is never even exploited (read dark data), and it is essential that
organizations identify this data too.
3. IDENTIFY WHAT IS MISSING
The third practice is analyzing and understanding what is missing. Once you have collected the
data needed for a project, identify the additional information that might be required for that
particular project and where can it come from. For instance, if you want to leverage big data
analytics in your organization to understand your employee's well-being, then along with
information such as login logout time, medical reports, and email reports, we need to have some
additional information about the employee’s, let’s say, stress levels. This information can be
provided by co-workers or leaders.
4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED
After analyzing and collecting data from different sources, it's time for the organization to
understand which big data technologies, such as predictive analytics, stream analytics, data
preparation, fraud detection, sentiment analysis, and so on can be best used for the current
business requirements. For instance, big data analytics helps the HR team in companies for the
recruitment process to identify the right talent faster by collaborating the social media and job
portals using predictive and sentiment analysis.
5. ANALYZE DATA CONTINUOUSLY
This is the final best practice that an organization must follow when it comes to big data. You
must always be aware of what data is lying with your organization and what is being done with
it. Check the health of your data periodically to never miss out on any important but hidden
signals in the data. Before implementing any new technology in your organization, it is vital to
have a strategy to help you get the most out of it. With adequate and accurate data at their
disposal, companies must also follow the above mentioned big data practices to extract value
from this data.
2. B) How can you identify the companies that are using Big Data Analytics in
Ethiopia?
Some of the applications of BDA include market segmentation, sales forecasting, weather
forecasting, payment fraud detection, crop diseases detection, e-commerce analysis and users
purchasing recommendation and others. The application of BDA is not only left for
economically developed regions. It is also important for resource-constrained environments
challenges of identifying and utilizing big data analytics in the resource-constrained environment
in the case of Ethiopia have been explored using some case. To identify the companies that are
using big data analysis in Ethiopia first we have to focus potential industries that can generate
big data in Ethiopia, Like Ethiopian Telecommunication Corporation, Agricultural
Transformation Agency, Payment systems like Hello Cash and Ethiopian Educational Networks.
Then we have to collect data using a semi-structured interview approach. There may be
challenges in identifying of BDA in a resource-constrained environment. Some of these areas:
lack of BDA awareness, data integration challenge, lack of skilled experts in the area, lack of
data correctness and completeness, lack of standardized data registry, lack of leadership and
management skill, issue of data privacy and infrastructure challenges including a huge volume of
storage device constraint.
RDBMS stands for the relational database management system. It is a database system based on
the relational model specified by Edgar F. Codd in 1970. The database management software
like Oracle server, My SQL, and IBM DB2 are based on the relational database management
system.
The data represented in the RDBMS is in the form of the rows or the tuples. This table is
basically a collection of related data objects and it consists of columns and rows. Normalization
plays a crucial role in RDBMS. It contains the group of the tables, each table contains the
primary key.
The key difference between RDBMS and Hadoop is that the RDBMS stores structured data
while the Hadoop stores structured, semi-structured, and unstructured data. The RDBMS is a
database management system based on the relational model. The Hadoop is software for storing
data and running applications on clusters of commodity hardware.
Hadoop RDBMS
Store data in native format Structural format
Scale horizontal and vertical Scale up vertically
Key value pairs Relational table
Functional programming Declarative queries
Offline batch processing Online transaction processing
Dynamic schema Predefined schema
Model less approach Not preferred for large data base
Open sources Not best fit for hierarchical data
Proffered for large data base Relational model
Open sources as well as closed
B) Highlight the features of Hadoop and explain the functionalities of Hadoop cluster?
Open Source
Hadoop is an open source framework which means it is available free of cost.
Also, the users are allowed to change the source code as per their requirements.
Distributed Processing
Hadoop supports distributed processing of data i.e. faster processing.
The data in Hadoop HDFS is stored in a distributed manner and Map Reduce is
responsible for the parallel processing of data.
Fault Tolerance
Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes,
by default.
This number can be changed according to the requirement. So, we can recover the data
from another node if one node fails.
The detection of node failure and recovery of data is done automatically.
Reliability
Hadoop stores data on the cluster in a reliable manner that is independent of machine.
So, the data stored in Hadoop environment is not affected by the failure of the machine.
Scalability
Another important feature of Hadoop is the scalability. It is compatible with the other
hardware and we can easily ass the new hardware to the nodes.
Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are
4. A) Explain the significances Hadoop distributed file systems and its application.
2. Computing power: Hadoop's distributed computing model processes big data fast. The more
computing nodes you use the more processing power you have.
3. Fault tolerance. Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. Multiple copies of all data are stored automatically.
4. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before
storing it. We can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.
5. Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
6. Scalability. We can easily grow your system to handle more data simply by adding nodes.
Little administration is required.
Hadoop Applications:
B. Difference between Name Node, Checkpoint Name Node and Backup Node
Name Node is the core of HDFS that manages the metadata – the information of what file
maps to what block locations and what blocks are stored on what data node. In simple terms,
it’s the data about the data being stored. Name Node supports a directory tree-like structure
consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for name
space’s image file. It keeps track of the latest checkpoint of the namespace.
Edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Name Node has the same directory structure as Name Node, and creates
checkpoints for namespace at regular intervals by downloading the fs image and edits file and
margining them within the local directory. The new image after merging is then uploaded to
Name Node.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not
support the ‘upload to Name Node’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization with
Name Node. It maintains an up-to-date in-memory copy of file system namespace and doesn’t
require getting hold of changes after regular intervals. The backup node needs to save the
current state in-memory to an image file to create a new checkpoint.
5. A) what is Commodity hardware? Commodity hardware sometimes known as off-the-
shelf hardware is a computer device or IT component that is relatively inexpensive, widely
available and basically interchangeable with other hardware of its type. Unlike purpose-
built hardware designed for a specific IT function, commodity hardware can perform many
different functions.
5. B) How big data analysis helps businesses increase their revenue? Give example
and Name some companies that use Hadoop.