0% found this document useful (0 votes)
89 views54 pages

Unit 1

This document provides an introduction to big data, including its sources, characteristics, storage, processing and examples. It discusses key topics such as the 3Vs of big data (volume, velocity and variety), unstructured and structured data, data lakes, Hadoop and cloud-based solutions for processing large datasets. Examples are given of how big data is used across various industries including healthcare, energy, finance, manufacturing and government.

Uploaded by

nageswari v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views54 pages

Unit 1

This document provides an introduction to big data, including its sources, characteristics, storage, processing and examples. It discusses key topics such as the 3Vs of big data (volume, velocity and variety), unstructured and structured data, data lakes, Hadoop and cloud-based solutions for processing large datasets. Examples are given of how big data is used across various industries including healthcare, energy, finance, manufacturing and government.

Uploaded by

nageswari v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT I - UNDERSTANDING BIG DATA

Introduction to big data – convergence of key trends – unstructured data – industry examples of
big data – web analytics – big data applications– big data technologies – introduction to Hadoop
– open source technologies – cloud and big data – mobile business intelligence – Crowd sourcing
analytics – inter and trans firewall analytics.

1. INTRODUCTION TO BIG DATA

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15
byte size is called Big Data. It is stated that almost 90% of today's data has been
generated in the past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.

3V's of Big Data


1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data
will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in
tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
Why is big data important?

Companies use big data in their systems to improve operations, provide better customer service,
create personalized marketing campaigns and take other actions that, ultimately, can increase
revenue and profits. Businesses that use it effectively hold a potential competitive advantage
over those that don't because they're able to make faster and more informed business decisions.

For example, big data provides valuable insights into customers that companies can use to refine
their marketing, advertising and promotions in order to increase customer engagement and
conversion rates. Both historical and real-time data can be analyzed to assess the evolving
preferences of consumers or corporate buyers, enabling businesses to become more responsive to
customer wants and needs.

Big data is also used by medical researchers to identify disease signs and risk factors and by
doctors to help diagnose illnesses and medical conditions in patients. In addition, a combination
of data from electronic health records, social media sites, the web and other sources gives
healthcare organizations and government agencies up-to-date information on infectious disease
threats or outbreaks.

Here are some more examples of how big data is used by organizations:

 In the energy industry, big data helps oil and gas companies identify potential drilling
locations and monitor pipeline operations; likewise, utilities use it to track electrical grids.

 Financial services firms use big data systems for risk management and real-time analysis of
market data.

 Manufacturers and transportation companies rely on big data to manage their supply chains
and optimize delivery routes.

 Other government uses include emergency response, crime prevention and smart city
initiatives.
What are examples of big data?

Big data comes from myriad sources -- some examples are transaction processing systems,
customer databases, documents, emails, medical records, internet clickstream logs, mobile apps
and social networks. It also includes machine-generated data, such as network and server log
files and data from sensors on manufacturing machines, industrial equipment and internet of
things devices.

In addition to data from internal systems, big data environments often incorporate external data
on consumers, financial markets, weather and traffic conditions, geographic information,
scientific research and more. Images, videos and audio files are forms of big data, too, and many
big data applications involve streaming data that is processed and collected on a continual basis.

Breaking down the V's of big data

Volume is the most commonly cited characteristic of big data. A big data environment doesn't
have to contain a large amount of data, but most do because of the nature of the data being
collected and stored in them. Clickstreams, system logs and stream processing systems are
among the sources that typically produce massive volumes of data on an ongoing basis.

Big data also encompasses a wide variety of data types, including the following:

 structured data, such as transactions and financial records;

 unstructured data, such as text, documents and multimedia files; and

 semistructured data, such as web server logs and streaming data from sensors.

Various data types may need to be stored and managed together in big data systems. In addition,
big data applications often include multiple data sets that may not be integrated upfront. For
example, a big data analytics project may attempt to forecast sales of a product by correlating
data on past sales, returns, online reviews and customer service calls.

Velocity refers to the speed at which data is generated and must be processed and analyzed. In
many cases, sets of big data are updated on a real- or near-real-time basis, instead of the daily,
weekly or monthly updates made in many traditional data warehouses. Managing data velocity is
also important as big data analysis further expands into machine learning and artificial
intelligence (AI), where analytical processes automatically find patterns in data and use them to
generate insights.

More characteristics of big data

Looking beyond the original three V's, here are details on some of the other ones that are now
often associated with big data:

 Veracity refers to the degree of accuracy in data sets and how trustworthy they are. Raw data
collected from various sources can cause data quality issues that may be difficult to pinpoint.
If they aren't fixed through data cleansing processes, bad data leads to analysis errors that can
undermine the value of business analytics initiatives. Data management and analytics teams
also need to ensure that they have enough accurate data available to produce valid results.

 Some data scientists and consultants also add value to the list of big data's characteristics.
Not all the data that's collected has real business value or benefits. As a result, organizations
need to confirm that data relates to relevant business issues before it's used in big data
analytics projects.

 Variability also often applies to sets of big data, which may have multiple meanings or be
formatted differently in separate data sources -- factors that further complicate big data
management and analytics.

Some people ascribe even more V's to big data; various lists have been created with between
seven and 10.
How is big data stored and processed?

Big data is often stored in a data lake. While data warehouses are commonly built on relational
databases and contain structured data only, data lakes can support various data types and
typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other
big data platforms.

Many big data environments combine multiple systems in a distributed architecture; for example,
a central data lake might be integrated with other platforms, including relational databases or a
data warehouse. The data in big data systems may be left in its raw form and then filtered and
organized as needed for particular analytics uses. In other cases, it's preprocessed using data
mining tools and data preparation software so it's ready for applications that are run regularly.

Big data processing places heavy demands on the underlying compute infrastructure. The
required computing power often is provided by clustered systems that distribute processing
workloads across hundreds or thousands of commodity servers, using technologies like Hadoop
and the Spark processing engine.

Getting that kind of processing capacity in a cost-effective way is a challenge. As a result, the
cloud is a popular location for big data systems. Organizations can deploy their own cloud-based
systems or use managed big-data-as-a-service offerings from cloud providers. Cloud users can
scale up the required number of servers just long enough to complete big data analytics projects.
The business only pays for the storage and compute time it uses, and the cloud instances can be
turned off until they're needed again.

Structured Data

 Structured data can be crudely defined as the data that resides in a fixed field within a
record.
 It is type of data most familiar to our everyday lives. for ex: birthday,address
 A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity. Relationships are enforced by the application
of table constraints.
 The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
Semi-Structured Data

 Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
 Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
 A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
 Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
 This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
2. Unstructured Data

 Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
 Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
 Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.
1. Images, video and audio media content: The media and entertainment industry,
surveillance systems, professional publishers and even individuals are constantly creating
image, video and audio content. These media files are often stored in structured
databases, but such databases do not process or understand the actual contents of the
media files, which are in the form of unstructured data.

The ability to interpret and understand media, often in real-time, has far-reaching
implications for governance, business, and healthcare. Some examples are: an analysis of
911 call records could aid criminal investigations, CCTV camera footage could help to
prevent or detect incidents, identifying the persons in a video could be useful in news
reporting, and videos of shoppers could help retailers to understand their movements and
shopping patterns.

Considering the huge volume of data involved, analyzing the content of media files
manually is a daunting task, which is why automation solutions are currently being
developed. For example, natural-language processing can extract text out of audio files
using speech-to-text technology, and the text can be analyzed to perform sentiment
analysis. Metatags are also helpful to classify media files and perform search operations.

2. Communications – live chat, messaging and web meetings: Today, professional, as


well as personal, discussions take place across a variety of communications platforms.
Popular apps such as WhatsApp, web conferencing platforms such as Zoom or Skype,
and collaboration tools such as Slack are some of the places where data is being created
in the form of unstructured audio and text. Consider an organization where employees are
speaking with customers and vendors across multiple communication platforms. In order
to get a unified view of a particular customer, there is a need not only to integrate
unstructured data created on different platforms, but to standardize and interpret it.

Customer sales or service calls can be stored, categorized, transcribed and analyzed to
find meaning. A speech recognition program converts voice to text, and emotion
detection capabilities observe the tone during the call through changes in the customers’
speed, pitch, and volume. Natural language processing helps to identify key themes,
products, and sentiments, equipping the organization to improve the customer experience,
retain customers and enhance sales.

An increasing number of websites and apps are offering visitors a live-chat functionality.
Chat conversation transcribes are a treasure trove of market intelligence if analyzed
correctly. This is where data visualization tools can play a role in helping discover key
themes. Chat data gathered over time helps to understand trends — i.e. whether a topic is
becoming hotter or cooler by the day. This knowledge can go a long way in building
deeper relationships with customers.
What is unstructured data?

Unstructured data is often categorized as qualitative and cannot be processed and analyzed using
conventional data tools and methods. It is also known as "schema independent" or "schema on
read" data.

Examples of unstructured data include text, video files, audio files, mobile activity, social
media posts, satellite imagery, surveillance imagery – the list goes on and on.

Unstructured data is difficult to deconstruct because it has no predefined data model, meaning it
cannot be organized in relational databases. Instead, non-relational or NoSQL databases are the
best fit for managing unstructured data.

Another way to manage unstructured data is to have it flow into a data lake or pool, allowing it
to be in its raw, unstructured format.

Finding the insight buried within unstructured data isn’t an easy task. It requires advanced
analytics and high technical expertise to make a difference. Data analysis can be an expensive
shift for many companies.
More examples of unstructured data:

Unstructured data is any event or alert sent and received by any user within an organization with
no proper file formatting or direct business co-dependency.

 Rich media: Social media, entertainment, surveillance, satellite information, geospatial


data, weather forecasting, podcasts
 Documents: Invoices, records, web history, emails, productivity applications
 Media and entertainment data, surveillance data, geospatial data, audio, weather data
 Internet of things: sensor data, ticker data
 Analytics: Machine learning, artificial intelligence (AI)

Benefits of unstructured data

Unstructured data, also known as big data nowadays, is free-flowing and native to each specific
company. It is schema independent and is known as "schema on read." Customizing this data per
your business strategies can give you a competitive edge over competitors still stuck in
traditional decision-making. And here is why.

 Unstructured data is easily available and has enough insights businesses can collect to
learn about their product response.
 Unstructured data is schema-independent. Hence minor alterations to the database do not
impact cost, time, or resources.
 Unstructured data can be stored on shared or hybrid cloud servers with minimal
expenditure on database management.
 Unstructured data is in its native format, so data scientists or engineers do not define it
until needed. It opens the expandability of file formats, as it is available in different
formats like .mp3, .opus, .pdf, .png, and so on.
 Data lakes come with "pay-as-you-use" pricing, which helps businesses cut their costs
and resource consumption.
Challenges of unstructured data

Unstructured data is the most trending method of data collection and manipulation today. Many
businesses are switching to more "customer-centric" business models and banking on consumer
data. However, working on unstructured data results in the following challenges.

 Unstructured data is not the easiest to understand. Users require a proficient background
in data science and machine learning to prepare, analyze and integrate it with machine
learning algorithms.
 Unstructured data rests on less authentic and encrypted shared servers, which are more
prone to ransomware and cyber attacks.
 Currently, there aren't many tools that can manipulate unstructured data apart from cloud
commodity servers and open-source NoSQL DBMS.

3. Industry Examples of Big Data

6 big data examples by industry

 Media and entertainment


 Finance
 Healthcare
 Education
 Retail
 Manufacturing

How big data is used in media and entertainment

Analyzing big data is crucial to generating more revenue and providing personalized experiences
in this digitally-driven industry. Here are a few ways big data is being applied in media and
entertainment today:

 Companies like Hulu and Netflix work with an abundance of big data daily to analyze
user tendencies, preferred content, trends in consumption, and much more. As a matter of
fact, Netflix used predictive data analysis to craft its show House of Cards since the
data validated that it’d be a hit with consumers.
 Ever wonder why so many streaming services are coming out? That’s because big data is
unveiling new ways to monetize digital content, creating new revenue sources for media
and entertainment companies.
 Ads are targeted more strategically thanks to big data analytics software, helping
companies understand the performance of ads more clearly based on certain types of
consumers.
Finance

Big data has fundamentally changed the finance industry, particularly stock trading. The
introduction of quantitative analysis models has marked a shift from manual trading to trading
backed by technology.

The first adopters of this technology were large financial institutions and hedge funds. Now,
quantitative models have become the standard.

These models analyze big data to predict outcomes of certain events in the financial world, make
accurate enter/exit trade decisions, minimize risk using machine learning, and even gauge
market sentiment using opinion mining.

Healthcare

The ability to improve quality of life, provide hyper-personalized patient treatment, and discover
medical breakthroughs makes the healthcare industry a perfect candidate for big data. As a
matter of fact, the healthcare industry is one of the largest recent adopters of big data analytics.

How big data is used in healthcare

In healthcare, it’s not about increasing profits or finding new product opportunities, it’s about
analyzing and applying big data in a patient-centric way. There are already many great examples
of this today:

 In our roundup of predictive analytics examples, we discussed how AlayaCare analyzed


big data to predict negative health events that seniors could experience from home-care.
The analysis reduced hospitalizations and ER visits by 73 percent, and 64 percent
amongst chronically ill patients.
 Historical big data from healthcare providers can be used to identify and analyze certain
risk factors in patients. This is useful for earlier detection of diseases, allowing doctors
and their patients to take action sooner.
 Big data can identify disease trends as a whole based on demographics, geographics,
socio-economics, and other factors.

Education

Modern learning supported by technology is moving away from what we “think” works and
more toward what we “know” works. Through big data, educators are able to craft more
personalized learning models instead of relying on standardized, one-size-fits-all frameworks.

Big data is helping schools understand the unique needs of students by blending traditional
learning environments with online environments. This allows educators to track the progress of
their students and identify gaps in the learning process.
As a matter of fact, big data is already being used on some college campuses to reduce dropout
rates by identifying risk factors in students who are falling behind in their classes.

Retail

The retail industry has gone digital, and customers expect a seamless experience from online to
brick and mortar. Big data analytics allows retail companies to provide a variety of services and
understand more about their customers.

How big data is used in retail

You’ll find that some of the use cases of big data in retail closely mimic those of media and
entertainment. But in retail, it’s a bit more focused on the full customer lifecycle.

 Amazon has set the golden standard when it comes to applying big data for product
recommendations based on past searches on its platform. Using predictive analytics,
Amazon and other retailers are able to accurately predict what you’re likely to purchase
next.
 Demand forecasting is another application of big data. For example, retailers like
Walmart and Walgreens regularly analyze changes in weather to see any patterns in
product demand.
 Big data is useful for crisis control. For example, in product recalls, big data helps
retailers identify who purchased the product and allows them to reach out accordingly.

Manufacturing

Supply chain management and big data go hand-in-hand, which is why manufacturing is one of
the top industries to benefit from the use of big data.

Monitoring the performance of production sites is more efficient with big data analytics. The use
of analytics is also extremely useful for quality control, especially in large-scale manufacturing
projects.

Big data analytics plays a key role in tracking and managing overhead and logistics across
multiple sites. For example, being able to accurately measure the cost of shop floor tasks can
help reduce labor costs.

Then there’s predictive analytics software, which uses big data from sensors attached to
manufacturing equipment. Early detection of equipment malfunctions can save sites from costly
repairs capable of paralyzing production.

4. Web Analytics
Web analytics is the collection, reporting, and analysis of website data. The focus is on
identifying measures based on your organizational and user goals and using the website data to
determine the success or failure of those goals and to drive strategy and improve the user’s
experience.
Measuring Content
Critical to developing relevant and effective web analysis is creating objectives and calls-to-
action from your organizational and site visitors goals, and identifying key performance
indicators (KPIs) to measure the success or failures for those objectives and calls-to-action. Here
are some examples for building a measurement framework for an informational website:
The process of web analytics involves:

 Setting business goals: Defining the key metrics that will determine the success of your
business and website
 Collecting data: Gathering information, statistics, and data on website visitors using
analytics tools
 Processing data: Converting the raw data you’ve gathered into meaningful ratios, KPIs,
and other information that tell a story
 Reporting data: Displaying the processed data in an easy-to-read format
 Developing an online strategy: Creating a plan to optimize the website experience to
meet business goals
 Experimenting: Doing A/B tests to determine the best way to optimize website
performance

The importance of web analytics

Your company’s website is probably the first place your users end up on to learn more
about your product. In fact, your website is also a product. That’s why the data you
collect on your website visitors can tell you a lot about them and their website and
product expectations.

Here are a few reasons why web analytics are important:

Understand your website visitors

Web analytics tools reveal key details about your site visitors—including their average
time spent on page and whether they’re a new or returning user—and which content
draws in the most traffic. With this information, you’ll learn more about what parts of
your website and product interest users and potential customers the most.
For instance, an analytics tool might show you that a majority of your website visitors are
landing on your German site. You could use this information to ensure you have a
German version of your product that’s well translated to meet the needs of these users.

Analyze website conversions

Conversions could mean real purchases, signing up for your newsletter, or filling out a
contact form on your website. Web analytics can give you information about the total
number of these conversions, how much you earned from the conversions, the percentage
of conversions (number of conversions divided by the number of website sessions), and
the abandonment rate. You can also see the “conversion path,” which shows you how
your users moved through your site before they converted.

By looking at the above data, you can do conversion rate optimization (CRO). CRO will
help you design your website to achieve the optimum quantity and quality of conversions.

Web analytics tools can also show you important metrics that help you boost purchases
on your site. Some tools offer an enhanced ecommerce tracking feature to help you figure
out which are the top-selling products on your website. Once you know this, you can
refine your focus on your top-sellers and boost your product sales.

Boost your search engine optimization (SEO)

By connecting your web analytics tool with Google Search Console, it’s possible to track
which search queries are generating the most traffic for your site. With this data, you’ll
know what type of content to create to answer those queries and boost your site’s search
rankings.

It’s also possible to set up onsite search tracking to know what users are searching for on
your site. This search data can further help you generate content ideas for your site,
especially if you have a blog.

Understand top performing content

Web analytics tools will also help you learn which content is performing the best on your
site, so you can focus on the types of content that work and also use that information to
make product improvements. For instance, you may notice blog articles that talk about
design are the most popular on your website. This might signal that your users care about
the design feature of your product (if you offer design as a product feature), so you can
invest more resources into the design feature. The popular content pieces on your website
could spark ideas for new product features, too.

Understand and optimize referral sources

Web analytics will tell you who your top referral sources are, so you know which
channels to focus on. If you’re getting 80% of your traffic from Instagram, your
company’s marketers will know that they should invest in ads on that platform.

Web analytics also shows you which outbound links on your site people are clicking on.
Your company’s marketing team might discover a mutually beneficial relationship with
these external websites, so you can reach out to them to explore partnership or cross-
referral opportunities.

Example metrics to track with web analytics

Website performance metrics vary from company to company based on their goals for
their site. Here are some example KPIs that businesses should consider tracking as a part
of their web analytics practice.

Page visits / Sessions


Page visits and sessions refer to the traffic to a webpage over a specific period of time.
The more visits, the more your website is getting noticed.

Keep in mind traffic is a relative success metric. If you’re seeing 200 visits a month to a
blog post, that might not seem like great traffic. But if those 200 visits represent high-
intent views—views from prospects considering purchasing your product—that traffic
could make the blog post much more valuable than a high-volume, low-intent piece.

Source of traffic

Web analytics tools allow you to easily monitor your traffic sources and adjust your
marketing strategy accordingly. For example, if you’re seeing lots of traffic from email
campaigns, you can send out more email campaigns to boost traffic.
Total website conversion rate

Total website conversion rate refers to the percentage of people who complete a critically
important action or goal on your website. A conversion could be a purchase or when
someone signs up for your email list, depending on what you define as a conversion for
your website.

Bounce rate

Bounce rate refers to how many people visit just one page on your website and then leave
your site.

Interpreting bounce rates is an art. A high bounce rate could be both negative and positive
for your business. It’s a negative sign since it shows people are not interacting with other
pages on your site, which might signal low engagement among your site visitors. On the
other hand, if they spend quality time on a single page, it might indicate that users are
getting all the information they need, which could be a positive sign. That’s why you
need to investigate bounce rates further to understand what they might mean.

Repeat visit rate

Repeat visit rate tells you how many people are visiting your website regularly or
repeatedly. This is your core audience since it consists of the website visitors you’ve
managed to retain. Usually, a repeat visit rate of 30% is good. Anything below 20%
shows your website is not engaging enough.

Monthly unique visitors

Monthly unique visitors refers to the number of visitors who visit your site for the first
time each month.

This metric shows how effective your site is at attracting new visitors each month, which
is important for your growth. Ideally, a healthy website will show a steady flow of new
visitors to the site.

Unique ecommerce metrics

Along with tracking these basic metrics, an ecommerce company’s team might also track
additional KPIs to understand how to boost sales:
Shopping cart abandonment rate shows how many people leave their shopping carts
without actually making a purchase. This number should be as low as possible.
Other relevant ecommerce metrics include average order value and the average number
of products per sale. You need to boost these metrics if you want to increase sales.
Web analytics tools

There is a whole range of tools you can use for web analytics, including tools that traditionally
specialize in product analytics or experience analytics. Some of these include:

 Adobe Analytics
 Amplitude
 Contentsquare
 Crazy Egg
 fullstory
 Glassbox
 Google Analytics
 Heap
 Hotjar
 Mixpanel
 Pendo
Don’t just take our word for it, though. Check out review sites like G2 for a roundup of
the best web analytics tools.

Common issues with web analytics


While web analytics can be extremely useful for optimizing the website experience, there
are some drawbacks to it. Some of these include:

Keeping track of too many metrics


There are so many data points available to track. It can be overwhelming to combine web
analytics, product analytics, customer experience tools, heatmaps, and other business
intelligence analytics to make sense of things.

As a general rule, only measure the metrics that are important to your business goals, and
ignore the rest. For example, if your primary goal is to increase sales in a certain location,
you don’t need metrics about anything outside of that location.

Data is not always accurate


The data collected by analytics tools is not always accurate. Many users may opt-out of
analytics services, preventing web analytics tools from collecting information on them.
They may also block cookies, further preventing the collection of their data and leading
to a lot of missing information in the data reported by analytics tools. As we move
towards a cookieless world, you’ll need to consider analytics solutions that track first-
party data, rather than relying on third-party data.

Your web analytics tool may also be using incorrect data filters, which may skew the
information it collects, making the data inaccurate and unreliable. And there’s not much
you can do with unreliable data.

Data privacy is at risk


Untracked or overly exposed data can cause privacy or security vulnerabilities. People
could reveal all sorts of personal information about themselves on your website,
including credit card details and their address. Any breach to an analytics service
provider that compromises your user data can be devastating for your business’
reputation. Since privacy laws have become more stringent over the last decade globally,
it’s important you pay attention to cyber security.

Website data is particularly sensitive. Make sure your web analytics tools have proper
monitoring procedures and security testing in place. Take steps to protect your website
against any potential threats.

Data doesn’t tell the whole story


While web analytics are useful to learn how users are interacting with your website, they
only scratch the surface when it comes to understanding user behavior. Web analytics can
tell you what users are doing, but not why they do it. To understand behaviors, you need
to go beyond web analytics and leverage a behavioral analytics solution like Amplitude
Analytics. By looking at behavioral product data, you’ll see which actions drive higher
engagement, retention, and lifetime value.
5. Applications of Big Data
5.1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which brand they wish to spent, how
frequently they spent), shopping behavior, customer’s most liked product (so that they
can keep those products in the store). Which product is being searched/sold most, based
on that data, production/collection rate of that product get fixed. Banking sector uses their
customer’s spending behavior-related data so that they can provide the offer to a
particular customer to buy his particular liked product by using bank’s credit or debit card
with discount or cashback. By this way, they can send the right offer to the right person at
the right time.
5.2 Recommendation: By tracking customer spending habit, shopping behavior, Big
retails store provide a recommendation to the customer. E-commerce site like Amazon,
Walmart, Flipkart does product recommendation. They track what product a customer is
searching, based on that data they recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got
data that customer may be interested to buy bed cover. Next time when that customer will
go to any google page, advertisement of various bed covers will be seen. Thus,
advertisement of the right product to the right customer can be sent.

YouTube also shows recommend video based on user’s previous liked, watched video
type. Based on the content of a video, the user is watching, relevant advertisement is
shown during video running. As an example suppose someone watching a tutorial video
of Big data, then advertisement of some other big data course will be shown during that
video.

5.3 Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-
free or less jam way, less time taking ways are recommended. Such a way smart
traffic system can be built in the city by Big data analysis. One more profit is fuel
consumption can be reduced.

5.4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature, other
environmental condition. Based on such data analysis, an environmental parameter within
flight are set up and varied. By analyzing flight’s machine-generated data, it can be
estimated how long the machine can operate flawlessly when it to be replaced/repaired.

5.5 Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the
various spot of car camera, a sensor placed, that gather data like the size of the surrounding
car, obstacle, distance from those, etc. These data are being analyzed, then various
calculation like how many angles to rotate, what should be speed, when to stop, etc carried
out. These calculations help to take action automatically.
5.6 Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like
Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide the
answer of the various question asked by users. This tool tracks the location of the user, their
local time, season, other data related to question asked, etc. Analyzing all such data, it
provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data
like location of the user, season and weather condition at that location, then analyze these
data to conclude if there is a chance of raining, then provide the answer.

5.7. IoT:

Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any problem
when it requires repairing so that company can take action before the situation when machine
facing a lot of issues or gets totally down. Thus, the cost to replace the whole machine can be
saved.In the Healthcare field, Big data is providing a significant contribution. Using big data
tool, data regarding patient experience is collected and is used by doctors to give better
treatment. IoT device can sense a symptom of probable coming disease in the human body
and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-born baby
constantly keeps track of various health condition like heart bit rate, blood presser, etc.
Whenever any parameter crosses the safe limit, an alarm sent to a doctor, so that they can
take step remotely very soon.
5.8 Education Sector: Online educational course conducting organization utilize big data to
search candidate, interested in that course. If someone searches for YouTube tutorial video
on a subject, then online or offline course provider organization on that subject send ad
online to that person about their course.

5.9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this
read data to the server, where data analyzed and it can be estimated what is the time in a day
when the power load is less throughout the city. By this system manufacturing unit or
housekeeper are suggested the time when they should drive their heavy machine in the night time
when power load less to enjoy less electricity bill.

5.10. Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what
type of video, music users are watching, listening most, how long users are spending on site, etc
are collected and analyzed to set the next business strategy.

6. Convergence of Key Trends in Big Data

1. TinyML

TinyML is a type or technique of ML that is powered with small and low-powered


devices such as microcontrollers. The best part about TinyML is it runs on low latency at the
edge of devices. Thus, it consumes microwatts or milliwatts which is 1000x less than a
standard GPU. This quality of TinyML helps devices to run for a longer period of time which
can also be years in some cases. Since they’re low on power consumption, they don’t allow
any data to get stored and that’s the best part when it comes to safety concerns.
2. AutoML

It is also considered as modern ML these days. AutoML is being used to reduce human
interaction and process all the tasks automatically to solve real-life problems. This
functionality includes the whole process right from raw data to a final ML model. The motive
of AutoML is to offer extensive learning techniques and models for non-experts in ML. Not to
forget, although AutoML does not require human interaction that doesn’t mean that it’s going
to completely overtake it.

3. Data Fabric

Data Fabric has been in trend for a while now and will continue its dominance in the coming
times. It’s an architecture and group of data services throughout the cloud environment. Not
only this but data fabric has been also listed as the best analytical tool by Gartner. However, it
has to continue spreading all over the enterprise scale. It consists of key data management
technologies which include data pipelining, data integration, data governance, etc. It has been
accepted by enterprise scales openly as it consumes less time for fetching out business insights
which can be helpful for making impactful business decisions.

4. Cloud Migration

In today’s world of technology, businesses are now shifting towards cloud technology.
However, cloud migration has been in trend for a while now and this is the next future in
technology. Moving towards the cloud has several benefits and not only businesses but even
“we” as an individual are also relying totally on cloud technology. Cloud migration is very
much helpful in terms of performance as it uplifts the performance, speed, and scalability of
any operation, especially during heavy traffic.

5. Data Regulation

Since industries have started changing their working patterns and measuring business
decisions, it’s now making it easy for them to manage their operations. However, big data is
yet to make some more impact on the legal industry. In fact, some have started adopting big
data structures but it’s a long way to go. This comes up with a lot of responsibility of handing
data on such a large scale and some specific industries such as healthcare, legal fields cannot
be compromised or let’s say if there’s any patient data, it cannot be left with AI methods only.
So, as far as we’re concerned a better data regulations are going to play a major role in 2022.

6. IoT

With the growing pace of technology, we’re becoming more dependable on


technology. IoT has been playing a great role in this for the last few years and we believe it’ll
be playing a more interesting role in the near future. Today advanced data technologies and
architectures are adding value over IoT with the help of monitoring, and collecting data in
different forms. We believe IoT should be playing it on a larger scale now for storing
and processing data in real-time to solve unusual problems such as Traffic Management,
Manufacturing, Healthcare, etc.

7. NLP

Natural Language Processing is a kind of AI that helps in assessing text or voice inputs
provided by humans. In short, it is being used nowadays to understand what’s being said and
works like a charm. It is a next-level achievement in technology where we’ve been working
now and even you can find some of the examples where you can ask a machine to read aloud
for you. NLP uses a method of methodologies to extract the vagueness in speech and to
provide it a natural touch. Your very best example can be Apple’s Siri or Google Assistant,
where you speak to the AI and it provides you the useful information as per your need.

8. Data Quality

Data quality has one of the most sought concerns for companies later in 2021. In fact, the ratio
is less where companies have accepted that data quality is becoming an issue for them. Well,
on the other hand, it’s not a concern for them. To date, companies have not been focusing on
the quality of data from various mining tools which resulted in poor data management. The
reason is, if ‘Data’ is their decision-maker and playing a crucial role then they might be setting
wrong targets for their business or might be targeting the wrong group. That’s where filtration
is required to achieve real milestones.

9. Cyber Security

With the rise of pandemic (COVID-19), where the world was forced to shut down and
companies were left with none other than WFH, things began changing. Even after so many
months and years, people are focusing on getting remote work. Everything has pros and cons
in its own way. This also comes with a lot of challenges which include cyber-attacks. In fact,
working remotely comes with a lot of safety measures and responsibilities. Since the employee
is out of cyber security range and thus it becomes a concern for companies. As people are
working remotely, cyber attackers are becoming more active to breach out by finding different
ways of attack.
Taking this into considerations, XDR (Extended Detection and Response) and SOAR have
been introduced which helps in detecting any cyber-attack by applying advanced security
analytics into their network. Therefore, it is and will be one of the major trends for 2022 in big
data and analytics.

10. Predictive Analytics

It helps in identifying any future trends and forecasts with the help of certain sets of statistics
tools. Predictive analytics analyses a pattern in a meaningful way and it is being used for
weather forecasts. However, its ability and techniques are not just limited to this, in fact, it can
be used in sorting any data, and based on the pattern, it analyses the stats.
Some of the examples are Share market, Product Research, etc. Based on the provided data, it
measures and provide a full report beforehand if any market share is dipping down or if you
want to launch any product then it collects data from different regions and based on their
interests, it will help you in analyzing your business decision and in the world of this heavy
competition, it’s becoming even more demanding and will be in trend for the upcoming years.

7. Big Data Technologies


7.1 What is Big Data Technology?

Big data technology is defined as software-utility. This technology is primarily designed to


analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.

Among the larger concepts of rage in technology, big data technologies are widely associated
with many other technologies such as deep learning, machine learning, artificial intelligence
(AI), and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of real-
time data and batch-related data.

7.2 Types of Big Data Technology

Before we start with the list of big data technologies, let us first discuss this technology's board
classification. Big Data technology is primarily classified into the following two types:

Operational Big Data Technologies

This type of big data technology mainly includes the basic day-to-day data that people used to
process. Typically, the operational-big data includes daily basis data such as online transactions,
social media platforms, and the data from any particular organization or a firm, which is usually
needed for analysis using the software based on big data technologies. The data can also be
referred to as raw data used as the input for several Analytical Big Data Technologies.

Some specific examples that include the Operational Big Data Technologies can be listed as
below:

o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart, Walmart,
etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Analytical Big Data Technologies

Analytical Big Data is commonly referred to as an improved version of Big Data Technologies.
This type of big data technology is a bit complicated when compared with operational-big data.
Analytical big data is mainly used when performance criteria are in use, and important real-time
business decisions are made based on reports created by analyzing operational-real data. This
means that the actual investigation of big data that is important for business decisions falls under
this type of big data technology.

Some common examples that involve the Analytical Big Data Technologies can be listed as
below:

o Stock marketing data


o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the health status of an
individual
o Carrying out the space mission databases where every information of a mission is very
important

Top Big Data Technologies

We can categorize the leading big data technologies into the following four sections:

o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage

Let us first discuss leading Big Data Technologies that come under Data Storage:

o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.

Data Mining

Let us now discuss leading Big Data Technologies that come under Data Mining:

o Presto: Presto is an open-source and a distributed SQL query engine developed to run
interactive analytical queries against huge-sized data sources. The size of data sources
can vary from gigabytes to petabytes. Presto helps in querying the data in Cassandra,
Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache Software
Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
o RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and scripting
support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning Environment).
A few sets of companies that are making good use of the RapidMiner tool are Boston
Consulting Group, InFocus, Domino's, Slalom, and Vivint.SmartHome.
o ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e., Logstash
and Kibana). In simple words, ElasticSearch is a search engine based on the Lucene
library and works similarly to Solr. Also, it provides a purely distributed, multi-tenant
capable search engine. This search engine is completely text-based and contains schema-
free JSON documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed in
2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012. ElasticSearch
is used by many top companies, such as LinkedIn, Netflix, Facebook, Google, Accenture,
StackOverflow, etc.

Data Analytics

Now, let us discuss leading Big Data Technologies that come under Data Analytics:

o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform
is primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data technologies.
It is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like Amazon,
ORACLE, CISCO, VerizonWireless, and Hortonworks are using this big data technology
and making good use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications related
to different industries, such as finance, supply chain, manufacturing, etc. It is primarily
used in processing operations like payments and escrow. This helps in reducing the risks
of fraud. Besides, it enhances the transaction's overall processing speed, increases
financial privacy, and internationalize the markets. Additionally, it is also used to fulfill
the needs of shared ledger, smart contract, privacy, and consensus in any Business
Network Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.

Data Visualization

Let us discuss leading Big Data Technologies that come under Data Visualization:

o Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this
tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotlycompany. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.

Emerging Big Data Technologies

Apart from the above mentioned big data technologies, there are several other emerging big data
technologies. The following are some essential technologies among them:

o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible ecosystem


tools, and community resources that help researchers implement the state-of-art in
Machine Learning. Besides, this ultimately allows developers to build and deploy
machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on C++,
CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using this
technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It is
written in Python and Java. Some leading companies like Amazon, ORACLE, Cisco, and
VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy, and
execute applications easier by using containers. Containers usually help developers pack
up applications properly, including all the required components like libraries and
dependencies. Typically, containers bind all components and ship them all together as a
package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define workflows
in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is based
on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing Foundation.
It is written in the Go language. Companies like American Express, Pear Deck,
PeopleSource, and Northwestern Mutual are making good use of this technology.

These are emerging technologies. However, they are not limited because the ecosystem of big
data is constantly emerging. That is why new technologies are coming at a very fast pace based
on the demand and requirements of IT industries.

8. Introduction to Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and


 Storage layer (Hadoop Distributed File System).
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at the
application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on
all the platforms since it is Java based.
9. Open Source Technologies
Open source is a kind of licensing agreement that allows users to freely inspect, copy, modify
and redistribute a software, and have full access to the underlying source code. Simply put, open
source refers to software created by a developer who pledges to make the entire software source
code available to users. On the other hand, proprietary or “closed source” software has source
code that only the creator can legally copy, inspect, and modify. Top Open source technologies
celebrate the principle of open exchange, shared participation, transparency, and community-
oriented development.
1. Mozilla Firefox
Mozilla Firefox is a free and open-source internet browser that offers numerous plugins which
can be accessed with a single mouse click. Available for Android, iOS, Linux, and Windows,
Mozilla is free to use, modify and redistribute. Mozilla was born about 20 years ago out of the
open-source software movement, and over the years, it reshaped the technology industry and the
way social networks and operating systems operate. Today, open-source is mainstream, and it
powers tech giants like Google, Facebook, and even Microsoft.
2. LibreOffice
LibreOffice is a free and open-source alternative to Microsoft Office. It is a complete office suite
like MS Office, with which you can offer presentations, documents, spreadsheets, and databases.
LibreOffice is used by millions of people all over the world. The clean interface and feature-rich
tools of this open-source software allow users to use their creativity and enhance productivity.
Unlike Microsoft Office, which is available at a price, LibreOffice is accessible totally free. This
makes it the most versatile open-source office suite available in the market. The software is
available in Mac, Linux, and Windows and is compatible with all kinds of document formats,
including Microsoft Word, Excel, PowerPoint, and Publisher.
3. GIMP
The photo editing tool GIMP is one of the most popular and best maintained open-source
software available. It offers image editing, filters, effects, and flexibility features like some of the
expensive image editing tools, yet it is completely free. With GIMP, you can use layers, filters,
automatic photo enhancement features, and create new graphic design elements easily. Available
across different operating systems, including Windows, Mac, and Linux, GIMP has a fully
customizable interface and allows users to download plug-ins created by the GIMP community.
It is a favored photo editor of photographers, graphic designers, and illustrators.
4. VLC Media Player
The VLC multimedia player is a free open source software used for video, audio, and media
files. Most users use VLC to play discs, webcams, streams, and devices. VLC media player
allows optimizing multimedia files for specific hardware configuration and offers numerous
extensions for users to create customized designs. On top of it, the software runs on different
platforms like Android, Mac OS X, Linux, Windows, iOS.
5. Shotcut
This free, open-source video editing software offers advanced editing features like premium
editors. Preferred for its ability to edit every format of audio, video, or photo media, Shotcut also
offers HDMI preview and capture, a plethora of codecs, and non-destructive audio and video
editing. This allows users to compile effects without any loss in video quality.
6. Brave
The open-source Brave web browser is designed to keep browsing activity private by
automatically disabling website trackers and blocking ads. Brave offers a faster and more secure
browsing experience than with Google Chrome, plus users can access most Google Chrome
extensions.
7. Linux
Linux is the most in-demand open-source operating system available in the market. Most
commonly used on desktops and Android devices, Linux comes absolutely free and is extremely
customizable. The reasons for the immense popularity of Linux are its user-friendliness, strong
security features, and excellent community support.
8. Python
One of the most popular programming and scripting languages used by software
developers, Python, is an open-source software that is free to use and distribute. It is powerful
and fast, easy to learn and use, and runs everywhere, which is why developers choose Python as
one of the top open-source technologies available.
9. PHP
PHP is an open-source scripting language used for creating dynamic and interactive web pages
and various digital platforms. PHP can be embedded into HTML. What sets PHP from other
languages is that it is extremely simple for a beginner to learn but offers many advanced features
for professional developers. Some of the most powerful websites like Slack and Spotify have
been powered by PHP.
10. GNU Compiler Collection
The GNU Compiler Collection is a set of compilation and development tools used for software
development in C, C++, Ada, Fortran, and other programming languages. This free software
provides regular, high-quality releases which work well with native and cross targets. The
sources of the GNU Compiler collection are available for free via Git and weekly snapshots. The
collection is available on Linux, Windows, and other operating systems.
10. Cloud and Big Data
1. Big Data :
Big data refers to the data which is huge in size and also increasing rapidly with respect to
time. Big data includes structured data, unstructured data as well as semi-structured data. Big
data can not be stored and processed in traditional data management tools it needs specialized
big data management tools. It refers to complex and large data sets having 5 V’s volume,
velocity, Veracity, Value and variety information assets. It includes data storage, data analysis,
data mining and data visualization.
Examples of the sources where big data is generated includes social media data, e-commerce
data, weather station data, IoT Sensor data etc.
Characteristics of Big Data :
 Variety of Big data – Structured, unstructured, and semi structured data
 Velocity of Big data – Speed of data generation
 Volume of Big data – Huge volumes of data that is being generated
 Value of Big data – Extracting useful information and making it valuable
 Variability of Big data – Inconsistency which can be shown by the data at times.
Advantages of Big Data :
 Cost Savings
 Better decision-making
 Better Sales insights
 Increased Productivity
 Improved customer service.
Disadvantages of Big Data :
 Incompatible tools
 Security and Privacy Concerns
 Need for cultural change
 Rapid change in technology
 Specific hardware needs.
2. Cloud Computing :
Cloud computing refers to the on demand availability of computing resources over internet.
These resources includes servers, storage, databases, software, analytics, networking and
intelligence over the Internet and all these resources can be used as per requirement of the
customer. In cloud computing customers have to pay as per use. It is very flexible and can be
resources can be scaled easily depending upon the requirement. Instead of buying any IT
resources physically, all resources can be availed depending on the requirement from the cloud
vendors. Cloud computing has three service models i.e Infrastructure as a Service (IaaS),
Platform as a Service (PaaS) and Software as a Service (SaaS) .
Examples of cloud computing vendors who provides cloud computing services are Amazon
Web Service (AWS), Microsoft Azure, Google Cloud Platform, IBM Cloud Services etc.
Characteristics of Cloud Computing :
 On-Demand availability
 Accessible through a network
 Elastic Scalability
 Pay as you go model
 Multi-tenancy and resource pooling.
Advantages of Cloud Computing :
 Back-up and restore data
 Improved collaboration
 Excellent accessibility
 Low maintenance cost
 On-Demand Self-service.
Disadvantages of Cloud Computing :
 Vendor lock-in
 Limited Control
 Security Concern
 Downtime due to various reason
 Requires good Internet connectivity.
Difference between Big Data and Cloud Computing :
10. Mobile Business Intelligence
What is mobile Business Intelligence (BI)?
You’re probably familiar with the term business intelligence (BI), but if not, business
intelligence refers to the technology-driven process of analyzing data and providing actionable
insight resources that can be used to make intelligent business decisions. Or is otherwise known
that BI is about having the right data at the right time to make the right decision.

Consequently, Mobile Business Intelligence (BI) refers to the ability to access and perform BI-
related data analysis on mobile devices and tablets. It makes it easier to display KPIs, business
metrics, and dashboards. Mobile BI empowers users to use and receive the same features and
capabilities as those in the desktop / app-based BI software.
Why is mobile BI important?
Businesses these days possess an abundance of data. In this fast-paced environment, everyone
needs real-time data access to make data-driven decisions anytime and anywhere.

The number of organizations using mobile apps like SaaS applications for critical
business processes is increasing every day. Whether you are a CEO, a sales person, a digital
marketer, a department manager, or another employee, mobile BI can help you increase
productivity, improve the decision-making process, and overall boost your business.

Advantages of mobile BI
Accessibility
Having company insights at your fingertips is the most valuable advantage of mobile BI. You are
not limited to one computer in one location, but instead, you can access important data
information on your mobile device at any time and from any location. Having real-time data
insights always available helps improve the overall productivity of your daily operations.

Improved decision–making
Mobile BI apps speed up the decision-making process. When decisions must be made, or when
actions must be taken on the spot and at the moment, mobile BI provides up-to-the-minute
insights based on data to help users when they need it the most.

Stay ahead of your competitors


Access to real-time data helps in seeing business opportunities sooner, reacting to changes in
market conditions in a timely manner, as well as increasing the opportunity to up-sell and cross-
sell. Deploying a mobile BI solution makes you more flexible and more adaptable to business
shifts.

Mobile BI solutions
Every mobile device can be used to display and work with data. But of course, there are some
significant differences – the size of the screen is obviously difference #1. For example,
some data visualizations require more screen space, while others can easily fit in the width of a
smartphone or a tablet.

There are a couple of ways to implement content on mobile devices with those being the most
common:

Web page – Every mobile device includes a web browser that can access almost any web page.
Depending on some factors, the quality of the accessed page could be terrible, acceptable, or
great. In this scenario, BI developers need to check how their content renders on mobile devices
when creating reports and other data visualizations. This means that they should design the
application specifically for mobile use.

HTML5 site – Same as the web page, but with some improvements. HTML5 enables RIA (Rich
Internet Application) content to be displayed across all types of mobile devices without relying
on proprietary standards. The advantages of HTML5 are functions such as zooming, double-
tapping, and pinching, as well as that the user doesn’t have to install the app on its device to be
able to use it.

Native app – Those are web applications that can be downloaded and installed on mobile
devices. The application’s software is tailored to the OS of the mobile devices. It is the most
difficult and most expensive solution for manufacturers to support mobile BI, but it enables
interactive and enhanced use of analytics content.

Why is a Native Mobile BI Solution So Important?


When you are creating a dashboard you shouldn’t have to worry if this dashboard will look good
from your computer to your phone to your tablet. It should just work and look beautiful on all
platforms! That is the beauty of working with a native solution. You as the creator don’t have to
think about designing your dashboards for mobile – it just happens with the same look and feel
across any device.

Mobile markets are forever changing their devices, operating systems, and screen sizes – so why
should you have to think about designing for so many types and always updating them?

These are the top 3 advantages of mobile native apps:

Best Performance
With native mobile app development, the app is created and optimized for a specific platform.
The result is that the app has a higher level of performance. Native apps are very fast and
responsive because they are built for that specific platform and are compiled using a platform’s
core programming language and APIs. This also creates greater efficiency. The device stores
the app and this allows the software to leverage the device’s processing speed. So when users
navigate through a native mobile app, load times are faster because the content and visual
elements are already stored on their phone.

In contrast, a web app operates as a series of calls to and from remote web pages, and its speed is
constrained by all those Internet connections.

More Secure and Reliable


Web apps rely on a variety of browsers and underlying technologies such as JavaScript, HTML5,
and CSS. You may find more security and performance holes using web apps because of their
non-standard nature. Native apps, on the other hand, benefit from the more proactive security
and performance upgrades of the platform itself. Native mobile apps also allow companies to
take advantage of mobile device management solutions, even providing remote management
controls of apps on individual devices. Developing a native mobile app is a great way to
guarantee your users reliable data protection.

More Interactive and Intuitive


The most advantageous benefit to native mobile apps is the superior user experience. Native apps
are created specifically for an operating system. They adhere to the guidelines that ultimately
enhance and align the user experience with the specific operating system. As a result, the flow of
the app is more natural as they have specific UI standards for each platform. This allows the user
to learn the app, such as deleting an element quickly. Adhering to specific guidelines eliminates
the learning curve and allows users to interact with apps using actions and
gestures they’re familiar with already.

For example, take a look at this dashboard running native on iOS versus a dashboard running as
a web app:

Another benefit of a native app is that, because it is developed for a particular platform, it can
fully utilize the software and the operating systems’ features. The app can directly access the
hardware of the device such as the GPS, camera, microphone, etc. so they are faster in execution,
which ultimately results in a better user experience.

11. Crowd Sourcing Analytics

Crowdsourcing is a sourcing model in which an individual or organization receives assistance

from a large, open-minded, and rapidly changing group of people in the form of ideas, micro-

tasks, finances, and so on. Crowdsourcing is most commonly associated with the use of the

internet to attract a large group of people to divide tasks or achieve a goal. Jeff Howe and Mark

Robinson coined the term in 2005.

Crowdsourcing can assist various types of organizations in obtaining new ideas and solutions,

deeper consumer engagement, task optimization, and a variety of other benefits.

Crowdsourcing entails a large number of dispersed participants contributing or producing goods

or services (such as ideas, votes, microtasks, and finances) for payment or as volunteers. In

today's crowdsourcing, digital platforms are frequently used to attract and divide work among

participants in order to achieve a cumulative result.

Crowdsourcing, however, is not limited to online activity, and there are numerous historical

examples of crowdsourcing. The term "crowdsourcing" is a combination of the words "crowd"

and "outsourcing."
Crowdsourcing allows companies to tap into the world of ideas and allows many to work

through a rapid design process. You can outsource to a large group of people to ensure that your

products and services are delivered correctly.

Crowdsourcing is very powerful because it allows for a wide range of participation from people

at a low or no cost. Suggestions are provided by experienced professionals and volunteers who

are compensated only if their ideas are implemented — they rely on creativity that people are

willing to share. All they require is an opportunity to participate. This is especially true when

people use the internet for crowdsourcing. Many people, for example, create and post videos on

YouTube.

There are numerous avenues for crowdsourcing, such as enlisting volunteers, blogs, hotlines,

distribution incentives, free products, and so on. Companies such as IdeaSkill and InnoCentive

specialize in delivering the crowd, allowing you to directly tap into a predefined group of people

willing to help you solve your problem or design your product.

When it comes to how crowdsourcing works, it has a very low cost. Everyone should invest in

crowdsourcing so that they can tap into a global pool of creativity. This also assists the company

in driving, motivating, mass collaboration, and innovation while remaining true to the

competition.
Step 1: Recognize "The Crowd":

Before you can effectively crowdsource, you must first identify the crowd. Most of the time, it is

not as simple as "every employee." While you may want to send out broad-based surveys (like

the Boeing initiative mentioned above), you may also want to narrow your focus for information

gathering.

To determine which topics are relevant, you may want to survey only engineers. Or just

salespeople to see how they can improve their numbers the most. When crowdsourcing content,
it's critical to limit your "crowd" to Subject Matter Experts - you don't want every Tom, Dick,

and Harriet telling you how to perform a delicate engineering operation! You will be able to do

so in many cases.

Step 2: Collect Information:

Anyone interested in crowdsourcing training should acquire or develop an internal platform that

allows for easy submission of crowdsourced content, as well as efficient content management
and ease of access for those seeking information from the crowd. Improved technology allows

for greater connectivity and efficiency within organizations. Training has enormous potential for

use in a variety of business functions.

So, the first step is to make it simple for people to provide you with the information you require.

Once you've cleared that (not insignificant) hurdle, you'll need to determine what type of

information you're looking for.

When crowdsourcing topics, it's best to keep things simple. In the Boeing example, employees

were given four options with the option to add additional ideas. A simpler approach not only

makes it easier for employees to respond, but it also keeps everything in perspective - for both

them and you. They'll know they're not being given free rein to re-invent the company's training

department after four questions. When it comes time to prioritize topics, it also allows you to

highlight your expertise.

Step 3: Validate and Curate:

No matter how good the information you gather is, it is always your responsibility to ensure that

your audience receives the right information in the right way. That is why, as a training

professional, crowdsourcing will never be able to replace you. YOU are the true source of

effective crowdsourcing when you:

 Sort and organize information.

 Align group preferences with business objectives.

 Follow up on content - confirm with management, other SMEs, and so on.

 Editing and shaping the content


 Create a solid educational plan.

Advantages of Crowdsourcing:

As previously stated, crowdsourcing greatly aids business growth. Crowdsourcing, whether for a

business idea or a logo design, directly engages people while saving money and energy.

Crowdsourced marketing will undoubtedly gain traction in the coming years as the world

embraces technology more quickly.

Companies can gain access to amazing suggestions for a new product or service, or for a new

solution to a difficult problem by posing a question to a diverse talent pool. This not only aids in

problem-solving but also allows groups to feel connected to businesses and organizations.

Building this contributor community can have significant marketing, brand visibility, and

customer loyalty benefits.

Crowdsourcing has numerous other advantages:

1. Lower costs: While winning ideas should be rewarded, providing these rewards is

usually much less expensive than formally hiring people to solve problems.

2. Greater speed: Using a larger pool of people can accelerate the problem-solving

process, particularly when completing a large number of small tasks in real time.

3. More diversity: Some businesses (particularly smaller businesses) may lack internal

diversity. They can benefit from others with different backgrounds, values, and life

experiences by crowdsourcing ideas.


4. Marketing and media coverage: As demonstrated by My Starbucks Idea and Lego

Ideas, crowdsourcing can be an excellent and cost-effective source of marketing and

media coverage.

5. Evolving Innovation: Innovation is required everywhere, and in this rapidly changing

world, innovation plays a significant role. Crowdsourcing facilitates the collection of

innovative ideas from people from various fields, thereby assisting businesses in all fields

to grow.

6. Increased Efficiency: Crowdsourcing has increased the efficiency of business models by

funding several expert ideas.

Crowdsourcing can be used to find solutions to a wide range of problems. This can range from

something as simple as a band asking its fans which cities they should visit on their next tour to

more ambitious projects like genetic researchers seeking assistance in sequencing the human

genome.

The breadth and diversity of social media also offer enormous potential for crowdsourcing, as

demonstrated by the Obama administration's use of Twitter to solicit questions for town hall

debates and football clubs asking fans to vote for the starting lineup ahead of each match.

Crowdsourcing can also take the form of idea competitions, such as Ideas for Action, which

provides a platform for students and young professionals to submit solutions to global innovation

challenges.
11. Inter and Trans Firewall Analytics

What is Firewall?

A firewall is a sort of network security hardware or software application that monitors and filters
incoming and outgoing network traffic according to a set of security rules. It serves as a barrier
between internal private networks and public networks (such as the public Internet).

A firewall’s principal goal is to allow non-threatening communication while blocking dangerous


or undesirable data transmission in order to protect the computer from viruses and attacks. A
firewall is a cybersecurity solution that filters network traffic and assists users in preventing
harmful malware from gaining access to the Internet on compromised machines.

Firewall is Hardware or Software?

Whether a firewall is hardware or software is one of the most difficult issues to answer. A
firewall, as previously noted, can be a network security equipment or a computer software
programme. This means that the firewall is available on both hardware and software levels, albeit
it is preferable to have both.

Need of Firewall:

Firewalls are used to protect against malware and network-based threats. They can also aid in the
prevention of application-layer assaults. These firewalls serve as a barrier or a gatekeeper. They
keep track of every connection our machine makes to another network. They only let data
packets to pass through them if the data is coming or leaving from a trusted source designated by
the user.
Firewalls are built in such a way that they can identify and counter-attack threats throughout the
network fast. They can work with regulations that have been set up to defend the network, as
well as conduct fast inspections to look for any unusual activity. In other words, the firewall can
be used as a traffic controller.

The following are some of the major dangers of not having a firewall:

Open to the public:

If a computer is not protected by a firewall, it allows unrestricted access to other networks. This
means it accepts any type of connection that is made through another person. It is impossible to
detect threats or assaults going over our network in this circumstance. We render our devices
vulnerable to malicious users and other undesired sources if we don’t employ a firewall.

Data that has been lost or mutilated:

We’re leaving our gadgets open to everyone if we don’t use a firewall. This means that anyone,
even the network, can gain access to our device and have complete control over it.
Cybercriminals can easily destroy our data or utilise our personal information for their own gain
in this situation.

Network Crashes:

Anyone could gain access to our network and shut it down if we didn’t have a firewall. It may
prompt us to devote precious time and resources to restoring our network’s functionality.

As a result, it is critical to employ firewalls to protect our network, computer, and data from
unauthorized access.

What exactly is the work of a firewall?

A firewall system examines network traffic according to pre-set rules. The traffic is subsequently
filtered, and any traffic coming from untrustworthy or suspect sources is blocked. It only accepts
traffic that has been configured to accept. Firewalls often intercept network traffic at a
computer’s port, or entry point. According to pre-defined security criteria, firewalls allow or
block particular data packets (units of communication carried over a digital network). Only
trusted IP addresses or sources are allowed to send traffic in.
Firewall’s Functions

As previously established, the firewall acts as a gatekeeper. It examines all attempts to obtain
access to our operating system and blocks traffic from unidentified or unknown sources.

We can think of the firewall as a traffic controller since it operates as a barrier or filter between
the computer system and external networks (such as the public Internet). As a result, the major
function of a firewall is to protect our network and information by managing network traffic,
prohibiting unwanted incoming network traffic, and validating access by scanning network
traffic for dangerous things like hackers and viruses. Most operating systems (for example,
Windows OS) and security applications provide firewall capability by default. As a result, it’s a
good idea to make sure those options are enabled. We can also adjust the system’s security
settings to update automatically whenever new information becomes available.

Firewalls have grown in power, and now encompass a number of built-in functions and
capabilities:

 Preventing Network Threats


 Control based on the application and the user’s identity.
 Support for Hybrid Cloud.
 Performance that scales.
 Control and management of network traffic.
 Validation of access.
 Keep track of what happens and report on it.
Firewall’s Limitations:

Firewalls are the first line of defence when it comes to network security. However, the question
remains as to whether these firewalls are powerful enough to protect our gadgets from cyber-
attacks. It’s possible that the answer is “no.” When utilising the Internet, it is best to employ a
firewall system. Other defence mechanisms, on the other hand, should be used to assist safeguard
the network and data saved on the computer. Because cyber dangers are always changing, a
firewall should not be the only thing you think about when it comes to protecting your home
network.
Firewalls are obviously important as a security solution; nonetheless, they have significant
limitations:

Firewalls are unable to prevent people from accessing dangerous websites, leaving the
organisation open to internal threats and attacks.

Firewalls cannot prevent virus-infected data or software from being sent.

Passwords can’t be protected by firewalls.

If security rules are misconfigured, firewalls will not be able to protect you.

Non-technical security vulnerabilities, such as social engineering, are not protected by firewalls.

Firewalls cannot prevent or stop attackers using modems from calling in or out of the internal
network.

Firewalls can’t protect a machine that has already been infected.

As a result, it’s a good idea to maintain all Internet-connected gadgets up to date. This includes
the most recent versions of operating systems, web browsers, apps, and other security software
(such as anti-virus). Furthermore, wireless router security should be a standard practise.
Changing the router’s name and password on a regular basis, evaluating security settings, and
setting up a guest network for visitors are all possibilities for protecting a router.

Types of Firewalls:

Different types of firewalls exist, each with its own structure and functionality. The following is
a list of some of the most prevalent firewall types:

Proxy Firewall is a type of proxy firewall.

Firewalls with packet filtering.

Firewall with Stateful Multi-layer Inspection (SMLI).

Firewall with unified threat management (UTM).

You might also like