0% found this document useful (0 votes)
45 views12 pages

Unit I - BDA

Big data refers to large datasets that cannot be processed using traditional computing techniques due to their volume, variety and velocity. It has three main characteristics - variety, referring to data from multiple structured and unstructured sources; velocity, the speed at which data is created and needs to be processed in real-time; and volume, the huge amounts of data generated daily from various sources. Common sources of big data include social media, IoT devices, websites and cloud storage. Working with big data poses challenges including quick data growth, storage limitations, syncing data from different sources, security issues, and dealing with unreliable data.

Uploaded by

Sreelakshmi K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views12 pages

Unit I - BDA

Big data refers to large datasets that cannot be processed using traditional computing techniques due to their volume, variety and velocity. It has three main characteristics - variety, referring to data from multiple structured and unstructured sources; velocity, the speed at which data is created and needs to be processed in real-time; and volume, the huge amounts of data generated daily from various sources. Common sources of big data include social media, IoT devices, websites and cloud storage. Working with big data poses challenges including quick data growth, storage limitations, syncing data from different sources, security issues, and dealing with unreliable data.

Uploaded by

Sreelakshmi K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BIGDATA ANALYTICS

UNIT-I
Big data:
What is Big Data?
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on
a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and
volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and
Complexity.

Characteristics of Big Data

3 ‘V’s of Big Data –


Variety,
Velocity,
Volume.
\
1) Variety

Variety of Big Data refers to structured, unstructured, and semistructured data that is gathered from multiple
sources. While in the past, data could only be collected from spreadsheets and databases, today data comes
in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more.

2) Velocity

Velocity essentially refers to the speed at which data is being created in real-time. In a broader prospect, it
comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts.

3) Volume

We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a daily basis
from various sources like social media platforms, business processes, machines, networks, human
interactions, etc. Such a large amount of data are stored in data warehouses.

Types of Big Data

Structured

By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It
refers to highly organized information that can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms. For instance, the employee table in a company database
will be structured as the employee details, their job positions, their salaries, etc., will be present in an
organized manner.
Unstructured

Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes
it very difficult and time-consuming to process and analyze unstructured data. Email is an example of
unstructured data.

Semi-structured

Semi-structured data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information or tags that segregate individual
elements within the data.

Sources of big data


Voluminous amounts of big data make it crucial for businesses to differentiate, for the purpose of
effectiveness, the disparate big data sources available

Media as a big data source


Media is the most popular source of big data, as it provides valuable insights on consumer preferences and
changing trends. Since it is self-broadcasted and crosses all physical and demographical barriers, it is the
fastest way for businesses to get an in-depth overview of their target audience, draw patterns and
conclusions, and enhance their decision-making. Media includes social media and interactive platforms, like
Google, Facebook, Twitter, YouTube, Instagram, as well as generic media like images, videos, audios, and
podcasts that provide quantitative and qualitative insights on every aspect of user interaction.

Cloud as a big data source


Today, companies have moved ahead of traditional data sources by shifting their data on the cloud. Cloud
storage accommodates structured and unstructured data and provides business with real-time information
and on-demand insights. The main attribute of cloud computing is its flexibility and scalability. As big data
can be stored and sourced on public or private clouds, via networks and servers, cloud makes for an efficient
and economical data source.

The web as a big data source


The public web constitutes big data that is widespread and easily accessible. Data on the Web or ‘Internet’
is commonly available to individuals and companies alike. Moreover, web services such as Wikipedia
provide free and quick informational insights to everyone. The enormity of the Web ensures for its diverse
usability and is especially beneficial to start-ups and SME’s, as they don’t have to wait to develop their own
big data infrastructure and repositories before they can leverage big data.

IoT as a big data source


Machine-generated content or data created from IoT constitute a valuable source of big data. This data is
usually generated from the sensors that are connected to electronic devices. The sourcing capacity depends
on the ability of the sensors to provide real-time accurate information. IoT is now gaining momentum and
includes big data generated, not only from computers and smartphones, but also possibly from every device
that can emit data. With IoT, data can now be sourced from medical devices, vehicular processes, video
games, meters, cameras, household appliances, and the like.

Databases as a big data source


Businesses today prefer to use an amalgamation of traditional and modern databases to acquire relevant big
data. This integration paves the way for a hybrid data model and requires low investment and IT
infrastructural costs. Furthermore, these databases are deployed for several business intelligence purposes as
well. These databases can then provide for the extraction of insights that are used to drive business profits.
Popular databases include a variety of data sources, such as MS Access, DB2, Oracle, SQL, and Amazon
Simple, among others.

Working with unstructured data

The process of extracting and analyzing data amongst extensive big data sources is a complex
process and can be frustrating and time-consuming. These complications can be resolved if organizations
encompass all the necessary considerations of big data, take into account relevant data sources, and deploy
them in a manner which is well tuned to their organizational goals.

Before the modern day ubiquity of online and mobile applications, databases processed
straightforward, structured data. Data models were relatively simple and described a set of relationships
between different data types in the database.

Unstructured data, in contrast, refers to data that doesn’t fit neatly into the traditional row and
column structure of relational databases. Examples of unstructured data include: emails, videos, audio files,
web pages, and social media messages. In today’s world of Big Data, most of the data that is created is
unstructured with some estimates of it being more than 95% of all data generated.
As a result, enterprises are looking to this new generation of databases, known as NoSQL, to address
unstructured data. MongoDB stands as a leader in this movement with over 10 million downloads and
hundreds of thousands of deployments. As a document database with flexible schema, MongoDB was built
specifically to handle unstructured data. MongoDB’s flexible data model allows for development without a
predefined schema which resonates particularly when most of the data in your system is unstructured.

The Evolution of Big Data


To truly understand the implications of Big Data analytics, one has to reach back into the annals of
computing history, specifically business intelligence (BI) and scientific computing. The ideology behind Big
Data can most likely be tracked back to the days before the age of computers, when unstructured data were
the norm (paper records) and analytics was in its infancy. Perhaps the first Big Data challenge came in the
form of the 1880 U.S. census, when the information concerning approximately 50 million people had to be
gathered, classified, and reported on.

With the 1880 census, just counting people was not enough information for the U.S. government to
work with—particular elements, such as age, sex, occupation, education level, and even the “number of
insane people in household,” had to be accounted for. That information had intrinsic value to the process,
but only if it could be tallied, tabulated, analyzed, and presented. New methods of relating the data to other
data collected came into being, such as associating occupations with geographic areas, birth rates with
education levels, and countries of origin with skill sets.

The 1880 census truly yielded a mountain of data to deal with, yet only severely limited technology
was available to do any of the analytics. The problem of Big Data could not be solved for the 1880 census,
so it took over seven years to manually tabulate and report on the data.

With the 1890 census, things began to change, ...

Challenges of Big Data


It must be pretty clear by now that while talking about big data one can’t ignore the fact that there are some
obvious challenges associated with it. So moving forward in this blog, let’s address some of those
challenges.

 Quick Data Growth

Data growing at such a quick rate is making it a challenge to find insights from it. There is more and more
data generated every second from which the data that is actually relevant and useful has to be picked up for
further analysis.

 Storage

Such large amount of data is difficult to store and manage by organizations without appropriate tools and
technologies.
 Syncing Across Data Sources

This implies that when organisations import data from different sources the data from one source might not
be up to date as compared to the data from another source.

 Security

Huge amount of data in organisations can easily become a target for advanced persistent threats, so here lies
another challenge for organisations to keep their data secure by proper authentication, data encryption, etc.

 Unreliable Data

We can’t deny the fact that big data can’t be 100 percent accurate. It might contain redundant or incomplete
data, along with contradictions.

 Miscellaneous Challenges

These are some other challenges that come forward while dealing with big data, like the integration of
data, skill and talent availability, solution expenses and processing a large amount of data in time and
with accuracy so that the data is available for data consumers whenever they need it.

Data Environment versus Big Data Environment


Below are the lists of points, describe the comparisons between Small Data and Big Data.

BasisOf
Small Data Big Data
Comparison

Data that is ‘small’ enough for human Data sets that are so large or complex that
Definition comprehension.In a volume and format that makes it traditional data processing applications
accessible, informative and actionable cannot deal with them

● Data from traditional enterprise systems like ● Purchase data from point-of-sale
○ Enterprise resource planning ● Clickstream data from websites
Data Source ○ Customer relationship management(CRM) ● GPS stream data – Mobility data sent to
● Financial Data like general ledger data a server
● Payment transaction data from website ● Social media – Facebook, Twitter

Most cases in a range of tens or hundreds of GB.Some


Volume More than a few Terabytes (TB)
case few TBs ( 1 TB=1000 GB)

Velocity ● Controlled and steady data flow ● Data can arrive at very fast speeds.
● Data accumulation is slow ● Enormous data can accumulate within
very short periods of time

High variety data sets which include Tabular


Structured data in tabular format with fixed schema
Variety data,Text files, Images, Video,
and semi-structured data in JSON or XML format
Audio,XML,JSON,Logs,Sensor data etc.

Usually, the quality of data not guaranteed.


Veracity (Quality Contains less noise as data collected in a controlled
Rigorous data validation is required before
of data ) manner.
processing.

Complex data mining for prediction,


Value Business Intelligence, Analysis, and Reporting
recommendation, pattern finding, etc.

Historical data equally valid as data represent solid In some cases, data gets older soon(Eg fraud
Time Variance
business interactions detection).

Mostly in distributed storages on Cloud or in


Data Location Databases within an enterprise, Local servers, etc.
external file systems.

More agile infrastructure with a horizontally


Predictable resource allocation.Mostly vertically
Infrastructure scalable architecture. Load on the system
scalable hardware
varies a lot.
UNIT-II
Bigdata Anlytics:
Overview of Business Intelligence:
Business Intelligence (BI) applications are decision support tools that enable real-time, interactive
access to and analysis of mission-critical corporate information. BI applications bridge the gaps between
information silos in an organization. Sophisticated analytical capabilities have access to such corporate
information resources as data warehouses, transaction processing applications, and enterprise applications
like Enterprise Resource Planning (ERP). BI enables users to access and leverage vast amounts of data,
providing valuable insight into potential opportunities and areas for business process refinement.

BI applications can be classified as follows:

 Personalized Dashboards for Process Monitoring and Highlighting Exceptions


 Decision Support with Drill-Down and “What-If” Analysis
 Data-Mining to Understand and Discover Patterns and Behaviors
 Automated Agents to Drive Rule-Based Business Strategy via Integrated Processes

Investments made in an EPM (Enterprise Process Management) implementation can be very expensive, so it
is imperative that every asset is leveraged. Youngsoft’s team of experienced professionals provide a wide
range of services across the suite of EPM products, consistently delivering solutions that reduce costs,
increase profits, and improve overall efficiency.

Benefits:

 Cost Reduction during the Implementation Process


 Retained Knowledge after Completion of the Implementation
 End User Step by Step Training during the Implementation Process
 Providing Tier-One Production Support during Post-Implementation Process
 Shadowing the Implementation

What is Data Science?

Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in
ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning
the data.

In simple terms, it is the umbrella of techniques used when trying to extract insights and information from
data.
Applications of Data Science

 Internet Search

Search engines make use of data science algorithms to deliver the best results for search queries in a
fraction of seconds.

 Digital Advertisements

The entire digital marketing spectrum uses the data science algorithms - from display banners to
digital billboards. This is the mean reason for digital ads getting higher CTR than traditional
advertisements.

 Recommender Systems

The recommender systems not only make it easy to find relevant products from billions of products
available but also adds a lot to user-experience. A lot of companies use this system to promote their
products and suggestions in accordance with the user’s demands and relevance of information. The
recommendations are based on the user’s previous search results.

Need of Bigdata Analytics


Why is big data analytics important?

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in
turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. In his
report Big Data in Big Companies, IIA Director of Research Tom Davenport interviewed more than 50
businesses to understand how they used big data. He found they got value in the following ways:

1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant
cost advantages when it comes to storing large amounts of data – plus they can identify more
efficient ways of doing business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined
with the ability to analyze new sources of data, businesses are able to analyze information
immediately – and make decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big data
analytics, more companies are creating new products to meet customers’ needs.
Types of data analytics
There are 4 different types of analytics.

Descriptive analytics

Descriptive analytics answers the question of what happened. Let us bring an example from ScienceSoft’s
practice: having analyzed monthly revenue and income per product group, and the total quantity of metal
parts produced per month, a manufacturer was able to answer a series of ‘what happened’ questions and
decide on focus product categories.

Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past.
However, these findings simply signal that something is wrong or right, without explaining why. For this
reason, our data consultants don’t recommend highly data-driven companies to settle for descriptive
analytics only, they’d rather combine it with other types of data analytics.

Diagnostic analytics

At this stage, historical data can be measured against other data to answer the question of why something
happened. For example, you can check ScienceSoft’s BI demo to see how a retailer can drill the sales and
gross profit down to categories to find out why they missed their net profit target. Another flashback to our
data analytics projects: in the healthcare industry, customer segmentation coupled with several filters
applied (like diagnoses and prescribed medications) allowed identifying the influence of medications.

Diagnostic analytics gives in-depth insights into a particular problem. At the same time, a company should
have detailed information at their disposal, otherwise, data collection may turn out to be individual for every
issue and time-consuming.

Predictive analytics

Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics
to detect clusters and exceptions, and to predict future trends, which makes it a valuable tool for forecasting.
Check ScienceSoft’s case study to get details on how advanced data analytics allowed a leading FMCG
company to predict what they could expect after changing brand positioning.

Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated
analysis based on machine or deep learning and proactive approach that predictions enable. However, our
data consultants state it clearly: forecasting is just an estimate, the accuracy of which highly depends on data
quality and stability of the situation, so it requires careful treatment and continuous optimization.

Prescriptive analytics

The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future
problem or take full advantage of a promising trend. An example of prescriptive analytics from our project
portfolio: a multinational company was able to identify opportunities for repeat purchases based on
customer analytics and sales history.

Prescriptive analytics uses advanced tools and technologies, like machine learning, business rules and
algorithms, which makes it sophisticated to implement and manage. Besides, this state-of-the-art type of
data analytics requires not only historical internal data but also external information due to the nature of
algorithms it’s based on. That is why, before deciding to adopt prescriptive analytics, ScienceSoft strongly
recommends weighing the required efforts against an expected added value.

Big Data Analytics Challenges

Need For Synchronization Across Disparate Data Sources

As data sets are becoming bigger and more diverse, there is a big challenge to incorporate them into an
analytical platform. If this is overlooked, it will create gaps and lead to wrong messages and insights.

2. Acute Shortage Of Professionals Who Understand Big Data Analysis

The analysis of data is important to make this voluminous amount of data being produced in every minute,
useful. With the exponential rise of data, a huge demand for big data scientists and Big Data analysts has
been created in the market. It is important for business organizations to hire a data scientist having skills that
are varied as the job of a data scientist is multidisciplinary. Another major challenge faced by businesses is
the shortage of professionals who understand Big Data analysis. There is a sharp shortage of data scientists
in comparison to the massive amount of data being produced.

3. Getting Meaningful Insights Through The Use Of Big Data Analytics

It is imperative for business organizations to gain important insights from Big Data analytics, and also it is
important that only the relevant department has access to this information. A big challenge faced by the
companies in the Big Data analytics is mending this wide gap in an effective manner.

4. Getting Voluminous Data Into The Big Data Platform

It is hardly surprising that data is growing with every passing day. This simply indicates that business
organizations need to handle a large amount of data on daily basis. The amount and variety of data available
these days can overwhelm any data engineer and that is why it is considered vital to make data accessibility
easy and convenient for brand owners and managers.

5. Uncertainty Of Data Management Landscape

With the rise of Big Data, new technologies and companies are being developed every day. However, a big
challenge faced by the companies in the Big Data analytics is to find out which technology will be best
suited to them without the introduction of new problems and potential risks.

6. Data Storage And Quality

Business organizations are growing at a rapid pace. With the tremendous growth of the companies and large
business organizations, increases the amount of data produced. The storage of this massive amount of data is
becoming a real challenge for everyone. Popular data storage options like data lakes/ warehouses are
commonly used to gather and store large quantities of unstructured and structured data in its native format.
The real problem arises when a data lakes/ warehouse try to combine unstructured and inconsistent data
from diverse sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and duplicates
data all result in data quality challenges.

7. Security And Privacy Of Data

Once business enterprises discover how to use Big Data, it brings them a wide range of possibilities and
opportunities. However, it also involves the potential risks associated with big data when it comes to the
privacy and the security of the data. The Big Data tools used for analysis and storage utilizes the data
disparate sources. This eventually leads to a high risk of exposure of the data, making it vulnerable. Thus,
the rise of voluminous amount of data increases privacy and security concerns.
The Importance of Big Data Analytics

Driven by specialized analytics systems and software, as well as high-powered computing systems, big data
analytics offers various business benefits, including:

 New revenue opportunities

 More effective marketing

 Better customer service

 Improved operational efficiency

 Competitive advantages over rivals

Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians and
other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of
data that are often left untapped by conventional BI and analytics programs. This encompasses a mix of
semi-structured and unstructured data -- for example, internet clickstream data, web server logs, social
media content, text from customer emails and survey responses, mobile phone records, and machine data
captured by sensors connected to the internet of things (IoT).

Basic Terminologies in big data environment

we will discuss the terminology related to Big Data ecosystem. This will give you a complete understanding
of Big Data and its terms.

Over time, Hadoop has become the nucleus of the Big Data ecosystem, where many new technologies have
emerged and have got integrated with Hadoop. So it’s important that, first, we understand and appreciate the
nucleus of modern Big Data architecture.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data
sets across clusters of computers, using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage.

You might also like