BDA Unit 1
BDA Unit 1
INTRODUCTION
Gartner definition:
Big data is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision making.
Big data refers to complex and large data sets that have to be processed and analyzed to
uncover valuable information that can benefit businesses and organizations.
Big Data refers to massive amounts of data produced by different sources like social media
platforms, web logs, sensors, IoT devices, and many more. It can be either structured (like tables in
DBMS), semi-structured (like XML files), or unstructured (like audios, videos, images).
It helps companies to generate valuable insights.
It can’t equate to any specific data volume. Big data deployments can involve terabytes,
petabytes, and even exabytes of data captured over time.
High-volume
High-velocity
High-variety
Cost-effective,
Innovative forms of
information processing
The first part "Big data is high-volume, high-velocity, and high-variety information assets"
talks about voluminous data that may have great mixture of structured, semi-structured and
unstructured data and will require a good speed/pace for storage, preparation, processing and
analysis.
The second part "cost effective, innovative forms of information processing" talks about
embracing new techniques and technologies to capture, store, process, persist, integrate and visualize
the high-volume, high-velocity and high-variety data.
The third part "enhanced insight and decision making" talks about deriving deeper, richer
and meaningful insights and then using these insights to make faster and better decisions to gain
business value and thus a competitive edge.
Data—>Information—>Actionable intelligence—>Better decisions—>Enhanced business value
Companies are using big data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about its
customers, the more competitive it becomes.
It is possible to use it with Machine Learning for creating market strategies based on
predictions about customers. Leveraging big data makes companies customer-centric.
Companies can use historical and real-time data to assess evolving consumer’s
preferences. This consequently enables businesses to improve and update their marketing
strategies which make companies more responsive to customer needs.
Storage
Curation
Analysis
Transfer
Visualization
Privacy Violations
b) Unstructured
This is the data which does not conform to a data model or is not in a form which can be used
easily by a computer program. It refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured
data. About 80% data of an organization is in this format; for example, memos, chat rooms,
powerpoint presentations, images, videos, letters, researches, white papers, body of an email, etc.
c) Semi-structured
Semi-structured data is also referred to as self-describing structure. This is the data which does not
conform to a data model but has some structure. It pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. About 10% data of an organization
is in this format; for example, HTML, XML, JSON, email data etc.
Digital Data
Unstructured data
Semi-structured data
Structured data
Condition
The condition of data deals with the state of data, that is, "Can one use this data as is for
analysis?" or "Does it require cleansing for further enhancement and enrichment?"
Context
The context of data deals with "Where has this data been generated?" "Why was this data
generated?" “How sensitive is this data?" "What are the events associated with this data?".
Small data (data as it existed prior to the big data revolution) is about certainty. It is about
known data sources; it is about no major changes to the composition or context of data.
Composition
Data Condition
Context
VOLUME
Terabyte
Records
VARIETY Tables, Files VELOCITY
Structured Distributed Batch
Unstructured Real time
Probabilistic Processes
Linked 5 V’s of Big Stream
Data
VERACITY VALUE
Authenticity Statistical
Reputation Events
Availability VARIABILITY Correlations
Accountability Changing data Hypothetical
Changing model
Linkage
Volume
Volume is a huge amount of data. To determine the value of data, size of data plays a very
crucial role. If the volume of data is very large then it is actually considered as a ‘Big Data’.
Bits—>Bytes—>Kilobytes—>Megabytes—>Gigabytes—>Terabytes—>Petabytes—>
Exabytes—>Zettabytes—>Yottabytes
Media
Sensor Data
Business apps
Public web
Social Media
Velocity
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.
Batch—>Periodic—>Near real time—>Real-time processing
There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
Variety
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
o Structured data: From traditional transaction processing systems and RDBMS etc.
o Semi-Structured data: HTML, XML
o Unstructured data: Unstructured text documents, audio, video, email, photos, PDFs, social
media, etc
Veracity
The “truth” or accuracy of data and information assets, which often determines executive-level
confidence.
It refers to inconsistencies and uncertainty in data, quality and accuracy are difficult to
control.
Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.
Value
The value of big data usually comes from insight discovery and pattern recognition that lead
to more effective operations, stronger customer relationships and other clear and quantifiable
business benefits.
Variability
The changing nature of the data companies seek to capture, manage and analyze in sentiment
or text analytics, changes in the meaning of keywords or phrases.
Automation
Interactive Voice Response (IVR), kiosks, mobile devices, email, chat, corporate websites,
third-party applications, and social networks have generated a fair amount of event information about
the customers.
Product
As products become increasingly electronic, they provide a lot of valuable data to the supplier
regarding product use and product quality. In many cases, suppliers can also collect information
about the context in which a product was used. Products can also supply information related to
frequency of use, interruptions, usage skipping, and other related aspects.
Electronic touch points
A fair amount of data can be collected from the touch points used for product shopping,
purchase, use, or payment.
Components
Sometimes, components may provide additional information. This information could include
data about component failures, use, or lack thereof.
Monetization
A data bazaar is the biggest enabler to create an external marketplace, where we collect,
exchange, and sell customer information. We are seeing a new trend in the marketplace, in which
customer experience from one industry is anonymized, packaged, and sold to other industries.
Location
It is increasingly available to suppliers. Assuming a product is consumed in conjunction with
a mobile device, the location of the consumer becomes an important piece of information that may
be available to the supplier.
Cookies
Web browsers carry enormous information using web cookies. Some of this may be directly
associated with touch points.
Usage data
A number of data providers have started to collect, synthesize, categorize, and package
information for reuse. This includes credit-rating agencies that rate consumers, social networks with
blogs published and cable companies with audience information.
Richer, deeper
Working with datasets whose insights into
volume and variety is beyond customers, partners
the storage and processing and the business
capability of a typical Big Data
database software Analytics
Competitive
advantage
Classification of Analytics
There are basically two schools of thought:
Classify analytics into basic, operationalized, advanced and monetized.
Classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
First School of Thought
Basic analytics: This primarily is slicing and dicing of data to help with basic business insights.
This is about reporting on historical data, basic visualization, etc.
Operationalized analytics: It is operationalized analytics if it gets woven into the enterprises
business processes.
Advanced analytics: This largely is about forecasting for the future by way of predictive and
prescriptive modelling.
Monetized analytics: This is analytics in use to derive direct business revenue.
Perspective
What will happen? Analytics
Why did it happen? Predictive
Analytics
What happened? Diagnostic
Analytics
Descriptive
Analytics
Data Science
Data science is the science of extracting knowledge from data. In other words, it is a science
of drawing out hidden patterns amongst data using statistical and mathematical techniques.
It employs techniques and theories drawn from many fields from the broad areas of
mathematics, statistics, information technology including machine learning, data engineering,
probability models, statistical learning, pattern recognition and learning, etc.
Data Scientist works on massive datasets for weather predictions, oil drillings, earthquake
prediction, financial frauds, terrorist network and activities, global economic impacts, sensor logs,
social media analytics, customer churn, collaborative filtering, regression analysis, etc. Data science
is multi-disciplinary.
Business Acumen Skills
A data scientist should have following ability to play the role of data scientist.
Understanding of domain
Business strategy
Problem solving
Communication
Presentation
Keenness
Technology Expertise
Following skills required as far as technical expertise is concerned.
Good database knowledge such as RDBMS
Good NoSQL database knowledge such as MongoDB, Cassandra, HBase
Programming languages such as Java. Python, C++
Open-source tools such as Hadoop
Data warehousing
Data mining
Visualization such as Tableau, Flare, Google visualization APIs
Mathematics Expertise
The following are the key skills that a data scientist will have to have to comprehend data,
interpret it and analyze.
Mathematics
Statistics
Artificial Intelligence
Algorithms
Machine learning
Pattern recognition
Natural Language Processing
To sum it up, the data science process is
Collecting raw data from multiple different data sources
Processing the data
Integrating the data and preparing clean datasets
Engaging in explorative data analysis using model and algorithms
Preparing presentations using data visualizations
Communicating the findings to all stakeholders
Making faster and better decisions
Responsibilities
Data Management: A data scientist employs several approaches to develop the relevant datasets for
analysis. Raw data is just "RAW", unsuitable for analysis. The data scientist works on it to prepare
to reflect the relationships and contexts. This data then becomes useful for processing and further
analysis.
Analytical Techniques: Depending on the business questions which we are trying to find answers
to and the type of data available at hand, the data scientist employs a blend of analytical techniques
to develop models and algorithms to understand the data, interpret relationships, spot trends, and
reveal patterns.
Business Analysis: A data scientist is a business analyst who distinguishes cool facts from insights
and is able to apply his business expertise and domain knowledge to see the results in the business
context.
Communicator: He is a good presenter and communicator who is able to communicate the results
of his findings in a language that is understood by the different business stakeholders.
Models & analyzes
to comprehend,
interpret
relationships, unveils
patterns,spots trends
Applies
Business
/Domain
knowledge to
provide context
Healthcare Providers
The healthcare sector has access to huge amounts of data but has been plagued by failures in
utilizing the data to curb the cost of rising healthcare and by inefficient systems that stifle faster and
better healthcare benefits across the board.
Some hospitals are using data collected from a cell phone app, from millions of patients, to
allow doctors to use evidence-based medicine as opposed to administering several medical/lab tests
to all patients who go to the hospital. A battery of tests can be efficient, but it can also be expensive
and usually ineffective.
Free public health data and Google Maps have been used to create visual data that allows for
faster identification and efficient analysis of healthcare information, used in tracking the spread of
chronic disease.
Education
Big data is used quite significantly in higher education. In a different use case of the use of
Big Data in education, it is also used to measure teacher’s effectiveness to ensure a pleasant
experience for both students and teachers.
Teacher’s performance can be fine-tuned and measured against student numbers, subject
matter, student demographics, student aspirations, behavioral classification, and several other
variables.
Government
In public services, Big Data has an extensive range of applications, including energy
exploration, financial market analysis, fraud detection, health-related research, and environmental
protection.
Big data is being used in the analysis of large amounts of social disability claims made to the
Social Security Administration (SSA) that arrive in the form of unstructured data. The analytics are
used to process medical information rapidly and efficiently for faster decision making and to detect
suspicious or fraudulent claims.
The Food and Drug Administration (FDA) is using Big Data to detect and study patterns of
food-related illnesses and diseases. This allows for a faster response, which has led to more rapid
treatment and less death.
Insurance
Big data has been used in the industry to provide customer insights for transparent and
simpler products, by analyzing and predicting customer behavior through data derived from social
media, GPS-enabled devices, and CCTV footage. The Big Data also allows for better customer
retention from insurance companies.
When it comes to claims management, predictive analytics from Big Data has been used to
offer faster service since massive amounts of data can be analyzed mainly in the underwriting stage.
Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.
Transportation
Government use of Big Data: traffic control, route planning, intelligent transport systems,
congestion management
Private-sector use of Big Data: revenue management, technological enhancements, logistics
and for competitive advantage
Individual use of Big Data: includes route planning to save on fuel and time, for travel
arrangements in tourism, etc.