Emerging - 2021 - Module 2 PDF
Emerging - 2021 - Module 2 PDF
DEPARTMENT OF MBA
Workshops
& Seminars Sports /
Certification Cultural &
courses Literary
Clubs
Forum Projects /
Activities & Consultancy
Events &
Placements
• You will have to also look at the aggregate of all the rows and
columns in the file and see if the values you obtain make sense. If
it doesn’t, you will have to remove or replace the data that doesn’t
make sense. Once you have completed the data cleaning process,
your data will be ready for an exploratory data analysis (EDA).
STEP 4: EXPLORING THE
DATA
• In this step, you will have to develop
ideas that can help identify hidden
patterns and insights.
• You will have to find more interesting
patterns in the data, such as why sales
of a particular product or service have
gone up or down.
• You must analyze or notice this kind of
data more thoroughly. This is one of the
most crucial steps in a data science
process
STEP 5: PERFORMING IN-
DEPTH ANALYSIS
➢The data analytic lifecycle is designed for big data problems and data
science projects. The cycle is iterative to represent real project.
➢To address the distinct requirements for performing analysis on big
data, step – by – step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and
repurposing data.
PHASE 1:
DISCOVERY
• The data science team learn and
investigate the problem.
• Develop context and understanding.
• Come to know about data sources
needed and available for the project.
• The team formulates initial hypothesis
that can be later tested with data
• Steps to explore, preprocess, and condition data prior to
modeling and analysis.
• It requires the presence of an analytic sandbox, the team
execute, load, and transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple
times and not in predefined order.
• Several tools commonly used for this phase are – hadoop,
alpine miner, open refine, etc.
PHASE 2: DATA
PREPARATION
Team explores data to learn about relationships
between variables and subsequently, selects key
variables and the most suitable models.
MODEL
PLANNING Team builds and executes models based on the work
done in the model planning phase.
4. Impact
This is the final phase of the data value chain process involves three activities:
using data to comprehend a problem and take a decision, altering the
consequence of a project or improving a situation, and reusing the data by
combining them with other data and sharing them freely.
• Data acquisition
Gathering ,filtering and cleaning data before it is put in a data warehouse or any other
storage solution.
• Data analysis
Exploring, transforming, and modelling data with the goal of highlighting relevant data
synthesizing and extracting useful hidden information with high potential from a business
point of view.
• Data curation
Data curation is the organization and integration of data collected from various sources. It
involves annotation-more of explanation, publication and presentation of the data such
that the value of the data is maintained over time, and the data remains available for
reuse and preservation.
• Data storage:
✓Persistence and management of data in a scalable way
✓Provides fast access to the data.
✓The parallel data storage solutions like hadoop framework, no sql, cloud
✓Storage, etc. Are some well-known big data storage.
• Data usage:
✓In business decision-making can enhance competitiveness through
✓Reduction of costs, increased added value, or any other parameter that
✓Can be measured against existing performance criteria.
• McDonald's mission is to provide customers with low-
priced food items
• McDonald's value chain is a component of the industry's
value system. The value system is composed of various
other value chains of the business units of all
organizations involved, such as the company's beverage
suppliers and the rest of the supply chain.
• With the advent of the internet of things (IOT), more
objects and devices are connected to the internet,
gathering data on customer usage patterns and
product performance. The emergence of machine
learning has produced still more data.
• While big data has come far, its usefulness is only
just beginning. Cloud computing has expanded big
data possibilities even further.
• The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to
test a subset of data.
• And graph databases are becoming increasingly
important as well, with their ability to display
massive amounts of data in a way that makes
analytics fast and comprehensive.
BIG DATA
Big data can help you address a range of business
activities, from customer experience to analytics.
Here are just a few.
• Product development
• Predictive maintenance
• Customer experience
• Fraud and compliance
• Machine learning
• Operational efficiency
• Drive innovation
BIG DATA EXAMPLES
• Personalized e-commerce shopping experiences
• Financial market modelling
• Compiling trillions of data points to speed up cancer research
• Media recommendations from streaming services like spotify, hulu and netflix
• Predicting crop yields for farmers
• Analyzing traffic patterns to lessen congestion in cities
• Data tools recognizing retail shopping habits and optimal product placement
• Big data helping sports teams maximize their efficiency and value
• Recognizing trends in education habits from individual students, schools and districts
BIG DATA STATISTICS
❑Google gets over 3.5 billion searches daily
❑Data interactions went up by 5000%
between 2011 and 2021
❑ In 2020, every person generated 1.7
megabytes per second
❑ Whatsapp users exchange up to 65 billion
messages daily.
❑ 95% of businesses cite the need to manage
unstructured data as
❑A problem for their business.
❑ Data interactions went up by 5000%
between 2011 and 2021
❑ Twitter users send over half a million tweets
CHARACTERISTICS
OF BIG DATA
1. Volume
• The most characteristic property of big data, volume, highlights the amount of data that passes through the
business day in and day out and how each of the data items needs to be captured to make a holistic sense
of the business to derive value out of it.
• Data here will refer to anything that can be captured, structured, unstructured, semi-structured, arriving in
batches or real-time.
• Without the huge volume, it would not be fair to call the data ecosystem big data, even if it captures every
business aspect. The whole premise of value in big data is based on this first V, volume.
• To give you a small example of how big the data should be to classify it as big data, sample this.
• The year 2016 saw global mobile traffic of 6.2 exabytes per month. It is estimated that by the year 2020,
this will easily top 40000 exabytes
2. Velocity
• Big data refers to the crucial characteristic of capturing data coming in at any speed. In
today’s world, data is almost stateless.
• It is coming and leaving the business ecosystem at great speeds.
• Big data systems are equipped to capture this data at the rate at which it is coming in.
If the speed does not match up to the speed at which data is coming in, there will be
frequent backlogs, ultimately choking the system.
• Big data systems are designed to handle a massive and continuous flow of data—
methods like sampling help in dealing with velocity issues in a big data system.
• As an example of the velocity that a big data system has to endure, more than 3.5 billion
searches per day are made through the google search engine. With an ever-increasing
number of active accounts on facebook, the number of likes, updates, shares, and
comments coming into facebook increases by 22% every year.
3. Variety
• It is characteristic of big data to capture anything and everything of value in the business
ecosystem.
• This includes data with no immediate value to derive but can be processed further with advanced
tools to gain insights into building intelligence into the system.
• Apart from the structured data that a business is used to, unstructured data buckets like images,
videos, sounds, flat files, email bodies, log files, and more.
• These contain data that can be mined out with advanced tools.
• Big data system is designed to capture the unstructured and semi-structured data passing
through the business in a timely and efficient manner.
• This also means that apart from storing a variety of data or heterogeneous data, the big data
system should hook onto these various data types of data sources efficiently without
compromising on speed.
4. Veracity
• With the volume, variety, and velocity that big data allows for, the models built on the data will not be of
true value without this characteristic.
• Veracity is the trustworthiness of the source’s data, the quality of the data derived after processing.
• The system should allow mitigation against data biases, abnormalities or inconsistencies, volatility,
duplication, among other factors.
5. Value
▪ Value is one of the first properties discussed in business, and a certain degree of value will be projected at
the outset of a big data project.
▪ Big data helps to build the infrastructure that machine learning and artificial intelligence can be based on.
▪ Businesses that start today into big data can tomorrow easily transition to machine learning and artificial
intelligence to augment their decision-making processes.
BIG DATA
LIFECYCLE
TOP 5 SOURCES OF BIG DATA