0% found this document useful (0 votes)
20 views23 pages

Data For Business Analytics Unit 2

The document discusses data management concepts including data collection, data quality, data security, big data characteristics, structured and unstructured data, business intelligence, and techniques for dealing with missing data such as imputation and removing data. Common data sources both internal and external to an organization are also covered.

Uploaded by

iemhardik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views23 pages

Data For Business Analytics Unit 2

The document discusses data management concepts including data collection, data quality, data security, big data characteristics, structured and unstructured data, business intelligence, and techniques for dealing with missing data such as imputation and removing data. Common data sources both internal and external to an organization are also covered.

Uploaded by

iemhardik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data For Business Analytics

Unit 2
• Data : Numerical text and figures that have
been collected through some type of
measurement process.

• Information : Result of analyzing data that is


extracting meaning from data to support
evaluation and decision making.
Data source –Internal/External
Internal
These types of data can easily be found within the organization such as
market record, a sales record, transactions, customer data, accounting
resources, etc. The cost and time consumption is less in obtaining internal
sources.

• Financial Statements
• Sales Reports
• Retailer/Distributor/Deal Feedback
• Customer Personal Information (e.g., name, address, age, contact info)
• Business Journals
• Government Records (e.g., census, tax records, Social Security info)
• Trade/Business Magazines
• The internet
• External : The data which can’t be found at internal organizations
and can be gained through external third party resources is
external source dataGovernment publications, news publications,
Registrar General of India, planning commission.
• Sensors data: With the advancement of IoT devices, the sensors of
these devices collect data which can be used for sensor data
analytics to track the performance and usage of products.
• Satellites data: Satellites collect a lot of images and data in
terabytes on daily basis through surveillance cameras which can
be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities many formats
of data Which is uploaded by users on different platforms can be
predicted and collected with their permission for data analysis.
The search engines also provide their data through keywords and
queries searched mostly.
Types of Data
• Quantitative/Qualitative Data
• Discreet/Continuous Data
• Nominal
• Ordinal
• Interval
• Ratio
Data Collection
• Data collection is the process of acquiring,
collecting, extracting, and storing the
voluminous amount of data which may be in
the structured or unstructured form like text,
video, audio, XML files, records, or other
image files used in later stages of data
analysis. In the process of big data analysis
• Types of Data Collection: Primary and
Secondary .
Data Management
• Data management refers to the professional
practice of constructing and maintaining a
framework for ingesting, storing, mining, and
archiving the data integral to a modern
business.
• Example : Competitive Exams,Organization
data from diff departments.Purchase history
data to segment different customers for
future.
• ERP
Benefits of Data Management System
• Data management provides businesses with a way of measuring
the amount of data in play.
• Data management gives managers a big-picture look at business
processes, which helps with both perspective and planning.
• Once data is under management, it can be mined for
informational gold: business intelligence. This helps business users
across the organization in a variety of ways, including the
following:
• Smart advertising that targets customers according to their
interests and interaction.
• Holistic security that safeguards critical information
• Alignment with relevant compliance standards, saving time and
money
Data Management Challenges
• The amount of data can be (at least temporarily) overwhelming.
• The development team may work from one data set, the sales team
from another, operations from another finance from other and so on.
• The journey from unstructured data to structured data can be steep.
• Making team members aware of the benefits of data management
(and the potential pitfalls of ignoring it) and fostering the skills of
using data correctly, managers engage team members as essential
pieces of the information process.
Data Management
• Master Data Management: Master data management (MDM) is the process of
ensuring the organization is always working with — and making business decisions
based on — a single version of current, reliable information.
• Data quality management: Quality management is responsible for combing
through collected data for underlying problems like duplicate records, inconsistent
versions, and more. Data quality managers support the defined data management
system.
• Data security: One of the most important aspects of data management today is
security. Though emergent practices like DevSecOps incorporate security
considerations at every level of application development and data exchange,
security specialists are still tasked with encryption management, preventing
unauthorized access, guarding against accidental movement or deletion, and other
frontline concerns.
Data Quality /Security
• Data governance sets the law for an enterprise’s state of
information. A data governance framework is like a
constitution that clearly outlines policies for the intake, flow,
and protection of institutional information.
• Data governors oversee their network of stewards, quality
management professionals, security teams, and other people
and data management processes in pursuit of a governance
policy that serves a master data management approach.
• Data stewardship: A data steward does not develop
information management policies but rather deploys and
enforces them across the enterprise.
Big Data
• Big data consists of huge amounts of information
that cannot be stored or processed using traditional
data storage mechanisms or processing techniques.

• Big datarefer to massive amounts of business data


from a wide variety of sources, much of which is
available in real time, and much of which is uncertain
or unpredictable. IBM calls these characteristics
volume, variety, velocity,and veracity.
Characteristics of Big Data
Management
• Volume: This trait refers to the immense amounts of
information generated every second via social media, cell
phones, cars, transactions, connected sensors, images, video,
and text. In petabytes, terabytes, or even zettabytes, these
volumes can only be managed by big data technologies.
• Variety: To the existing landscape of transactional and
demographic data such as phone numbers and addresses,
information in the form of photographs, audio streams, video,
and a host of other formats now contributes to a multiplicity of
data types — about 80% of which are completely unstructured.
• Velocity: Refers to the speed with which big data can be
processed and analyzed to extract the insights and patterns it
contains. These days, that speed is often real-time.
Veracity: This is the degree of reliability and truth that
big data has to offer in terms of its relevance,
cleanliness, and accuracy.
Value: Since the primary aim of big data gathering and
analysis is to discover insights that can inform
decision-making and other processes, this
characteristic explores the benefit or otherwise that
information and analytics can ultimately produce.
• Structured data (as its name suggests) has a well-defined
structure and follows a consistent order. This kind of information is
designed so that it can be easily accessed and used by a person or
computer. Structured data is usually stored in the well-defined
rows and columns of a table (such as a spreadsheet) and
databases — particularly relational database management
systems, or RDBMS.

• Semi-structured data exhibits a few of the same properties as


structured data, but for the most part, this kind of information has
no definite structure and cannot conform to the formal rules of
data models such as an RDBMS.

• Unstructured data possesses no consistent structure across its


various forms and does not obey conventional data models’ formal
structural rules. In very few instances, it may have information
related to date and time.
How is Big Data Collected
1. Asking for it the majority of firms prefer asking
users directly to share their personal information.
Include username and email.
2. Cookies They provide basic statistics about how a
website is used.
3. Email tracking: email tracker allows detecting when
an email was opened. Both Google and Yahoo use
this method to learn their users’ behavioural
patterns and provide personalized advertising.
Business Intelligence
• It use business process data to create charts
and tables that summarize business
performance .
• Main purpose is to analye business data to
create summarized information periodically.
• Techniques involved
summarization,visualization,and charting etc.
• Software used Business Objects,SAP/BI,
Pentaho.
Dealing with missing data or
incomplete data
• Data that is not captured for a variable for the
observation in question. Missing data reduces
the statistical power of the analysis, which can
distort the validity of the results.
Technique for missing data
• The imputation method develops reasonable guesses
for missing data. It’s most useful when the percentage
of missing data is low. If the portion of missing data is
too high, the results lack natural variation that could
result in an effective model.
• The other option is to remove data. When dealing with
data that is missing at random, related data can be
deleted to reduce bias. Removing data may not be the
best option if there are not enough observations to
result in a reliable analysis. In some situations,
observation of specific events or factors may be
required.
Reason for Missing Data
• Missing at Random (MAR): The data is not missing
across all observations but only within
sub-samples of the data. The missing data can be
predicted based on the complete observed data.
• In MCAR situation, the data is missing across all
observations regardless of the expected value or
other variables. Data scientists can compare two
sets of data, one with missing observations and
one without. Using a t-test, if there is no
difference between the two data sets, the data is
characterized as MCAR.
• Missing Not at Random (MNAR)
• The MNAR category applies when the missing data has
a structure to it. In other words, there appear to be
reasons the data is missing. In a survey, perhaps a
specific group of people – say women ages 45 to 55 –
did not answer a question. Like MAR, the data cannot
be determined by the observed data, because the
missing information is unknown. Data scientists must
model the missing data to develop an unbiased
estimate. Simply removing observations with missing
data could result in a model with bias.
• List wise
• In this method, all data for an observation that has one or more missing
values are deleted. The analysis is run only on observations that have a
complete set of data. If the data set is small, it may be the most efficient
method to eliminate those cases from the analysis. However, in most
cases, the data are not missing completely at random (MCAR). Deleting
the instances with missing observations can result in biased parameters
and estimates and reduce the statistical power of the analysis.
• Pair wise
• Pair wise deletion assumes data are missing completely at random
(MCAR), but all the cases with data, even those with missing data, are used
in the analysis. Pairwise deletion allows data scientists to use more of the
data. However, the resulting statistics may vary because they are based on
different data sets. The results may be impossible to duplicate with a
complete set of data.

You might also like