50% found this document useful (2 votes)

7K views106 pages

Unit - 1 Notes - Introduction To Data-Analytics PDF

This document provides an introduction to the concepts of data analytics. It discusses the course outcomes and Bloom's taxonomy levels for the course. It then defines key terms related to data analytics including data, data types (qualitative, quantitative, nominal, ordinal, discrete, continuous), and the data analysis process (determining requirements, collection, organization, cleaning). Finally, it discusses why data analytics is important and the different types of data analytics (descriptive, diagnostic, predictive, prescriptive).

Uploaded by

Kanav Kesar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

7K views106 pages

Unit - 1 Notes - Introduction To Data-Analytics PDF

Uploaded by

Kanav Kesar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

UNIT – I

Introduction to Data Analytics

Subject Code: ( KCS-051)
KIT – 601
Dr. Monika Sainger
Course Outcome (CO) & Bloom’s Knowledge Level (KL)

Bloom’s
Course Outcomes (CO) Knowledge
Level (KL)

CO 1 Discuss various concepts of data analytics pipeline K1, K2

CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data. K2, K3
CO 4 Compare different clustering and frequent pattern mining K4
algorithms
CO 5 Describe the concept of R programming and implement K2,K3
analytics on Big data using R.
Bloom's Taxonomy
• One of the most recognized learning theories in the
field of education.
• Bloom's Taxonomy is used to create learning
outcomes that target not only subject matter but
also the depth of learning they want students to
achieve.
• And to then create assessments that accurately
report on students’ progress towards these
outcomes.
Bloom's Taxonomy
What is Data Analytics ?
• It refers to the process of examining datasets
to draw conclusions about the
information they contain.

• Data analytic techniques enable you to take

raw data and uncover patterns to extract
valuable insights from it.
Why Data Analytics ?
• Data is the fuel that can drive a business to the right
path.
• Data can help businesses better understand their
customers, improve their advertising campaigns,
personalize their content and improve their bottom lines.
• The advantages of data are many, but you can’t access
these benefits without the proper data analytics tools
and processes.
• While raw data has a lot of potential, you need data
analytics to unlock the power to grow your business.
What is Data?
• Data are a set of values of qualitative or quantitative
variables about one or more persons or objects.

• When data is processed, organized, structured or presented

in a given context so as to make it useful, it is called
information.
Types of Data
• When this Data has so much importance in our life then it
becomes important to properly store and process this without
any error.
• When dealing with datasets, the category of data plays an
important role to determine which preprocessing strategy
would work for a particular set to get the right results or
which type of statistical analysis should be applied for the
best results.
• Let’s dive into some of the commonly used categories of data.
Types of Data

• Qualitative Data Type

• Quantitative Data Type

Qualitative Data Type
• Qualitative or Categorical Data describes the object under
consideration using a finite set of discrete classes.
• It means that this type of data can’t be counted or measured easily
using numbers and therefore divided into categories.
• The gender of a person (male, female, or others) is a good example
of this data type.
• These are usually extracted from audio, images, or text medium.
Another example can be of a smartphone brand that provides
information about the current rating, the color of the phone,
category of the phone, and so on.
• All this information can be categorized as Qualitative data.
Qualitative Data Type
• There are two subcategories under this:

1. Nominal

2. Ordinal
Nominal Data Type
• These are the set of values that don’t possess a natural
ordering.
• For example, color of a smartphone can be considered
as a nominal data type as we can’t compare one color
with others.
• It is not possible to state that ‘Red’ is greater than
‘Blue’.
• Other example can be the gender of a person where
we can’t differentiate between male, female, or others.
Ordinal Data Type
• These types of values have a natural ordering
while maintaining their class of values.
• If we consider the size of a clothing brand then we
can easily sort them according to their name tag in
the order of small < medium < large.
• The grading system while marking candidates in a
test can also be considered as an ordinal data type
where A+ is definitely better than B grade.
Ordinal Data Type
• These categories help us deciding which encoding
strategy can be applied to which type of data.

• Data encoding for Qualitative data is more important.

• Because machine learning models can’t handle these

values directly and needed to be converted to
numerical types as the models are mathematical in
nature.
Ordinal Data Type
• For nominal data type where there is no comparison among
the categories, one-hot encoding can be applied which is
similar to binary coding considering there are in less number.

• for the ordinal data type, label encoding can be applied which
is a form of integer encoding.
One-Hot Encoding Scheme
• In one-hot encoding, for each level of a categorical feature,
we create a new variable.
• Each category is mapped with a binary variable containing
either 0 or 1.
Ordinal Encoding
• An ordinal encoding involves mapping each unique label to
an integer value.

• This type of encoding is really only appropriate if there is a

known relationship between the categories.

• For example, “S” is 38, “L” is 40, and “XL” is 42...

Quantitative Data Type
• This data type tries to quantify things and it does by considering
numerical values that make it countable in nature.

• The price of a smartphone, discount offered, number of ratings on a

product, the frequency of processor of a smartphone, or ram of that
particular phone, all these things fall under the category of
Quantitative data types.

• The key thing is that there can be an infinite number of values a

feature can take.

• For instance, the price of a smartphone can vary from x amount to

any value and it can be further broken down based on fractional
values.
Quantitative Data Type
• The two subcategories which describe them
clearly are:

1. Discrete
2. Continuous
Discrete Data Type
• The numerical values which fall under are integers or whole
numbers are placed under this category.

• The number of speakers in the phone, cameras, cores in the

processor, the number of sims supported all these are some
of the examples of the discrete data type.
Continuous Data Type
• It can be divided up as much as you want, and measured to
many decimal places.

• The fractional numbers are considered as continuous values.

• These can take the form of the operating frequency of the

processors, the android version of the phone, wifi frequency,
temperature of the cores, and so on.
Data Types
Understanding Data Analytics
• Data analytics is a broad term that encompasses many
diverse types of data analysis.

• Any type of information can be subjected to data analytics

techniques to get insight that can be used to improve
things.

• For example, manufacturing companies often record the

runtime, downtime, and work queue for various machines
and then analyze the data to better plan the workloads so
the machines operate closer to peak capacity.
Understanding Data Analytics
• Data analytics can do much more than point out
bottlenecks in production.

• Gaming companies use data analytics to set reward

schedules for players that keep the majority of players
active in the game.

• Content companies use many of the same data analytics to

keep you clicking, watching, or re-organizing content to get
another view or another click.
Data Analysis Process
• The data analysis process involves several different
steps:

1. Determining the data requirements.

2. The process of collection of data.

3. Organization of data so it can be analyzed.

4. Cleaning of data before analysis.

Determining the Data Requirements
• The first step is to determine the data
requirements or how the data is grouped.

• Data may be separated by age, demographic,

income, or gender.

• Data values may be numerical or be divided by

category.
The Process of Collection of Data
• The second step in data analytics is the
process of collecting it.

• This can be done through a variety of

sources such as computers, online sources,
cameras, environmental sources, or
through personnel.
Organization of data
• Once the data is collected, it must be
organized so it can be analyzed.

• Organization may take place on a

spreadsheet or other form of software that
can take statistical data.
Cleaning of data
• The data is then cleaned up before analysis.

• This means it is scrubbed and checked to

ensure there is no duplication or error, and
that it is not incomplete.

• This step helps correct any errors before it

goes on to a data analyst to be analyzed.
Why Data Analytics Matters?
• Data analytics is important because it helps businesses
optimize their performances.

• With this, companies can help reduce costs by identifying

more efficient ways of doing business and by storing large
amounts of data.

• A company can also use data analytics to make better

business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products
and services.
Types of Data Analytics
• Data analytics is broken down into four basic
types:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive analytics
• Descriptive statistics is the type
of statistics which is used to summarize and
describe the dataset.

• It is used to describe the characteristics

of data.
Descriptive analytics
• It describes what has happened over a given
period of time.

• Have the number of views gone up?

• Are sales stronger this month than last?

Descriptive analytics
• Descriptive statistics can be useful for two purposes:

1) To provide basic information about variables in a

dataset and

2) To highlight potential relationships between

variables.
Diagnostic analytics
• The diagnostic analysis is a special type of analytical
technique.

• In this, data is interpreted and analyzed properly to

find out what happened or caused a particular cyber
breach.

• Diagnostic Analytics helps you understand why

something happened in the past.
Diagnostic analytics
• It focuses more on why something happened.

• This involves more diverse data inputs and a bit

of hypothesizing.

• Did the weather affect icecream sales?

• Did that latest marketing campaign impact sales?

Predictive analytics
• Predictive analytics is the branch of the advanced
analytics which is used to make predictions about
unknown future events.

• It is the use of data, statistical algorithms and

machine learning techniques to identify the
likelihood of future outcomes based on
historical data.
Predictive analytics
• The goal is to go beyond knowing what has
happened to providing a best assessment of
what will happen in the future.

• It serves as a decision making tool.

Prescriptive analytics
• Prescriptive analytics focuses on finding the
best course of action in a scenario, given the
available data.
• It is related to both descriptive analytics and
predictive analytics.
• But it emphasizes actionable insights instead
of data monitoring.
Prescriptive analytics
• Prescriptive analytics is the final step of
business analytics.

• It suggests a course of action.

Data Source
• A data source, in the context of computer science and computer
applications, is the location, where data that is being used come
from.

• In a database management system, the primary data source is

the database, which can be located in a disk or a remote server.

• The data source for a computer program can be a file, a data

sheet, a spreadsheet, an XML file or even hard-coded data
within the program.
What are the different sources of data?
• Following are the two sources of data:

1. Internal Source

2. External Source
Internal Source
• When data are collected from reports and
records of the organization itself, it is known
as the internal source.

• For example, a company publishes its ‘Annual

Report’ on Profit and Loss, Total Sales, Loans,
Wages, etc.
External Source

• When data are collected from outside the

organization, it is known as the external source.

• For example, if a Tour and Travels Company obtains

information on ‘Karnataka Tourism’ from Karnataka
Transport Corporation, it would be known as external
sources of data.
Data Classification
• Structured Data

• Semi-structured Data

• Quasi-structured Data

• Unstructured data
Structured Data
• It is the data containing a defined data type,
format, and structure.

• For example: transaction data, traditional

RDBMS, CSV files, and even simple spread
sheets etc.
• Structured data adheres to a pre-defined data
model.

• It is, therefore, straightforward to analyze.

• Structured data conforms to a tabular format

with relationship between the different rows
and columns.
• Structured data depends on the existence of a
data model.

• That means a model of how data can be stored,

processed and accessed.

• Because of a data model, each field is discrete

and can be accesses separately or jointly along
with data from other fields.
• This makes structured data extremely powerful.

• It is possible to quickly aggregate data from

various locations in the database.

• Structured data is considered the most

‘traditional’ form of data storage, since the
earliest versions of database management
systems (DBMS) were able to store, process and
access structured data.
• RDBMS is the example of structured data.

• The RDBMS may store characteristics of the

support calls as typical structured data, with
attributes such as time stamps, machine type,
problem type, and operating system.
Semi-structured Data
• Semi-structured data is the data which does not conforms to
a data model but has some structure.

• It lacks a fixed or rigid schema.

• It is the data that does not reside in a rational database but

that have some organizational properties that make it easier
to analyze.

• With some process, we can store them in the relational

database.
Semi-structured Data
• Semi-structured data is a form of structured data that
contain tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
within the data.

• Therefore, it is also known as self-describing structure.

• Examples of semi-structured data include JSON and

XML are forms of semi-structured data.
Semi-structured Data
• The reason that this third category exists (between structured
and unstructured data) is because semi-structured data is
considerably easier to analyze than unstructured data.

• Many Big Data solutions and tools have the ability to ‘read’
and process either JSON or XML.

• This reduces the complexity to analyze structured data,

compared to unstructured data.
Quasi-structured Data
• This type of data consists of textual content
with erratic data format.

• The formatting of this type of data structure

requires effort, s/w system tools and time.

• Example of it is data about webpages a user

visited and in what order.
Unstructured Data
• Unstructured data is information that is not arranged according
to a pre-set data model or schema.

• Therefore it cannot be stored in a traditional relational

database or RDBMS.

• Text and multimedia are two common types of unstructured

content.

• Many business documents are unstructured, as are email

messages, videos, photos, webpages, and audio files.
Unstructured Data
• From 80 to 90 percent of data generated and collected by organizations, is
unstructured.

• And its volumes are growing rapidly — many times faster than the rate of
growth for structured databases.

• Unstructured data stores contain a wealth of information that can be used to

guide business decisions.

• However, unstructured data has historically been very difficult to analyze.

• With the help of AI and machine learning, it is possible to uncover beneficial

and actionable business intelligence out of this type of data.
Applications of Data Analytics
Healthcare Services

 Quality of medical care’s improvement.

 Pressure to cater large number of patients.

Machine and instrument information use has risen radically

in order to upgrade and track treatment.
Applications of Data Analytics
Accounting
• Various EMI schemes are serious products of Data Analytics.
• To get knowledge about the significant insights in finance and the
different kinds of enhancements that can be made to build the
productivity
• to deal with the risks that might be confronted,
• Bookkeepers use data analytics to shape a solid relationship with
numerous business chiefs.
• Accordingly, data analytics encourages in picking up information to
improve finance.
• With better analysis of gains and profits, bookkeepers can reshape
their thoughts on financing.
Applications of Data Analytics
Extortion and Risk Detection
 one of the underlying utilizations of data science which
was extracted from the field of finance.

 This helps banks to figure out how to divide and

conquer data from their clients’ profiles, recent
expenditure and other critical data that were made
accessible to them.

 This made it simple for them to examine and deduce if

there was any likelihood of clients defaulting.
Applications of Data Analytics
Policing/Security

 Several cities all over the world have employed predictive

analysis in predicting areas that would likely witness a surge in
crime with the use of geographical data and historical data.

 This has seemed to work in major cities such as Chicago,

London, Los Angeles, etc.

 This shows that this kind of data analytics application will make
us have safer cities without police putting their lives at risk.
Applications of Data Analytics
Transportation
• A few years back at the London Olympics, there was a need for
handling over 18 million journeys made by fans in the city of London
and fortunately, it were sorted out.

• How was this feat achieved? The TFL and train operators made use of
data analytics to ensure the large numbers of journeys went
smoothly.

• They were able to input data from events that took place and
forecasted a number of persons that were going to travel; transport
was being run efficiently and effectively so that athletes and
spectators can be transported to and from the respective stadiums.
Applications of Data Analytics
• Manage Risk in the insurance industry.

• Web Provisioning: The key component of this is being able to shift bandwidth at
the right time and location. This can only be achieved by the use of data.

• A study recently carried out showed that a lack of investment in technology was
the cause customer dissatisfaction of the present generation of insurance customers
because they prefer using mobile and online channels, social media and other
recent mediums to interact with their agents.

• However, the older generation still prefers the use of the telephone.

• To improve the overall experience of customers, it is best for insurance companies

to provide a wide range of communication methods for their customers
Reporting
• Reporting is a process in which data is organized
and summarized in an easy-to-understand
format.

• Reports enable organizations to monitor various

performance parameters and improve customer
satisfaction.

• it is used as a tool to monitor the day-to-day

business operations of an organization.
Analysis
• Analysis is a process in which data and reports are
examined to get insight from them.

• These insights help an organization to perform

important tasks in a timely manner.

• These tasks may be planning a new strategy, taking

important business decisions, introducing a new
product, and improving customer satisfaction.
Analysis Vs Reporting
1. In simple words, reporting can be considered as a
process in which raw data is transferred into useful
information and analysis as a process that transforms
information into insights.

2. A good report enables an organization to ask the

most relevant questions related to the business and
customer whereas analysis enables organizations to
answer these questions and implement the insights.
Analysis Vs Reporting
3. Reports enable an organization to understand what is
happening but analysis helps an organization
understand why it is happening and what action can
be taken for it.

4. An analysis can result in reports and reports can result

in analysis.
What is Big Data?
Think of the following:
• Every second, there are around 822 tweets on Twitter.
• Every minute nearly 510 comments are posted, 293000
statuses are updated, and 136000 photos are uploaded
on Facebook.
• Every hour, Walmart, a chain of departmental store,
handles more than 1 million customer transactions.
• Every day around 11.5 million payments are done only
on PayPal.
What is Big Data?
• That means data is coming from everywhere: sensors used to
gather climate information, posts to social media sites, digital
pictures and videos, purchase transaction records and cell phone
GPS signals etc…

• Collecting such huge amount of data would just be a waste of time,

effort and storage space.(if it is not being used for some purpose)

• But for analytics, there is a need to sort, organize, analyze and offer
this critical data in a systematic manner leads to the rise of Big
Data.
What is Big Data?
• The term Big Data refers to all the data that is being
generated across the globe at an unprecedented
rate.

• This data could be either structured or unstructured.

• Today’s business enterprises owe a huge part of their

success to an economy that is firmly
knowledge(data)-oriented.
Big Data
• There is a need to convert Big Data into Business Intelligence that
enterprises can readily deploy.

• Better data leads to better decision making.

• It provides an improved way to strategize for organizations regardless of

their size, geography, market share, customer segmentation and such
other categorizations.

• Hadoop is the platform of choice for working with extremely large

volumes of data.
Introduction to Big Data Platform
• Big data platform is a type of IT solution.

• It combines the features and capabilities of several big

data application and utilities within a single solution.

• It is an enterprise class IT platform that enables

organization in developing, deploying, operating and
managing a big data infrastructure /environment.
Big Data Platform
• Big data platform generally consists of big data storage, servers,
database, big data management, business intelligence and other big
data management utilities.

• It also supports custom development, querying and integration with

other systems.

• The primary benefit behind a big data platform is to reduce the

complexity of multiple vendors/ solutions into a one cohesive
solution.

• Big data platform are also delivered through cloud where the
provider provides an all inclusive big data solutions and services.
What is Hadoop?
• Apache Hadoop is an open source framework that is used to
efficiently store and process large datasets ranging in size
from gigabytes to petabytes of data.

• Instead of using one large computer to store and process the

data, Hadoop allows clustering multiple computers to analyze
massive datasets in parallel more quickly.
Characteristics of Big Data
Big Data has certain characteristics and hence is defined using
4Vs namely:
 Volume
 Velocity
 Variety
 Veracity
• Volume: the amount of data that businesses can collect is really enormous
and hence the volume of the data becomes a critical factor in Big Data
analytics.

• Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also
important to parse Big Data in a timely manner.

• Variety: the data that is generated is completely heterogeneous in the sense

that it could be in various formats like video, text, database, numeric,
sensor data and so on and hence understanding the type of Big Data is a
key factor to unlocking its value.

• Veracity: knowing whether the data that is available is coming from a

credible source is of utmost importance before deciphering and
implementing Big Data for business needs.
The Evolution of Analytic Scalability
• The world of big data requires new levels of scalability.

• As the amount of data to organizations process

continues to increase, the same old methods for handling
data just won’t work anymore.

• Organizations that don’t scale their technologies to

handle large data, will quite simply choke on big data.
The Evolution of Analytic Scalability
• Luckily, there are multiple technologies
available that address different aspects of the
process of taming big data and making use of
it in analytic processes.

• Some of these advances are quite new, and

organizations need to keep up with the times.
The Evolution of Analytic Scalability
• Few of these technologies are:

– Massively Parallel Processing (MPP) architectures,

– Cloud Computing

– Grid Computing

– MapReduce.
Massively Parallel Processing Systems (MPP)
• MPP (massively parallel processing) is the
coordinated processing of a program by multiple
processors.
• These processors work on different parts of the
program, with each processor using its own
operating system and memory .
• Typically, MPP processors communicate using
some messaging interface.
• In some implementations, up to 200 or more
processors can work on the same application.
Handling of Big Data using MPP
• Big data is split into many parts and the
processors works in parallel on each part of
data.

• Divide and conquer strategy is implemented.

There are several types of MPP database architectures, each with
their own benefits:

Grid computing–
 uses multiple computers in distributed networks.
 This type of architecture uses use resources
opportunistically based on their availability.
 This architecture reduces costs for server space, but also
limits bandwidth and capacity at peak times or when there
are too many requests.

Computer clustering – links the available power into nodes that can
connect with each other to handle multiple tasks at once.
MapReduce
• MapReduce is now the most widely used, general
purpose computing model and run-time system for
distributed data analytics.

• It provides a flexible and scalable foundation for

analytics.

• From traditional reporting to leading-edge machine

learning algorithms.
MapReduce
• In this model, a compute “job” is decomposed
into smaller “tasks”.
• These taske correspond to separate Java
Virtual Machine (JVM) procsses in Hadoop
Implementation.
• These tasks are distributed around the cluster
to parallelize and balance the load as much as
possible.
MapReduce
• The MapReduce runtime infrastructure coordinates
the tasks, re-running any that fails/hangs.

• Users of MapReduce do not need to implement

parallelism or reliability features themselves.

• Instead they focus on the data problem they are

trying to solve.
MapReduce
Modern Data Analytic Tools
• There are many open source tools for data
analytics.

• These tools do not require much/any coding

and manages to deliver better results than
paid versions.
R Programming
• R is the leading analytics tool in the industry and widely used
for statistics and data modeling.
• It can easily manipulate your data and present in different
ways.
• R compiles and runs on a wide variety of platforms viz -UNIX,
Windows and MacOS.
• It has 11,556 packages and allows you to browse the packages
by categories.
• R also provides tools to automatically install all packages as per
user requirement, which can also be well assembled with Big
data.
Tableau Public
• Tableau Public is a free software that connects any data source be it
corporate Data Warehouse, Microsoft Excel or web-based data in a variety of
formats.

• It creates data visualizations, maps, dashboards etc. with real-time updates

presenting on web.

• User can easily create interactive graphs, maps, and live dashboards in
minutes.

• No coding required.

• Tableau’s Big Data capabilities makes it important and one can analyze and
visualize data better than any other data visualization software in the market.
Few examples of Tableau Output
Python
• Python is an object-oriented scripting language which is easy to
read, write, maintain and is a free open source tool.
• It was developed by Guido van Rossum in late 1980’s which
supports both functional and structured programming methods.
• Phython is easy to learn as it is very similar to JavaScript, Ruby,
and PHP.
• Also, Python has very good machine learning libraries viz.
Scikitlearn, Theano, Tensorflow and Keras.
• Another important feature of Python is that it can be assembled
on any platform like SQL server, a MongoDB database or JSON.
SAS (Statistical Analysis System)
• SAS is a programming environment and language for data
manipulation and a leader in analytics.
• It was developed by the SAS Institute in 1966 and further developed in
1980’s and 1990’s.
• SAS is easily accessible, managable and can analyze data from any
sources.
• SAS introduced a large set of products in 2011 for customer
intelligence and numerous SAS modules for web, social media and
marketing analytics that is widely used for profiling customers and
prospects.
• It can also predict their behaviors, manage, and optimize
communications.
KNIME (Konstanz Information Miner)
• KNIME is a free and open-source data analytics, reporting and
integration platform.

• It integrates various components for machine learning and data

mining through its modular data pipelining "Lego of Analytics"
concept.

• KNIME Developed in January 2004 by a team of software

engineers at University of Konstanz.

• KNIME allow you to analyze and model the data through visual
programming.
Apache Spark
• The University of California, Berkeley’s AMP Lab, developed
Apache in 2009.
• Apache Spark is a data processing engine and executes
applications in Hadoop clusters 100 times faster in memory
and 10 times faster on disk.
• Spark makes concepts of data science effortless.
• Spark is also popular for data pipelines and machine learning
models development.
• Spark also includes a library – MLlib, that provides a
progressive set of machine algorithms for repetitive data
science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.
RapidMiner
• RapidMiner is a powerful integrated data science platform.
• It is developed to perform predictive analysis and other
advanced analytics like data mining, text analytics, machine
learning and visual analytics without any programming.
• RapidMiner can incorporate with any data source types,
including Access, Excel, Microsoft SQL, Tera data, Oracle,
Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc.
• The tool is very powerful that can generate analytics based on
real-life data transformation settings,
• i.e. you can control the formats and data sets for predictive
analysis.
QlikView
• QlikView has many unique features like in-memory data
processing, which executes the result very fast to the end
users and stores the data in the report itself.

• Data association in QlikView is automatically maintained and

can be compressed to almost 10% from its original size.

• Data relationship is visualized using colors – a specific color is

given to related data and another color for non-related data.
What is the Data Analysis Process?
• Data Analysis Process consists of the following
phases that are iterative in nature −

Data Requirements Specification

Data Collection
Data Processing
Data Cleaning
Data Analysis
Communication
Data Requirements Specification
• The data required for analysis is based on a question or an
experiment.
• Based on the requirements of those directing the analysis, the
data necessary as inputs to the analysis is identified.
• (e.g., Population of people). Specific variables regarding a
population (e.g., Age and Income) may be specified and
obtained.
• Data may be numerical or categorical.
Data Collection
• Data Collection is the process of gathering information on
targeted variables identified as data requirements.
• Data Collection ensures that data gathered is accurate such
that the related decisions are valid.
• Data Collection provides both a baseline to measure and a
target to improve.
• Data is collected from various sources ranging from
organizational databases to the information in web pages.
• The data thus obtained, may not be structured and may
contain irrelevant information.
• Hence, the collected data is required to be subjected to
Data Processing and Data Cleaning.
Data Processing
• The data that is collected must be processed or
organized for analysis.
• This includes structuring the data as required for
the relevant Analysis Tools.
• For example, the data might have to be placed
into rows and columns in a table within a
Spreadsheet or Statistical Application.
• A Data Model might have to be created.
Data Cleaning
• The processed and organized data may be incomplete,
contain duplicates, or contain errors.
• Data Cleaning is the process of preventing and correcting
these errors.
• There are several types of Data Cleaning that depend on
the type of data.
• For example, while cleaning the financial data, certain
totals might be compared against reliable published
numbers or defined thresholds.
• Likewise, quantitative data methods can be used for outlier
detection that would be subsequently excluded in analysis.
Data Analysis
• Data that is processed, organized and cleaned would be ready for the
analysis.
• Various data analysis techniques are available to understand,
interpret, and derive conclusions based on the requirements.
• Data Visualization may also be used to examine the data in graphical
format, to obtain additional insight regarding the messages within the
data.
• Statistical Data Models such as Correlation, Regression Analysis can be
used to identify the relations among the data variables.
• These models that are descriptive of the data are helpful in simplifying
analysis and communicate results.
• The process might require additional Data Cleaning or additional Data
Collection, and hence these activities are iterative in nature.
Communication
• The results of the data analysis are to be reported in a
format as required by the users to support their
decisions and further action.
• The feedback from the users might result in additional
analysis.
• The data analysts can choose data visualization
techniques, such as tables and charts, which help in
communicating the message clearly and efficiently to
the users.
• The analysis tools provide facility to highlight the
required information with color codes and formatting in
tables and charts.

Data Analytics Lecture Notes
100% (1)
Data Analytics Lecture Notes
10 pages
Dating Question and Answer Format PDF Love PDF
81% (36)
Dating Question and Answer Format PDF Love PDF
2 pages
Maker's Muse 50 3D Printing Tips 2017
No ratings yet
Maker's Muse 50 3D Printing Tips 2017
71 pages
100 General Aptitude Questions With Answers PDF
84% (25)
100 General Aptitude Questions With Answers PDF
23 pages
Celebrity Format's PDF
89% (289)
Celebrity Format's PDF
37 pages
Data Mining Notes For BCA 5th Sem 2019 PDF
No ratings yet
Data Mining Notes For BCA 5th Sem 2019 PDF
46 pages
Celebrity Format
89% (3140)
Celebrity Format
3 pages
FMT To Collect CC and Bank Acct-1
95% (19)
FMT To Collect CC and Bank Acct-1
6 pages
NEN-En 9136 Aerospace Series - Root Cause Analysis and Problem Solving (9S Methodology)
No ratings yet
NEN-En 9136 Aerospace Series - Root Cause Analysis and Problem Solving (9S Methodology)
60 pages
Data Science MCQ Questions and Answer PDF
75% (8)
Data Science MCQ Questions and Answer PDF
6 pages
Dating Format 1
87% (218)
Dating Format 1
19 pages
Daftar Buku K3
No ratings yet
Daftar Buku K3
5 pages
Facets of Data
0% (1)
Facets of Data
22 pages
R-Programming Notes
100% (1)
R-Programming Notes
33 pages
CHRISTIAN LAW OF SUCCESSION Lenin Fam Law
100% (1)
CHRISTIAN LAW OF SUCCESSION Lenin Fam Law
2 pages
Dating For Woman To Man
92% (216)
Dating For Woman To Man
7 pages
Software Engineering Question Bank
100% (2)
Software Engineering Question Bank
6 pages
Engine Control PDF
No ratings yet
Engine Control PDF
1 page
Digital Plethysmometer PLM-02 2
No ratings yet
Digital Plethysmometer PLM-02 2
4 pages
Activity 4
No ratings yet
Activity 4
2 pages
System Software Notes
67% (3)
System Software Notes
100 pages
Research Methodology MCQ Questions With Answers
81% (779)
Research Methodology MCQ Questions With Answers
60 pages
Whythedrive Theutilitaria Nand Hedonic Benefits of Self-Expression Through Consumption
No ratings yet
Whythedrive Theutilitaria Nand Hedonic Benefits of Self-Expression Through Consumption
6 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Question and Answer of Research Methodology
92% (26)
Question and Answer of Research Methodology
15 pages
36389-Article Text-169429-1-10-20220329
No ratings yet
36389-Article Text-169429-1-10-20220329
11 pages
HTL Report-27-Aug-24
No ratings yet
HTL Report-27-Aug-24
16 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
Pir 19
No ratings yet
Pir 19
29 pages
DBMS and SQL Notes
100% (1)
DBMS and SQL Notes
68 pages
R Notes-Unit-3
No ratings yet
R Notes-Unit-3
65 pages
CAT Grade 10 Term 3 Solution Development Word LG
No ratings yet
CAT Grade 10 Term 3 Solution Development Word LG
3 pages
00289-Udaya Polytechnic College
No ratings yet
00289-Udaya Polytechnic College
4 pages
MCQs of Communication skills..KV
88% (16)
MCQs of Communication skills..KV
10 pages
PROJECT Obstacle Avoiding Robot
No ratings yet
PROJECT Obstacle Avoiding Robot
3 pages
III-II Big Data Analytics Question Bank
100% (1)
III-II Big Data Analytics Question Bank
3 pages
ENtrepreneurship Multiple Choice Questions
88% (112)
ENtrepreneurship Multiple Choice Questions
24 pages
Survey of Intestinal Helminth Infections Amongst Primary School Pupils in Ozubulu Anambra State Nigeria - 1512713551
No ratings yet
Survey of Intestinal Helminth Infections Amongst Primary School Pupils in Ozubulu Anambra State Nigeria - 1512713551
5 pages
Unit 1; Data Analytics (KCA-034)
No ratings yet
Unit 1; Data Analytics (KCA-034)
21 pages
ppt on Ensuring drug adulteration by Pratham jachak
No ratings yet
ppt on Ensuring drug adulteration by Pratham jachak
8 pages
DBMS Questions-Answers
0% (1)
DBMS Questions-Answers
13 pages
Data Structure Best Dsa Notes
No ratings yet
Data Structure Best Dsa Notes
98 pages
Consumables Purchasing Process
No ratings yet
Consumables Purchasing Process
3 pages
Data Analytics III I
No ratings yet
Data Analytics III I
86 pages
Agro Ecosystem
No ratings yet
Agro Ecosystem
34 pages
Lecture 8-3 - Presenting Survey Data and Results
No ratings yet
Lecture 8-3 - Presenting Survey Data and Results
25 pages
Chapter 3 Quantitative Demand Analysis
No ratings yet
Chapter 3 Quantitative Demand Analysis
34 pages
Advanced Excel
No ratings yet
Advanced Excel
37 pages
3 SEM - Software-Engineering-Notes
No ratings yet
3 SEM - Software-Engineering-Notes
95 pages
Celebrity Work
89% (247)
Celebrity Work
6 pages
CSE - Data Visualization Lab Manual
No ratings yet
CSE - Data Visualization Lab Manual
110 pages
Fcu Method Statement
No ratings yet
Fcu Method Statement
3 pages
Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006
100% (2)
Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006
78 pages
Social Media Analytics Unit-1
No ratings yet
Social Media Analytics Unit-1
43 pages
Management Science-Question Bank
No ratings yet
Management Science-Question Bank
5 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
22 pages
100+ Bigdata Solved MCQs With PDF Download
No ratings yet
100+ Bigdata Solved MCQs With PDF Download
10 pages
Q-1 Describe Data Structures For Symbol Table. Ans
No ratings yet
Q-1 Describe Data Structures For Symbol Table. Ans
14 pages
[FREE PDF sample] Road Vehicle Dynamics: Fundamentals and Modeling with MATLAB® 2nd Edition Georg Rill ebooks
100% (1)
[FREE PDF sample] Road Vehicle Dynamics: Fundamentals and Modeling with MATLAB® 2nd Edition Georg Rill ebooks
55 pages
Unit-Iv: Database Management System
50% (2)
Unit-Iv: Database Management System
8 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
R Viva Questions
100% (1)
R Viva Questions
4 pages
Socrates-Blended Lesson Plan Template
No ratings yet
Socrates-Blended Lesson Plan Template
4 pages
Operations Management - MCQ With Answers
95% (66)
Operations Management - MCQ With Answers
12 pages
Digital Marketing Notes
88% (33)
Digital Marketing Notes
67 pages
Advantages of Relational Algebra
50% (2)
Advantages of Relational Algebra
11 pages
KNC501 Question Bank
No ratings yet
KNC501 Question Bank
2 pages
Big Data (KCS-061)
No ratings yet
Big Data (KCS-061)
46 pages
Daa Handwritten Notes
No ratings yet
Daa Handwritten Notes
43 pages
Question Bank Python For Data Science
0% (1)
Question Bank Python For Data Science
3 pages
Semiconductor Device Fabrication (Chapter 9)
No ratings yet
Semiconductor Device Fabrication (Chapter 9)
13 pages
Female Format
94% (85)
Female Format
8 pages
Chap6-Relational Algebra
No ratings yet
Chap6-Relational Algebra
49 pages
Database Management System (Mca) (1sem) Continue....... (Repaired) PDF
33% (3)
Database Management System (Mca) (1sem) Continue....... (Repaired) PDF
108 pages
Bda Viva Q&a
No ratings yet
Bda Viva Q&a
24 pages
Co-Po Big Data Analytics
No ratings yet
Co-Po Big Data Analytics
41 pages
Forecasting MethodsandApplicationsABookreview
No ratings yet
Forecasting MethodsandApplicationsABookreview
4 pages
RDBMS
100% (1)
RDBMS
21 pages
What Are The Essentials of Good Sampling? 1. Representative
100% (2)
What Are The Essentials of Good Sampling? 1. Representative
6 pages
Lambda Cut
No ratings yet
Lambda Cut
3 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Job Interview Questions and Answers PDF
94% (32)
Job Interview Questions and Answers PDF
14 pages
Part 2 Cue Cards
No ratings yet
Part 2 Cue Cards
5 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Constitution of India PDF
No ratings yet
Constitution of India PDF
23 pages
Imp 25 AKTU Ques DBMS
No ratings yet
Imp 25 AKTU Ques DBMS
4 pages
Dbms MBA Notes
50% (2)
Dbms MBA Notes
125 pages
Question Bank 3
No ratings yet
Question Bank 3
2 pages
Map Reduce Applications
No ratings yet
Map Reduce Applications
94 pages
Abstract of CPWD-Part 1 - Electrical Notes & Articles
No ratings yet
Abstract of CPWD-Part 1 - Electrical Notes & Articles
18 pages
Q.Bank - DAA CS1-3year
No ratings yet
Q.Bank - DAA CS1-3year
1 page
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
Bayesian Networks
No ratings yet
Bayesian Networks
7 pages
Bayes Theorem
No ratings yet
Bayes Theorem
9 pages
CV Format BD
69% (153)
CV Format BD
11 pages
Lab 6 by Saad
100% (1)
Lab 6 by Saad
10 pages
Question Bank Unit 4
No ratings yet
Question Bank Unit 4
2 pages
4327 12266 1 PB PDF
No ratings yet
4327 12266 1 PB PDF
22 pages
MCQ - Bda
33% (3)
MCQ - Bda
3 pages
Bayesian Model - Statistics
No ratings yet
Bayesian Model - Statistics
29 pages
Iare Befa Tutorial Question Bank-Converted 0
No ratings yet
Iare Befa Tutorial Question Bank-Converted 0
20 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
Data Analytics Notes
No ratings yet
Data Analytics Notes
1 page
Billing Format 101
94% (307)
Billing Format 101
1 page
System Analysis and Design - BCA
50% (2)
System Analysis and Design - BCA
3 pages
VAS6161 - VAS 6161 Battery Tester Instruction Manual
No ratings yet
VAS6161 - VAS 6161 Battery Tester Instruction Manual
27 pages
OOAD Notes PDF
100% (2)
OOAD Notes PDF
92 pages
Compiler Phases
No ratings yet
Compiler Phases
19 pages
Properties of Relational Decomposition
No ratings yet
Properties of Relational Decomposition
3 pages
Mca-1 Rdbms Syllabus
No ratings yet
Mca-1 Rdbms Syllabus
3 pages
Reversible Reactions (Multiple Choice) QP
No ratings yet
Reversible Reactions (Multiple Choice) QP
6 pages
ADBMS Question Paper
No ratings yet
ADBMS Question Paper
2 pages
Data Structure BCA Practical Exercises
No ratings yet
Data Structure BCA Practical Exercises
8 pages
DBMS ER Design Issues - Copy Unit.2
No ratings yet
DBMS ER Design Issues - Copy Unit.2
2 pages
Block Diagram of A Computer
No ratings yet
Block Diagram of A Computer
2 pages
Visa Interview Question and Answers
78% (67)
Visa Interview Question and Answers
7 pages
BCA Data Structures Notes
No ratings yet
BCA Data Structures Notes
24 pages
Most Common Interview Questions and Answers
90% (29)
Most Common Interview Questions and Answers
3 pages