0% found this document useful (1 vote)
5K views

Unit - 1 Notes - Introduction To Data-Analytics PDF

This document provides an introduction to the concepts of data analytics. It discusses the course outcomes and Bloom's taxonomy levels for the course. It then defines key terms related to data analytics including data, data types (qualitative, quantitative, nominal, ordinal, discrete, continuous), and the data analysis process (determining requirements, collection, organization, cleaning). Finally, it discusses why data analytics is important and the different types of data analytics (descriptive, diagnostic, predictive, prescriptive).

Uploaded by

Kanav Kesar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
5K views

Unit - 1 Notes - Introduction To Data-Analytics PDF

This document provides an introduction to the concepts of data analytics. It discusses the course outcomes and Bloom's taxonomy levels for the course. It then defines key terms related to data analytics including data, data types (qualitative, quantitative, nominal, ordinal, discrete, continuous), and the data analysis process (determining requirements, collection, organization, cleaning). Finally, it discusses why data analytics is important and the different types of data analytics (descriptive, diagnostic, predictive, prescriptive).

Uploaded by

Kanav Kesar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

UNIT – I

Introduction to Data Analytics


Subject Code: ( KCS-051)
KIT – 601
Dr. Monika Sainger
Course Outcome (CO) & Bloom’s Knowledge Level (KL)

Bloom’s
Course Outcomes (CO) Knowledge
Level (KL)

CO 1 Discuss various concepts of data analytics pipeline K1, K2


CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data. K2, K3
CO 4 Compare different clustering and frequent pattern mining K4
algorithms
CO 5 Describe the concept of R programming and implement K2,K3
analytics on Big data using R.
Bloom's Taxonomy
• One of the most recognized learning theories in the
field of education.
• Bloom's Taxonomy is used to create learning
outcomes that target not only subject matter but
also the depth of learning they want students to
achieve.
• And to then create assessments that accurately
report on students’ progress towards these
outcomes.
Bloom's Taxonomy
What is Data Analytics ?
• It refers to the process of examining datasets
to draw conclusions about the
information they contain.

• Data analytic techniques enable you to take


raw data and uncover patterns to extract
valuable insights from it.
Why Data Analytics ?
• Data is the fuel that can drive a business to the right
path.
• Data can help businesses better understand their
customers, improve their advertising campaigns,
personalize their content and improve their bottom lines.
• The advantages of data are many, but you can’t access
these benefits without the proper data analytics tools
and processes.
• While raw data has a lot of potential, you need data
analytics to unlock the power to grow your business.
What is Data?
• Data are a set of values of qualitative or quantitative
variables about one or more persons or objects.

• When data is processed, organized, structured or presented


in a given context so as to make it useful, it is called
information.
Types of Data
• When this Data has so much importance in our life then it
becomes important to properly store and process this without
any error.
• When dealing with datasets, the category of data plays an
important role to determine which preprocessing strategy
would work for a particular set to get the right results or
which type of statistical analysis should be applied for the
best results.
• Let’s dive into some of the commonly used categories of data.
Types of Data

• Qualitative Data Type

• Quantitative Data Type


Qualitative Data Type
• Qualitative or Categorical Data describes the object under
consideration using a finite set of discrete classes.
• It means that this type of data can’t be counted or measured easily
using numbers and therefore divided into categories.
• The gender of a person (male, female, or others) is a good example
of this data type.
• These are usually extracted from audio, images, or text medium.
Another example can be of a smartphone brand that provides
information about the current rating, the color of the phone,
category of the phone, and so on.
• All this information can be categorized as Qualitative data.
Qualitative Data Type
• There are two subcategories under this:

1. Nominal

2. Ordinal
Nominal Data Type
• These are the set of values that don’t possess a natural
ordering.
• For example, color of a smartphone can be considered
as a nominal data type as we can’t compare one color
with others.
• It is not possible to state that ‘Red’ is greater than
‘Blue’.
• Other example can be the gender of a person where
we can’t differentiate between male, female, or others.
Ordinal Data Type
• These types of values have a natural ordering
while maintaining their class of values.
• If we consider the size of a clothing brand then we
can easily sort them according to their name tag in
the order of small < medium < large.
• The grading system while marking candidates in a
test can also be considered as an ordinal data type
where A+ is definitely better than B grade.
Ordinal Data Type
• These categories help us deciding which encoding
strategy can be applied to which type of data.

• Data encoding for Qualitative data is more important.

• Because machine learning models can’t handle these


values directly and needed to be converted to
numerical types as the models are mathematical in
nature.
Ordinal Data Type
• For nominal data type where there is no comparison among
the categories, one-hot encoding can be applied which is
similar to binary coding considering there are in less number.

• for the ordinal data type, label encoding can be applied which
is a form of integer encoding.
One-Hot Encoding Scheme
• In one-hot encoding, for each level of a categorical feature,
we create a new variable.
• Each category is mapped with a binary variable containing
either 0 or 1.
Ordinal Encoding
• An ordinal encoding involves mapping each unique label to
an integer value.

• This type of encoding is really only appropriate if there is a


known relationship between the categories.

• For example, “S” is 38, “L” is 40, and “XL” is 42...


Quantitative Data Type
• This data type tries to quantify things and it does by considering
numerical values that make it countable in nature.

• The price of a smartphone, discount offered, number of ratings on a


product, the frequency of processor of a smartphone, or ram of that
particular phone, all these things fall under the category of
Quantitative data types.

• The key thing is that there can be an infinite number of values a


feature can take.

• For instance, the price of a smartphone can vary from x amount to


any value and it can be further broken down based on fractional
values.
Quantitative Data Type
• The two subcategories which describe them
clearly are:

1. Discrete
2. Continuous
Discrete Data Type
• The numerical values which fall under are integers or whole
numbers are placed under this category.

• The number of speakers in the phone, cameras, cores in the


processor, the number of sims supported all these are some
of the examples of the discrete data type.
Continuous Data Type
• It can be divided up as much as you want, and measured to
many decimal places.

• The fractional numbers are considered as continuous values.

• These can take the form of the operating frequency of the


processors, the android version of the phone, wifi frequency,
temperature of the cores, and so on.
Data Types
Understanding Data Analytics
• Data analytics is a broad term that encompasses many
diverse types of data analysis.

• Any type of information can be subjected to data analytics


techniques to get insight that can be used to improve
things.

• For example, manufacturing companies often record the


runtime, downtime, and work queue for various machines
and then analyze the data to better plan the workloads so
the machines operate closer to peak capacity.
Understanding Data Analytics
• Data analytics can do much more than point out
bottlenecks in production.

• Gaming companies use data analytics to set reward


schedules for players that keep the majority of players
active in the game.

• Content companies use many of the same data analytics to


keep you clicking, watching, or re-organizing content to get
another view or another click.
Data Analysis Process
• The data analysis process involves several different
steps:

1. Determining the data requirements.

2. The process of collection of data.

3. Organization of data so it can be analyzed.

4. Cleaning of data before analysis.


Determining the Data Requirements
• The first step is to determine the data
requirements or how the data is grouped.

• Data may be separated by age, demographic,


income, or gender.

• Data values may be numerical or be divided by


category.
The Process of Collection of Data
• The second step in data analytics is the
process of collecting it.

• This can be done through a variety of


sources such as computers, online sources,
cameras, environmental sources, or
through personnel.
Organization of data
• Once the data is collected, it must be
organized so it can be analyzed.

• Organization may take place on a


spreadsheet or other form of software that
can take statistical data.
Cleaning of data
• The data is then cleaned up before analysis.

• This means it is scrubbed and checked to


ensure there is no duplication or error, and
that it is not incomplete.

• This step helps correct any errors before it


goes on to a data analyst to be analyzed.
Why Data Analytics Matters?
• Data analytics is important because it helps businesses
optimize their performances.

• With this, companies can help reduce costs by identifying


more efficient ways of doing business and by storing large
amounts of data.

• A company can also use data analytics to make better


business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products
and services.
Types of Data Analytics
• Data analytics is broken down into four basic
types:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive analytics
• Descriptive statistics is the type
of statistics which is used to summarize and
describe the dataset.

• It is used to describe the characteristics


of data.
Descriptive analytics
• It describes what has happened over a given
period of time.

• Have the number of views gone up?

• Are sales stronger this month than last?


Descriptive analytics
• Descriptive statistics can be useful for two purposes:

1) To provide basic information about variables in a


dataset and

2) To highlight potential relationships between


variables.
Diagnostic analytics
• The diagnostic analysis is a special type of analytical
technique.

• In this, data is interpreted and analyzed properly to


find out what happened or caused a particular cyber
breach.

• Diagnostic Analytics helps you understand why


something happened in the past.
Diagnostic analytics
• It focuses more on why something happened.

• This involves more diverse data inputs and a bit


of hypothesizing.

• Did the weather affect icecream sales?

• Did that latest marketing campaign impact sales?


Predictive analytics
• Predictive analytics is the branch of the advanced
analytics which is used to make predictions about
unknown future events.

• It is the use of data, statistical algorithms and


machine learning techniques to identify the
likelihood of future outcomes based on
historical data.
Predictive analytics
• The goal is to go beyond knowing what has
happened to providing a best assessment of
what will happen in the future.

• It serves as a decision making tool.


Prescriptive analytics
• Prescriptive analytics focuses on finding the
best course of action in a scenario, given the
available data.
• It is related to both descriptive analytics and
predictive analytics.
• But it emphasizes actionable insights instead
of data monitoring.
Prescriptive analytics
• Prescriptive analytics is the final step of
business analytics.

• It suggests a course of action.


Data Source
• A data source, in the context of computer science and computer
applications, is the location, where data that is being used come
from.

• In a database management system, the primary data source is


the database, which can be located in a disk or a remote server.

• The data source for a computer program can be a file, a data


sheet, a spreadsheet, an XML file or even hard-coded data
within the program.
What are the different sources of data?
• Following are the two sources of data:

1. Internal Source

2. External Source
Internal Source
• When data are collected from reports and
records of the organization itself, it is known
as the internal source.

• For example, a company publishes its ‘Annual


Report’ on Profit and Loss, Total Sales, Loans,
Wages, etc.
External Source

• When data are collected from outside the


organization, it is known as the external source.

• For example, if a Tour and Travels Company obtains


information on ‘Karnataka Tourism’ from Karnataka
Transport Corporation, it would be known as external
sources of data.
Data Classification
• Structured Data

• Semi-structured Data

• Quasi-structured Data

• Unstructured data
Structured Data
• It is the data containing a defined data type,
format, and structure.

• For example: transaction data, traditional


RDBMS, CSV files, and even simple spread
sheets etc.
• Structured data adheres to a pre-defined data
model.

• It is, therefore, straightforward to analyze.

• Structured data conforms to a tabular format


with relationship between the different rows
and columns.
• Structured data depends on the existence of a
data model.

• That means a model of how data can be stored,


processed and accessed.

• Because of a data model, each field is discrete


and can be accesses separately or jointly along
with data from other fields.
• This makes structured data extremely powerful.

• It is possible to quickly aggregate data from


various locations in the database.

• Structured data is considered the most


‘traditional’ form of data storage, since the
earliest versions of database management
systems (DBMS) were able to store, process and
access structured data.
• RDBMS is the example of structured data.

• The RDBMS may store characteristics of the


support calls as typical structured data, with
attributes such as time stamps, machine type,
problem type, and operating system.
Semi-structured Data
• Semi-structured data is the data which does not conforms to
a data model but has some structure.

• It lacks a fixed or rigid schema.

• It is the data that does not reside in a rational database but


that have some organizational properties that make it easier
to analyze.

• With some process, we can store them in the relational


database.
Semi-structured Data
• Semi-structured data is a form of structured data that
contain tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
within the data.

• Therefore, it is also known as self-describing structure.

• Examples of semi-structured data include JSON and


XML are forms of semi-structured data.
Semi-structured Data
• The reason that this third category exists (between structured
and unstructured data) is because semi-structured data is
considerably easier to analyze than unstructured data.

• Many Big Data solutions and tools have the ability to ‘read’
and process either JSON or XML.

• This reduces the complexity to analyze structured data,


compared to unstructured data.
Quasi-structured Data
• This type of data consists of textual content
with erratic data format.

• The formatting of this type of data structure


requires effort, s/w system tools and time.

• Example of it is data about webpages a user


visited and in what order.
Unstructured Data
• Unstructured data is information that is not arranged according
to a pre-set data model or schema.

• Therefore it cannot be stored in a traditional relational


database or RDBMS.

• Text and multimedia are two common types of unstructured


content.

• Many business documents are unstructured, as are email


messages, videos, photos, webpages, and audio files.
Unstructured Data
• From 80 to 90 percent of data generated and collected by organizations, is
unstructured.

• And its volumes are growing rapidly — many times faster than the rate of
growth for structured databases.

• Unstructured data stores contain a wealth of information that can be used to


guide business decisions.

• However, unstructured data has historically been very difficult to analyze.

• With the help of AI and machine learning, it is possible to uncover beneficial


and actionable business intelligence out of this type of data.
Applications of Data Analytics
Healthcare Services

 Quality of medical care’s improvement.

 Pressure to cater large number of patients.

Machine and instrument information use has risen radically


in order to upgrade and track treatment.
Applications of Data Analytics
Accounting
• Various EMI schemes are serious products of Data Analytics.
• To get knowledge about the significant insights in finance and the
different kinds of enhancements that can be made to build the
productivity
• to deal with the risks that might be confronted,
• Bookkeepers use data analytics to shape a solid relationship with
numerous business chiefs.
• Accordingly, data analytics encourages in picking up information to
improve finance.
• With better analysis of gains and profits, bookkeepers can reshape
their thoughts on financing.
Applications of Data Analytics
Extortion and Risk Detection
 one of the underlying utilizations of data science which
was extracted from the field of finance.

 This helps banks to figure out how to divide and


conquer data from their clients’ profiles, recent
expenditure and other critical data that were made
accessible to them.

 This made it simple for them to examine and deduce if


there was any likelihood of clients defaulting.
Applications of Data Analytics
Policing/Security

 Several cities all over the world have employed predictive


analysis in predicting areas that would likely witness a surge in
crime with the use of geographical data and historical data.

 This has seemed to work in major cities such as Chicago,


London, Los Angeles, etc.

 This shows that this kind of data analytics application will make
us have safer cities without police putting their lives at risk.
Applications of Data Analytics
Transportation
• A few years back at the London Olympics, there was a need for
handling over 18 million journeys made by fans in the city of London
and fortunately, it were sorted out.

• How was this feat achieved? The TFL and train operators made use of
data analytics to ensure the large numbers of journeys went
smoothly.

• They were able to input data from events that took place and
forecasted a number of persons that were going to travel; transport
was being run efficiently and effectively so that athletes and
spectators can be transported to and from the respective stadiums.
Applications of Data Analytics
• Manage Risk in the insurance industry.

• Web Provisioning: The key component of this is being able to shift bandwidth at
the right time and location. This can only be achieved by the use of data.

• A study recently carried out showed that a lack of investment in technology was
the cause customer dissatisfaction of the present generation of insurance customers
because they prefer using mobile and online channels, social media and other
recent mediums to interact with their agents.

• However, the older generation still prefers the use of the telephone.

• To improve the overall experience of customers, it is best for insurance companies


to provide a wide range of communication methods for their customers
Reporting
• Reporting is a process in which data is organized
and summarized in an easy-to-understand
format.

• Reports enable organizations to monitor various


performance parameters and improve customer
satisfaction.

• it is used as a tool to monitor the day-to-day


business operations of an organization.
Analysis
• Analysis is a process in which data and reports are
examined to get insight from them.

• These insights help an organization to perform


important tasks in a timely manner.

• These tasks may be planning a new strategy, taking


important business decisions, introducing a new
product, and improving customer satisfaction.
Analysis Vs Reporting
1. In simple words, reporting can be considered as a
process in which raw data is transferred into useful
information and analysis as a process that transforms
information into insights.

2. A good report enables an organization to ask the


most relevant questions related to the business and
customer whereas analysis enables organizations to
answer these questions and implement the insights.
Analysis Vs Reporting
3. Reports enable an organization to understand what is
happening but analysis helps an organization
understand why it is happening and what action can
be taken for it.

4. An analysis can result in reports and reports can result


in analysis.
What is Big Data?
Think of the following:
• Every second, there are around 822 tweets on Twitter.
• Every minute nearly 510 comments are posted, 293000
statuses are updated, and 136000 photos are uploaded
on Facebook.
• Every hour, Walmart, a chain of departmental store,
handles more than 1 million customer transactions.
• Every day around 11.5 million payments are done only
on PayPal.
What is Big Data?
• That means data is coming from everywhere: sensors used to
gather climate information, posts to social media sites, digital
pictures and videos, purchase transaction records and cell phone
GPS signals etc…

• Collecting such huge amount of data would just be a waste of time,


effort and storage space.(if it is not being used for some purpose)

• But for analytics, there is a need to sort, organize, analyze and offer
this critical data in a systematic manner leads to the rise of Big
Data.
What is Big Data?
• The term Big Data refers to all the data that is being
generated across the globe at an unprecedented
rate.

• This data could be either structured or unstructured.

• Today’s business enterprises owe a huge part of their


success to an economy that is firmly
knowledge(data)-oriented.
Big Data
• There is a need to convert Big Data into Business Intelligence that
enterprises can readily deploy.

• Better data leads to better decision making.

• It provides an improved way to strategize for organizations regardless of


their size, geography, market share, customer segmentation and such
other categorizations.

• Hadoop is the platform of choice for working with extremely large


volumes of data.
Introduction to Big Data Platform
• Big data platform is a type of IT solution.

• It combines the features and capabilities of several big


data application and utilities within a single solution.

• It is an enterprise class IT platform that enables


organization in developing, deploying, operating and
managing a big data infrastructure /environment.
Big Data Platform
• Big data platform generally consists of big data storage, servers,
database, big data management, business intelligence and other big
data management utilities.

• It also supports custom development, querying and integration with


other systems.

• The primary benefit behind a big data platform is to reduce the


complexity of multiple vendors/ solutions into a one cohesive
solution.

• Big data platform are also delivered through cloud where the
provider provides an all inclusive big data solutions and services.
What is Hadoop?
• Apache Hadoop is an open source framework that is used to
efficiently store and process large datasets ranging in size
from gigabytes to petabytes of data.

• Instead of using one large computer to store and process the


data, Hadoop allows clustering multiple computers to analyze
massive datasets in parallel more quickly.
Characteristics of Big Data
Big Data has certain characteristics and hence is defined using
4Vs namely:
 Volume
 Velocity
 Variety
 Veracity
• Volume: the amount of data that businesses can collect is really enormous
and hence the volume of the data becomes a critical factor in Big Data
analytics.

• Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also
important to parse Big Data in a timely manner.

• Variety: the data that is generated is completely heterogeneous in the sense


that it could be in various formats like video, text, database, numeric,
sensor data and so on and hence understanding the type of Big Data is a
key factor to unlocking its value.

• Veracity: knowing whether the data that is available is coming from a


credible source is of utmost importance before deciphering and
implementing Big Data for business needs.
The Evolution of Analytic Scalability
• The world of big data requires new levels of scalability.

• As the amount of data to organizations process


continues to increase, the same old methods for handling
data just won’t work anymore.

• Organizations that don’t scale their technologies to


handle large data, will quite simply choke on big data.
The Evolution of Analytic Scalability
• Luckily, there are multiple technologies
available that address different aspects of the
process of taming big data and making use of
it in analytic processes.

• Some of these advances are quite new, and


organizations need to keep up with the times.
The Evolution of Analytic Scalability
• Few of these technologies are:

– Massively Parallel Processing (MPP) architectures,

– Cloud Computing

– Grid Computing

– MapReduce.
Massively Parallel Processing Systems (MPP)
• MPP (massively parallel processing) is the
coordinated processing of a program by multiple
processors.
• These processors work on different parts of the
program, with each processor using its own
operating system and memory .
• Typically, MPP processors communicate using
some messaging interface.
• In some implementations, up to 200 or more
processors can work on the same application.
Handling of Big Data using MPP
• Big data is split into many parts and the
processors works in parallel on each part of
data.

• Divide and conquer strategy is implemented.


There are several types of MPP database architectures, each with
their own benefits:

Grid computing–
 uses multiple computers in distributed networks.
 This type of architecture uses use resources
opportunistically based on their availability.
 This architecture reduces costs for server space, but also
limits bandwidth and capacity at peak times or when there
are too many requests.

Computer clustering – links the available power into nodes that can
connect with each other to handle multiple tasks at once.
MapReduce
• MapReduce is now the most widely used, general
purpose computing model and run-time system for
distributed data analytics.

• It provides a flexible and scalable foundation for


analytics.

• From traditional reporting to leading-edge machine


learning algorithms.
MapReduce
• In this model, a compute “job” is decomposed
into smaller “tasks”.
• These taske correspond to separate Java
Virtual Machine (JVM) procsses in Hadoop
Implementation.
• These tasks are distributed around the cluster
to parallelize and balance the load as much as
possible.
MapReduce
• The MapReduce runtime infrastructure coordinates
the tasks, re-running any that fails/hangs.

• Users of MapReduce do not need to implement


parallelism or reliability features themselves.

• Instead they focus on the data problem they are


trying to solve.
MapReduce
Modern Data Analytic Tools
• There are many open source tools for data
analytics.

• These tools do not require much/any coding


and manages to deliver better results than
paid versions.
R Programming
• R is the leading analytics tool in the industry and widely used
for statistics and data modeling.
• It can easily manipulate your data and present in different
ways.
• R compiles and runs on a wide variety of platforms viz -UNIX,
Windows and MacOS.
• It has 11,556 packages and allows you to browse the packages
by categories.
• R also provides tools to automatically install all packages as per
user requirement, which can also be well assembled with Big
data.
Tableau Public
• Tableau Public is a free software that connects any data source be it
corporate Data Warehouse, Microsoft Excel or web-based data in a variety of
formats.

• It creates data visualizations, maps, dashboards etc. with real-time updates


presenting on web.

• User can easily create interactive graphs, maps, and live dashboards in
minutes.

• No coding required.

• Tableau’s Big Data capabilities makes it important and one can analyze and
visualize data better than any other data visualization software in the market.
Few examples of Tableau Output
Python
• Python is an object-oriented scripting language which is easy to
read, write, maintain and is a free open source tool.
• It was developed by Guido van Rossum in late 1980’s which
supports both functional and structured programming methods.
• Phython is easy to learn as it is very similar to JavaScript, Ruby,
and PHP.
• Also, Python has very good machine learning libraries viz.
Scikitlearn, Theano, Tensorflow and Keras.
• Another important feature of Python is that it can be assembled
on any platform like SQL server, a MongoDB database or JSON.
SAS (Statistical Analysis System)
• SAS is a programming environment and language for data
manipulation and a leader in analytics.
• It was developed by the SAS Institute in 1966 and further developed in
1980’s and 1990’s.
• SAS is easily accessible, managable and can analyze data from any
sources.
• SAS introduced a large set of products in 2011 for customer
intelligence and numerous SAS modules for web, social media and
marketing analytics that is widely used for profiling customers and
prospects.
• It can also predict their behaviors, manage, and optimize
communications.
KNIME (Konstanz Information Miner)
• KNIME is a free and open-source data analytics, reporting and
integration platform.

• It integrates various components for machine learning and data


mining through its modular data pipelining "Lego of Analytics"
concept.

• KNIME Developed in January 2004 by a team of software


engineers at University of Konstanz.

• KNIME allow you to analyze and model the data through visual
programming.
Apache Spark
• The University of California, Berkeley’s AMP Lab, developed
Apache in 2009.
• Apache Spark is a data processing engine and executes
applications in Hadoop clusters 100 times faster in memory
and 10 times faster on disk.
• Spark makes concepts of data science effortless.
• Spark is also popular for data pipelines and machine learning
models development.
• Spark also includes a library – MLlib, that provides a
progressive set of machine algorithms for repetitive data
science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.
RapidMiner
• RapidMiner is a powerful integrated data science platform.
• It is developed to perform predictive analysis and other
advanced analytics like data mining, text analytics, machine
learning and visual analytics without any programming.
• RapidMiner can incorporate with any data source types,
including Access, Excel, Microsoft SQL, Tera data, Oracle,
Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc.
• The tool is very powerful that can generate analytics based on
real-life data transformation settings,
• i.e. you can control the formats and data sets for predictive
analysis.
QlikView
• QlikView has many unique features like in-memory data
processing, which executes the result very fast to the end
users and stores the data in the report itself.

• Data association in QlikView is automatically maintained and


can be compressed to almost 10% from its original size.

• Data relationship is visualized using colors – a specific color is


given to related data and another color for non-related data.
What is the Data Analysis Process?
• Data Analysis Process consists of the following
phases that are iterative in nature −

Data Requirements Specification


Data Collection
Data Processing
Data Cleaning
Data Analysis
Communication
Data Requirements Specification
• The data required for analysis is based on a question or an
experiment.
• Based on the requirements of those directing the analysis, the
data necessary as inputs to the analysis is identified.
• (e.g., Population of people). Specific variables regarding a
population (e.g., Age and Income) may be specified and
obtained.
• Data may be numerical or categorical.
Data Collection
• Data Collection is the process of gathering information on
targeted variables identified as data requirements.
• Data Collection ensures that data gathered is accurate such
that the related decisions are valid.
• Data Collection provides both a baseline to measure and a
target to improve.
• Data is collected from various sources ranging from
organizational databases to the information in web pages.
• The data thus obtained, may not be structured and may
contain irrelevant information.
• Hence, the collected data is required to be subjected to
Data Processing and Data Cleaning.
Data Processing
• The data that is collected must be processed or
organized for analysis.
• This includes structuring the data as required for
the relevant Analysis Tools.
• For example, the data might have to be placed
into rows and columns in a table within a
Spreadsheet or Statistical Application.
• A Data Model might have to be created.
Data Cleaning
• The processed and organized data may be incomplete,
contain duplicates, or contain errors.
• Data Cleaning is the process of preventing and correcting
these errors.
• There are several types of Data Cleaning that depend on
the type of data.
• For example, while cleaning the financial data, certain
totals might be compared against reliable published
numbers or defined thresholds.
• Likewise, quantitative data methods can be used for outlier
detection that would be subsequently excluded in analysis.
Data Analysis
• Data that is processed, organized and cleaned would be ready for the
analysis.
• Various data analysis techniques are available to understand,
interpret, and derive conclusions based on the requirements.
• Data Visualization may also be used to examine the data in graphical
format, to obtain additional insight regarding the messages within the
data.
• Statistical Data Models such as Correlation, Regression Analysis can be
used to identify the relations among the data variables.
• These models that are descriptive of the data are helpful in simplifying
analysis and communicate results.
• The process might require additional Data Cleaning or additional Data
Collection, and hence these activities are iterative in nature.
Communication
• The results of the data analysis are to be reported in a
format as required by the users to support their
decisions and further action.
• The feedback from the users might result in additional
analysis.
• The data analysts can choose data visualization
techniques, such as tables and charts, which help in
communicating the message clearly and efficiently to
the users.
• The analysis tools provide facility to highlight the
required information with color codes and formatting in
tables and charts.

You might also like