0% found this document useful (0 votes)
5 views20 pages

Data Analytics II-unit

The document provides an overview of data analytics, highlighting its significance in extracting insights from large data sets to improve business performance. It discusses various tools and techniques used in data analytics, including R, Python, and Hadoop, and emphasizes the importance of data modeling for effective decision-making. Additionally, it categorizes types of data and outlines best practices for data modeling to support business objectives.

Uploaded by

Praveen Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Data Analytics II-unit

The document provides an overview of data analytics, highlighting its significance in extracting insights from large data sets to improve business performance. It discusses various tools and techniques used in data analytics, including R, Python, and Hadoop, and emphasizes the importance of data modeling for effective decision-making. Additionally, it categorizes types of data and outlines best practices for data modeling to support business objectives.

Uploaded by

Praveen Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Analytics

UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics

As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:

 Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
 Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.

Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:

Fig 2.1 Data Analytics


Data Analytics

In general data analytics also deals with bit of human knowledge as discussed below
in figure 2.2 in this under each type of analytics there is a part of human knowledge required
in prediction. Descriptive analytics requires the highest human input while predictive
analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.

Fig 2.3 Data and Human work

Fig 2.3 Data Analytics


Data Analytics

2.2 Introduction to Tools and Environment

In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the
mainly used tool is Rand Phyton as shown in figure 2.3

With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top tools
in the data analytics market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
 Python – Python is an open-source, object-oriented programming language which is easy
to read, write and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
 Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with
real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
 Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk.
This tool is also popular for data pipelines and machine learning model development.
Data Analytics

Apart from the above-mentioned capabilities, a Data Analyst should also possess skills
such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if
you have knowledge of Machine Learning, then that would make you stand out from the
crowd.

2.3 Application of modelling a business & Need for business Modelling

Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in
detail. Nowadays majority of the business deals with prediction with large amount of data to
work with.

Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:

1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and
user interactivity.

2.3.1 Utilizing Hadoop in Big Data Analytics.

Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Data Analytics

Fig 2.4: Working of Hadoop – With Map Reduce Concept


Data Analytics

The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that isorganized by key values K2. A master node arranges it so that
for redundant copies of input data only one isprocessed.

The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are
located on the same worker node.

The Reduce() step: Worker nodes process each group of output data(perkey) in parallel,
executing the userprovidedReduce() code; each function is run exactly once for each K2 key
value pro-duced by the map step.

Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.

Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown
in Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be
processed by the mappers. Once the key/values aregenerated by mappers, a shuffling process
is used to mix (combine) these key values (combining the same keys in the sameworker
node). Finally, the reduce functions are used to count the words that generate a common
output as a result of thealgorithm. As a result of the execution or wrappers/reducers, the out-
put will generate a sorted list of word counts from theoriginal text input.

2.3.2 The Employment of Big Data Analytics on IBM.

IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue

2.3.3 The Performance of Data Driven Companies.

Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making

2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics

a database management system (DBMS)


Data Analytics

The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL
and object-oriented databases

A text database is a system that maintains a (usually large) text collection and
provides fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..

A desktop database is a database system that is made to run on a single computer


or PC. These simpler solutions for data storage are much more limited and constrained than
larger data center or data warehouse systems, where primitive database software is replaced
by sophisticated hardware and networking setups. Eg: Microsoft excel, open access, etc.

A relational database (RDB) is a collective set of multiple data sets organized by


tables, records and columns. RDBs establish a well-defined relationship
between database tables. Tables communicate and share information, which facilitates data
searchability, organization and reporting. Eg: sql, oracle,Db2, DbaaS etc

NoSQL databases are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc

Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a
class is a collection of objects. Object-oriented databases follow the fundamental principles
of object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..

2.5 Types of Data and variables

In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.

In terms of big data we represent the columns from RDMS as an attribute or a


variable. This variable can be divided in to two types’ categorical data or qualitative data
and continuous or discrete data called as quantitative data. As shown below in figure 2.5.

Qualitative data or Categorical data is normally represented as variable that holds


characters. And this is divided in to two types’ nominal data and ordinal data.

InNominal Data there is no natural ordering in values in the attribute of the dataset.
Eg: color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
Data Analytics

cannot be valid.
Data Analytics

In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg:
size (S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.

Fig 2.5: Types of Data Variables

Quantitative data or (discrete or continuous data) can be further divided in to two


types’ discrete attribute and continuous attribute.

Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

2.5 Data Modelling Techniques

Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make
data-driven decisions and meet varied business goals.

The entire process of data modelling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and then propose
Data Analytics

a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics

Types of Data Models

Data modeling can be achieved in various ways. However, the basic concept of each
of them remains the same. Let’s have a look at the commonly used data modeling methods:

Hierarchical model

As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult
in a hierarchical database. This is why it is rarely used now.

Fig 2.6: Hierarchical Model Structure

Relational model

Proposed as an alternative to hierarchical model by an IBM researcher, here data is


represented in the form of tables. It reduces the complexity and provides a clear overview of
the data as shown below in figure 2.7.

Fig 2.7: Relational Model Structure


Network model
Data Analytics

The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be
shared easily and the computation becomes easier.

Fig 2.8: Network Model Structure


Object-oriented model

This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.

Fig 2.9: Object-Oriented Model Structure

Entity-relationship model
Data Analytics

Entity-relationship model, also known as ER model, represents entities and their


relationships in a graphical format. An entity could be anything – a concept, a piece of data,
or an object.

Fig 2.10: Entity Relationship Diagram

The entity relationship diagram explains relation between variables and with their
primary key and foreign key as shown in figure 2.10. along with this it also explains the
multiple instances of relation between tables.

Now that we have a basic understanding of data modeling, let’s see why it is important.

Importance of Data Modeling


 A clear representation of data makes it easier to analyze the data properly. It provides
a quick overview of the data which can then be used by the developers in varied
applications.
 Data modeling represents the data properly in a model. It rules out any chances of
data redundancy and omission. This helps in clear analysis and processing.
 Data modeling improves data quality and enables the concerned stakeholders to make
data-driven decisions.

Since a lot of business processes depend on successful data modeling, it is necessary to


adopt the right data modeling techniques for the best results.

Best Data Modeling Practices to Drive Your Key Business Decisions

Have a clear understanding of your end-goals and results


Data Analytics

You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.

Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.

Keep it sweet and simple and scale as you grow

Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.

Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.

Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order

You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.

Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.

In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.

This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.

Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.

Keep as much as is needed

While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics

machines’ performance.
Data Analytics

More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.

Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.

Keep crosschecking before continuing

Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.

For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.

Let them evolve


Data models are never written in stone. As your business evolves, it is essential to customize
your data modeling accordingly. Thus, it is essential that you keep them updating over time.
The best practice here is to store your data models in as easy-to-manage repository such that
you can make easy adjustments on the go.

Key takeaway: Data models become outdated quicker than you expect. It is necessary that
you keep them updated from time to time.

The Wrap Up

Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.

2.6 Missing Imputations

In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics

to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics

I. Do nothing to missing data


II. Fill the missing values in the dataset using mean, median.

Eg: for sample dataset given below

SNo Column 1 Column2 Column 3


1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16

Can be replaced as using column mean as follows

SNo Column 1 Column2 Column 3


1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14
5 6 8.66 9.5
6 10 13 16

Advantages:
• Works well with numerical dataset.
• Very fast and reliable.

Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data

III. Imputations using (most frequent) or (zero / constant) values


This can be used for categorical attributes.
Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.

IV. Imputation using KNN


It creates a basic mean impute then uses the resulting complete list to construct a KDTree.
Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-
NNs, it takes the weighted average of them.
Data Analytics

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode

Disadvantage:
• Sensitive to outliers

UNIT-3
BLUE Property
assumptions

 The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.

 There are five Gauss Markov assumptions (also called conditions):

 Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
 Random:
o Our data must have been randomly sampled from the population.
 Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
 Exogeneity:
o The regressors aren’t correlated with the error term.
 Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.

Purpose of the Assumptions


 The Gauss Markov assumptions guarantee the validity of ordinary least squares for
estimating regression coefficients.

 Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.

You might also like