Notes of Unit-I Data Analyticsdocx - 250319 - 093958
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
KCA-034)
Introduction to Data Analytics: Data has been the buzzword for ages now.
Either the data being generated from large-scale enterprises or the data generated from
an individual, each and every aspect of data needs to be analyzed to benefit yourself
from it. But how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.
● Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
● Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in business.
● Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
● Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
So, in short, if you understand your Business Administration and have the capability to
perform Exploratory Data Analysis, to gather the required information, then you are
good to go with a career in Data Analytics.
● R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
● Python – Python is an open-source, object-oriented programming language that is easy
to read, write, and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras, etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
● Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps, dashboards
etc with real-time updates on the web.
● QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
● SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
● Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.
● RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
● KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
● OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
● Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times faster
on disk. This tool is also popular for data pipelines and machine learning model
development.[ https://www.edureka.co/blog/what-is-data-analytics/ ]
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis.
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is to
be analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich data.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data
1.Primary data:
The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would be a burden in the data
processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions are
asked and noted down in the form of notes, audio, or video and this data is
stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email,
etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing
data. Examples are online surveys or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video,
or any raw formats. In this method, the data is collected directly by posting a few
questions on the participants. For example, observing a group of customers and
their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
● CRD- Completely Randomized design is a simple experimental design used
in data analytics which is based on randomization and replication. It is mostly
used for comparing the experiments.
● RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique
known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
● LSD – Latin Square Design is an experimental design that is similar to CRD
and RBD blocks but contains rows and columns. It is an arrangement of NxN
squares with an equal amount of rows and columns which contain letters that
occurs only once in a row. Hence the differences can be easily found with
fewer errors in the experiment. Sudoku puzzle is an example of a Latin square
design.
● FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again
for some valid purpose. This type of data is previously recorded from primary
data and it has two types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc.
The cost and time consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples of
external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
Other sources:
● Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
● Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful
information.
● Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.
Classification of data
We can classify data as structured data, semi-structured data, or unstructured
data. Structured data resides in predefined formats and models, Unstructured
data is stored in its natural format until it's extracted for analysis, and
Semi-structured data basically is a mix of both structured and unstructured
data.
1) Structured Data
● Structured data is generally tabular data that is represented by columns and rows in a
database.
● Databases that hold tables in this form are called relational databases.
● The mathematical term “relation” specify to a formed set of data held as a table.
● In structured data, all row in a table has the same set of columns.
● SQL (Structured Query Language) programming language used for structured data.
2) Semi-structured Data
3) Unstructured Data
● Unstructured data is information that either does not organize in a pre-defined manner or not
have a pre-defined data model.
● Unstructured information is a set of text-heavy but may contain data such as numbers, dates,
and facts as well.
● Videos, audio, and binary data files might not have a specific structure. They’re assigned to
as unstructured data.
● Relational databases provide undoubtedly the most well-understood model for holding data.
● The simplest structure of columns and tables makes them very easy to use initially, but the
inflexible structure can cause some problems.
● We can communicate with relational databases using Structured Query Language (SQL).
● SQL allows the joining of tables using a few lines of code, with a structure most beginner
employees can learn very fast
. Non-Relational Data
● Non-relational databases permit us to store data in a format that more closely meets the
original structure.
● A non-relational database is a database that does not use the tabular schema of columns
and rows found in most traditional database systems.
● It uses a storage model that is enhanced for the specific requirements of the type of data
being stored.
● In a non-relational database the data may be stored as JSON documents, as
simple key/value pairs, or as a graph consisting of edges and vertices.
● Examples of non-relational databases:
● Redis
● JanusGraph
● MongoDB
● RabbitMQ
● A document data store handles a set of objects data values and named string fields in an
entity referred to as a document.
● These data stores generally store data in the form of JSON documents.
● A columnar or column-family data store construct data into rows and columns. The columns
are divided into groups known as column families.
● Each column family consists of a set of columns that are logically related and are generally
retrieved or manipulated as a unit.
● Within a column family, rows can be sparse and new columns can be added dynamically.
● A graph data store handles two types of information, edges, and nodes.
● Edges point out the relationships between these entities and Nodes represent entities.
● The aim of a graph datastore is to grant an application to efficiently perform queries that
traverse the network of edges and nodes and to inspect the relationships between entities.
● Time series data is a set of values formed by time, and a time-series data store is making the
best for this type of data.
● Time series data stores must support a very large number of writes, as they generally collect
large amounts of data in real-time from a huge number of sources.
● Object data stores are correct for retrieving and storing large binary objects or blobs such as
audio and video streams, images, text files, large application documents and data objects,
and virtual machine disk images.
● An object consists of some metadata, stored data, and a unique ID for access to the object.
● External index data stores give the ability to search for information held in other data services
and stores.
● An external index acts as a secondary index for any data store. It can provide real-time
access to indexes and can be used to index massive volumes of data
Characteristics Of Data
● Data should be precise which means it should contain accurate information. ...
● Data should be relevant and according to the requirements of the user. ...
● Data should be consistent and reliable. ...
● Relevance of data is necessary in order for it to be of good quality and useful.
The growth of data has always been at a pace that strains the most
scalable options available at any point in time. The traditional ways
of performing advanced analytics are already reaching their limits
before big data. Now, traditional approaches just won't do. This
chapter discusses the convergence of the analytic and data
environments, massively parallel processing (MPP) architectures,
the cloud, grid computing, and MapReduce. Each of these
paradigms enables greater scalability and plays a role in the analysis
of big data. The most important lesson one can take away from the
chapter is that analytic and data management environments are
converging. In-database processing is replacing much of the
traditional offline analytic processing used to support advanced
analytics..
Reporting ≠ Analytics
Google Analytics is one of the most commonly used applications in business
today. As useful as it is, it’s not actually an analytical tool - it’s a reporting
tool. Here’s the difference between reporting and analytics.
● Reporting is about taking existing information and presenting it in a
way that is user friendly and digestible. This often involves pulling
data from different places, like in Google Analytics, or presenting the
data in a new way. Reporting is always defined and specified - it’s
about getting reconciliation and making it accurate, because the
business depends on the accuracy of those numbers to then make a
decision.
● Analytics is about adding value or creating new data to help inform a
decision, whether through an automated process or a manual
analysis. Unlike reporting, analytics is about uncertainty - you use it
when you don’t know exactly how to come to a good answer. This
could be because it’s a complicated problem, or because it’s a
challenge that isn’t well-defined, or because it’s a situation that
changes frequently so the answer you got yesterday is unlikely to
help you today.
1. Transportation
Data analytics can be applied to help in improving Transportation Systems and
intelligence around them. The predictive method of the analysis helps find
transport problems like Traffic or network congestions. It helps synchronize the
vast amount of data and uses them to build and design plans and strategies to
plan alternative routes, reduce congestions and traffics, which in turn reduces
the number of accidents and mishappenings. Data Analytics can also help to
optimize the buyer’s experience in the travels through recording the
information from social media. It also helps the travel companies fixing their
packages and boost the personalized travel experience as per the data collected.
For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers
traveling from one place to another using prediction tools and techniques.
The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand.
For example, when you search for a product on amazon it keeps showing on
your social media profiles or to provide you with the details of the product to
convince you by that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall
working through certain tools like prediction analysis, regression analysis,
budgeting, etc. The unit can figure out the number of products needed to be
manufactured according to the data collected and analyzed from the demand
samples and likewise in many other operations increasing the operating
capacity as well as the profitability.
5. Security
Data analyst provides utmost security to the organization, Security Analytics is
a way to deal with online protection zeroed in on the examination of
information to deliver proactive safety efforts. No business can foresee the
future, particularly where security dangers are concerned, yet by sending
security investigation apparatuses that can dissect security occasions it is
conceivable to identify danger before it gets an opportunity to affect your
framework and main concern.
6. Education
Data analytics applications in education are the most needed data analyst in the
current scenario. It is mostly used in adaptive learning, new innovations,
adaptive content, etc. Is the estimation, assortment, investigation, and detailing
of information about students and their specific circumstance, for reasons for
comprehension and streamlining learning and conditions in which it happens.
7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers
for various illnesses. This won’t just give precise arrangements dependent on
recorded data yet may likewise give accurate answers for exceptional worries
for specific patients.
8. Military
Military applications of data analytics bring together an assortment of
specialized and application-situated use cases. It empowers chiefs and
technologists to make associations between information investigation and such
fields as augmented reality and psychological science that are driving military
associations around the globe forward.