0% found this document useful (0 votes)
8 views18 pages

Notes of Unit-I Data Analyticsdocx - 250319 - 093958

The document provides an overview of Data Analytics, emphasizing its importance in business for gathering insights, generating reports, and performing market analysis. It discusses various tools used in Data Analytics, such as R, Python, and Tableau, and outlines methods for data collection, including primary and secondary data. Additionally, it classifies data into structured, semi-structured, and unstructured types, and introduces the concept of Big Data Platforms for managing and analyzing large datasets.

Uploaded by

sachintaba9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Notes of Unit-I Data Analyticsdocx - 250319 - 093958

The document provides an overview of Data Analytics, emphasizing its importance in business for gathering insights, generating reports, and performing market analysis. It discusses various tools used in Data Analytics, such as R, Python, and Tableau, and outlines methods for data collection, including primary and secondary data. Additionally, it classifies data into structured, semi-structured, and unstructured types, and introduces the concept of Big Data Platforms for managing and analyzing large datasets.

Uploaded by

sachintaba9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Notes of Unit – I(Data Analytics

KCA-034)
Introduction to Data Analytics: Data has been the buzzword for ages now.
Either the data being generated from large-scale enterprises or the data generated from
an individual, each and every aspect of data needs to be analyzed to benefit yourself
from it. But how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business
requirements.
Role of Data Analytics

●​ Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
●​ Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in business.
●​ Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
●​ Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Data Analytics for Beginners


Data Analytics refers to the techniques used to analyze data to enhance productivity
and business gain. Data is extracted from various sources and is cleaned and
categorized to analyze various behavioral patterns. The techniques and the tools used
vary according to the organization or individual.

So, in short, if you understand your Business Administration and have the capability to
perform Exploratory Data Analysis, to gather the required information, then you are
good to go with a career in Data Analytics.

What are the tools used in Data Analytics?


With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top
tools in the data analytics market are as follows.

●​ R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
●​ Python – Python is an open-source, object-oriented programming language that is easy
to read, write, and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras, etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
●​ Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps, dashboards
etc with real-time updates on the web.
●​ QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
●​ SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
●​ Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.
●​ RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
●​ KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
●​ OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
●​ Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times faster
on disk. This tool is also popular for data pipelines and machine learning model
development.[ https://www.edureka.co/blog/what-is-data-analytics/ ]

Different Sources of Data for Data


Analysis

Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis.​
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is to
be analyzed must be collected from different valid sources.

The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich data.
The actual data is then further divided mainly into two types known as:
1.​ Primary data
2.​ Secondary data

1.Primary data:
The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would be a burden in the data
processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions are
asked and noted down in the form of notes, audio, or video and this data is
stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email,
etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing
data. Examples are online surveys or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video,
or any raw formats. In this method, the data is collected directly by posting a few
questions on the participants. For example, observing a group of customers and
their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
●​ CRD- Completely Randomized design is a simple experimental design used
in data analytics which is based on randomization and replication. It is mostly
used for comparing the experiments.
●​ RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique
known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
●​ LSD – Latin Square Design is an experimental design that is similar to CRD
and RBD blocks but contains rows and columns. It is an arrangement of NxN
squares with an equal amount of rows and columns which contain letters that
occurs only once in a row. Hence the differences can be easily found with
fewer errors in the experiment. Sudoku puzzle is an example of a Latin square
design.
●​ FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.

2. Secondary data:
Secondary data is the data which has already been collected and reused again
for some valid purpose. This type of data is previously recorded from primary
data and it has two types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc.
The cost and time consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples of
external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
Other sources:
●​ Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
●​ Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful
information.
●​ Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.
Classification of data
We can classify data as structured data, semi-structured data, or unstructured
data. Structured data resides in predefined formats and models, Unstructured
data is stored in its natural format until it's extracted for analysis, and
Semi-structured data basically is a mix of both structured and unstructured
data.
1) Structured Data

●​ Structured data is generally tabular data that is represented by columns and rows in a
database.
●​ Databases that hold tables in this form are called relational databases.
●​ The mathematical term “relation” specify to a formed set of data held as a table.
●​ In structured data, all row in a table has the same set of columns.
●​ SQL (Structured Query Language) programming language used for structured data.

2) Semi-structured Data

●​ Semi-structured data is information that doesn’t consist of Structured data (relational


database) but still has some structure to it.
●​ Semi-structured data consist of documents held in JavaScript Object Notation (JSON)
format. It also includes key-value stores and graph databases.

3) Unstructured Data

●​ Unstructured data is information that either does not organize in a pre-defined manner or not
have a pre-defined data model.
●​ Unstructured information is a set of text-heavy but may contain data such as numbers, dates,
and facts as well.
●​ Videos, audio, and binary data files might not have a specific structure. They’re assigned to
as unstructured data.

Characteristics Of Structured (Relational) and Unstructured


(Non-Relational)
Data[https://k21academy.com/microsoft-azure/dp-900/structured-data-vs-un
structured-data-vs-semi-structured-data/]
Relational Data

●​ Relational databases provide undoubtedly the most well-understood model for holding data.
●​ The simplest structure of columns and tables makes them very easy to use initially, but the
inflexible structure can cause some problems.
●​ We can communicate with relational databases using Structured Query Language (SQL).
●​ SQL allows the joining of tables using a few lines of code, with a structure most beginner
employees can learn very fast
. Non-Relational Data

●​ Non-relational databases permit us to store data in a format that more closely meets the
original structure.
●​ A non-relational database is a database that does not use the tabular schema of columns
and rows found in most traditional database systems.
●​ It uses a storage model that is enhanced for the specific requirements of the type of data
being stored.
●​ In a non-relational database the data may be stored as JSON documents, as
simple key/value pairs, or as a graph consisting of edges and vertices.
●​ Examples of non-relational databases:

●​ Redis
●​ JanusGraph
●​ MongoDB
●​ RabbitMQ

Document Data Stores

●​ A document data store handles a set of objects data values and named string fields in an
entity referred to as a document.
●​ These data stores generally store data in the form of JSON documents.

Columnar Data Stores

●​ A columnar or column-family data store construct data into rows and columns. The columns
are divided into groups known as column families.
●​ Each column family consists of a set of columns that are logically related and are generally
retrieved or manipulated as a unit.
●​ Within a column family, rows can be sparse and new columns can be added dynamically.

Key/Value Data Stores

●​ A key/value store is actually a large hash table.


●​ We associate each data value with a unique key, and the key/value store uses this key to
store the data by using a correct hashing function.
●​ The hashing function is preferred to provide an even distribution of hashed keys across the
data storage.
●​ Key/value stores are highly suitable for applications operating simple lookups using the value
of the by a range of keys.

Graph Data Stores

●​ A graph data store handles two types of information, edges, and nodes.
●​ Edges point out the relationships between these entities and Nodes represent entities.
●​ The aim of a graph datastore is to grant an application to efficiently perform queries that
traverse the network of edges and nodes and to inspect the relationships between entities.

Time series data stores

●​ Time series data is a set of values formed by time, and a time-series data store is making the
best for this type of data.
●​ Time series data stores must support a very large number of writes, as they generally collect
large amounts of data in real-time from a huge number of sources.

​ Object data stores

●​ Object data stores are correct for retrieving and storing large binary objects or blobs such as
audio and video streams, images, text files, large application documents and data objects,
and virtual machine disk images.
●​ An object consists of some metadata, stored data, and a unique ID for access to the object.

External index data stores

●​ External index data stores give the ability to search for information held in other data services
and stores.
●​ An external index acts as a secondary index for any data store. It can provide real-time
access to indexes and can be used to index massive volumes of data

Characteristics Of Data

●​ Data should be precise which means it should contain accurate information. ...
●​ Data should be relevant and according to the requirements of the user. ...
●​ Data should be consistent and reliable. ...
●​ Relevance of data is necessary in order for it to be of good quality and useful.

Introduction to Big Data Platform


Big Data Platform refers to IT solutions that combine severaBig Data
Tools and utilities into one packaged answer, and this is then used
further for managing as well as analyzing Big Data. The emphasis on
why this is needed is taken care of later in the blog, but know how
much data is getting created daily. This Big Data if not maintained
well, enterprises are bound to lose out on customers. Let's get started
with the basics.
ElixirData It provides Flexibility, Security, and Stability for an Enterprise
application and Big Data Infrastructure to deploy on-premises and
Public Cloud with cognitive insights using ML and AI. Taken From
Article: Big Data Integration and Management Platform
Why we need Big Data Platform?
This solution combines all the capabilities and every feature of many
big data applications into a single solution. It generally consists of big
data servers, management, storage, databases, management utilities,
and business intelligence.
It also focuses on providing their user with efficient analytics tools for
massive datasets. These platforms are often used by data engineers
to aggregate, clean, and prepare data for business analysis. Data
scientists use this platform to discover relationships and patterns in
large data sets using a Machine learning algorithm. The user of such
platforms can custom build applications according to their use case
like to calculate customer loyalty (E-Commerce user case), and so on,
there are countless use cases.
What are the best Big Data Platforms?
This aims around four letters which are S, A, P, S; which means
Scalability, Availability, Performance, and Security. There are various
tools responsible to manage hybrid data of IT systems. Some of them
are listed below:

1.​ Hadoop Delta Lake Migration Platform


2.​ Data Catalog Platform
3.​ Data Ingestion Platform
4.​ IoT Analytics Platform
5.​ Data Integration and Management Platform
6.​ ETL Data Transformation Platform

The Evolution of Analytic Scalability :-

The growth of data has always been at a pace that strains the most
scalable options available at any point in time. The traditional ways
of performing advanced analytics are already reaching their limits
before big data. Now, traditional approaches just won't do. This
chapter discusses the convergence of the analytic and data
environments, massively parallel processing (MPP) architectures,
the cloud, grid computing, and MapReduce. Each of these
paradigms enables greater scalability and plays a role in the analysis
of big data. The most important lesson one can take away from the
chapter is that analytic and data management environments are
converging. In-database processing is replacing much of the
traditional offline analytic processing used to support advanced
analytics..
Reporting ≠ Analytics
Google Analytics is one of the most commonly used applications in business
today. As useful as it is, it’s not actually an analytical tool - it’s a reporting
tool. Here’s the difference between reporting and analytics.
●​ Reporting is about taking existing information and presenting it in a
way that is user friendly and digestible. This often involves pulling
data from different places, like in Google Analytics, or presenting the
data in a new way. Reporting is always defined and specified - it’s
about getting reconciliation and making it accurate, because the
business depends on the accuracy of those numbers to then make a
decision.
●​ Analytics is about adding value or creating new data to help inform a
decision, whether through an automated process or a manual
analysis. Unlike reporting, analytics is about uncertainty - you use it
when you don’t know exactly how to come to a good answer. This
could be because it’s a complicated problem, or because it’s a
challenge that isn’t well-defined, or because it’s a situation that
changes frequently so the answer you got yesterday is unlikely to
help you today.

APPLICATION OF ANALYTICS IN DIFFERENT FIELDS


Not just one or two, the use of data analytics is in every field you can see around.
Be it from Online shopping, or Hitech industries, or the government, everyone
uses data analytics to help them in decision making, budgeting, planning, etc.
The data analytics are employed in various places like:

1. Transportation
Data analytics can be applied to help in improving Transportation Systems and
intelligence around them. The predictive method of the analysis helps find
transport problems like Traffic or network congestions. It helps synchronize the
vast amount of data and uses them to build and design plans and strategies to
plan alternative routes, reduce congestions and traffics, which in turn reduces
the number of accidents and mishappenings. Data Analytics can also help to
optimize the buyer’s experience in the travels through recording the
information from social media. It also helps the travel companies fixing their
packages and boost the personalized travel experience as per the data collected.

For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers
traveling from one place to another using prediction tools and techniques.

2. Logistics and Delivery


There are different logistic companies like DHL, FedEx, etc that uses data
analytics to manage their overall operations. Using the applications of data
analytics, they can figure out the best shipping routes, approximate delivery
times, and also can track the real-time status of goods that are dispatched using
GPS trackers. Data Analytics has made online shopping easier and more
demandable.
Example of Use of data analytics in Logistics and Delivery:
When a shipment is dispatched from its origin, till it reaches its buyers, every
position is tracked which leads to the minimizing of the loss of the goods.

3. Web Search or Internet Web Results


The web search engines like Yahoo, Bing, Duckduckgo, Google uses a set of data
to give you when you search a data. Whenever you hit on the search button, the
search engines use algorithms of data analytics to deliver the best-searched
results within a limited time frame. The set of data that appears whenever we
search for any information is obtained through data analytics.

The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand.
For example, when you search for a product on amazon it keeps showing on
your social media profiles or to provide you with the details of the product to
convince you by that product.

4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall
working through certain tools like prediction analysis, regression analysis,
budgeting, etc. The unit can figure out the number of products needed to be
manufactured according to the data collected and analyzed from the demand
samples and likewise in many other operations increasing the operating
capacity as well as the profitability.

5. Security
Data analyst provides utmost security to the organization, Security Analytics is
a way to deal with online protection zeroed in on the examination of
information to deliver proactive safety efforts. No business can foresee the
future, particularly where security dangers are concerned, yet by sending
security investigation apparatuses that can dissect security occasions it is
conceivable to identify danger before it gets an opportunity to affect your
framework and main concern.

6. Education
Data analytics applications in education are the most needed data analyst in the
current scenario. It is mostly used in adaptive learning, new innovations,
adaptive content, etc. Is the estimation, assortment, investigation, and detailing
of information about students and their specific circumstance, for reasons for
comprehension and streamlining learning and conditions in which it happens.
7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers
for various illnesses. This won’t just give precise arrangements dependent on
recorded data yet may likewise give accurate answers for exceptional worries
for specific patients.

8. Military
Military applications of data analytics bring together an assortment of
specialized and application-situated use cases. It empowers chiefs and
technologists to make associations between information investigation and such
fields as augmented reality and psychological science that are driving military
associations around the globe forward.

Key Roles for Data Analytics project


●​ Last Updated : 28 Oct, 2020
There are certain key roles that are required for the complete and fulfilled
functioning of the data science team to execute projects on analytics
successfully. The key roles are seven in number.
Each key plays a crucial role in developing a successful analytics project. There
is no hard and fast rule for considering the listed seven roles, they can be used
fewer or more depending on the scope of the project, skills of the participants,
and organizational structure.
Example –​
For a small, versatile team, these listed seven roles may be fulfilled by only three
to four people but a large project on the contrary may require 20 or more people
for fulfilling the listed roles.
Key Roles for a Data analytics project :
1.​ Business User :
●​ The business user is the one who understands the main area
of the project and is also basically benefited from the results.
●​ This user gives advice and consult the team working on the
project about the value of the results obtained and how the
operations on the outputs are done.
●​ The business manager, line manager, or deep subject matter
expert in the project mains fulfills this role.
2.​ Project Sponsor :
●​ The Project Sponsor is the one who is responsible to initiate
the project. Project Sponsor provides the actual requirements
for the project and presents the basic business issue.
●​ He generally provides the funds and measures the degree of
value from the final output of the team working on the project.
●​ This person introduce the prime concern and brooms the
desired output.

3.​ Project Manager :


●​ This person ensures that key milestone and purpose of the
project is met on time and of the expected quality.

4.​ Business Intelligence Analyst :


●​ Business Intelligence Analyst provides business domain
perfection based on a detailed and deep understanding of the
data, key performance indicators (KPIs), key matrix, and
business intelligence from a reporting point of view.
●​ This person generally creates fascia and reports and knows
about the data feeds and sources.

5.​ Database Administrator (DBA) :


●​ DBA facilitates and arrange the database environment to
support the analytics need of the team working on a project.
●​ His responsibilities may include providing permission to key
databases or tables and making sure that the appropriate
security stages are in their correct places related to the data
repositories or not.

6.​ Data Engineer :


●​ Data engineer grasps deep technical skills to assist with
tuning SQL queries for data management and data extraction
and provides support for data intake into the analytic sandbox.
●​ The data engineer works jointly with the data scientist to help
build data in correct ways for analysis.

7.​ Data Scientist :


●​ Data scientist facilitates with the subject matter expertise for
analytical techniques, data modelling, and applying correct
analytical techniques for a given business issues.
●​ He ensures overall analytical objectives are met.
●​ Data scientists outline and apply analytical methods and
proceed towards the data available for the concerned project.
Data Analytics Lifecycle :​
The Data analytic lifecycle is designed for Big Data problems and data science
projects. The cycle is iterative to represent real project. To address the distinct
requirements for performing analysis on Big Data, step – by – step methodology
is needed to organize the activities and tasks involved with acquiring, processing,
analyzing, and repurposing data.
Phase 1: Discovery –
●​ The data science team learn and investigate the problem.
●​ Develop context and understanding.
●​ Come to know about data sources needed and available for the project.
●​ The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
●​ Steps to explore, preprocess, and condition data prior to modeling and
analysis.
●​ It requires the presence of an analytic sandbox, the team execute, load,
and transform, to get data into the sandbox.
●​ Data preparation tasks are likely to be performed multiple times and not
in predefined order.
●​ Several tools commonly used for this phase are – Hadoop, Alpine
Miner, Open Refine, etc.
Phase 3: Model Planning –
●​ Team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.
●​ In this phase, data science team develop data sets for training, testing,
and production purposes.
●​ Team builds and executes models based on the work done in the model
planning phase.
●​ Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
●​ Team develops datasets for testing, training, and production purposes.
●​ Team also considers whether its existing tools will suffice for running the
models or if they need more robust environment for executing models.
●​ Free or open-source tools – Rand PL/R, Octave, WEKA.
●​ Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
●​ After executing model team need to compare outcomes of modeling to
criteria established for success and failure.
●​ Team considers how best to articulate findings and outcomes to various
team members and stakeholders, taking into account warning,
assumptions.
●​ Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize –
●​ The team communicates benefits of project more broadly and sets up
pilot project to deploy work in controlled way before broadening the
work to full enterprise of users.
●​ This approach enables team to learn about performance and related
constraints of the model in production environment on small scale , and
make adjustments before full deployment.
●​ The team delivers final reports, briefings, codes.
●​ Free or open source tools – Octave, WEKA, SQL, MADlib.
.

You might also like