Data Analytics - Unit - 1
Data Analytics - Unit - 1
Analysis
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is
to be analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.
The actual data is then further divided mainly into two types known as:
1.Primary data: The data which is Raw, original, and extracted directly from
the official sources is known as primary data. This type of data is collected
directly by performing techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the demand and requirements
of the target audience on which analysis is performed otherwise it would be a
burden in the data processing.Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target
audience by a person called interviewer and the person who answers the
interview is known as the interviewee. Some basic business or product related
questions are asked and noted down in the form of notes, audio, or video and
this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone,
face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing
data. Examples are online surveys or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio,
video, or any raw formats. In this method, the data is collected directly by
posting a few questions on the participants. For example, observing a group of
customers and their behavior towards the products. The data obtained will be
sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design
used in data analytics which is based on randomization and replication. It is
mostly used for comparing the experiments.
RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments
are performed on each of the blocks and results are drawn using a
technique known as analysis of variance (ANOVA). RBD was originated from
the agriculture sector.
LSD – Latin Square Design is an experimental design that is similar to
CRD and RBD blocks but contains rows and columns. It is an arrangement
of NxN squares with an equal amount of rows and columns which contain
letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzle is an example of a
Latin square design.
FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.
2. Secondary data: Secondary data is the data which has already been
collected and reused again for some valid purpose. This type of data is
previously recorded from primary data and it has two types of sources named
internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc.
The cost and time consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and
time consumption is more because this contains a huge amount of data.
Examples of external sources are Government publications, news publications,
Registrar General of India, planning commission, international labor bureau,
syndicate services, and other non-governmental publications.
Other sources:
Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on
daily basis through surveillance cameras which can be used to collect useful
information.
Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.
Structured and Unstructured Data
Structured data is data that has been predefined and formatted to a set
structure before being placed in data storage, which is often referred to as
schema-on-write. The best example of structured data is the relational
database: the data has been formatted into precisely defined fields, such as
credit card numbers or address, in order to be easily queried with SQL.
The cons of structured data are centered in a lack of data flexibility. Here are
some potential drawbacks to structured data’s use:
Structured data is an old, familiar friend. It’s the basis for inventory control
systems and ATMs. It can be human- or machine-generated.
Unstructured data is data stored in its native format and not processed
until it is used, which is known as schema-on-read. It comes in a myriad of file
formats, including email, social media posts, presentations, chats, IoT sensor
data, and satellite imagery.
As there are pros and cons of structured data, unstructured data also has
strengths and weaknesses for specific business needs. Some of its benefits
include:
There are also cons to using unstructured data. It requires specific expertise and
specialized tools in order to be used to its fullest potential.
1. Requires data science expertise: The largest drawback to unstructured
data is that data science expertise is required to prepare and analyze the
data. A standard business user cannot use unstructured data as it is, due to
its undefined/non-formatted nature. Using unstructured data requires
understanding the topic or area of the data, but also of understanding how
the data can be related to make it useful.
2. Specialized tools: In addition to the required expertise, unstructured data
requires specialized tools to manipulate. Standard are intended for use
with structured data, which leaves a data manager with limited choices in
products for unstructured data, some of which are still in their infancy.
Examples of unstructured data
Structured data vs. unstructured data comes down to data types that can be
used, the level of data expertise required to use it, and on-write versus on-read
schema.
Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in a
big data environment.
5 V’s of Big Data
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:
a. Structured data: In Structured schema, along with all the required columns. It
is in a tabular form. Structured Data is stored in the relational database
management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-structured data. It is stored in
relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations
have much data available, but they did not know how to derive the value of
data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with
inconsistent data formats that are formatted with effort and time with some
tools.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.
Big Data Platforms vs. Data Lake vs. Data Warehouse
Big Data, at its core, refers to technologies that handle large volumes of data too
complex to be processed by traditional databases. However, it is a very broad term,
functioning as an umbrella term for more specific solutions such as Data Lake and Data
Warehouse.
Data Lake is a scalable storage repository that not only holds large volumes of raw data
in its native format but also enables organizations to prepare them for further usage.
That means data coming to Data Lake doesn’t have to be collected with a specific
purpose from the beginning, it can be defined later. Without it, data can be loaded faster
since they do not need to undergo an initial transformation process.
In Data Lakes, data is gathered in its native formats, which provides more opportunities
for exploration, analysis, and further operations, as all data requirements can be tailored
on a case-by-case basis, then – once the schema has been developed – it can be kept
for future use or discarded.
Compared to Data Lakes, it can be said that Data Warehouses represent a more
traditional and restrictive approach.
Data Warehouse is a scalable storage data repository holding large volumes of raw
data, but its environment is far more structured than in Data Lake. Data collected in
Data Warehouse are already pre-processed, which means it is not in their native
formats. Data requirements must be known and set up front to make sure the models
and schemas produce usable data for all users.
Key differences between Data Lake and Data
Warehouse
Analytic processes and tools
Data Analytics Process
The process of data analysis, or alternately, data analysis steps, involves gathering all the
information, processing it, exploring the data, and using it to find patterns and other insights.
The process of data analysis consists of:
Data Requirement Gathering: Ask yourself why you’re doing this analysis, what type
of data you want to use, and what data you plan to analyze.
Data Collection: Guided by your identified requirements, it’s time to collect the data
from your sources. Sources include case studies, surveys, interviews, questionnaires,
direct observation, and focus groups. Make sure to organize the collected data for
analysis.
Data Cleaning: Not all of the data you collect will be useful, so it’s time to clean it up.
This process is where you remove white spaces, duplicate records, and basic errors. Data
cleaning is mandatory before sending the information on for analysis.
Data Analysis: Here is where you use data analysis software and other tools to help you
interpret and understand the data and arrive at conclusions. Data analysis
tools include Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase, Redash,
and Microsoft Power BI.
Data Interpretation: Now that you have your results, you need to interpret them and
come up with the best courses of action based on your findings.
Data Visualization: Data visualization is a fancy way of saying, “graphically show your
information in a way that people can read and understand it.” You can use charts, graphs,
maps, bullet points, or a host of other methods. Visualization helps you derive valuable
insights by helping you compare datasets and observe relationships.
Reporting Vs Analysis
When you have a set of data stored somewhere and you are happy with the structure
(e.g. you have already cleaned or enhanced your dataset), it is time to make something
out of it. So here comes the “reporting” part. Data reporting is about taking the
available information (e.g. your dataset), organizing it, and displaying it in a well-
structured and digestible format we call “reports”. You can present data from various
sources, making it available for anyone to analyze it.
Reporting is a great way to help the internal teams and experts answer the question
of what is happening.
Analytics is a much wider term. It actually contains reporting as you cannot actually talk
about analytics without proper reporting. Having said that, for proper decision-making,
you will need much more than that.
Analytics is about diving deeper into your data and reports in order to look for insights.
It’s actually an attempt to answer why something is happening. Analytics powers up
decision-making as the main goal is to make sense of the data explaining the reason
behind the reported numbers. Last but not least, in the context of reporting vs. analytics,
you will find that analytics includes recommendations as well. After you analyze your
data and know why something is happening, your aim is to determine a course of action
to either improve something or provide a solution.
As discussed, to do a proper analysis you will need well-designed reports. As opposed
to reporting where data is just grouped up and presented, analytics rests on dashboards
that allow you to dive deeper into existing numbers and look for insights.
Transforms your data into Transforms the information into insights &
Value
information recommendations.
1. Transportation
One can use data analytics to solve traffic congestion and improve travel by improving
transportation systems and intelligence. It works by obtaining enormous volumes of
data to build alternative routes to solve traffic cogestion. This would reduce traffic
congestion and, in turn, reduce road accidents. Likewise, travel companies can obtain
buyers’ preferences from social media and other sources to improve their packages.
This would improve the travel experiences of buyers and companies’ customer base.
For example, data analytics was used to solve the transportation problem of 18 million
people in London city during the 2012 Olympics.
2. Education
Policymakers can use data analytics to improve learning curricula and management
decisions. These applications would improve both learning experiences and
administrative management. To improve the curriculum, we can collect preference data
from each student to build curricula. This would create a better system where students
use different ways to learn the same content. Also, quality data obtained from students
can help better resource allocation and sustainable management decisions. For
example, data analytics can let admins know what facilities students use less or
subjects they are barely interested in.
3. Internet web search results
Search engines like Google, Amazon e-commerce search, Bing, etc., use analytics to
arrange data and deliver the best search results. This implies that data analytics is used
in most search engine operations. When storing web data, data analytics gathers
massive volumes of data submitted by different pages and groups them according to
keywords. In each group, analytics also helps rank web pages according to relevance.
Likewise, every word the searcher enters is a keyword in delivering search results. Data
analytics is again used to search a particular group of web pages to provide the one that
matches the keyword intent best.
6. Security
Security personnel use data analytics (especially predictive analytics) to find future
cases of crimes or security breaches.They can also investigate past or ongoing attacks.
Analytics makes it possible to analyze how IT systems were breached during an attack,
other plausible weaknesses, and the behavior of end-users or devices involved in a
security breach.
Some cities use data analytics to monitor areas with high crime rates. They monitor
crime patterns and predict future crime possibilities from these patterns. This helps
maintain a safe city without risking police officers’ lives.
7. Fraud detection
Phase 1: Discovery –
Phase 6: Operationalize
The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
This approach enables team to learn about performance and related
constraints of the model in production environment on small scale , and
make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Key Roles of a Successful Data Analytics Project
There are certain key roles that are required for the complete and fulfilled
functioning of the data science team to execute projects on analytics
successfully. The key roles are seven in number.
Each key plays a crucial role in developing a successful analytics project. There
is no hard and fast rule for considering the listed seven roles, they can be used
fewer or more depending on the scope of the project, skills of the participants,
and organizational structure.
Example –
For a small, versatile team, these listed seven roles may be fulfilled by only
three to four people but a large project on the contrary may require 20 or more
people for fulfilling the listed roles.
Key Roles for a Data analytics project :
1. Business User :
The business user is the one who understands the main area of the
project and is also basically benefited from the results.
This user gives advice and consult the team working on the project about
the value of the results obtained and how the operations on the outputs
are done.
The business manager, line manager, or deep subject matter expert in
the project mains fulfills this role.
2. Project Sponsor :
The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and
presents the basic business issue.
He generally provides the funds and measures the degree of value from
the final output of the team working on the project.
This person introduce the prime concern and brooms the desired output.
3. Project Manager :
This person ensures that key milestone and purpose of the project is met
on time and of the expected quality.
6. Data Engineer :
Data engineer grasps deep technical skills to assist with tuning SQL
queries for data management and data extraction and provides support
for data intake into the analytic sandbox.
The data engineer works jointly with the data scientist to help build data
in correct ways for analysis.
7. Data Scientist :
Data scientist facilitates with the subject matter expertise for analytical
techniques, data modelling, and applying correct analytical techniques for
a given business issues.
He ensures overall analytical objectives are met.
Data scientists outline and apply analytical methods and proceed towards
the data available for the concerned project.