0% found this document useful (0 votes)
45 views21 pages

Da Notes-1

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views21 pages

Da Notes-1

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

DATA ANALYTICS UNIT–I

DATA ANALYTICS
NOTES

Faculty: KHUSHBU DOULANI


ASSISTANT PROFESSOR
CSE
GNITC

Basics Terminology
DATA ANALYTICS UNIT–I

Data

Datasets

Information

Datasets Example

Business data Example


DATA ANALYTICS UNIT–I

INTRODUCTION:

In the beginning times of computers and Internet, the data used was not as much of as it is today, the data then
could be so easily stored and managed by all the users and business enterpriseson a single computer, because the
data never exceeded to the extent of 19 exabytes but now in this era, the data has increased about 2.5 quintillion
per day.

Most of the data is generated from social media sites like Facebook, Instagram, Twitter, etc, and the other sources
can be e-business, e-commerce transactions, hospital, school, bank data, etc. This datais impossible to manage by
traditional data storing techniques. Either the data being generated from large-scale enterprises or the data
generated from an individual, each and every aspect of data needsto be analysed to benefit yourself from it. But
how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden insights, Interesting Patterns
in Data, generate reports, perform market analysis, and improve business requirements.

What is the role of Data Analytics?

Gather Hidden Insights – Hidden insights from data are gathered and then analyzed withrespect to
business requirements.

Generate Reports – Reports are generated from the data and are passed on to the respectiveteams and
individuals to deal with further actions for a high rise in business.
Perform Market Analysis – Market Analysis can be performed to understand the strengthsand
weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to customer
requirements and experience.

What are the tools used in Data Analytics?


R programming
Python
Tableau Public
QlikView
SAS
Microsoft Excel
RapidMiner
KNIME
OpenRefine
Apache Spark
DATA ANALYTICS UNIT–I

Data and architecture design:

Data architecture in Information Technology is composed of models, policies, rules or standards thatgovern
which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in
organizations.

A data architecture should set data standards for all its data systems as a vision or a model of the eventual
interactions between those data systems.
Data architectures address data in storage and data in motion; descriptions of data stores, datagroups and
data items; and mappings of those data artifacts to data qualities, applications, locations etc.
Essential to realizing the target state, Data Architecture describes how data is processed, stored,and utilized
in a given system. It provides criteria for data processing operations that make it possible to design data
flows and also control the flow of data in the system.
The Data Architect is typically responsible for defining the target state, aligning duringdevelopment and
then following up to ensure enhancements are done in the spirit of the originalblueprint.

During the definition of the target state, the Data Architecture breaks a subject down to the atomiclevel and
then builds it back up to the desired form.

The Data Architect breaks the subject down by going through 3 traditional architectural processes:

Conceptual model: It is a business model which uses Entity Relationship (ER) model for relationbetween
entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as rows andcolumn of
data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database technologywill be
suitable for architecture.

Layer View Data (What) Stakeholder

1 Scope/Contextual List of things and architectural Planner


standards important to the business

2 Business Semantic model Owner


Model/Conceptual or Conceptual/Enterprise Data Model

3 System Model/Logical Enterprise/Logical Data Model Designer

4 Technology Physical Data Model Builder


Model/Physical

5 Detailed Actual databases Subcontractor


Representations
DATA ANALYTICS UNIT–I

The data architecture is formed by dividing into three essential models and then are combined:

Factors that influence Data Architecture:


Various constraints and influences will have an effect on data architecture design. These includeenterprise
requirements, technology drivers, economics, business policies and data processing need. Enterprise
requirements:
These will generally include such elements as economical and effective system expansion, acceptable
performance levels (especially system access speed), transaction reliability, and transparent data
management.
In addition, the conversion of raw data such as transaction records and image files into more useful
information forms through such features as data warehouses is also a commonorganizational requirement,
since this enables managerial decision making and other organizational processes.
One of the architecture techniques is the split between managing transaction data and (master)reference
data. Another one is splitting data capture systems from data retrieval systems (as done in a data
warehouse).
Technology drivers:
These are usually suggested by the completed data architecture and database architecture designs.
In addition, some technology drivers will derive from existing organizational integrationframeworks and
standards, organizational economics, and existing site resources (e.g. previously purchased software
licensing).
Economics:
These are also important factors that must be considered during the data architecture phase. It is possible
that some solutions, while optimal in principle, may not be potential candidates due to their cost.
External factors such as the business cycle, interest rates, market conditions, and legal considerations
could all have an effect on decisions relevant to data architecture.
Business policies:
Business policies that also drive data architecture design include internal organizational policies,rules of
regulatory bodies, professional standards, and applicable governmental laws that can vary by applicable
DATA ANALYTICS UNIT–I

agency.
These policies and rules will help describe the manner in which enterprise wishes to process their data.
Data processing needs
These include accurate and reproducible transactions performed in high volumes, data warehousing for
the support of management information systems (and potential data mining),repetitive periodic reporting,
ad hoc reporting, and support of various organizational initiativesas required (i.e. annual budgets, new
product development)
The General Approach is based on designing the Architecture at three Levels of Specification.
➢ The Logical Level
➢ The Physical Level
➢ The Implementation Level
Understand various sources of the Data:
Data can be generated from two types of sources namely Primary and Secondary Sources of Primary
Data.
Data collection is the process of acquiring, collecting, extracting, and storing the voluminous amount of
data which may be in the structured or unstructured form like text, video, audio, XML files, records, or
other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to analysethe patterns
or useful information in data. The data which is to be analysed must be collected from different valid
sources.
The data which is collected is known as raw data which is not useful now but on cleaning the impure and
utilizing that data for further analysis forms information, the information obtained is known as
“knowledge”. Knowledge has many meanings like business knowledge or sales of enterprise products,
disease treatment, etc.
The main goal of data collection is to collect information-rich data.
Data collection starts with asking some questions such as what type of data is to be collected
and what is the source of collection.
Most of the data collected are of two types known as qualitative data which is a group of non-numerical
data such as words, sentences mostly focus on behaviour and actions of the group and another one is
quantitative data which is in numerical forms and can be calculatedusing different scientific tools and
sampling data.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data
DATA ANALYTICS UNIT–I

1. Primary data:
• The data which is Raw, original, and extracted directly from the official sources is known as primary data.
This type of data is collected directly by performing techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would bea burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
• The data collected during this process is through interviewing the target audience by a personcalled
interviewer and the person who answers the interview is known as the interviewee.
• Some basic business or product related questions are asked and noted down in the form ofnotes,
audio, or video and this data is stored for processing.
• These can be both structured and unstructured like personal interviews or formal interviewsthrough
telephone, face to face, email, etc.

2. Survey method:
• The survey method is the process of research where a list of relevant questions are asked andanswers are
noted down in the form of text, audio, or video.
• The survey method can be obtained in both online and offline mode like through website formsand email.
Then that survey answers are stored for analysing data. Examples are online surveysor surveys through
social media polls.
3. Observation method:
• The observation method is a method of data collection in which the researcher keenly observes the
behaviour and practices of the target audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats.
• In this method, the data is collected directly by posting a few questions on the participants. Forexample,
observing a group of customers and their behaviour towards the products. The data obtained will be sent
for processing.
4. Experimental method:
• The experimental method is the process of collecting data through performing experiments, research, and
investigation.
DATA ANALYTICS UNIT–I

• The most frequently used experiment methods are CRD, RBD, LSD, FD.

2. Secondary data:
Secondary data is the data which has already been collected and reused again for some valid purpose.This type of
data is previously recorded from primary data and it has two types of sources named internal source and external
source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales record,transactions,
customer data, accounting resources, etc. The cost and time consumption is less in obtaining internal sources.
▪ Accounting resources- This gives so much information which can be used by the marketingresearcher.
They give information about internal factors.
▪ Sales Force Report- It gives information about the sales of a product. The information provided is from
outside the organization.
▪ Internal Experts- These are people who are heading the various departments. They can givean idea of
how a particular thing is working.
▪ Miscellaneous Reports- These are what information you are getting from operational reports.If the data
available within the organization are unsuitable or inadequate, the marketer shouldextend the search to
external secondary data sources.
External source:
The data which can’t be found at internal organizations and can be gained through external third-partyresources is
external source data. The cost and time consumption are more because this contains a huge amount of data.
Examples of external sources are Government publications, news publications, Registrar General of India,
planning commission, international labour bureau, syndicate services, and other non-governmental publications.
1. Government Publications-
▪ Government sources provide an extremely rich pool of data for the researchers. In addition, many of
these data are available free of cost on internet websites. There are number of government agencies
generating data.
These are like: Registrar General of India- It is an office which generates demographic data.It includes
details of gender, age, occupation etc.
2. Central Statistical Organization-
▪ This organization publishes the national accounts statistics. It contains estimates of nationalincome for
several years, growth rate, and rate of major economic activities. Annual survey of Industries is also
published by the CSO.
▪ It gives information about the total number of workers employed, production units, materialused and
value added by the manufacturer.
3. Director General of Commercial Intelligence-
▪ This office operates from Kolkata. It gives information about foreign trade i.e. import and export. These
figures are provided region-wise and country-wise.
4. Ministry of Commerce and Industries-
DATA ANALYTICS UNIT–I

▪ This ministry through the office of economic advisor provides information on wholesale price index.
These indices may be related to a number of sectors like food, fuel, power, food grainsetc.
▪ It also generates All India Consumer Price Index numbers for industrial workers, urban, non-manual
employees and cultural labourers.
5. Planning Commission-
▪ It provides the basic statistics of Indian Economy.
6. Reserve Bank of India-
▪ This provides information on Banking Savings and investment. RBI also prepares currency and finance
reports.
7. Labour Bureau-
▪ It provides information on skilled, unskilled, white collared jobs etc.
8. National Sample Survey-
▪ This is done by the Ministry of Planning and it provides social, economic, demographic, industrial and
agricultural statistics.
9. Department of Economic Affairs-
▪ It conducts economic survey and it also generates information on income, consumption, expenditure,
investment, savings and foreign trade.
10. State Statistical Abstract-
▪ This gives information on various types of activities related to the state like - commercial activities,
education, occupation etc.
11. Non-Government Publications-
▪ These includes publications of various industrial and trade associations, such as The Indian Cotton Mill
Association Various chambers of commerce.
12. The Bombay Stock Exchange
▪ It publishes a directory containing financial accounts, key profitability and other relevant matter)
Various Associations of Press Media.
• Export Promotion Council.
• Confederation of Indian Industries (CII)
• Small Industries Development Board of India
• Different Mills like - Woollen mills, Textile mills etc
▪ The only disadvantage of the above sources is that the data may be biased. They are likelyto colour
their negative points.
13. Syndicate Services-
▪ These services are provided by certain organizations which collect and tabulate the marketing
information on a regular basis for a number of clients who are the subscribers to these services.
▪ These services are useful in television viewing, movement of consumer goods etc.
▪ These syndicate services provide information data from both household as well as institution.
DATA ANALYTICS UNIT–I

In collecting data from household, they use three approaches:


Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
Mail Diary Panel- It may be related to 2 fields - Purchase and Media. Electronic
Scanner Services- These are used to generate data on volume.They collect data for
Institutions from
• Whole sellers
• Retailers, and
• Industrial Firms
▪ Various syndicate services are Operations Research Group (ORG) and The Indian Marketing
Research Bureau (IMRB).
Importance of Syndicate Services:
• Syndicate services are becoming popular since the constraints of decision making are changingand we
need more of specific decision-making in the light of changing environment. Also, Syndicate services are
able to provide information to the industries at a low unit cost.
Disadvantages of Syndicate Services:
• The information provided is not exclusive. A number of research agencies provide customized services
which suits the requirement of each individual organization.
International Organization-
These includes
• The International Labour Organization (ILO):
• It publishes data on the total and active population, employment, unemployment, wages and
consumer prices.
• The Organization for Economic Co-operation and development (OECD):
• It publishes data on foreign trade, industry, food, transport, and science and technology.
• The International Monetary Fund (IMA):
• It publishes reports on national and international foreign exchange regulations.
Other sources:
Sensor’s data: With the advancement of IoT devices, the sensors of these devices collect data whichcan be used
for sensor data analytics to track the performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through surveillancecameras
which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by userson different
platforms can be predicted and collected with their permission for data analysis. The searchengines also provide
their data through keywords and queries searched mostly.
Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and real time simultaneous analysis.
GPS data : GPS (Global Positioning System) data refers to geographic information derived from satellites that
determine the location of a device using coordinates (latitude, longitude, and altitude), along with timestamps,
speed, and direction. This data is widely used in navigation for personal devices and vehicles, tracking logistics
and fleet management, creating geofences for location-based alerts, mapping and surveying for geographic
information systems (GIS), and optimizing agricultural practices through precision farming. By collecting and
processing GPS data, organizations can analyze movement patterns, visualize routes on maps, and integrate it with
DATA ANALYTICS UNIT–I

other datasets for enhanced insights and decision-making across various industries.

Data Management:
Data management is the practice of collecting, keeping, and using data securely, efficiently, and cost-effectively.
The goal of data management is to help people, organizations, and connected things optimize the use of data
within the bounds of policy and regulation so that they can make decisions and take actions that maximize the
benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies, procedures, and practices. The
work of data management has a wide scope, covering factors such as how to:
• Create, access, and update data across a diverse data tier
• Store data across multiple clouds and on premises
• Provide high availability and disaster recovery
• Use data in a growing variety of apps, analytics, and algorithms
• Ensure data privacy and security
• Archive and destroy data in accordance with retention schedules and compliance requirements
What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It doesn’t store any data on
the hard disk of your personal computer. In cloud computing, you can access data froma remote server.
Service Models of Cloud computing are the reference models on which the Cloud Computing is based.
These can be categorized into
three basic service models as listed below:
1. INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual machines, virtualstorage,
etc.
2. PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools, etc.
3. SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.

For providing the above services models AWS is one of the popular platforms. In this Amazon Cloud(Web)
Services is one of the popular service platforms for Data Management
DATA ANALYTICS UNIT–I

Data Quality:
What is Data Quality?
There are many definitions of data quality, in general, data quality is the assessment of howmuch the data is
usable and fits its serving context.

Why Data Quality is Important?


Enhancing the data quality is a critical concern as data is considered as the core of all activitieswithin organizations,
poor data quality leads to inaccurate reporting which will result inaccuratedecisions and surely economic damages.

Many factors help measuring data quality such as:


Data Accuracy: Data are accurate when data values stored in the database correspondto real-world values.
Data Uniqueness: A measure of unwanted duplication existing within or across systemsfor a particular field,
record, or data set.
Data Consistency: Violation of semantic rules defined over the dataset.
Data Completeness: The degree to which values are present in a data collection.
Data Timeliness: The extent to which age of the data is appropriated for the task athand.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,Believability.

Data Pre-processing:

Data preprocessing is a crucial step in machine learning and data analysis, where raw data is prepared for further
processing. It involves cleaning the data by handling missing values, removing duplicates, and correcting errors. The
data may also need to be transformed by scaling, normalizing, or encoding categorical variables. Additionally, feature
selection or extraction is often performed to reduce the dataset's dimensionality and improve the model's efficiency.
Preprocessing ensures that the data is in a suitable format, which helps in building more accurate and reliable machine
learning models.

1. Missing Values: These occur when some data points are absent from the dataset. For example, in a patient
health record, if the blood pressure column is empty for some patients, that's a missing value. Techniques
like filling with the mean value or using algorithms to predict the missing data can address this.
2. Duplicate Values: These are repeated entries in the dataset. For instance, if the same patient’s data is
accidentally recorded twice, this is considered a duplicate. Duplicates can skew the analysis and are typically
removed to maintain data integrity.
3. Noise: Noise refers to irrelevant or random errors in data. For example, in sensor data, sudden spikes or
fluctuations in readings, like an unusually high temperature reading, can be considered noise. Smoothing
techniques can help reduce noise in the data.
4. Outliers: These are extreme values that significantly differ from the rest of the data. For instance, if most
patients have a blood pressure reading around 120-130, but one patient has a reading of 250, this would be
an outlier. Outliers can distort the results of data analysis and may need to be handled or removed based on
the context.

Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually performed by a
data scientist or team of data scientists, it is important for data processing to be donecorrectly as not to negatively
affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable format (graphs,documents,
DATA ANALYTICS UNIT–I

etc.), giving it the form and context necessary to be interpreted by computers and utilizedby employees throughout
an organization.

Six stages of data processing:


1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources, including datalakes and
data warehouses. It is important that the data sources available are trustworthy and well- built so the data collected
(and later used as information) is of the highest possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referredto as “pre-
processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing.
During preparation, raw data is diligently checked for any errors. The purposeof this step is to eliminate bad data
(redundant, incomplete, or incorrect data) and begin to create high-quality data for the best business intelligence.

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse like
Redshift), and translated into a language that it can understand. Data input is the first stage in which raw data
begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation.
Processing is done using machine learning algorithms, though the process itself may vary slightly depending on
the source of data being processed (data lakes, social networks, connected devices etc.) and its intended use
(examining advertising patterns, medical diagnosis from connecteddevices, determining customer needs, etc.).

5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It istranslated,
readable, and often in the form of graphs, videos, images, plain text, etc.).

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for future use.
While some information may be put to use immediately, much of it will serve a purpose later on. When data is
properly stored, it can be quickly and easily accessed by members of the organization when needed.
DATA ANALYTICS UNIT–I

Data Processing Chain

Steps for Data Analytics

Data analytics involves several key steps to extract insights from data. Here's a simplified breakdown:
1. Data Collection: Gathering raw data from various sources such as databases, surveys, or sensors. This data
could be structured (e.g., spreadsheets) or unstructured (e.g., text, images).
2. Data Cleaning: Ensuring data quality by handling missing values, removing duplicates, correcting errors,
and dealing with inconsistencies. This step prepares the data for accurate analysis.
3. Data Exploration (EDA): Analyzing the data to understand its basic structure and identify patterns or
trends. Techniques like data visualization, summary statistics, and correlation analysis are used to get
insights into the relationships within the data.
4. Data Transformation: Preprocessing the data to make it suitable for analysis. This includes normalizing,
scaling, encoding categorical variables, and selecting or creating new features (feature engineering).
5. Data Modeling: Applying statistical models or machine learning algorithms to analyze the data. This step
includes training models, making predictions, and testing hypotheses to draw conclusions from the data.
6. Data Interpretation: Interpreting the results of the analysis by evaluating model performance,
understanding the outputs, and drawing actionable insights from the data.
7. Data Visualization: Presenting the results through charts, graphs, and dashboards to communicate findings
clearly and effectively to stakeholders.
8. Decision Making: Using the insights derived from the analysis to make data-driven decisions and improve
processes, strategies, or outcomes.
Each step is crucial in transforming raw data into valuable insights for making informed decisions.

Data types and Variable

Data types and variables are fundamental concepts in programming and data analysis. Here's a detailed explanation
of both:
Data Types
Data types define the kind of data a variable can hold, determining how the data is stored, processed, and
interpreted by programs. Different data types are used based on the nature of the information being stored.
1. Numerical Data Types
• Integer (int): Whole numbers without decimal points. For example, 5, -10, 100.
• Floating Point (float): Numbers with decimals. For example, 5.67, -3.14, 0.001.
• Double: Similar to float, but with double precision (more accuracy). Used when greater precision is needed
in calculations.
2. Categorical Data Types
• Boolean (bool): Represents two values, either True or False. Commonly used in logical operations.
• Character (char): A single character, such as 'A', '1', or '#'. In some programming languages, this data type
stores one letter or symbol.
• String (str): A sequence of characters (letters, numbers, symbols). For example, "Hello", "12345", or
"A@12". It is used for textual data.
3. Complex Data Types
• Array/List: A collection of items (can be numbers, strings, etc.), typically of the same type, stored in an
ordered sequence. For example, [1, 2, 3] or ['apple', 'banana'].
DATA ANALYTICS UNIT–I

• Tuple: Similar to a list but immutable (cannot be changed once created). For example, (1, 2, 3) or ('A', 'B',
'C').
• Set: A collection of unique items, unordered and without duplicates. For example, {1, 2, 3} or {'apple',
'orange'}.
• Dictionary: A collection of key-value pairs, where each key is associated with a specific value. For
example, {'name': 'Alice', 'age': 25}.
4. Specialized Data Types
• Date/Time: Stores date and time values. For example, 2024-10-17 or 12:30:00. Used for temporal data in
various applications.
• Binary: Stores binary data like images, audio files, or any non-textual data in digital form (usually in
bytes).
Variables
A variable is a symbolic name that holds data. Variables are assigned values and act as containers that store
information during program execution. The value assigned to a variable can change, allowing the program to
manipulate data dynamically.
1. Variable Declaration and Assignment
• Declaration: Creating a variable without assigning a value. (In some languages, this is optional, and
variables are implicitly declared when assigned.)
• Assignment: Assigning a value to a variable using the = operator. For example, in Python:
python
Copy code
x = 5 # Integer assignment
y = 3.14 # Float assignment
name = "Alice" # String assignment
is_active = True # Boolean assignment
2. Types of Variables
• Local Variable: Declared inside a function and accessible only within that function. Local variables exist
only during the function execution.
o Example:
python
Copy code
def example():
local_var = 10 # Local variable
• Global Variable: Declared outside of all functions and accessible throughout the program. Global variables
can be accessed and modified by any function.
o Example:
python
Copy code
global_var = 100 # Global variable
def example():
print(global_var)
• Instance Variable: Variables declared inside a class and associated with an instance of that class. Each
object of the class can have different values for these variables.
o Example:
python
Copy code
class Car:
def __init__(self, model):
self.model = model # Instance variable
car1 = Car("Toyota")
car2 = Car("Honda")
• Class Variable: Declared inside a class but outside any instance method. They are shared across all
instances of the class.
o Example:
python
Copy code
class Car:
wheels = 4 # Class variable (same for all instances)

Categorization of Variables (In Data Analysis)


In data analysis, variables are also categorized based on the type of data they hold and their role in analysis:
DATA ANALYTICS UNIT–I

1. Quantitative (Numerical) Variables


• Continuous Variables: Can take any value within a range. For example, height (e.g., 172.5 cm),
temperature, and weight. These variables can have decimal values.
• Discrete Variables: Can only take specific values, often integers. For example, the number of children in a
family (1, 2, 3), or the number of cars sold.
2. Qualitative (Categorical) Variables
• Nominal Variables: Categorical variables without any inherent order. For example, gender (Male/Female),
colors (Red, Blue, Green), or blood types (A, B, O).
• Ordinal Variables: Categorical variables that have a defined order or ranking. For example, customer
satisfaction levels (Low, Medium, High) or education levels (High School, Bachelor’s, Master’s).
3. Dependent and Independent Variables
• Independent Variables (Predictors): Variables that are manipulated or observed to see their effect on the
dependent variable. In an experiment, they are the inputs.
o Example: In studying the effect of study hours on exam scores, "study hours" is the independent
variable.
• Dependent Variables (Response): The variable that is measured and is expected to change as a result of
changes in the independent variable.
o Example: In the same study, "exam score" is the dependent variable, as it depends on the study
hours.

Data Analytics is the process of examining datasets to draw meaningful insights and conclusions. It
involves using statistical techniques, algorithms, and software to analyze raw data and uncover patterns,
correlations, trends, and predictions that help in decision-making. Data analytics is crucial for various
fields, including healthcare, where it turns large amounts of medical data into actionable insights for
improving patient care, operational efficiency, and treatment outcomes.
There are four main types of data analytics, each serving a distinct purpose, with applications in
healthcare:
1. Descriptive Analytics: This type focuses on summarizing historical data to understand what
has happened over a certain period. In healthcare, descriptive analytics can help hospitals track
patient admission rates, treatment outcomes, or the spread of diseases over time. For example, a
hospital might analyze data to see how many patients were admitted with respiratory infections
during the past year to evaluate trends and allocate resources more effectively.

2. Diagnostic Analytics: Going a step further from descriptive analytics, diagnostic analytics helps
explain why something happened by identifying relationships, anomalies, and patterns within the
data. In healthcare, it can help physicians and healthcare providers understand why certain
medical treatments work better for some patients than others. For instance, diagnostic analytics
could be used to analyze why certain patient groups have a higher rate of post-surgery
complications by looking into patient history, surgical procedures, and post-operative care.

3. Predictive Analytics: This type of analytics forecasts future outcomes based on historical data.
In healthcare, predictive analytics is often used to predict disease outbreaks, patient readmission
rates, or potential complications in high-risk patients. For example, predictive models might be
employed to forecast which patients are likely to develop chronic conditions like diabetes based
on their medical history, lifestyle, and genetic data, allowing healthcare providers to offer
preventive care.

4. Prescriptive Analytics: The most advanced form, prescriptive analytics, not only predicts future
outcomes but also provides recommendations on the best course of action. In healthcare,
prescriptive analytics can help optimize treatment plans by recommending the most effective
therapies or interventions for individual patients. For instance, a hospital might use prescriptive
analytics to determine the best staffing levels during flu season based on predicted patient influx,
ensuring the right number of healthcare professionals are available to handle increased demand.

In summary, these types of data analytics play a vital role in transforming raw healthcare data into
insights that improve patient care, enhance operational efficiency, and drive better decision-making
across healthcare organizations.
DATA ANALYTICS UNIT–I

Business Data Analytics is the practice of using data analysis techniques to examine business data
and extract actionable insights that help organizations make informed decisions. It involves collecting,
processing, and analyzing vast amounts of data generated by businesses in areas such as sales,
marketing, operations, finance, and customer behavior. By leveraging business data analytics, companies
can identify trends, assess performance, optimize processes, and gain a competitive advantage.
Business data analytics can be categorized into several types, each serving a different purpose:
1. Descriptive Analytics: This type helps businesses understand what has happened in the past.
It focuses on summarizing historical data to provide insights into key business metrics like sales
figures, customer demographics, or product performance. For example, a retail company may use
descriptive analytics to examine last quarter's sales data and determine which products were the
best sellers or which regions had the highest sales.

2. Diagnostic Analytics: This type digs deeper into the data to understand the reasons behind
past outcomes. For businesses, diagnostic analytics can help identify the causes of
underperformance or success. For instance, a company might use diagnostic analytics to
understand why sales of a particular product dropped by analyzing factors such as customer
reviews, competitor activity, and marketing efforts. By identifying these causes, businesses can
adjust strategies accordingly.

3. Predictive Analytics: Predictive analytics helps businesses anticipate future trends based on
historical data. It uses machine learning algorithms, statistical models, and data mining
techniques to forecast what might happen next. For example, an e-commerce company might
predict future sales trends based on previous customer behavior, market conditions, and seasonal
patterns. This allows the company to stock inventory, plan marketing campaigns, and set sales
targets more effectively.

4. Prescriptive Analytics: This advanced type of analytics goes beyond prediction to suggest
actionable recommendations. Businesses use prescriptive analytics to determine the best course
of action in various scenarios. For instance, a logistics company could use prescriptive analytics to
optimize its delivery routes, considering factors like traffic, weather, and fuel costs, to reduce
delivery times and operational costs.

Through business data analytics, organizations can improve decision-making, enhance customer
satisfaction, streamline operations, and drive revenue growth. For example, a marketing team can use
analytics to tailor campaigns to specific customer segments based on their behavior and preferences,
increasing conversion rates. Similarly, financial teams can use analytics to assess risk and make better
investment decisions. Overall, business data analytics plays a crucial role in helping companies harness
the power of data to achieve their goals.

Data modelling
Data is changing the way the world functions. It can be a study about disease cures, a company’s
revenue strategy, efficient building construction, or those targeted ads on your social media page; it is all
due to data.
This data refers to information that is machine-readable as opposed to human-readable. For example,
customer data is meaningless to a product team if they do not point to specific product purchases.
Similarly, a marketing team will have no use of that same data if the IDs didn’t relate to specific price
points during buying.
This is where Data Modeling comes in. It is the process that assigns relational rules to data. A Data
Model un-complicates data into useful information that organizations can then use for decision-making
and strategy. Before getting started with what is data modelling, let’s understand what is a Data Model
in detail.
DATA ANALYTICS UNIT–I

What is a Data Model?


Good data allows organizations to establish baselines, benchmarks, and goals to keep moving forward.
In order for data to allow this measuring, it has to be organized through data description, data
semantics, and consistency constraints of data. A Data Model is this abstract model that allows the
further building of conceptual models and to set relationships between data items.
An organization may have a huge data repository; however, if there is no standard to ensure the basic
accuracy and interpretability of that data, then it is of no use. A proper data model certifies actionable
downstream results, knowledge of best practices regarding the data, and the best tools to access it.

What is Data Modeling?


Data Modeling is the process of simplifying the diagram or data model of a software system by applying
certain formal techniques. It involves expressing data and information through text and symbols. The
data model provides the blueprint for building a new database or reengineering legacy applications.
In the light of the above, it is the first critical step in defining the structure of available data. Data
Modeling is the process of creating data models by which data associations and constraints are described
and eventually coded to reuse. It conceptually represents data with diagrams, symbols, or text to
visualize the interrelation.
Data Modeling thus helps to increase consistency in naming, rules, semantics, and security. This, in turn,
improves data analytics. The emphasis is on the need for availability and organization of data,
independent of the manner of its application.
Data Modeling Process
Data modeling is a process of creating a conceptual representation of data objects and their relationships
to one another. The process of data modeling typically involves several steps, including requirements
gathering, conceptual design, logical design, physical design, and implementation. During each step of
the process, data modelers work with stakeholders to understand the data requirements, define the
entities and attributes, establish the relationships between the data objects, and create a model that
accurately represents the data in a way that can be used by application developers, database
administrators, and other stakeholders.
Levels Of Data Abstraction
Data modeling typically involves several levels of abstraction, including:
• Conceptual level: The conceptual level involves defining the high-level entities and relationships
in the data model, often using diagrams or other visual representations.

• Logical level: The logical level involves defining the relationships and constraints between the
data objects in more detail, often using data modeling languages such as SQL or ER diagrams.

• Physical level: The physical level involves defining the specific details of how the data will be
stored, including data types, indexes, and other technical details.

Data Modeling Examples


The best way to picture a data model is to think about a building plan of an architect. An architectural
building plan assists in putting up all subsequent conceptual models, and so does a data model.
These data modeling examples will clarify how data models and the process of data modeling highlights
essential data and the way to arrange it.
1. ER (Entity-Relationship) Model
This model is based on the notion of real-world entities and relationships among them. It creates an
entity set, relationship set, general attributes, and constraints.
Here, an entity is a real-world object; for instance, an employee is an entity in an employee database. An
attribute is a property with value, and entity sets share attributes of identical value. Finally, there is the
relationship between entities.
2. Hierarchical Model
This data model arranges the data in the form of a tree with one root, to which other data is connected.
The hierarchy begins with the root and extends like a tree. This model effectively explains several real-
time relationships with a single one-to-many relationship between two different kinds of data.
For example, one supermarket can have different departments and many aisles. Thus, the ‘root’ node
supermarket will have two ‘child’ nodes of (1) Pantry, (2) Packaged Food.
3. Network Model
This models enables many-to-many relationships among the connected nodes. The data is arranged in a
graph-like structure, and here ‘child’ nodes can have multiple ‘parent’ nodes. The parent nodes are
DATA ANALYTICS UNIT–I

known as owners, and the child nodes are called members.


4. Relational Model
This popular data model example arranges the data into tables. The tables have columns and rows, each
cataloging an attribute present in the entity. It makes relationships between data points easy to identify.
For example, e-commerce websites can process purchases and track inventory using the relational
model.
5. Object-Oriented Database Model
This data model defines a database as an object collection, or recyclable software components, with
related methods and features.

6. Object-Relational Model
This model is a combination of an object-oriented database model and a relational database model.
Therefore, it blends the advanced functionalities of the object-oriented model with the ease of the
relational data model.
The data modeling process helps organizations to become more data-driven. This starts with cleaning
and modeling data. Let us look at how data modeling occurs at different levels.
……………………………………………………………………………………………………………………………………………………
………

Database modeling is an essential process in the development and management of databases. It


involves designing the logical structure of a database, which defines how data is stored, accessed, and
managed. Different types of database models are used depending on the specific needs of the system,
and these models determine the relationships between data, the structure of the data, and how the
database operates. The most widely used types of database models are the hierarchical model, network
model, relational model, object-oriented model, and entity-relationship (ER) model, each offering unique
advantages and characteristics.

The hierarchical model organizes data in a tree-like structure, where each record has a single parent
but can have multiple children, forming a parent-child hierarchy. This model is useful in situations where
data follows a one-to-many relationship, like in organizational structures, file directories, or categories.
For instance, in a company's employee structure, one department head may have several employees
working under them. The main advantage of this model is its efficiency in handling large volumes of
hierarchical data and read-heavy operations. However, its rigid structure makes it difficult to manage
complex relationships, as it does not support many-to-many relationships, limiting flexibility.

The network model is an extension of the hierarchical model that allows more complex relationships,
particularly many-to-many relationships. Instead of a strict parent-child hierarchy, each record (node) in
the network model can have multiple parent and child nodes. This allows greater flexibility in
representing real-world relationships between data, such as one supplier providing multiple products to
various stores, and each store receiving products from several suppliers. While more flexible, the
network model is also more complex to design and maintain. Its use is typically suited to situations
where data relationships are complex, but it requires specialized knowledge for effective implementation
and management.

The relational model is the most popular and widely used database model. It organizes data into
tables (also called relations) where each row represents a record, and each column represents an
attribute of that record. The strength of the relational model lies in its simplicity and flexibility, as data in
one table can be linked to data in another table through the use of primary and foreign keys. This model
is particularly well-suited for applications that need to perform a variety of operations on data, including
querying, updating, and deleting records. The relational model allows for the use of Structured Query
Language (SQL) to manage and manipulate data efficiently. It also enforces data integrity and
consistency through rules like normalization and constraints. While highly efficient for most applications,
relational databases can become slow when dealing with extremely large datasets or highly complex
queries, making scalability a challenge in certain contexts.
DATA ANALYTICS UNIT–I

The object-oriented model integrates the concepts of object-oriented programming (OOP) into
databases, allowing data to be stored in the form of objects, as seen in OOP languages like Java and
C++. In this model, data is not only represented by attributes but also includes methods or behaviors
that can manipulate the data. Each object is an instance of a class, and the relationships between
objects are based on inheritance and other OOP principles. This model is particularly well-suited for
applications that require complex data and operations, such as computer-aided design (CAD) systems,
multimedia databases, and simulations. While the object-oriented model provides a natural mapping
between application code and data storage, it is not as commonly used as the relational model due to its
complexity and the need for specialized tools.

Finally, the entity-relationship (ER) model is more of a conceptual framework used during the design
phase of a database. It represents data as entities (objects) and relationships between those entities.
For example, in a university database, entities might include students, courses, and instructors, while the
relationships might represent the enrollment of students in courses or the assignment of instructors to
teach certain courses. The ER model is visualized through ER diagrams, which help database designers
conceptualize the structure of the database before implementing it using a physical model like the
relational model. The ER model is valuable for planning and designing databases in a clear and organized
way, but it is not directly implemented in a database management system.

Each database modeling type serves specific needs, with the relational model being the most versatile
and widely adopted, while hierarchical and network models cater to more specialized data structures.
The object-oriented model is useful for applications that deal with complex objects, and the ER model
provides a strong foundation for database design. The choice of database model depends on the
complexity of the data, the relationships between data elements, and the requirements of the application
or system being developed.

What is data modeling?


The process of creating a visual representation of either part of a system or the entire system to
communicate connections between structures and data points using elements, texts, and symbols.
Q2. What are the types of data models?
There are three types of data models: dimensional, relational, and entity relational. These models follow
three approaches: conceptual, logical, and physical. Other data models are also there; however, they are
obsolete, such as network, hierarchical, object-oriented, and multi-value.
Q3. What are the types of data modeling techniques?
The following are the types of data modeling techniques: hierarchical, network, relational, object-
oriented, entity-relationship, dimensional, and graph.
Q4. What is the data modeling process?
The first step in the data modeling process is identifying the use cases and logical data models. Then
create a preliminary cost estimation. Identify the data access patterns and technical requirements.
Create DynamoDB data model and queries. Validate the model and review the cost estimation.
Q5. How can AWS help with data modeling?
You can use Amazon RDS (relational database service) to implement relational data models, Amazon
Neptune to implement graph data models, and AWS Amplify DataStore for faster and easier data
modeling to build web and mobile applications.
Q6. What are data modeling concepts?
Data modeling concepts answer the question of WHAT the system contains. A conceptual model helps to
organize, scope, and define business concepts and rules. These concepts are created by data architects
and business stakeholders.
Q7. Why is data modeling important?
An organized and comprehensive data modeling is crucial to create a simplified, logical, and physical
database. It is necessary to eliminate storage requirements and redundancy and enable efficient data
retrieval.
Q8. What are the types of data modeling?
The predominant data modeling types are hierarchical, network, relational, and entity-relationship. These
models help teams to manage data and convert them into valuable business information.
DATA ANALYTICS UNIT–I

Q9. What are the three levels of data abstraction?


Three levels of data abstraction are physical or internal, logical or conceptual, and view or external. The
lowest form is physical, and the highest is the view. On a logical level, the information is stored in the
database in the form of tables.

Regression
Regression is a type of statistical method used in machine learning and data analysis to model and analyze the
relationship between a dependent variable (target) and one or more independent variables (predictors). The main
goal of regression is to predict the value of the dependent variable based on the independent variables.
Regression helps to understand:
• How the dependent variable changes when the independent variables are varied.
• The strength and nature (positive or negative) of the relationship between variables.
There are different types of regression, with linear and logistic regression being the most common:
• Linear Regression: Used for predicting continuous outcomes (e.g., temperature, sales).
• Logistic Regression: Used for predicting categorical outcomes (e.g., yes/no, true/false).
In simple terms, regression finds the "best-fit" relationship between variables so that future predictions can be made
more accurately.

1. Linear Regression:
o Purpose: Predicts a continuous output (e.g., predicting house prices based on size).
o Model: Establishes a relationship between the independent variables (input) and a continuous
dependent variable (output) using a straight line (linear relationship).
o Equation: y=mx+by = mx + by=mx+b, where:
▪ yyy is the predicted output,
▪ mmm is the slope (change in yyy for each unit change in xxx),
▪ xxx is the input variable,
▪ bbb is the intercept (value of yyy when x=0x = 0x=0).
o Usage: Best used when the relationship between the variables is linear (straight line).
2. Logistic Regression:
o Purpose: Predicts a binary outcome (e.g., yes/no, true/false, or success/failure).
o Model: Unlike linear regression, it predicts probabilities that an outcome belongs to a particular
class (usually 0 or 1).
o Equation: It uses the logistic function to limit the output between 0 and 1:
P(y=1)=11+e−(mx+b)P(y=1) = \frac{1}{1 + e^{-(mx+b)}}P(y=1)=1+e−(mx+b)1 Here, eee is the
base of the natural logarithm, and the equation models the probability of the outcome being 1.
o Usage: Used when the output is categorical, usually binary.
Both are widely used in predictive modeling, with linear regression focusing on continuous outcomes and logistic
regression on categorical outcomes.

You might also like