0% found this document useful (0 votes)

19 views10 pages

Unit 1 R

Uploaded by

Madhav Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

Unit 1 R

Uploaded by

Madhav Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA SCIENCE USING R (NOTES)

UNIT-I
Structured data and unstructured data are two primary types of data, distinguished by their
format, organization, and how easily they can be used by algorithms, databases, and other
analytical tools.

Structured Data:

 Format: Structured data is organized into a defined schema or structure, usually in

rows and columns, making it easily searchable and analyzable. Examples include
spreadsheets and relational databases.

 Examples:

o Databases: SQL databases, where data is organized in tables with predefined

columns and data types.

o Spreadsheets: Excel files, where data is stored in cells organized by rows and
columns.

o CSV Files: Data stored in comma-separated values, which can be easily

imported into databases.

o Sensor Data: Data from IoT devices, where readings are taken at regular
intervals and stored in a structured format.

 Characteristics:

o Easily searchable: Due to its organization, querying structured data with tools
like SQL is straightforward.

o Predefined format: The schema is defined before data is input, making it more
rigid but easier to manage.

o Storage: Typically stored in relational databases or data warehouses.

Unstructured Data:

 Format: Unstructured data does not follow a specific format or structure, making it
more flexible but also more challenging to manage and analyze. It often includes text-
heavy data but can also contain multimedia elements.

 Examples:

o Text Documents: Emails, Word documents, and PDFs, where data is presented
in natural language.

o Multimedia: Images, audio files, and videos, which do not have a predefined
data model.
o Social Media Content: Tweets, Facebook posts, and other user-generated
content.

o Web Pages: HTML content that includes text, images, and embedded media.

 Characteristics:

o Complex processing: Requires advanced techniques, like natural language

processing (NLP) or image recognition, to extract meaningful information.

o Flexible format: Data can be stored in various forms without a predefined

structure.

o Storage: Often stored in non-relational databases (NoSQL), data lakes, or file

systems.

Key Differences:

1. Organization:

o Structured data is neatly organized in predefined formats like tables.

o Unstructured data lacks this organization and can come in various forms.
2. Ease of Analysis:

o Structured data can be easily analysed using standard database tools.

o Unstructured data requires more advanced tools and algorithms for analysis.

3. Flexibility:

o Structured data is less flexible due to its predefined schema.

o Unstructured data is highly flexible but harder to manage and analyse.

Use Cases:
 Structured Data: Ideal for use in scenarios where the data format is consistent and
predefined, such as in banking, inventory management, and CRM systems.

 Unstructured Data: Suitable for analysing data-rich environments like social media
monitoring, customer reviews, and multimedia content analysis.

Understanding the differences between structured and unstructured data is crucial for
determining the appropriate tools and methodologies for data processing, storage, and analysis.

Quantitative and qualitative data are two primary types of data used in research, each serving
different purposes and providing different insights.

Quantitative Data:
 Definition: Quantitative data refers to information that can be measured, counted, or
expressed numerically. It quantifies variables and typically involves statistical analysis
to identify patterns, relationships, or trends.

 Examples:

o Height and Weight: Measurements of a person's height in centimetres or

weight in kilograms.

o Age: Number of years since birth.

o Sales Figures: The number of products sold, revenue generated, or profits

earned.

o Temperature: Degrees in Celsius or Fahrenheit.

 Characteristics:

o Numerical: Expressed in numbers, making it easy to quantify and compare.

o Objective: Based on measurable, observable phenomena, making it less prone

to subjective interpretation.

o Statistical Analysis: Often analysed using statistical methods, such as averages,

correlations, regressions, and hypothesis testing.

o Scales of Measurement: Includes nominal, ordinal, interval, and ratio scales.

 Use Cases:

o Surveys: Measuring customer satisfaction on a scale of 1 to 10.

o Experiments: Comparing the effectiveness of two treatments by measuring
outcomes in patients.
o Market Research: Analysing demographic data to understand customer
segments.

Qualitative Data:
 Definition: Qualitative data refers to non-numerical information that describes
qualities, characteristics, or concepts. It is often used to explore complex phenomena
and gain deeper insights into people's experiences, behaviours, or perceptions.

 Examples:

o Interviews: Transcripts of interviews with participants discussing their

experiences.

o Observations: Notes from observing behaviours in a natural setting.

o Open-Ended Survey Responses: Participants detailed written responses to
questions.
o Textual Content: Books, articles, and social media posts analysed for themes
or patterns.

 Characteristics:

o Descriptive: Captures the richness and complexity of the subject matter, often
in the form of words, images, or objects.

o Subjective: Interpretation is influenced by the researcher's perspective, context,

and the participant's viewpoint.

o Thematic Analysis: Often analysed by identifying themes, patterns, or

narratives within the data.

o Contextual: Provides context and depth to understand the "why" and "how"
behind certain phenomena.

 Use Cases:

o Case Studies: Exploring an individual's or group's experiences in depth.

o Focus Groups: Understanding consumer preferences through group

discussions.

o Content Analysis: Analysing media content for themes or trends.

o Ethnography: Studying cultural practices and behaviours within a community.

Key Differences:

1. Nature of Data:
o Quantitative Data: Numerical, objective, and measurable.

o Qualitative Data: Descriptive, subjective, and interpretative.

2. Analysis Methods:

o Quantitative Data: Statistical methods are used to analyse and interpret the
data.

o Qualitative Data: Thematic analysis, coding, and narrative analysis are

common.

3. Purpose:

o Quantitative Data: Used to quantify variables, test hypotheses, and generalize

findings.

o Qualitative Data: Used to explore underlying reasons, opinions, and

motivations.

4. Outcome:
o Quantitative Data: Results are often presented as graphs, tables, or charts.

o Qualitative Data: Results are presented as themes, narratives, or descriptive

summaries.

Integration:

In many research projects, mixed methods are used, combining both quantitative and
qualitative data to provide a more comprehensive understanding of the research problem.
Quantitative data provides the "what," while qualitative data offers insights into the "why" and
"how."

In data science, understanding the different levels of measurement—nominal, ordinal,

interval, and ratio—is essential for selecting the appropriate statistical methods and analyses.
These levels determine how data can be categorized, ordered, and quantified.

1. Nominal Level:
 Definition: Nominal data represents categories or groups without any inherent order or
ranking. It is the most basic level of measurement, where numbers or labels are used
solely for classification.

 Characteristics:
o Categories: Data is grouped into mutually exclusive categories.

o No Order: There is no logical order or ranking among categories.

o No Mathematical Operations: You can count the frequency of categories, but

you cannot perform mathematical operations like addition or subtraction.

 Examples:

o Gender: Categories such as "Male," "Female," and "non-binary."

o Eye Color: Categories such as "Blue," "Brown," "Green."

o Nationality: Categories such as "American," "French," "Chinese."

 Usage in Data Science:

o Encoding: Nominal data is often encoded using techniques like one-hot

encoding or label encoding for machine learning models.

o Analysis: Frequency counts, mode, and chi-square tests are typical analyses.

2. Ordinal Data:
 Definition: Ordinal data represents categories with a meaningful order or ranking, but
the intervals between categories are not necessarily equal or known.
 Characteristics:

o Ordered Categories: Data is categorized in a specific order or rank.

o No Equal Intervals: The difference between ranks is not uniform or
measurable.

o Limited Mathematical Operations: You can determine the order of categories

but not perform arithmetic operations on them.

 Examples:

o Survey Ratings: "Poor," "Fair," "Good," "Very Good," "Excellent."

o Education Level: "High School," "Bachelor's," "Master's," "PhD."

o Socioeconomic Status: "Low," "Middle," "High."

 Usage in Data Science:

o Encoding: Ordinal data can be encoded using ordinal encoding or other

techniques.

o Analysis: Median, percentiles, and non-parametric tests like the Mann-Whitney

U test are common for analysing ordinal data.

3. Interval Level:
 Definition: Interval data is numeric data with equal intervals between values, but it
lacks a true zero point. This means that while you can measure differences between
values, you cannot make statements about how many times greater one value is
compared to another.

 Characteristics:

o Equal Intervals: The difference between any two consecutive values is the
same.

o No True Zero: Zero is arbitrary and does not indicate the absence of the
quantity being measured.

o Addition/Subtraction: You can add and subtract values, but multiplication and
division are not meaningful in terms of ratios.

 Examples:

o Temperature (Celsius or Fahrenheit): The difference between 20°C and 30°C

is the same as between 30°C and 40°C, but 0°C does not represent "no
temperature."

o IQ Scores: The difference between scores is consistent, but a score of zero is

not possible or meaningful.

o Calendar Years: The difference between the years 2000 and 2010 is the same
as between 2010 and 2020, but 0 AD does not represent the "beginning" in a
numerical sense.
 Usage in Data Science:

o Analysis: Mean, standard deviation, correlation, and ANOVA can be applied.

o Transformation: Sometimes converted to ratio data if a true zero point can be

established.

4. Ratio Level:

 Definition: Ratio data is the highest level of measurement, featuring all the
characteristics of interval data, but with a meaningful zero point. This allows for the
full range of mathematical operations, including meaningful ratios.

 Characteristics:

o True Zero: Zero indicates the absence of the quantity being measured, making
statements about "twice as much" meaningful.

o Equal Intervals: Like interval data, the difference between values is consistent.

o All Mathematical Operations: Addition, subtraction, multiplication, and

division can all be meaningfully applied.

 Examples:

o Height and Weight: 0 kg means no weight, and 60 kg is twice as heavy as 30

kg.

o Distance: 0 meters means no distance, and 10 meters is twice as long as 5

meters.

o Income: 0 dollars means no income, and $100,000 is twice as much as $50,000.

 Usage in Data Science:

o Analysis: Full range of statistical tests and operations, including geometric
mean, ratio comparisons, and more complex statistical models.

o Machine Learning: Ratio data is often the most informative type of data for
training models, as it allows for a wide variety of transformations and analyses.

Summary:
 Nominal: Categorization without order (e.g., eye colour, gender).

 Ordinal: Ordered categories without equal intervals (e.g., survey ratings, education
levels).
 Interval: Equal intervals, no true zero (e.g., temperature in Celsius, IQ scores).

 Ratio: Equal intervals with a true zero, allowing for all mathematical operations (e.g.,
height, weight, income).
Understanding these levels is crucial for selecting the right methods in data preprocessing,
statistical analysis, and machine learning.

The data science process typically involves five key steps that guide the transformation of raw
data into actionable insights. These steps form a structured approach to problem-solving in
data-driven projects.

1. Understanding the Problem (Problem Definition)

 Objective: Clearly define the problem you are trying to solve and understand the
business or research objectives.

 Activities:

o Identify the specific question or problem the analysis should address.

o Understand the context, including the stakeholders, goals, and constraints.

o Translate business objectives into data science objectives.

 Output: A well-defined problem statement and a plan that outlines the approach, scope,
and expected outcomes.

2. Data Collection and Acquisition

 Objective: Gather the relevant data needed to address the problem.

 Activities:

o Identify data sources (e.g., databases, APIs, sensors, public datasets).

o Collect raw data through extraction, querying, or accessing existing databases.

o Consider both structured and unstructured data sources, depending on the

problem.

 Output: A dataset or collection of datasets that are relevant, accessible, and sufficient
for analysis.

3. Data Cleaning and Preprocessing

 Objective: Prepare the data for analysis by cleaning and transforming it to a usable
format.

 Activities:

o Handle missing data (e.g., imputation, removal).

o Correct or remove errors and inconsistencies in the data.

o Transform and normalize data (e.g., scaling, encoding categorical variables).

o Split data into training, validation, and test sets, if applicable.

 Output: A clean, structured, and well-organized dataset ready for analysis and
modelling.

4. Data Exploration and Analysis

 Objective: Explore the data to uncover patterns, trends, and relationships that will
inform the modelling phase.

 Activities:

o Conduct exploratory data analysis (EDA) using statistical methods and

visualization techniques.

o Identify key variables, correlations, distributions, and anomalies.

o Generate hypotheses and insights that can guide the model-building process.

 Output: A set of findings, visualizations, and potential features that provide insights
into the data and inform the next steps.

5. Modelling and Evaluation

 Objective: Build and evaluate models that solve the problem defined in the first step.
 Activities:

o Select appropriate algorithms and models based on the type of data and problem.

o Train models using the prepared data, and tune parameters to optimize
performance.

o Evaluate models using metrics such as accuracy, precision, recall, F1-score,

RMSE, etc.

o Compare different models and select the best-performing one.

 Output: A validated model that provides predictions, classifications, or insights, along

with an evaluation report.

6. Deployment and Communication (Optional but Essential)

 Objective: Implement the model in a production environment and communicate results

to stakeholders.

 Activities:

o Deploy the model using tools and platforms suited for production.

o Monitor the model’s performance and make adjustments as necessary.

o Communicate findings and recommendations to stakeholders through reports,

dashboards, or presentations.
 Output: A deployed model in a live environment and a comprehensive report that
stakeholders can use to make informed decisions.

These steps may be iterative, especially the data exploration, modelling, and evaluation phases,
where you may need to revisit earlier steps based on new findings or feedback. This structured
approach ensures a thorough and methodical process, leading to reliable and actionable
outcomes in data science projects.

Panchamrut Dairy
50% (2)
Panchamrut Dairy
52 pages
Test Name: CPHQ Practice Exam: Form A: Your Score Status Initial Score
100% (2)
Test Name: CPHQ Practice Exam: Form A: Your Score Status Initial Score
41 pages
Business Statistics: Correlation Study Alumni Giving Case
No ratings yet
Business Statistics: Correlation Study Alumni Giving Case
4 pages
Data Science Using R
No ratings yet
Data Science Using R
74 pages
ML Assignment 2
No ratings yet
ML Assignment 2
7 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
DAT100 Int Data Ana Lec3 Types of Data
No ratings yet
DAT100 Int Data Ana Lec3 Types of Data
35 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Module 1 - Lecture 3 - Types of Data - 16.5.2022
No ratings yet
Module 1 - Lecture 3 - Types of Data - 16.5.2022
38 pages
EDA Unit-1
No ratings yet
EDA Unit-1
9 pages
1 - Structured Analysis Methodology and Tools (20241204172416)
No ratings yet
1 - Structured Analysis Methodology and Tools (20241204172416)
30 pages
Statistics
No ratings yet
Statistics
9 pages
Unit 2 1
No ratings yet
Unit 2 1
48 pages
Research Chapter03
No ratings yet
Research Chapter03
38 pages
Assignment 2 ML
No ratings yet
Assignment 2 ML
4 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
Quantum DA Review
No ratings yet
Quantum DA Review
28 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
What Is Data? Explain The Importance of Data.: Unit I 1
No ratings yet
What Is Data? Explain The Importance of Data.: Unit I 1
52 pages
Q4
No ratings yet
Q4
2 pages
Ids Unit-Ii
No ratings yet
Ids Unit-Ii
44 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Data Types and Sources
No ratings yet
Data Types and Sources
36 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
RM 4
No ratings yet
RM 4
17 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
Structureddata
No ratings yet
Structureddata
17 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
Concept of Data Collection Methods
No ratings yet
Concept of Data Collection Methods
9 pages
Comprehensive Data Types Cheat Sheet
No ratings yet
Comprehensive Data Types Cheat Sheet
4 pages
Final Note
No ratings yet
Final Note
22 pages
UNIT - II Artificial Intelligence Second Part
No ratings yet
UNIT - II Artificial Intelligence Second Part
9 pages
Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
BIG DATA ANALYTICS Notes Unit 1 and 2
No ratings yet
BIG DATA ANALYTICS Notes Unit 1 and 2
34 pages
Exploring Data Types and Data Collection Methods
No ratings yet
Exploring Data Types and Data Collection Methods
4 pages
Quantitative & Qualitative Data AIML
No ratings yet
Quantitative & Qualitative Data AIML
32 pages
Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
Data Analytics 1
No ratings yet
Data Analytics 1
74 pages
Lecture 5 1 Flavours of Data
No ratings yet
Lecture 5 1 Flavours of Data
30 pages
Chapter 1
No ratings yet
Chapter 1
3 pages
Ahsan Stats
No ratings yet
Ahsan Stats
9 pages
Ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
Ece 2318 GENERAL DATA AND ITS TYPES
34 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
Unit 1
No ratings yet
Unit 1
85 pages
Data and Types of Data
No ratings yet
Data and Types of Data
7 pages
AFDM UNIT 2 Notes
No ratings yet
AFDM UNIT 2 Notes
29 pages
Mylesson 3
No ratings yet
Mylesson 3
19 pages
Research Methodology Unit 4
No ratings yet
Research Methodology Unit 4
5 pages
Data - Visualisation - Charts and Types of Data
No ratings yet
Data - Visualisation - Charts and Types of Data
7 pages
Types of Data
No ratings yet
Types of Data
14 pages
Unit III Research Methodology
No ratings yet
Unit III Research Methodology
117 pages
Types of Data by Domain
No ratings yet
Types of Data by Domain
14 pages
Midterm Notes
No ratings yet
Midterm Notes
10 pages
Unit 3
No ratings yet
Unit 3
30 pages
How Data Is Col
No ratings yet
How Data Is Col
11 pages
Data Analysis
No ratings yet
Data Analysis
5 pages
Sources and Nature of Data
No ratings yet
Sources and Nature of Data
44 pages
Da Mod 1
No ratings yet
Da Mod 1
60 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
How to Research Qualitatively: Tips for Scientific Working
From Everand
How to Research Qualitatively: Tips for Scientific Working
Martin Gertler
No ratings yet
1.3.5-Types of Controlling.
No ratings yet
1.3.5-Types of Controlling.
5 pages
Business Analytics
No ratings yet
Business Analytics
12 pages
Machine Learing r20 QP
No ratings yet
Machine Learing r20 QP
4 pages
DS Lab
No ratings yet
DS Lab
31 pages
Election Prediction Projectfinal
No ratings yet
Election Prediction Projectfinal
30 pages
From The Help Desk: Seemingly Unrelated Regression With Unbalanced Equations
No ratings yet
From The Help Desk: Seemingly Unrelated Regression With Unbalanced Equations
7 pages
Classification and Clustering
No ratings yet
Classification and Clustering
8 pages
Kamrul Hasan PDF
No ratings yet
Kamrul Hasan PDF
153 pages
Regression Stepwise (PIZZA)
No ratings yet
Regression Stepwise (PIZZA)
4 pages
Assignment Brief 2023
No ratings yet
Assignment Brief 2023
10 pages
06hypothesis Testing v2 PDF
No ratings yet
06hypothesis Testing v2 PDF
39 pages
Contoh CV
No ratings yet
Contoh CV
2 pages
Financial Modelling: Term - IV
No ratings yet
Financial Modelling: Term - IV
16 pages
Data Mining Part 02 Eng
No ratings yet
Data Mining Part 02 Eng
12 pages
L4b - Perfomance Evaluation Metric - Regression
No ratings yet
L4b - Perfomance Evaluation Metric - Regression
6 pages
Group 3 Final Research
No ratings yet
Group 3 Final Research
26 pages
Sop 23
No ratings yet
Sop 23
8 pages
ML600 - Assignment 1
No ratings yet
ML600 - Assignment 1
11 pages
Anreg - StatG - (Fara, Nada, Hanan, Rey)
No ratings yet
Anreg - StatG - (Fara, Nada, Hanan, Rey)
12 pages
Assignment
No ratings yet
Assignment
7 pages
The RCMDR Guide
No ratings yet
The RCMDR Guide
93 pages
ECON3208 / ECON3291 (ARTS) Econometric Methods: Australian School of Business School of Economics
No ratings yet
ECON3208 / ECON3291 (ARTS) Econometric Methods: Australian School of Business School of Economics
18 pages
The Seismic Analysis Code: A Primer and User's Guide
No ratings yet
The Seismic Analysis Code: A Primer and User's Guide
1 page
Results: Paired Samples T-Test
No ratings yet
Results: Paired Samples T-Test
3 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
DWM Module 1 (1.1)
No ratings yet
DWM Module 1 (1.1)
11 pages