0% found this document useful (0 votes)
19 views10 pages

Unit 1 R

Uploaded by

Madhav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Unit 1 R

Uploaded by

Madhav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA SCIENCE USING R (NOTES)

UNIT-I
Structured data and unstructured data are two primary types of data, distinguished by their
format, organization, and how easily they can be used by algorithms, databases, and other
analytical tools.

Structured Data:

 Format: Structured data is organized into a defined schema or structure, usually in


rows and columns, making it easily searchable and analyzable. Examples include
spreadsheets and relational databases.

 Examples:

o Databases: SQL databases, where data is organized in tables with predefined


columns and data types.

o Spreadsheets: Excel files, where data is stored in cells organized by rows and
columns.

o CSV Files: Data stored in comma-separated values, which can be easily


imported into databases.

o Sensor Data: Data from IoT devices, where readings are taken at regular
intervals and stored in a structured format.

 Characteristics:

o Easily searchable: Due to its organization, querying structured data with tools
like SQL is straightforward.

o Predefined format: The schema is defined before data is input, making it more
rigid but easier to manage.

o Storage: Typically stored in relational databases or data warehouses.

Unstructured Data:

 Format: Unstructured data does not follow a specific format or structure, making it
more flexible but also more challenging to manage and analyze. It often includes text-
heavy data but can also contain multimedia elements.

 Examples:

o Text Documents: Emails, Word documents, and PDFs, where data is presented
in natural language.

o Multimedia: Images, audio files, and videos, which do not have a predefined
data model.
o Social Media Content: Tweets, Facebook posts, and other user-generated
content.

o Web Pages: HTML content that includes text, images, and embedded media.

 Characteristics:

o Complex processing: Requires advanced techniques, like natural language


processing (NLP) or image recognition, to extract meaningful information.

o Flexible format: Data can be stored in various forms without a predefined


structure.

o Storage: Often stored in non-relational databases (NoSQL), data lakes, or file


systems.

Key Differences:

1. Organization:

o Structured data is neatly organized in predefined formats like tables.

o Unstructured data lacks this organization and can come in various forms.
2. Ease of Analysis:

o Structured data can be easily analysed using standard database tools.

o Unstructured data requires more advanced tools and algorithms for analysis.

3. Flexibility:

o Structured data is less flexible due to its predefined schema.

o Unstructured data is highly flexible but harder to manage and analyse.

Use Cases:
 Structured Data: Ideal for use in scenarios where the data format is consistent and
predefined, such as in banking, inventory management, and CRM systems.

 Unstructured Data: Suitable for analysing data-rich environments like social media
monitoring, customer reviews, and multimedia content analysis.

Understanding the differences between structured and unstructured data is crucial for
determining the appropriate tools and methodologies for data processing, storage, and analysis.

Quantitative and qualitative data are two primary types of data used in research, each serving
different purposes and providing different insights.

Quantitative Data:
 Definition: Quantitative data refers to information that can be measured, counted, or
expressed numerically. It quantifies variables and typically involves statistical analysis
to identify patterns, relationships, or trends.

 Examples:

o Height and Weight: Measurements of a person's height in centimetres or


weight in kilograms.

o Age: Number of years since birth.

o Sales Figures: The number of products sold, revenue generated, or profits


earned.

o Temperature: Degrees in Celsius or Fahrenheit.

 Characteristics:

o Numerical: Expressed in numbers, making it easy to quantify and compare.

o Objective: Based on measurable, observable phenomena, making it less prone


to subjective interpretation.

o Statistical Analysis: Often analysed using statistical methods, such as averages,


correlations, regressions, and hypothesis testing.

o Scales of Measurement: Includes nominal, ordinal, interval, and ratio scales.

 Use Cases:

o Surveys: Measuring customer satisfaction on a scale of 1 to 10.


o Experiments: Comparing the effectiveness of two treatments by measuring
outcomes in patients.
o Market Research: Analysing demographic data to understand customer
segments.

Qualitative Data:
 Definition: Qualitative data refers to non-numerical information that describes
qualities, characteristics, or concepts. It is often used to explore complex phenomena
and gain deeper insights into people's experiences, behaviours, or perceptions.

 Examples:

o Interviews: Transcripts of interviews with participants discussing their


experiences.

o Observations: Notes from observing behaviours in a natural setting.


o Open-Ended Survey Responses: Participants detailed written responses to
questions.
o Textual Content: Books, articles, and social media posts analysed for themes
or patterns.

 Characteristics:

o Descriptive: Captures the richness and complexity of the subject matter, often
in the form of words, images, or objects.

o Subjective: Interpretation is influenced by the researcher's perspective, context,


and the participant's viewpoint.

o Thematic Analysis: Often analysed by identifying themes, patterns, or


narratives within the data.

o Contextual: Provides context and depth to understand the "why" and "how"
behind certain phenomena.

 Use Cases:

o Case Studies: Exploring an individual's or group's experiences in depth.

o Focus Groups: Understanding consumer preferences through group


discussions.

o Content Analysis: Analysing media content for themes or trends.

o Ethnography: Studying cultural practices and behaviours within a community.

Key Differences:

1. Nature of Data:
o Quantitative Data: Numerical, objective, and measurable.

o Qualitative Data: Descriptive, subjective, and interpretative.

2. Analysis Methods:

o Quantitative Data: Statistical methods are used to analyse and interpret the
data.

o Qualitative Data: Thematic analysis, coding, and narrative analysis are


common.

3. Purpose:

o Quantitative Data: Used to quantify variables, test hypotheses, and generalize


findings.

o Qualitative Data: Used to explore underlying reasons, opinions, and


motivations.

4. Outcome:
o Quantitative Data: Results are often presented as graphs, tables, or charts.

o Qualitative Data: Results are presented as themes, narratives, or descriptive


summaries.

Integration:

In many research projects, mixed methods are used, combining both quantitative and
qualitative data to provide a more comprehensive understanding of the research problem.
Quantitative data provides the "what," while qualitative data offers insights into the "why" and
"how."

In data science, understanding the different levels of measurement—nominal, ordinal,


interval, and ratio—is essential for selecting the appropriate statistical methods and analyses.
These levels determine how data can be categorized, ordered, and quantified.

1. Nominal Level:
 Definition: Nominal data represents categories or groups without any inherent order or
ranking. It is the most basic level of measurement, where numbers or labels are used
solely for classification.

 Characteristics:
o Categories: Data is grouped into mutually exclusive categories.

o No Order: There is no logical order or ranking among categories.

o No Mathematical Operations: You can count the frequency of categories, but


you cannot perform mathematical operations like addition or subtraction.

 Examples:

o Gender: Categories such as "Male," "Female," and "non-binary."

o Eye Color: Categories such as "Blue," "Brown," "Green."

o Nationality: Categories such as "American," "French," "Chinese."

 Usage in Data Science:

o Encoding: Nominal data is often encoded using techniques like one-hot


encoding or label encoding for machine learning models.

o Analysis: Frequency counts, mode, and chi-square tests are typical analyses.

2. Ordinal Data:
 Definition: Ordinal data represents categories with a meaningful order or ranking, but
the intervals between categories are not necessarily equal or known.
 Characteristics:

o Ordered Categories: Data is categorized in a specific order or rank.


o No Equal Intervals: The difference between ranks is not uniform or
measurable.

o Limited Mathematical Operations: You can determine the order of categories


but not perform arithmetic operations on them.

 Examples:

o Survey Ratings: "Poor," "Fair," "Good," "Very Good," "Excellent."

o Education Level: "High School," "Bachelor's," "Master's," "PhD."

o Socioeconomic Status: "Low," "Middle," "High."

 Usage in Data Science:

o Encoding: Ordinal data can be encoded using ordinal encoding or other


techniques.

o Analysis: Median, percentiles, and non-parametric tests like the Mann-Whitney


U test are common for analysing ordinal data.

3. Interval Level:
 Definition: Interval data is numeric data with equal intervals between values, but it
lacks a true zero point. This means that while you can measure differences between
values, you cannot make statements about how many times greater one value is
compared to another.

 Characteristics:

o Equal Intervals: The difference between any two consecutive values is the
same.

o No True Zero: Zero is arbitrary and does not indicate the absence of the
quantity being measured.

o Addition/Subtraction: You can add and subtract values, but multiplication and
division are not meaningful in terms of ratios.

 Examples:

o Temperature (Celsius or Fahrenheit): The difference between 20°C and 30°C


is the same as between 30°C and 40°C, but 0°C does not represent "no
temperature."

o IQ Scores: The difference between scores is consistent, but a score of zero is


not possible or meaningful.

o Calendar Years: The difference between the years 2000 and 2010 is the same
as between 2010 and 2020, but 0 AD does not represent the "beginning" in a
numerical sense.
 Usage in Data Science:

o Analysis: Mean, standard deviation, correlation, and ANOVA can be applied.

o Transformation: Sometimes converted to ratio data if a true zero point can be


established.

4. Ratio Level:

 Definition: Ratio data is the highest level of measurement, featuring all the
characteristics of interval data, but with a meaningful zero point. This allows for the
full range of mathematical operations, including meaningful ratios.

 Characteristics:

o True Zero: Zero indicates the absence of the quantity being measured, making
statements about "twice as much" meaningful.

o Equal Intervals: Like interval data, the difference between values is consistent.

o All Mathematical Operations: Addition, subtraction, multiplication, and


division can all be meaningfully applied.

 Examples:

o Height and Weight: 0 kg means no weight, and 60 kg is twice as heavy as 30


kg.

o Distance: 0 meters means no distance, and 10 meters is twice as long as 5


meters.

o Income: 0 dollars means no income, and $100,000 is twice as much as $50,000.

 Usage in Data Science:


o Analysis: Full range of statistical tests and operations, including geometric
mean, ratio comparisons, and more complex statistical models.

o Machine Learning: Ratio data is often the most informative type of data for
training models, as it allows for a wide variety of transformations and analyses.

Summary:
 Nominal: Categorization without order (e.g., eye colour, gender).

 Ordinal: Ordered categories without equal intervals (e.g., survey ratings, education
levels).
 Interval: Equal intervals, no true zero (e.g., temperature in Celsius, IQ scores).

 Ratio: Equal intervals with a true zero, allowing for all mathematical operations (e.g.,
height, weight, income).
Understanding these levels is crucial for selecting the right methods in data preprocessing,
statistical analysis, and machine learning.

The data science process typically involves five key steps that guide the transformation of raw
data into actionable insights. These steps form a structured approach to problem-solving in
data-driven projects.

1. Understanding the Problem (Problem Definition)

 Objective: Clearly define the problem you are trying to solve and understand the
business or research objectives.

 Activities:

o Identify the specific question or problem the analysis should address.

o Understand the context, including the stakeholders, goals, and constraints.

o Translate business objectives into data science objectives.

 Output: A well-defined problem statement and a plan that outlines the approach, scope,
and expected outcomes.

2. Data Collection and Acquisition

 Objective: Gather the relevant data needed to address the problem.


 Activities:

o Identify data sources (e.g., databases, APIs, sensors, public datasets).

o Collect raw data through extraction, querying, or accessing existing databases.

o Consider both structured and unstructured data sources, depending on the


problem.

 Output: A dataset or collection of datasets that are relevant, accessible, and sufficient
for analysis.

3. Data Cleaning and Preprocessing

 Objective: Prepare the data for analysis by cleaning and transforming it to a usable
format.

 Activities:

o Handle missing data (e.g., imputation, removal).

o Correct or remove errors and inconsistencies in the data.

o Transform and normalize data (e.g., scaling, encoding categorical variables).

o Split data into training, validation, and test sets, if applicable.


 Output: A clean, structured, and well-organized dataset ready for analysis and
modelling.

4. Data Exploration and Analysis

 Objective: Explore the data to uncover patterns, trends, and relationships that will
inform the modelling phase.

 Activities:

o Conduct exploratory data analysis (EDA) using statistical methods and


visualization techniques.

o Identify key variables, correlations, distributions, and anomalies.

o Generate hypotheses and insights that can guide the model-building process.

 Output: A set of findings, visualizations, and potential features that provide insights
into the data and inform the next steps.

5. Modelling and Evaluation

 Objective: Build and evaluate models that solve the problem defined in the first step.
 Activities:

o Select appropriate algorithms and models based on the type of data and problem.

o Train models using the prepared data, and tune parameters to optimize
performance.

o Evaluate models using metrics such as accuracy, precision, recall, F1-score,


RMSE, etc.

o Compare different models and select the best-performing one.

 Output: A validated model that provides predictions, classifications, or insights, along


with an evaluation report.

6. Deployment and Communication (Optional but Essential)

 Objective: Implement the model in a production environment and communicate results


to stakeholders.

 Activities:

o Deploy the model using tools and platforms suited for production.

o Monitor the model’s performance and make adjustments as necessary.

o Communicate findings and recommendations to stakeholders through reports,


dashboards, or presentations.
 Output: A deployed model in a live environment and a comprehensive report that
stakeholders can use to make informed decisions.

These steps may be iterative, especially the data exploration, modelling, and evaluation phases,
where you may need to revisit earlier steps based on new findings or feedback. This structured
approach ensures a thorough and methodical process, leading to reliable and actionable
outcomes in data science projects.

You might also like