0% found this document useful (0 votes)
19 views14 pages

FDS - 2 Solved

TYBCS foundation of data science solved question paper

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

FDS - 2 Solved

TYBCS foundation of data science solved question paper

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Q1) Attempt any EIGHT of the following:

a) What is Data science?

Sol:

Data science is an interdisciplinary field that combines domain expertise, programming


skills, and knowledge of mathematics and statistics to extract meaningful insights from
data. It involves processes such as data collection, data cleaning, data analysis, data
visualization, and the application of machine learning algorithms to solve complex
problems and make data-driven decisions.

b) Define Data source?

Sol:

A data source is any location or system that provides data for analysis. Examples include
databases, spreadsheets, APIs, sensors, and web services. Data sources can be internal
(e.g., company databases) or external (e.g., public datasets, social media).

c) What is missing values?

Sol:

Missing values refer to data points that are not recorded or are absent in a dataset. They
can occur due to various reasons, such as data entry errors, data corruption, or incomplete
data collection. Handling missing values is crucial for accurate data analysis.

d) List the visualization libraries in python.

Sol:

1. Matplotlib

2. Seaborn

3. Plotly

4. Bokeh

5. Altair
e) List applications of data science.

Sol:

1. Healthcare: Predictive analytics, disease diagnosis, personalized treatment plans.

2. Finance: Fraud detection, credit risk assessment, algorithmic trading.

3. Marketing: Customer segmentation, sentiment analysis, recommendation systems.

4. Retail: Inventory management, demand forecasting, customer behavior analysis.

5. Manufacturing: Predictive maintenance, quality control, supply chain optimization.

f) What is data transformation?

Sol:

Data transformation is the process of converting data from one format or structure into
another. This process includes various techniques such as normalization, aggregation, and
scaling to prepare data for analysis, improve its quality, and ensure consistency across
datasets.

g) Define Hypothesis Testing?

Sol:

Hypothesis testing is a statistical method used to make inferences or draw conclusions


about a population based on sample data. It involves formulating a null hypothesis (H0)
and an alternative hypothesis (H1), selecting a significance level, calculating a test
statistic, and making a decision to accept or reject the null hypothesis based on the test
results.

h) What is use of Bubble plot?

Sol:

A bubble plot is a type of scatter plot where each point is represented by a bubble. The
position of the bubble indicates the values of two variables, while the size of the bubble
represents the value of a third variable. Bubble plots are useful for visualizing the
relationships between three variables in a single plot.
i) Define Data cleaning?

Sol:

Data cleaning is the process of identifying, correcting, or removing errors and


inconsistencies in a dataset. This involves handling missing values, outliers, duplicate
records, and formatting issues to ensure that the data is accurate, complete, and ready for
analysis.

j) Define standard deviation?

Sol:

Standard deviation is a measure of the dispersion or spread of a set of values. It quantifies


the amount of variation or dispersion in a dataset relative to its mean. A low standard
deviation indicates that the data points are close to the mean, while a high standard
deviation indicates that the data points are spread out over a wide range of values.

Q2) Attempt any FOUR of the following:

a) List the tools for data scientist.

Sol:

• Tableau: A powerful data visualization tool that helps create interactive and
shareable dashboards.
• Power BI: A business analytics service by Microsoft that provides interactive
visualizations and business intelligence capabilities with a simple interface.
• Matplotlib: A plotting library for Python that provides tools to create static,
interactive, and animated visualizations.
• Seaborn: Built on top of Matplotlib, this Python library provides high-level
interface for drawing attractive and informative statistical graphics.
b) Define statistical data analysis?

Sol:

Statistical data analysis is the process of collecting, organizing, analyzing, interpreting, and
presenting data using statistical methods. It involves using techniques such as descriptive
statistics, inferential statistics, hypothesis testing, regression analysis, and more to derive
insights and make data-driven decisions.

c) What is data cube?

Sol:

A data cube is a multi-dimensional array of values, commonly used to describe data in a


data warehouse. It allows for complex queries and analyses on multi-dimensional data,
providing a way to view and analyze data from different perspectives (dimensions) such as
time, geography, and product categories.

d) Give the purpose of data preprocessing?

Sol:

The purpose of data preprocessing is to prepare raw data for analysis by transforming it into
a clean and usable format. This involves steps such as data cleaning, normalization,
transformation, feature extraction, and encoding. Data preprocessing ensures that the data
is accurate, consistent, and ready for analysis, which improves the quality and reliability of
the results.

e) What is the purpose of data visualization?

Sol:

The purpose of data visualization is to represent data graphically, making it easier to


understand and interpret. It helps in identifying patterns, trends, and outliers in the data,
facilitating data-driven decision-making. Data visualization also aids in communicating
complex information effectively to stakeholders and enhances the overall analytical
process.
Q3) Attempt any TWO of the following:

a) What are the measures of central tendency? Explain any two of them in brief.

Sol:

Measures of central tendency are statistical metrics used to describe the center or
distribution of a dataset. The three main measures are:

1. Mean: The average of all data points, calculated by summing the values and dividing
by the number of observations.

o Example: For the dataset [1, 2, 3, 4, 5], the mean is (1+2+3+4+5)/5 = 3.

2. Median: The middle value in an ordered dataset, which separates the data into two
halves.

o Example: For the dataset [1, 2, 3, 4, 5], the median is 3.

3. Mode: The most frequently occurring value in a dataset.

o Example: For the dataset [1, 2, 2, 3, 4], the mode is 2.

b) What are the various types of data available? Give example of each.

Sol:

1. Nominal Data:

o Definition: Categorical data without any intrinsic ordering. It represents


categories or labels.

o Example: Gender (Male, Female), Colors (Red, Blue, Green)

2. Ordinal Data:

o Definition: Categorical data with a clear ordering or ranking between the


categories.

o Example: Education levels (High School, Bachelor's, Master's, Ph.D.),


Satisfaction levels (Poor, Fair, Good, Excellent)
3. Interval Data:

o Definition: Numeric data with meaningful differences between values, but no


true zero point.

o Example: Temperature in Celsius (0°C does not mean 'no temperature')

4. Ratio Data:

o Definition: Numeric data with meaningful differences between values and a


true zero point.

o Example: Height (in centimeters), Weight (in kilograms), Age (in years)

c) What is Venn diagram? How to create it? Explain with example.

Sol:

Venn Diagram: A Venn diagram is a graphical representation used to show the relationships
between different sets. It uses overlapping circles to illustrate the commonalities and
differences between the sets.

How to Create a Venn Diagram:

1. Draw circles for each set. The number of circles corresponds to the number of sets.

2. Label each circle with the set name.

3. Place elements in the appropriate sections of the circles to indicate membership in


the sets.

Example: To represent sets A = {1, 2, 3} and B = {2, 3, 4}:

• Draw two overlapping circles, one for A and one for B.

• Place 1 in the non-overlapping part of A, 4 in the non-overlapping part of B, and 2


and 3 in the overlapping section.

A B

1 2, 3 4

This diagram shows that 2 and 3 are common to both sets A and B, while 1 is unique to A
and 4 is unique to B.
Q4) Attempt any two of the following:

a) Explain different data formats in brief.

Sol:

1. CSV (Comma-Separated Values):

o Description: A plain text format where each line represents a record, and fields are
separated by commas. It is widely used for data exchange between applications.

o Example:

Name, Age, Country

John, 25, USA

Jane, 30, UK

2. JSON (JavaScript Object Notation):

o Description: A structured, text-based format that uses key-value pairs to represent


objects and arrays. It is commonly used for data interchange between web services
and applications.

o Example:

"name": "John",

"age": 25,

"country": "USA"

3. XML (eXtensible Markup Language):

o Description: A flexible text format that uses tags to define the structure and content of
data. It is used for data representation and exchange.

o Example:

<person>

<name>John</name>
<age>25</age>

<country>USA</country>

</person>

4. SQL (Structured Query Language):

o Description: A language used for managing and querying relational databases. It


allows for data manipulation and retrieval.

o Example:

SELECT name, age, country

FROM people

WHERE age > 20;

5. Parquet:

o Description: A columnar storage file format optimized for efficient data storage and
retrieval, particularly in big data environments.

o Example:

// Parquet file structure

"columns": [

"name",

"age",

"country"

],

"data": [

["John", 25, "USA"],

["Jane", 30, "UK"]

}
b) What is data quality? Which factors affect data qualities?

Sol:

Data Quality: Data quality refers to the accuracy, completeness, reliability, and relevance
of data for its intended use. High-quality data ensures that analyses and decisions based
on the data are accurate and reliable.

Factors Affecting Data Quality:

1. Accuracy: The degree to which data correctly represents the real-world values it is
intended to model. Inaccurate data can lead to incorrect conclusions.

2. Completeness: The extent to which all required data is available. Missing data can
result in biased analyses.

3. Consistency: The degree to which data is uniform and free from contradictions.
Inconsistent data can arise from different data sources or entry errors.

4. Timeliness: The degree to which data is up-to-date and available when needed.
Outdated data can render analyses irrelevant.

5. Validity: The extent to which data conforms to defined formats and standards.
Invalid data can result from incorrect data entry or format mismatches.

c) Write detailed notes on basic data visualization tools?

Sol:

Basic Data Visualization Tools:

1. Matplotlib:

o Description: Matplotlib is one of the most widely used data visualization libraries in
Python. It provides a flexible platform for creating static, animated, and interactive
visualizations.

o Capabilities: Line plots, scatter plots, bar charts, histograms, pie charts, and more.

o Example:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [10, 20, 25, 30, 35]


plt.plot(x, y)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Sample Line Plot')

plt.show()

2. Seaborn:

o Description: Built on top of Matplotlib, Seaborn offers a high-level interface for


drawing attractive and informative statistical graphics.

o Capabilities: Enhanced visualizations, including heatmaps, violin plots, box plots, and
pair plots.

o Example:

import seaborn as sns

import matplotlib.pyplot as plt

data = sns.load_dataset("iris")

sns.pairplot(data, hue="species")

plt.show()

3. Plotly:

o Description: Plotly is an interactive, open-source plotting library that supports a wide


range of visualization types and interactive features.

o Capabilities: 3D plots, geographic maps, interactive charts, and dashboards.

o Example:

import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species')

fig.show()
4. Bokeh:

o Description: Bokeh provides an elegant and concise way to create interactive


visualizations for modern web browsers.

o Capabilities: Interactive plots, dashboards, and data applications.

o Example:

from bokeh.plotting import figure, show

p = figure(title="Bokeh Plot Example", x_axis_label='x', y_axis_label='y')

p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], legend_label="Line", line_width=2)

show(p)

5. Altair:

o Description: Altair is a declarative statistical visualization library based on Vega and


Vega-Lite, designed for creating simple yet powerful visualizations.

o Capabilities: Interactive visualizations with concise code.

o Example:

import altair as alt

import pandas as pd

data = pd.DataFrame({

'a': [1, 2, 3, 4, 5],

'b': [3, 4, 5, 6, 7]

})

chart = alt.Chart(data).mark_line().encode(

x='a',

y='b'

chart.show()
Q5) Attempt any ONE of the following:

a) What is outlier? State types of outliers.

Sol:

Outlier: An outlier is a data point that significantly differs from other observations in a
dataset. It can indicate variability in the data, measurement errors, or experimental errors.
Outliers can skew statistical analyses and affect the accuracy of models.

Types of Outliers:

1. Global Outliers:

o Definition: Data points that deviate significantly from the rest of the dataset. These
outliers are also known as point outliers.

o Example: In a dataset of student grades, a score of 100 when most scores range
between 60-80.

2. Contextual Outliers:

o Definition: Data points that are outliers in a specific context or condition. These
outliers are context-dependent and may appear normal in other contexts.

o Example: A high temperature reading in a normally cold region during winter.

3. Collective Outliers:

o Definition: A group of data points that deviate significantly from the rest of the
dataset. These outliers may not be outliers individually but form a collective anomaly
when considered together.

o Example: A sudden spike in network traffic at a specific time, indicating a potential


cyber-attack.
b) State and explain any three data transformation techniques.

Sol:

1. Normalization:

o Description: Normalization is the process of scaling numeric data to a standard range,


typically between 0 and 1. It helps in bringing all features to the same scale, which
can improve the performance of machine learning algorithms.

o Example:

from sklearn.preprocessing import MinMaxScaler

data = [[100], [200], [300], [400], [500]]

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)

print(normalized_data)

2. Standardization:

o Description: Standardization involves scaling data to have a mean of 0 and a standard


deviation of 1. This technique is useful when the data has different units or scales.

o Example:

from sklearn.preprocessing import StandardScaler

data = [[100], [200], [300], [400], [500]]

scaler = StandardScaler()

standardized_data = scaler.fit_transform(data)

print(standardized_data)
3. Log Transformation:

o Description: Log transformation is used to stabilize variance and make the data more
normally distributed. It is particularly useful for data that follows a skewed
distribution.

o Example:

import numpy as np

data = [1, 10, 100, 1000, 10000]

log_transformed_data = np.log(data)

print(log_transformed_data)

You might also like