FDS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Q1 Explain data science life cycle with suitable diagram?

Ans:- The Data Science Life Cycle is a structured approach to solving data-driven problems. It involves
several interconnected stages: 1. Business Understanding:- Clearly define the problem to be solved and
the desired outcomes.
2. Data Acquisition:-Gather relevant data from various sources, ensuring its quality and completeness.
3. Data Preparation: Cleanse, transform, and preprocess the data to make it suitable for analysis.

4.Exploratory Data Analysis (EDA): Utilize statistical techniques and visualizations to understand data
patterns, relationships, and potential insights.

5. Model Building: Develop appropriate machine learning or statistical models to address the problem.
6. Model Evaluation: Assess the performance of the model using relevant metrics and refine it as
needed.7. Deployment: Integrate the model into a production environment to make predictions or
generate insights on new data.8. Monitoring and Maintenance: Continuously monitor the model's
performance and retrain it as necessary to adapt to changing conditions.

Q2Explain concept and use of data visualisation.


Ans:- **Data visualization** is the graphical representation of data and information using visual elements
such as charts, graphs, and maps. It transforms complex datasets into intuitive visuals, making it easier to
identify patterns, trends, and outliers. By leveraging human cognitive abilities to process visual information
quickly, data visualization helps in better understanding and communication of data insights.

The primary uses of data visualization include: - Exploratory Analysis: Identifying hidden patterns and
relationships in data. - Decision-Making: Supporting data-driven decisions by presenting findings clearly. -
Communication: Sharing insights effectively with stakeholders or audiences. - Monitoring: Tracking
metrics or performance indicators in real-time dashboards. Data visualization tools, such as Tableau,
Power BI, and Python libraries like Matplotlib and Seaborn, make it easier to interpret data and convey
results effectively.
Q3 Write a short note on hypothesis testing.
Ans:- Hypothesis testing is a statistical method used to make decisions or draw conclusions about a
population based on sample data. It involves formulating two opposing hypotheses: the null hypothesis
(H₀), which represents no effect or no difference, and the alternative hypothesis (H₁), which suggests the
presence of an effect or difference. The process begins by defining these hypotheses and selecting an
appropriate significance level (commonly 0.05) to control the risk of rejecting the null hypothesis when it is
true.Data from the sample is analyzed using a test statistic, and the p-value is calculated to assess the
strength of the evidence against the null hypothesis. If the p-value is less than the significance level, the
null hypothesis is rejected in favor of the alternative.

Q4 Explain data visualization libraries in Python.


Ans:- Data visualization libraries in Python provide powerful tools to create visual representations of data,
making it easier to analyze and communicate insights. Some popular libraries include Matplotlib, Seaborn,
and Plotly, each offering unique features to suit various needs. Matplotlib is a versatile library that allows
users to create a wide range of static, animated, and interactive plots. It serves as a foundation for many
other libraries. Seaborn, built on top of Matplotlib, simplifies the creation of aesthetically pleasing and
informative statistical visualizations, such as heatmaps and violin plots. Plotly, on the other hand, focuses
on interactive and web-based visualizations, enabling users to create dynamic charts like 3D plots, bubble
charts, and maps.
Q5 i)Define data science. ii) Explain any one technique of data transformation?
Ans:- Data Science is a multidisciplinary field that combines statistics, computer science, and domain
knowledge to extract meaningful insights and knowledge from structured and unstructured data. It
involves processes like data collection, cleaning, analysis, visualization, and modeling to solve complex
problems and support decision-making.
One common technique of data transformation is normalization, where data values are scaled to fit within
a specific range, such as 0 to 1. This is particularly useful when working with features of different units or
scales, as it ensures that all features contribute equally to machine learning models. Normalization
improves the performance and stability of algorithms that rely on distance metrics, such as k-nearest
neighbors and gradient descent methods.

Q6 Write any two applications of data science. ii) Explain any one type of outliers in detail.?
Ans:- Two Applications of Data Science:-
1. Recommendation Systems, 2. Fraud Detection
Outliers: Univariate Outliers:-Univariate outliers are data points that deviate significantly from the overall
pattern of a single variable. These outliers can be identified using various statistical methods:
Z-Score:- Measures how many standard deviations a data point is from the mean. Data points with a Z-
score greater than a certain threshold (e.g., 3) are considered outliers.
Box Plot:- Visually identifies outliers as data points that fall outside the whiskers of the box plot.

Interquartile Range (IQR):- Outliers are defined as data points that lie below Q1 - 1.5*IQR or above Q3 +
1.5*IQR, where Q1 and Q3 are the first and third quartiles, respectively.

Q7 Explain data cube aggregation method in context of data reduction.


Ans:- Data cube aggregation is a powerful technique for condensing large datasets into more manageable
forms. Imagine a multi-dimensional dataset like sales data, with dimensions like product category, region,
and time. Data cube aggregation involves summarizing this data along these dimensions, creating a
hierarchical structure.For instance, you might roll up daily sales data to weekly, monthly, or yearly totals.
This reduces the number of data points while retaining essential information. Additionally, you can drill
down from high-level summaries to more granular details, providing flexibility for analysis. By aggregating
data, we not only reduce storage requirements but also improve query performance. This technique is
crucial for data warehousing and business intelligence, enabling efficient exploration and analysis of vast
datasets.
Q8 Explain any four data visualization tools
Ans:- Data visualization tools are indispensable for transforming raw data into meaningful insights. Here
are four popular tools: 1. Tableau: Renowned for its user-friendly interface and powerful capabilities,
Tableau allows you to create interactive dashboards and visualizations with drag-and-drop simplicity 2.
Power BI:A versatile tool from Microsoft, Power BI excels in creating dynamic reports and visualizations. It
seamlessly integrates with other Microsoft products, making it a popular choice for businesses.3. Python
Libraries (Matplotlib, Seaborn, Plotly): Python offers a rich ecosystem of data visualization libraries.
Matplotlib is a foundational library for creating customizable plots, while Seaborn provides a higher-level
interface for statistical visualizations. 4. Google Data Studio: A free and user-friendly tool, Google Data
Studio allows you to create interactive dashboards and reports using data from various sources, including
Google Analytics and Google Sheets. It's perfect for sharing insights with teams and clients.
Q9 What do you mean by Data attributes? Explain types of attributes with example.
Ans:- Data attributes are the characteristics or properties that describe a data object. They are like the features of
an object that help us understand its nature.

1. Nominal Attributes: These attributes represent categories or labels without any inherent order.
Example: Gender (Male, Female), Eye Color (Blue, Brown, Green) 2. Ordinal Attributes: These
attributes represent categories with a specific order or ranking. Example: Education Level (High
School, Bachelor's, Master's, PhD), Product Rating (Poor, Fair, Good, Excellent) 3. Interval
Attributes: These attributes represent numerical values with meaningful differences between
them, but without a true zero point. Example: Temperature (Celsius, Fahrenheit), Year 4. Ratio
Attributes: These attributes represent numerical values with a true zero point, allowing for
meaningful ratios and comparisons. Example: Height, Weight, Salary
Q10 How do you visualize geospatial data? Explain in detail?
Ans:- Geospatial data visualization involves representing geographic data on maps to reveal patterns, trends, and
relationships. Here are some common techniques: 1. Choropleth Maps: These maps use color or shading to
represent data values across geographic regions.For example, a choropleth map could show population
density by county, with darker shades indicating higher density. 2. Dot Density Maps: These maps use
dots to represent the frequency or intensity of a phenomenon. The more dots in a region, the higher the
concentration. For example, a dot density map could show the distribution of hospitals across a state.
3. Heat Maps: Heat maps use color gradients to represent data density. Warmer colors indicate higher
density, while cooler colors indicate lower density.6 Heat maps are useful for visualizing point data, such as
crime incidents or traffic accidents. 4. Flow Maps: Flow maps use lines or arrows to represent movement
or flow between locations. The thickness of the lines or arrows can indicate the magnitude of the flow. For
example, a flow map could show migration patterns between countries.
Q12 What do you mean by Data transformation? Explain strategies of data transformation.?
Ans:- Data transformation involves converting raw data into a suitable format for analysis. This often
includes cleaning, integrating, and manipulating data.
Strategies for Data Transformation:-

1. Data Cleaning: Handles missing values, outliers, and inconsistencies. 2. Data Integration: Combines data
from multiple sources into a unified dataset. 3. Data Reduction: Reduces data volume while preserving
information. 4. Data Discretization: Divides continuous attributes into intervals. 5. Data Normalization:
Scales data to a common range. 6. Data Aggregation: Combines data into summary values.
Q13 Explain role of statistics in data science?

Ans:- Statistics plays a crucial role in data science as it provides methods to analyze, summarize and make
inferencesfrom data. It helps in understanding the underlying patterns and relationships in the data,
creating predictive models, and making data-driven decisions. Statistics provides techniques to clean, pre-
process, and transform data into a form suitable for analysis. It also provides tools for hypothesis testing,
estimation of parameters, and evaluation of model performance. Additionally, it helps in identifying
outliers, detecting and handling missing values, and performing dimensionality reduction.
Q14 Explain two methods of data cleaning for missing values?
Ans:- 1.Mean/Median Imputation: In this method, missing values are replaced with either the mean or
median of the non-missing values in the same column. This method is suitable for continuous variables and
assumes that the missing values are missing at random and have a similar distribution to the non-missing
values. 2.Multiple Imputation: In this method, multiple datasets are created by imputing the missing
values using a statistical model such as regression or clustering. Each dataset is analyzed and the results
are combined to account for the uncertainty in imputed values. This method is more robust and provides
more accurate resultscompared to mean/median imputation, especially when the missing values are not
missing at random.
Q15 Explain any two tools in data scientist toolbox?

Ans:- 1.Jupyter Notebook: Jupyter Notebook is a web-based interactive computational environment that
allows data scientists to create and share documents that contain live code, equations, visualizations, and
narrative text. It provides an easy way to perform data analysis and prototyping, and is widely used in data
science and machine learning projects. 2.Pandas: Pandas is a fast, flexible, and powerful open-source data
analysis and data manipulation library for Python. It provides data structures for efficiently storing large
datasets and tools for working with them, including data cleaning, filtering, grouping, and aggregating.
Pandas makes it easy to manipulate and analyse large datasets, and is a essential tool in the data scientist's
toolbox
Q16 What are the different methods for measuring the data dispersion?
Ans:- Measures of dispersion include: -1. *Range*: Difference between the highest and lowest values.
2. *Variance*: Average squared difference from the mean.

3. *Standard Deviation*: Square root of variance, indicating data spread.


4. *Interquartile Range (IQR)*: Range within the middle 50% of data.
Q17 What are the measures of central tendency? Explain any two of them in brief.*
Ans:- Measures of central tendency are statistical tools that help identify the central value or typical value of a
dataset. They provide a single representative value that summarizes the overall distribution of the data. The three
primary measures of central tendency are: 1Mean: The mean, also known as the average, is calculated by
summing all the values in a dataset and dividing the sum by the total number of values. It represents the
arithmetic center of the data.However, the mean can be heavily influenced by outliers, making it less
reliable for skewed distributions. 2Median: The median is the middle value in a dataset when the data is
arranged in ascending or descending order. It is less affected by outliers compared to the mean. If the
dataset has an even number of values, the median is the average of the two middle values.
Q18 What are the various types of data available
Ans:- In the foundation of data science, data can be classified into four primary types based on their
characteristics and nature:1. Nominal Data: This is categorical data where the categories have no inherent
order or ranking. Examples include gender (male, female) It is used for labeling without any quantitative
value.2. Ordinal Data: This type involves categories with a meaningful order or ranking, but the differences
between them are not measurable. Examples include education levels (high school, bachelor’s, master’s) 3.
Interval Data: This is numerical data where the intervals between values are consistent, but there is no
true zero point. Examples include temperature in Celsius o where zero does not indicate an absence of
temperature.4. Ratio Data: This is numerical data that has consistent intervals and a true zero point,
allowing for meaningful comparisons and calculations. Examples include height, weight)
Q19 What is Venn diagram? How to create it? Explain with example
Ans:- A Venn diagram is a visual representation used to show the relationships and overlaps between
different sets of data. It consists of overlapping circles, where each circle represents a specific set, and the
overlapping regions indicate the elements that belong to multiple sets.
How to Create a Venn Diagram:-Identify the Sets: Determine the groups or categories you want to
compare .Draw Circles: Represent each set with a circle. Ensure overlaps where relationships exist.
Label the Circles: Assign appropriate names or labels to each set. Fill in Elements: Place elements in their
respective regions based on their membership (unique to one set, common to multiple sets, etc.).For
example:- with our two sets:-Set A (fruits): {apple, banana, orange}
Set B (citrus fruits): {orange, lemon} The Venn diagram would show one circle for fruits with "apple" and
"banana" on one side and another circle for citrus fruits with "lemon" on the other side.
Q Explain different data formats in brief
Ans:- 1. *CSV (Comma-Separated Values)*: Plain text file format for tabular data.

2. *JSON (JavaScript Object Notation)*: Format for structured data, often used in web applications.
3. *XML (eXtensible Markup Language)*: Hierarchical data format, common in data exchange.
4. *SQL*: Structured Query Language, used for relational databases.
Q What is data quality? Which factors affect data quality?
Ans:- Data quality refers to the condition of a dataset and its suitability for use in analysis, decision-making, and
problem-solving. High-quality data is accurate, consistent, complete, and relevant, enabling reliable insights and
outcomes. Poor data quality can lead to incorrect conclusions, inefficiencies, and flawed decisions.

Factors Affecting Data Quality:-1Accuracy: The degree to which data correctly represents real-world
entities or events. 2Completeness: Ensuring no missing or incomplete data values. 3Consistency: Data
should be uniform across datasets without conflicting values. 4Timeliness: Data should be up-to-date and
available when needed. Types of Data:-Nominal Data, Ordinal Data, Interval Data, Ratio Data.
Q What is outlier? State types of outliers?
Ans:- An outlier is a data point that significantly deviates from the overall pattern or trend of a dataset. It appears
unusually large or small compared to other values, often indicating anomalies, errors, or rare events. Outliers can
influence statistical analyses, such as mean and standard deviation, and may require special handling to ensure
accurate insights.

Types of Outliers:
1. Global Outliers (Point Outliers): These are individual data points that are distinctly different from
the entire dataset. For example, in a dataset of student ages (15, 16, 17, 45), the value 45 is a global
outlier.
2. Contextual Outliers: These data points are unusual within a specific context but may appear normal
in a broader sense. For instance, 30°C could be an outlier in a winter dataset but typical in summer.
3. Collective Outliers: A group of data points collectively behaves unusually compared to the rest of
the dataset. For example, a sudden spike in website traffic due to a marketing campaign forms a
collective outlier.
Q State and explain any three data transformation techniques?
Ans:- Data transformation techniques are methods used to convert data into a format suitable for analysis, ensuring
consistency and improving the quality of insights.

Normalization: This technique rescales data to a standard range, usually between 0 and 1, to prevent
features with larger values from dominating the analysis. It is often used when the data has different units
or scales.
Standardization: Standardization transforms data to have a mean of 0 and a standard deviation of 1. This is
useful when data has different units and helps algorithms that assume normally distributed data, such as
linear regression.

Log Transformation: Log transformation is applied to skewed data to make it more normal. By taking the
logarithm of values, it reduces the impact of large values and makes the data more evenly distributed.
These techniques help improve the performance of data analysis models by making data more comparable
and reducing bias.
Q Explain outlier detection methods in brief.?
Ans:- Outlier detection methods identify data points that differ significantly from the rest of the dataset. Here are
three common techniques:

Z-Score Method: This calculates how many standard deviations a point is from the mean. Points with a Z-
score greater than 2.5 or 3 are considered outliers. It works well for datasets with a normal distribution.
Local Outlier Factor (LOF): LOF measures the density of a point relative to its neighbors. Points with lower
density than their neighbors are flagged as outliers. It’s effective for datasets with varying densities.
Isolation Forest: This algorithm isolates outliers by randomly selecting features and creating decision trees.
Outliers require fewer splits to be isolated. It works well in high-dimensional datasets and doesn’t assume
a specific data distribution.
Q Write different data visualization libraries in Python.?
Ans:- In Python, several data visualization libraries are available to create insightful plots and graphs:
Matplotlib: The most widely used library for static visualizations, such as line charts, bar graphs, and
scatter plots. It provides full control over plot formatting and customization. Seaborn: Built on top of
Matplotlib, Seaborn offers easier-to-use functions for creating attractive, informative visualizations like
heatmaps, pair plots, and categorical plots. Plotly: Known for creating interactive plots, Plotly supports a
wide range of charts like 3D plots and geographical maps, allowing for web-based visualization.
Altair: A declarative statistical visualization library that simplifies the creation of complex visualizations
with concise code and supports interactive features.
Q Explain 3V’s of Data Science?
Ans:- The 3 V's of Data Science refer to key characteristics of data that impact analysis and processing:

Volume: Refers to the vast amount of data generated every day. With the rise of big data, organizations
now deal with terabytes or even petabytes of information, requiring advanced storage and processing
techniques. Velocity: This describes the speed at which data is generated and needs to be processed. With
real-time data streams from sensors, social media, and transactions, quick analysis is essential for timely
decision-making .Variety: Represents the different types and formats of data, including structured, semi-
structured, and unstructured data. It encompasses text, images, videos, and sensor data, which require
diverse approaches for processing and analysis.

You might also like