FDS
FDS
FDS
Ans:- The Data Science Life Cycle is a structured approach to solving data-driven problems. It involves
several interconnected stages: 1. Business Understanding:- Clearly define the problem to be solved and
the desired outcomes.
2. Data Acquisition:-Gather relevant data from various sources, ensuring its quality and completeness.
3. Data Preparation: Cleanse, transform, and preprocess the data to make it suitable for analysis.
4.Exploratory Data Analysis (EDA): Utilize statistical techniques and visualizations to understand data
patterns, relationships, and potential insights.
5. Model Building: Develop appropriate machine learning or statistical models to address the problem.
6. Model Evaluation: Assess the performance of the model using relevant metrics and refine it as
needed.7. Deployment: Integrate the model into a production environment to make predictions or
generate insights on new data.8. Monitoring and Maintenance: Continuously monitor the model's
performance and retrain it as necessary to adapt to changing conditions.
The primary uses of data visualization include: - Exploratory Analysis: Identifying hidden patterns and
relationships in data. - Decision-Making: Supporting data-driven decisions by presenting findings clearly. -
Communication: Sharing insights effectively with stakeholders or audiences. - Monitoring: Tracking
metrics or performance indicators in real-time dashboards. Data visualization tools, such as Tableau,
Power BI, and Python libraries like Matplotlib and Seaborn, make it easier to interpret data and convey
results effectively.
Q3 Write a short note on hypothesis testing.
Ans:- Hypothesis testing is a statistical method used to make decisions or draw conclusions about a
population based on sample data. It involves formulating two opposing hypotheses: the null hypothesis
(H₀), which represents no effect or no difference, and the alternative hypothesis (H₁), which suggests the
presence of an effect or difference. The process begins by defining these hypotheses and selecting an
appropriate significance level (commonly 0.05) to control the risk of rejecting the null hypothesis when it is
true.Data from the sample is analyzed using a test statistic, and the p-value is calculated to assess the
strength of the evidence against the null hypothesis. If the p-value is less than the significance level, the
null hypothesis is rejected in favor of the alternative.
Q6 Write any two applications of data science. ii) Explain any one type of outliers in detail.?
Ans:- Two Applications of Data Science:-
1. Recommendation Systems, 2. Fraud Detection
Outliers: Univariate Outliers:-Univariate outliers are data points that deviate significantly from the overall
pattern of a single variable. These outliers can be identified using various statistical methods:
Z-Score:- Measures how many standard deviations a data point is from the mean. Data points with a Z-
score greater than a certain threshold (e.g., 3) are considered outliers.
Box Plot:- Visually identifies outliers as data points that fall outside the whiskers of the box plot.
Interquartile Range (IQR):- Outliers are defined as data points that lie below Q1 - 1.5*IQR or above Q3 +
1.5*IQR, where Q1 and Q3 are the first and third quartiles, respectively.
1. Nominal Attributes: These attributes represent categories or labels without any inherent order.
Example: Gender (Male, Female), Eye Color (Blue, Brown, Green) 2. Ordinal Attributes: These
attributes represent categories with a specific order or ranking. Example: Education Level (High
School, Bachelor's, Master's, PhD), Product Rating (Poor, Fair, Good, Excellent) 3. Interval
Attributes: These attributes represent numerical values with meaningful differences between
them, but without a true zero point. Example: Temperature (Celsius, Fahrenheit), Year 4. Ratio
Attributes: These attributes represent numerical values with a true zero point, allowing for
meaningful ratios and comparisons. Example: Height, Weight, Salary
Q10 How do you visualize geospatial data? Explain in detail?
Ans:- Geospatial data visualization involves representing geographic data on maps to reveal patterns, trends, and
relationships. Here are some common techniques: 1. Choropleth Maps: These maps use color or shading to
represent data values across geographic regions.For example, a choropleth map could show population
density by county, with darker shades indicating higher density. 2. Dot Density Maps: These maps use
dots to represent the frequency or intensity of a phenomenon. The more dots in a region, the higher the
concentration. For example, a dot density map could show the distribution of hospitals across a state.
3. Heat Maps: Heat maps use color gradients to represent data density. Warmer colors indicate higher
density, while cooler colors indicate lower density.6 Heat maps are useful for visualizing point data, such as
crime incidents or traffic accidents. 4. Flow Maps: Flow maps use lines or arrows to represent movement
or flow between locations. The thickness of the lines or arrows can indicate the magnitude of the flow. For
example, a flow map could show migration patterns between countries.
Q12 What do you mean by Data transformation? Explain strategies of data transformation.?
Ans:- Data transformation involves converting raw data into a suitable format for analysis. This often
includes cleaning, integrating, and manipulating data.
Strategies for Data Transformation:-
1. Data Cleaning: Handles missing values, outliers, and inconsistencies. 2. Data Integration: Combines data
from multiple sources into a unified dataset. 3. Data Reduction: Reduces data volume while preserving
information. 4. Data Discretization: Divides continuous attributes into intervals. 5. Data Normalization:
Scales data to a common range. 6. Data Aggregation: Combines data into summary values.
Q13 Explain role of statistics in data science?
Ans:- Statistics plays a crucial role in data science as it provides methods to analyze, summarize and make
inferencesfrom data. It helps in understanding the underlying patterns and relationships in the data,
creating predictive models, and making data-driven decisions. Statistics provides techniques to clean, pre-
process, and transform data into a form suitable for analysis. It also provides tools for hypothesis testing,
estimation of parameters, and evaluation of model performance. Additionally, it helps in identifying
outliers, detecting and handling missing values, and performing dimensionality reduction.
Q14 Explain two methods of data cleaning for missing values?
Ans:- 1.Mean/Median Imputation: In this method, missing values are replaced with either the mean or
median of the non-missing values in the same column. This method is suitable for continuous variables and
assumes that the missing values are missing at random and have a similar distribution to the non-missing
values. 2.Multiple Imputation: In this method, multiple datasets are created by imputing the missing
values using a statistical model such as regression or clustering. Each dataset is analyzed and the results
are combined to account for the uncertainty in imputed values. This method is more robust and provides
more accurate resultscompared to mean/median imputation, especially when the missing values are not
missing at random.
Q15 Explain any two tools in data scientist toolbox?
Ans:- 1.Jupyter Notebook: Jupyter Notebook is a web-based interactive computational environment that
allows data scientists to create and share documents that contain live code, equations, visualizations, and
narrative text. It provides an easy way to perform data analysis and prototyping, and is widely used in data
science and machine learning projects. 2.Pandas: Pandas is a fast, flexible, and powerful open-source data
analysis and data manipulation library for Python. It provides data structures for efficiently storing large
datasets and tools for working with them, including data cleaning, filtering, grouping, and aggregating.
Pandas makes it easy to manipulate and analyse large datasets, and is a essential tool in the data scientist's
toolbox
Q16 What are the different methods for measuring the data dispersion?
Ans:- Measures of dispersion include: -1. *Range*: Difference between the highest and lowest values.
2. *Variance*: Average squared difference from the mean.
2. *JSON (JavaScript Object Notation)*: Format for structured data, often used in web applications.
3. *XML (eXtensible Markup Language)*: Hierarchical data format, common in data exchange.
4. *SQL*: Structured Query Language, used for relational databases.
Q What is data quality? Which factors affect data quality?
Ans:- Data quality refers to the condition of a dataset and its suitability for use in analysis, decision-making, and
problem-solving. High-quality data is accurate, consistent, complete, and relevant, enabling reliable insights and
outcomes. Poor data quality can lead to incorrect conclusions, inefficiencies, and flawed decisions.
Factors Affecting Data Quality:-1Accuracy: The degree to which data correctly represents real-world
entities or events. 2Completeness: Ensuring no missing or incomplete data values. 3Consistency: Data
should be uniform across datasets without conflicting values. 4Timeliness: Data should be up-to-date and
available when needed. Types of Data:-Nominal Data, Ordinal Data, Interval Data, Ratio Data.
Q What is outlier? State types of outliers?
Ans:- An outlier is a data point that significantly deviates from the overall pattern or trend of a dataset. It appears
unusually large or small compared to other values, often indicating anomalies, errors, or rare events. Outliers can
influence statistical analyses, such as mean and standard deviation, and may require special handling to ensure
accurate insights.
Types of Outliers:
1. Global Outliers (Point Outliers): These are individual data points that are distinctly different from
the entire dataset. For example, in a dataset of student ages (15, 16, 17, 45), the value 45 is a global
outlier.
2. Contextual Outliers: These data points are unusual within a specific context but may appear normal
in a broader sense. For instance, 30°C could be an outlier in a winter dataset but typical in summer.
3. Collective Outliers: A group of data points collectively behaves unusually compared to the rest of
the dataset. For example, a sudden spike in website traffic due to a marketing campaign forms a
collective outlier.
Q State and explain any three data transformation techniques?
Ans:- Data transformation techniques are methods used to convert data into a format suitable for analysis, ensuring
consistency and improving the quality of insights.
Normalization: This technique rescales data to a standard range, usually between 0 and 1, to prevent
features with larger values from dominating the analysis. It is often used when the data has different units
or scales.
Standardization: Standardization transforms data to have a mean of 0 and a standard deviation of 1. This is
useful when data has different units and helps algorithms that assume normally distributed data, such as
linear regression.
Log Transformation: Log transformation is applied to skewed data to make it more normal. By taking the
logarithm of values, it reduces the impact of large values and makes the data more evenly distributed.
These techniques help improve the performance of data analysis models by making data more comparable
and reducing bias.
Q Explain outlier detection methods in brief.?
Ans:- Outlier detection methods identify data points that differ significantly from the rest of the dataset. Here are
three common techniques:
Z-Score Method: This calculates how many standard deviations a point is from the mean. Points with a Z-
score greater than 2.5 or 3 are considered outliers. It works well for datasets with a normal distribution.
Local Outlier Factor (LOF): LOF measures the density of a point relative to its neighbors. Points with lower
density than their neighbors are flagged as outliers. It’s effective for datasets with varying densities.
Isolation Forest: This algorithm isolates outliers by randomly selecting features and creating decision trees.
Outliers require fewer splits to be isolated. It works well in high-dimensional datasets and doesn’t assume
a specific data distribution.
Q Write different data visualization libraries in Python.?
Ans:- In Python, several data visualization libraries are available to create insightful plots and graphs:
Matplotlib: The most widely used library for static visualizations, such as line charts, bar graphs, and
scatter plots. It provides full control over plot formatting and customization. Seaborn: Built on top of
Matplotlib, Seaborn offers easier-to-use functions for creating attractive, informative visualizations like
heatmaps, pair plots, and categorical plots. Plotly: Known for creating interactive plots, Plotly supports a
wide range of charts like 3D plots and geographical maps, allowing for web-based visualization.
Altair: A declarative statistical visualization library that simplifies the creation of complex visualizations
with concise code and supports interactive features.
Q Explain 3V’s of Data Science?
Ans:- The 3 V's of Data Science refer to key characteristics of data that impact analysis and processing:
Volume: Refers to the vast amount of data generated every day. With the rise of big data, organizations
now deal with terabytes or even petabytes of information, requiring advanced storage and processing
techniques. Velocity: This describes the speed at which data is generated and needs to be processed. With
real-time data streams from sensors, social media, and transactions, quick analysis is essential for timely
decision-making .Variety: Represents the different types and formats of data, including structured, semi-
structured, and unstructured data. It encompasses text, images, videos, and sensor data, which require
diverse approaches for processing and analysis.