0% found this document useful (0 votes)
38 views7 pages

DS

Uploaded by

24f1002613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

DS

Uploaded by

24f1002613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1. Why is Data Science important?

• Data Science helps extract meaningful insights from massive amounts of structured and
unstructured data.

• It aids in decision-making, predictive analytics, and automation.

• Applications include healthcare (diagnostics), business (customer behavior analysis), finance


(fraud detection), and more.

2. Differentiate nominal and ordinal data.

Nominal Data Ordinal Data

Represents categories without any


Represents categories with a defined order.
order.

Example: Education levels (High School, Bachelor's,


Example: Colors (Red, Blue, Green).
Master's).

Cannot perform arithmetic operations. Can rank values but not quantify differences.

3. How to handle missing values in a dataset?

• Delete Rows/Columns: If the missing data is minimal and non-critical.

• Imputation:

o Replace with mean, median, or mode.

o Use regression or KNN to predict missing values.

• Flag Missing Data: Create a separate category for missing values.

• Use advanced techniques like machine learning models to estimate missing values.

4. Differentiate the operations of descriptive and inferential statistics.

Descriptive Statistics Inferential Statistics

Summarizes and describes data. Makes predictions or inferences about a population.

Example: Mean, Median, Mode. Example: Hypothesis testing, confidence intervals.

Does not go beyond the data analyzed. Uses sample data to draw conclusions about a larger group.

5. Logistic regression is not preferred for prediction. Justify.

• Logistic regression assumes a linear relationship between independent variables and the log
odds, which may not hold true in complex datasets.
• It cannot capture intricate patterns or non-linear relationships as well as machine learning
models like Random Forests or Neural Networks.

• It may struggle with overfitting or underfitting, depending on the dataset.

6. State Bayes' Theorem.

Bayes' Theorem provides a way to calculate the probability of an event based on prior knowledge:

P(A∣B)=P(B∣A)⋅P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)⋅P(A)

Where:

• P(A∣B)P(A|B)P(A∣B): Probability of event AAA given BBB.

• P(B∣A)P(B|A)P(B∣A): Probability of event BBB given AAA.

• P(A)P(A)P(A): Prior probability of AAA.

• P(B)P(B)P(B): Prior probability of BBB.

7. How can the maximum marginal hyperplane be fixed in SVM?

• The maximum marginal hyperplane (MMH) is the decision boundary that maximizes the
margin (distance) between classes.

• SVM optimizes the MMH by solving a convex optimization problem using techniques like
Lagrange multipliers.

• Kernel functions (e.g., linear, polynomial, RBF) can be used to handle non-linear data.

8. Compare the working of decision trees and random forests.

Decision Tree Random Forest

Single tree structure used for decision-making. Collection of multiple decision trees (ensemble).

Prone to overfitting. Reduces overfitting by averaging multiple trees.

Faster to train and interpret. Slower but more accurate.

9. Mention any four different types of plots used for data visualization.

1. Bar Plot: Compare categorical data.

2. Scatter Plot: Visualize relationships between two variables.

3. Histogram: Show frequency distribution of a dataset.

4. Box Plot: Summarize data distribution and detect outliers.


10. List any four common data science applications used in real time.

1. Healthcare: Disease prediction and diagnostics.

2. Finance: Fraud detection and credit scoring.

3. Retail: Recommendation systems and inventory management.

4. Transportation: Route optimization and autonomous vehicles.

11. What is outlier analysis?

Outlier analysis involves identifying data points that deviate significantly from the majority of the
dataset.

• Causes: Data entry errors, fraud, or true variability.

• Techniques: Z-score, IQR (Interquartile Range), or machine learning-based anomaly


detection.

12. Define data wrangling.

Data wrangling is the process of cleaning, transforming, and preparing raw data into a usable format.
Steps include:

• Removing duplicates.

• Handling missing values.

• Converting data types.

• Normalizing or scaling data for analysis.

Let me know if you'd like more details on any of these topics!

4o

You said:
ChatGPT said:

ChatGPT

Here are the answers to the questions from the second image:

1. Differentiate between structured and unstructured data, providing examples of each.


Structured Data Unstructured Data

Organized and stored in tabular formats (e.g., databases). Does not have a pre-defined format.

Example: Customer information table with columns (Name, Example: Images, videos, audio files,
Age, Email). emails.

Requires more preprocessing to


Easy to search and analyze.
analyze.

2. List the relationship between computer science, data science, and traditional sciences.

• Computer Science: Provides computational tools, algorithms, and infrastructure for data
analysis (e.g., programming, databases).

• Data Science: Combines computer science and statistics to extract insights from data.

• Traditional Sciences: Use data science for experimentation, modeling, and simulations (e.g.,
biology for genomics, physics for particle analysis).

3. State statistical inference and its significance.

Statistical inference involves drawing conclusions about a population based on a sample of data.
Significance:

• Helps make predictions or test hypotheses.

• Provides tools to generalize findings from a limited dataset to a broader context.

4. Define correlation matrix.

A correlation matrix is a table showing correlation coefficients between variables.

• Each cell indicates the strength and direction of the relationship between two variables.

• Values range from -1 (strong negative) to +1 (strong positive), with 0 indicating no


correlation.

5. Compare random forests and decision trees as ensemble learning methods.

Decision Tree Random Forest

Uses a single tree to make decisions. Combines multiple decision trees for better accuracy.

Prone to overfitting. Reduces overfitting by averaging predictions.

Simpler and faster. More complex and computationally intensive.


6. Tell about the Naive Bayes algorithm and its utilization in text classification.

• Naive Bayes is a probabilistic algorithm based on Bayes' Theorem.

• Assumes independence between features, making it computationally efficient.

• Utilization in text classification:

o Spam detection (classifying emails as spam or not spam).

o Sentiment analysis (positive or negative reviews).

7. Define Data Visualization and its importance in data science.

Definition: Data visualization is the graphical representation of information and data using charts,
graphs, and maps.
Importance:

• Simplifies complex data for easier interpretation.

• Helps identify trends, patterns, and outliers.

• Enhances decision-making and storytelling.

8. List the different types of data visualization techniques.

1. Line charts.

2. Bar charts.

3. Pie charts.

4. Scatter plots.

5. Heatmaps.

6. Histograms.

7. Box plots.

8. Tree maps.

9. Name a common application of data science in image classification.

• Facial recognition: Identifying individuals in photos or videos.

• Object detection: Detecting and classifying objects in images (e.g., self-driving cars).

10. Tell two well-known examples of Recommender Systems.

1. Netflix: Recommends movies and TV shows based on viewing history.

2. Amazon: Suggests products based on user behavior and purchase history.


11. Define clustering and its types.

Definition: Clustering is an unsupervised learning technique that groups similar data points into
clusters.
Types:

1. K-Means Clustering: Partitions data into kkk clusters.

2. Hierarchical Clustering: Creates a tree of clusters.

3. DBSCAN (Density-Based): Groups points based on density.

4. Mean-Shift Clustering: Finds clusters based on data density peaks.

12. Compare and contrast supervised and unsupervised learning.

Supervised Learning Unsupervised Learning

Labeled data is used for training. Uses unlabeled data.

Goal: Predict output or classify data. Goal: Discover hidden patterns or groupings.

Examples: Regression, classification. Examples: Clustering, dimensionality reduction.

You might also like