DM Answers
DM Answers
o Fraud Detection
o Customer Segmentation
o Healthcare Diagnostics
o Recommendation Systems
8. Compute the similarity between Chicken and Bird using SMC coefficient.
SMC = (Number of matching attributes) / (Total number of attributes)
Matching attributes: 7 (positions 2, 3, 4, 8, 10)
Total attributes: 10
SMC = 7/10 = 0.7
o Dimensionality Reduction
o Numerosity Reduction
o Data Compression
o Structured Data
o Unstructured Data
o Semi-structured Data
5-Marker Questions
19. Explain the Knowledge Discovery in Databases (KDD) process with a neat diagram.
Step-by-Step Explanation:
1. Data Cleaning:
o Remove noise, handle missing values, and correct inconsistencies in the data.
2. Data Integration:
3. Data Selection:
4. Data Transformation:
5. Data Mining:
6. Pattern Evaluation:
7. Knowledge Presentation:
Diagram:
Copy
Raw Data → Data Cleaning → Data Integration → Data Selection → Data Transformation →
Data Mining → Pattern Evaluation → Knowledge Presentation
Step-by-Step Explanation:
1. Classification:
2. Clustering:
3. Regression:
5. Anomaly Detection:
Step-by-Step Explanation:
1. Data Collection:
2. Data Preprocessing:
3. Model Selection:
o Choose appropriate algorithms (e.g., decision trees, neural networks).
4. Pattern Discovery:
5. Evaluation:
6. Deployment:
Step-by-Step Explanation:
1. Customer Insights:
2. Market Trends:
3. Operational Efficiency:
4. Risk Management:
5. Strategic Decisions:
Step-by-Step Explanation:
1. Prediction:
2. Description:
3. Classification:
o Categorize data into predefined classes (e.g., spam detection).
4. Clustering:
5. Association:
Step-by-Step Explanation:
1. Statistics:
2. Machine Learning:
3. Database Systems:
4. Domain Knowledge:
5. Visualization:
27. Illustrate the Typical View in ML and Statistics with a Neat Diagram
Step-by-Step Explanation:
2. Statistics:
3. Diagram:
Copy
Step-by-Step Explanation:
1. Fraud Detection:
2. Customer Segmentation:
3. Healthcare Diagnostics:
4. Recommendation Systems:
1. Data Collection:
2. Data Preprocessing:
Step-by-Step Explanation:
1. Data Quality:
2. Scalability:
4. Interpretability:
5. Ethical Concerns:
Step-by-Step Explanation:
1. Quantitative Data:
o Numerical and measurable.
2. Qualitative Data:
3. Comparison:
Step-by-Step Explanation:
1. Filter Methods:
2. Wrapper Methods:
3. Embedded Methods:
4. Example:
o In a dataset with age, income, and education, select income and education as
the most relevant features for predicting loan approval.
33. How to Perform Correlation Analysis Between Categorical Variables Using Chi-Square
Test
Step-by-Step Explanation:
χ2=∑(O−E)2Eχ2=∑E(O−E)2
Step-by-Step Explanation:
1. Given Data:
o One car: 73
o Two cars: 38
o Three or more: 18
o Total: 129
2. Expected Frequencies:
χ2=(73−77.4)277.4+(38−36.12)236.12+(18−15.48)215.48=0.25+0.10+0.42=0.77χ2=77.4(73−7
7.4)2+36.12(38−36.12)2+15.48(18−15.48)2=0.25+0.10+0.42=0.77
o Since 0.77 < 5.99, we fail to reject H₀. The data supports the study.
35. Calculate Covariance for Stocks A and B
Step-by-Step Explanation:
1. Given Data:
o Stock A: 2, 3, 5, 4, 6
2. Calculate Means:
o Mean of A = (2 + 3 + 5 + 4 + 6) / 5 = 4
3. Calculate Covariance:
Covariance=∑(Ai−Aˉ)(Bi−Bˉ)n−1Covariance=n−1∑(Ai−Aˉ)(Bi−Bˉ)
4. Interpretation:
36. What is Dimensionality Reduction? Explain Methods Used for Reducing Dimensionality
Step-by-Step Explanation:
1. Definition:
2. Methods:
3. Example:
o Reducing a dataset with 100 features to 10 principal components using PCA.
Step-by-Step Explanation:
1. Data Cleaning:
2. Data Integration:
3. Data Transformation:
4. Data Reduction:
5. Impact:
Step-by-Step Explanation:
1. Given Data:
o Salaries: 25, 30, 28, 55, 60, 42, 70, 75, 50, 48
2. Binning:
3. Smoothing:
4. Result:
o Smoothed data: 28, 28, 28, 51, 51, 51, 72.5, 72.5, 51, 51.
Step-by-Step Explanation:
1. Similarity:
o Properties:
2. Dissimilarity:
o Properties:
3. Example:
Cosine Similarity:
Similarity=A⋅B∥A∥∥B∥=3214⋅77≈0.974Similarity=∥A∥∥B∥A⋅B=14⋅7732≈0.974
Distance=(4−1)2+(5−2)2+(6−3)2=27≈5.196Distance=(4−1)2+(5−2)2+(6−3)2=27≈5.196
41. Define Noisy Data. Explain How Noisy Data Can Be Handled in Data Mining
Step-by-Step Explanation:
1. Noisy Data:
3. Example:
o For a dataset with noisy salary values, use binning to replace values with bin
means.
Step-by-Step Explanation:
1. Given Vectors:
o d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
o d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
(3×1)+(2×0)+(0×0)+(5×0)+(0×0)+(0×0)+(0×0)+(2×1)+(0×0)+(0×2)=3+0+0+0+0+0+0+2+0+0=5(3
×1)+(2×0)+(0×0)+(5×0)+(0×0)+(0×0)+(0×0)+(2×1)+(0×0)+(0×2)=3+0+0+0+0+0+0+2+0+0=5
3. Magnitude of d1 (||d1||):
32+22+02+52+02+02+02+22+02+02=9+4+0+25+0+0+0+4+0+0=42≈6.4832+22+02+52+02+02
+02+22+02+02=9+4+0+25+0+0+0+4+0+0=42≈6.48
4. Magnitude of d2 (||d2||):
12+02+02+02+02+02+02+12+02+22=1+0+0+0+0+0+0+1+0+4=6≈2.4512+02+02+02+02+02+0
2+12+02+22=1+0+0+0+0+0+0+1+0+4=6≈2.45
5. Cosine Similarity:
Similarity=d1⋅d2∥d1∥∥d2∥=56.48×2.45≈515.88≈0.315Similarity=∥d1∥∥d2∥d1⋅d2=6.48×2.455
≈15.885≈0.315
Step-by-Step Explanation:
1. Data Cleaning:
2. Data Integration:
3. Data Transformation:
4. Data Reduction:
5. Impact:
Step-by-Step Explanation:
1. Accuracy:
2. Completeness:
4. Timeliness:
5. Relevance:
Step-by-Step Explanation:
1. Data Cleaning:
2. Data Integration:
3. Data Transformation:
4. Data Reduction:
5. Data Discretization:
Step-by-Step Explanation:
1. Given Data:
2. Min-Max Normalization:
o Formula:
200−2001000−200=01000−200200−200=0
3. Z-Score Normalization:
o Formula:
200−500316.23≈−0.948316.23200−500≈−0.948
4. Decimal Scaling:
o Formula:
2001000=0.21000200=0.2
Step-by-Step Explanation:
1. Definition:
o Summarizes data across multiple dimensions.
2. Operations:
3. Example:
o A data cube for sales data might include dimensions like time, location, and
product, with measures like total sales and profit.
48. Calculate Covariance for Economic Growth and S&P 500 Returns
Step-by-Step Explanation:
1. Given Data:
Copy
2. Calculate Means:
o Mean of bi = (8 + 12 + 14 + 10) / 4 = 11
3. Calculate Covariance:
Covariance=∑(ai−aˉ)(bi−bˉ)n−1Covariance=n−1∑(ai−aˉ)(bi−bˉ)
4. Interpretation:
o Positive covariance indicates that economic growth and S&P 500 returns tend
to rise or fall together.
49. Explain Data Discretization in Detail, Supervised and Unsupervised Discretization
Step-by-Step Explanation:
1. Definition:
2. Supervised Discretization:
3. Unsupervised Discretization:
4. Example:
o For a dataset with ages, discretize into intervals like 0-20, 20-40, 40-60.
Step-by-Step Explanation:
1. Definition:
2. Example:
Step-by-Step Explanation:
1. Definition:
2. Example:
Step-by-Step Explanation:
1. Similarity:
2. Dissimilarity:
3. Properties:
4. Example:
10-Marker Questions
Step-by-Step Explanation:
1. Given Data:
Copy
S = (16, n), (0, y), (4, y), (12, y), (16, n), (26, n), (18, y), (24, n), (28, n)
2. Split Points:
o Entropy Formula:
Entropy(S)=−∑i=1npilog2(pi)Entropy(S)=−i=1∑npilog2(pi)
S2: (16, n), (16, n), (26, n), (18, y), (24, n), (28, n) → 5 "n" and 1 "y" →
Entropy = 0.65.
S1: (0, y), (4, y), (12, y), (16, n), (16, n), (18, y) → 3 "y" and 3 "n" →
Entropy = 1.
o Split point 14 has lower entropy (0.43) and is chosen as the best split.
Step-by-Step Explanation:
1. Given Points:
Copy
2. Euclidean Distance:
o Formula:
Distance=(x2−x1)2+(y2−y1)2Distance=(x2−x1)2+(y2−y1)2
(2−0)2+(0−2)2=4+4=2.83(2−0)2+(0−2)2=4+4=2.83
3. Minkowski Distance:
o Formula:
Distance=(∑i=1n∣xi−yi∣p)1/pDistance=(i=1∑n∣xi−yi∣p)1/p
Step-by-Step Explanation:
1. Dimensionality Reduction:
2. Numerosity Reduction:
3. Data Compression:
4. Feature Selection: