Scikit-learn Interview Questions and Answers-1
Scikit-learn Interview Questions and Answers-1
1. What is Scikit-learn?
Scikit-learn is an open-source machine learning library in Python, built on top of SciPy, NumPy, and
Matplotlib. It provides simple and efficient tools for data mining and data analysis, including various
algorithms for classification, regression, clustering, and more.
The typical workflow involves: 1. Importing the necessary modules (e.g., sklearn.model_selection,
sklearn.linear_model). 2. Loading and preprocessing the data. 3. Splitting data into training and
testing sets. 4. Choosing a model and training it using the fit() method. 5. Making predictions with
predict(). 6. Evaluating model performance using metrics like accuracy, precision, and recall.
Feature scaling standardizes the range of features so they have equal weight in model training. -
StandardScaler scales features by removing the mean and scaling to unit variance. - MinMaxScaler
scales features to a fixed range, usually [0, 1]. Use StandardScaler when data is normally
distributed, and MinMaxScaler when you need a bounded range.
5. What is cross-validation?
Cross-validation is a technique for assessing model performance by splitting data into multiple
subsets, training the model on some subsets, and validating on others. K-Fold Cross-Validation is a
popular method where data is divided into k subsets (folds), and the model is trained k times, each
time using a different fold for validation.
Bagging and Boosting are ensemble learning techniques: - Bagging: Combines multiple weak
models trained independently on random subsets of data, reducing variance (e.g., Random Forest).
- Boosting: Trains models sequentially, each correcting the errors of the previous one, reducing
bias (e.g., AdaBoost, Gradient Boosting).
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data
into a set of uncorrelated variables (principal components). Implementation in Scikit-learn: from
sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) This
reduces data to 2 principal components.