Scikit-learn Cheatsheet [2025 Updated] - Download pdf
Last Updated :
23 Jul, 2025
By allowing systems to learn from data and make judgments without explicit programming, machine learning is revolutionizing a number of sectors. It is changing how companies function and innovate in a variety of industries, including healthcare and entertainment, opening up new avenues for automation and clever solutions. And having the appropriate tools is crucial in this quickly changing sector. One of the most well-known and easily available libraries for machine learning in Python is Scikit-learn. Both beginners and experts wishing to efficiently create, improve, and assess machine learning models turn to it because of its ease of use and extensive feature set.
Scikit-Learn Cheat-SheetIn this article, we provide a Scikit-learn Cheat Sheet that covers the main features, techniques, and tasks in the library. This cheat sheet will be a useful resource to effectively create machine learning models, covering everything from data pretreatment to model evaluation.
What is Scikit-learn?
Scikit-learn is an open-source, free Python library. It facilitates activities such as classifying data, clustering similar data, forecasting values, and simplifying data for tasks like dimensionality reduction. Additionally, it gives you the skills to prepare data, select the optimal model, and assess performance. Scikit-learn, which is built on top of existing Python libraries like NumPy and SciPy, is easy to use, popular, and perfect for both novices and machine learning specialists.
Scikit-learn Cheat-Sheet
This Scikit-learn Cheat Sheet will help you learn how to use Scikit-learn for machine learning. It covers important topics like creating models, testing their performance, working with different types of data, and using machine learning techniques like classification, regression, and clustering. It’s a great guide to help you get hands-on experience and explore machine learning more easily.
Download the Cheat-Sheet here: Scikit-learn Cheat-Sheet
Installing Scikit-learn
Once you have Python installed, you can use the following command to install the scikit-learn library on Windows:
pip install scikit-learn
Data Preprocessing
Function | Description |
---|
StandardScaler | Standardize features by removing the mean and scaling to unit variance. |
---|
MinMaxScaler | Scale features to a specific range (e.g., 0 to 1). |
---|
Binarizer | Transform features into binary values (thresholding). |
---|
LabelEncoder | Encode target labels with values between 0 and n_classes-1. |
---|
OneHotEncoder | Perform one-hot encoding of categorical features. |
---|
PolynomialFeatures | Generate polynomial and interaction features. |
---|
SimpleImputer | Impute missing values using a strategy (mean, median, most frequent). |
---|
KNNImputer | Impute missing values using k-nearest neighbors. |
---|
Model Selection and Evaluation
Function | Description |
---|
train_test_split | Split data into training and testing sets |
---|
cross_val_score | Perform cross-validation on the model. |
---|
cross_val_predict | Cross-validation generator for predictions. |
---|
accuracy_score | Evaluate classification accuracy. |
---|
confusion_matrix | Generate confusion matrix for classification. |
---|
classification_report | Detailed classification report (precision, recall, F1-score). |
---|
mean_squared_error | Evaluate regression performance with mean squared error. |
---|
r2_score | Evaluate regression performance with R² score. |
---|
roc_auc_score | Compute area under the ROC curve for binary classification. |
---|
f1_score | Compute the F1 score for classification models. |
---|
precision_score | Compute precision score for classification models. |
---|
recall_score | Compute recall score for classification models. |
---|
Classification Models
Function | Description |
---|
LogisticRegression | A linear model used for binary or multi-class classification. |
---|
SVC | Support Vector Classifier, used for both linear and non-linear classification. |
---|
RandomForestClassifier | An ensemble method that builds multiple decision trees for robust classification. |
---|
GradientBoostingClassifier | An ensemble method that builds trees sequentially to correct errors of previous trees. |
---|
GaussianNB | Naive Bayes classifier based on Gaussian distribution of data. |
---|
KNeighborsClassifier | Classifier that assigns labels based on nearest neighbors' majority class. |
---|
DecisionTreeClassifier | A tree-based classifier that splits data into branches to make decisions. |
---|
Regression Models
Function | Description |
---|
LinearRegression | A linear model used to predict continuous numerical values. |
---|
Ridge | A linear regression model with L2 regularization to prevent overfitting. |
---|
Lasso | A linear regression model with L1 regularization to enhance sparsity. |
---|
DecisionTreeRegressor | A tree-based model that predicts continuous values by learning splits on the data. |
---|
RandomForestRegressor | An ensemble method that averages the predictions of multiple decision trees for better accuracy. |
---|
SVR | Support Vector Regressor, used for predicting continuous values with support vector machines. |
---|
Clustering Models
Function | Description |
---|
KMeans | A popular clustering algorithm that partitions data into k distinct clusters based on similarity. |
---|
DBSCAN | A density-based clustering algorithm that groups data points based on density, allowing for irregular shapes. |
---|
AgglomerativeClustering | A hierarchical clustering method that builds clusters iteratively by merging or splitting clusters. |
---|
Dimentionality Reduction
Function | Description |
---|
PCA | Principal Component Analysis (PCA) reduces the number of features by finding new dimensions that maximize variance. |
---|
TruncatedSVD | A dimensionality reduction method suited for sparse matrices, especially in text mining. |
---|
t-SNE | A technique for visualizing high-dimensional data by mapping it to a lower-dimensional space. |
---|
FeatureAgglomeration | A method for feature reduction that merges features based on their similarity. |
---|
Model Training and Prediction
Function | Description |
---|
fit() | Trains the model using the provided data (X_train, y_train). |
---|
predict() | Makes predictions based on the trained model for unseen data (X_test). |
---|
fit_predict() | Combines training and prediction into a single method, commonly used in clustering. |
---|
predict_proba() | Returns probability estimates for classification models, indicating class likelihoods. |
---|
score() | Evaluates the model's performance using a scoring metric, typically accuracy for classification or R² for regression. |
---|
Hands-on Practice with Scikit-learn
Importing and Preparing Data
Before building models, we need to load our dataset and split it into training and testing subsets. This ensures we can evaluate the model on unseen data.
Loading Built-in Datasets: Scikit-learn provides datasets like Iris and Boston Housing for experimentation.
Python
from sklearn.datasets import load_iris
data = load_iris()
X, y = data.data, data.target
Splitting Data into Training and Testing: Split the dataset into training data for model learning and testing data for evaluation.
Python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
Preprocessing transforms raw data into a suitable format for machine learning models. These techniques standardize, normalize, or otherwise prepare data.
a. Standardization: Ensures that features have zero mean and unit variance, which improves model performance.
Python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)
b. Normaliziation: Scales individual rows of data so that their norm equals 1, which is useful for distance-based models like KNN.
Python
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)
c. Binarization: Converts numeric features into binary values based on a threshold.
Python
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=1.0).fit(X)
X_binarized = binarizer.transform(X)
d. Encoding Non-Numerical Data: Converts categorical features into numeric ones using label encoder.
Python
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
e. Handling Missing Values: Handle missing data with a strategy like replacing with the mean, median, or mode.
Python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
f. Creating Polynomial Features: Generates additional features that represent polynomial combinations of the original ones, capturing non-linear patterns.
Python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
Building Machine Learning Models
Supervised Learning Algorithms
Supervised learning involves training models on labeled data, where the target variable is known.
- Linear Regression: Used to predict continuous values by fitting a linear relationship between input features and target variables.
Python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
- Naive Bayes: A fast algorithm based on Bayes’ theorem, often used for text classification.
Python
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Python
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Unsupervised Learning Algorithms
Unsupervised learning is used when the data has no labels or target variable, often for clustering or dimensionality reduction.
- PCA: Reduces high-dimensional data into fewer dimensions while preserving variance.
Python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)
- K-Means: Groups similar data points into clusters based on their features.
Python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)
y_pred = kmeans.predict(X_test)
Evaluation metrics are used to judge a model's performance. It involves measuring its accuracy or error rate on test data. Scikit-learn provides many metrics for this purpose.
a. Metrics for Classification Models
- Accuracy Score: Measures the proportion of correctly predicted labels.
Python
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
- Classification Report (Precision, Recall, F1): Provides detailed metrics for classification tasks.
Python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
- Confusion Matrix Insights: Shows the counts of true positives, true negatives, false positives, and false negatives.
Python
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
b. Metrics for Regression Models
- Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.
Python
from sklearn.metrics import mean_absolute_error
print("MAE:", mean_absolute_error(y_test, y_pred))
- Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
Python
from sklearn.metrics import mean_squared_error
print("MSE:", mean_squared_error(y_test, y_pred))
- R² Score: Indicates how well the model explains the variance in the target variable.
Python
from sklearn.metrics import r2_score
print("R² Score:", r2_score(y_test, y_pred))
c. Metrics for Clustering
- Adjusted Rand Index: Evaluates the similarity between two clusterings by considering all pairs of points.
Python
from sklearn.metrics import adjusted_rand_score
print("ARI:", adjusted_rand_score(y_test, y_pred))
- Homogeneity: Checks if clusters contain only data points that belong to a single class.
Python
from sklearn.metrics import homogeneity_score
print("Homogeneity:", homogeneity_score(y_test, y_pred))
- V-Measure: Measures the balance between homogeneity and completeness in clustering results.
Python
from sklearn.metrics import v_measure_score
print("V-Measure:", v_measure_score(y_test, y_pred))
Optimizing Models
Model optimization involves fine-tuning hyperparameters to improve performance.
a. Exhaustive Search with GridSearchCV: Tests all combinations of hyperparameters to find the best set.
Python
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
b. Randomized Search for Hyperparameters: Randomly samples hyperparameters for a faster search.
Python
from sklearn.model_selection import RandomizedSearchCV
random = RandomizedSearchCV(SVC(), param_distributions={
'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, n_iter=10)
random.fit(X_train, y_train)
Conclusion
In conclusion, Scikit-learn makes machine learning easier and more accessible. It provides simple and powerful tools to help you turn data into useful predictions. With its wide range of algorithms and features, you can quickly build, test, and improve your models. Whether you're working on a small task or a big project, Scikit-learn offers the flexibility and support you need. By learning how to use it, you'll be ready to apply machine learning to many different challenges.
Similar Reads
Tensorflow Cheat Sheet [2025 Updated] - Download PDF TensorFlow is an open-source powerful library by Google to build machine learning and deep learning models. The huge ecosystem of TensorFlow will make it easier for everyone in developing, training and deployment of scalable AI solutions. TensorFlow cheat sheet helps you on immediate reference to co
8 min read
How Should a Machine Learning Beginner Get Started on Kaggle Are you fascinated by Data Science? Do you think Machine Learning is fun? Do you want to learn more about these fields but arenât sure where to start? Well, start with Kaggle! Kaggle is an online community devoted to Data Scientists and Machine Learning, founded by Google in 2010. It is the largest
8 min read
Geek Streak - 30 Days POTD Challenge Practice coding every day and get rewarded for maintaining a streak for 30 days straight starting from 23rd July 2024. Important Note: The Geek Streak Challenge is over now. Please fill the form below to verify your participation in Geek Streak 30 Day Challenge.Google Form LinkIt is not as easy as i
2 min read
Free Python Course Online [2025] Want to learn Python and finding a course for free to learn Python programming? No need to worry now, Embark on an exciting journey of Python programming with our free Python course- Free Python Programming - Self-Paced, which covers everything from Python fundamentals to advanced. This course is pe
5 min read
Data Science Blogathon 2024 - From Data to Intelligence Attention, data science enthusiasts, AI aspirants, and machine learning masterminds! Are you passionate about delving into the fascinating world of data science? Do you desire to showcase your expertise and contribute to a vibrant community? Then the GeeksforGeeks Data Science Blogathon is your perf
9 min read
Geek Week 2023: Score a Career Six with GFG Courses Hey Geeks, are you ready for the most exciting week of the year? GeeksforGeeks is back with a bang, bringing you Geek Week 2023 from the 14th to the 21st of October! In the world of coding, there's a special week that programmers, tech enthusiasts, and learners eagerly await each year. It's a week t
5 min read