0% found this document useful (0 votes)
9 views

Practical Assignment ML

Uploaded by

Ayush Dumka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Practical Assignment ML

Uploaded by

Ayush Dumka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

NATIONAL FORENSIC SCIENCES UNIVERSITY, GANDHINAGAR

SCHOOL OF LAW, FORENSIC JUSTICE AND POLICY STUDIES

MACHINE LEARNING PRACTICAL ASSIGNMENT


Enrollment No.: - 15

1
PRACTICAL QUESTIONS

1. LIST AND EXPLAIN ANY 15 PANDAS LIBRARY FUNCTIONS WITH EXAMPLE.

2
3
4
5
2. LIST AND EXPLAIN ANY 15 NUMPY LIBRARY FUNCTIONS WITH EXAMPLE.

 Array (): Convert any sequence – like object to a NumPy array.

 Zeros (): Creates an array filled with zeros of specified shape and data type.

 ones (): an array filled with ones of specified shape and data type.

 arange(): Creates an array with evenly spaced values within a range.

6
 Linspace(): Creates an array with evenly spaced values between a start and end, with a
specified number of elements.

 reshape (): Reshapes an array into a new shape without changing its data.

 random.randint(): Generates a random integer array within a given range, shaped and
typed as specified.

 max (): Finds the maximum value in an array.

7
 min (): Finds the minimum value in an array.

 argmax (): Finds the index of the first occurrence of the maximum value in an array.

 argmin(): Finds the index of the first occurrence of the minimum value in an array.

 mean (): Calculates the mean (average) of all elements in an array.

 std (): Calculates the standard deviation of all elements in an array.

8
 sum (): Calculates the sum of all elements in an array.

 dot (): Calculates the dot product of two arrays, which is the sum of the products of
corresponding elements.

3. EXPLAIN 30 SCIKIT LEARN LIBRARY FUNCTION.

 load_iris:
o Loads the Iris dataset, a classic dataset for classification tasks.
o It returns features (sepal length, sepal width, petal length, petal width) and target
labels (species).
 train_test_split:
o Splits data into training and testing sets for model evaluation.
o Helps prevent overfitting by assessing model performance on unseen data.
 StandardScaler:
o Standardizes features by removing the mean and scaling to unit variance.
o Ensures that features are on a similar scale, which is important for many machine
learning algorithms.
 PCA (Principal Component Analysis):
o Reduces dimensionality of data by transforming it into a lower-dimensional space.
o Retains the most important information while reducing the number of features.
 SVC (Support Vector Classifier):
o Implements Support Vector Machine (SVM) algorithm for classification tasks.
o Finds the hyperplane that best separates classes in a high-dimensional space.
 RandomForestClassifier:
o Constructs a random forest ensemble model consisting of multiple decision trees.
o Combines predictions from individual trees to improve accuracy and prevent
overfitting.
 LogisticRegression:
o Fits a logistic regression model to binary or multiclass classification problems.

9
o Estimates the probability that an instance belongs to a particular class based on input
features.
 accuracy_score:
o Computes accuracy of a classification model by comparing predicted labels to true
labels.
o Provides a simple metric for evaluating classification performance.
 classification_report:
o Generates a detailed report containing various metrics for evaluating classification
model performance.
o Includes precision, recall, F1-score, and support for each class, as well as overall
accuracy.
 KMeans:
o Performs K-Means clustering, dividing data into K clusters based on similarity.
o Assigns data points to the nearest cluster centre iteratively.

 KNeighborsClassifier:
o Implements the K-Nearest Neighbours algorithm for classification tasks.
o Assigns the class label based on the majority class among the K nearest neighbors.
 DecisionTreeClassifier:
o Constructs a decision tree model for classification tasks.
o Makes decisions by recursively splitting the feature space based on feature values.
 GaussianNB:
o Fits a Gaussian Naive Bayes classifier to the data.
o Assumes that features follow a Gaussian distribution and are conditionally
independent given the class.
 CountVectorizer:
o Converts text documents into a matrix of token counts.
o Represents documents as vectors of word frequencies, suitable for machine learning
algorithms.
 Pipeline:
o Chains together multiple steps of data pre-processing and model training into a single
object.
o Simplifies the workflow by encapsulating pre-processing and modelling steps.
 ColumnTransformer:
o Applies different pre-processing transformations to different columns of the input
data.
o Useful for handling heterogeneous data types within the same dataset.
 MinMaxScaler:
o Scales features to a specified range (usually [0, 1]).
o Preserves the shape of the original distribution while scaling features.
 GridSearchCV:
o Performs exhaustive search over specified parameter values for an estimator.
o Finds the best parameters for a model by cross-validating on a grid of possible values.
 OneHotEncoder:
o Converts categorical integer features into one-hot encoded features.
o Creates binary columns for each category, representing presence or absence.
 LabelEncoder:

10
o Encodes target labels with values between 0 and n_classes-1.
o Converts categorical labels into numerical format suitable for model training.
 KFold:
o Splits data into K consecutive folds for cross-validation.
o Each fold is used once as a validation while the K-1 remaining folds form the training
set.
 GridSearchCV:
o Performs hyperparameter tuning by exhaustive search over a specified parameter
grid.
o Finds the best combination of hyperparameters for an estimator using cross-
validation.
 RandomizedSearchCV:
o Performs hyperparameter tuning by sampling from specified parameter distributions.
o Useful for exploring a wide range of hyperparameter values efficiently.
 Ridge:
o Fits a linear regression model with L2 regularization (ridge regression).
o Penalizes large coefficients to prevent overfitting.
 Lasso:
o Fits a linear regression model with L1 regularization (Lasso regression).
o Encourages sparsity by shrinking coefficients and setting some of them to zero.
 ElasticNet:
o Fits a linear regression model with a combination of L1 and L2 regularization.
o Controls the balance between Lasso and Ridge penalties.
 AdaBoostClassifier:
o Constructs an AdaBoost ensemble model by iteratively training weak learners.
o Focuses more on instances that are misclassified by previous classifiers.
 GradientBoostingClassifier:
o Builds a Gradient Boosting ensemble model by sequentially adding weak learners.
o Minimizes errors by optimizing a differentiable loss function.
 VotingClassifier:
o Combines multiple classifiers by voting on their predictions.
o Can use hard voting (majority rule) or soft voting (weighted probabilities).
 BaggingClassifier:
o Constructs an ensemble model by training multiple base estimators on random
subsets of the training data.
o Reduces variance and improves stability by averaging predictions from multiple
models.

4. IMPLEMENT ONE- LINER AND MULTI-LINER CHART.

11
12
5. DEMONSTRATE DIFFERENT TYPES OF GRAPHS IN PYTHON?
 BAR CHART

13
 SCATTER PLOT

14
 PIE CHART

15
 HISTOGRAM PLOT

16
6. IMPORT THE WALMART SALES DATA AND COMPARE THE SALES OVER THE
TIME.

17
7. COMPATE THE MARKS OF A STUDENT OF TWO SEMESTER.

18
8. IMPLEMENT GEOGRAPHICAL DATA PLOTTING.

19
20
21
22
23
9. IMPLEMENT VARIOUS PLOT IN SEABORN LIBRARY.
 Histplot: - It is used to create histograms, which display the distribution of a single
continuous variable. It divides the data into bins and represents the frequency or count of
observations in each bin using bars.

 Barplot: - Represents categorical data with rectangular bars, useful for comparing the
quantities of different categories.

24
 Countplot: -It is used to create bar plots that represent the counts of observations in each
category of a categorical variable. It displays the frequency or count of each category using
rectangular bars.

25
1. Boxplot: - Represents categorical data with rectangular bars, useful for comparing the
quantities of different categories.

26
10. TO USE PANDAS ANS MATPLOTLIB TO VISUALIZE COMPANY SALES DATA.

2. Get total profit of all months and show a line plot.

27
3. Read toothpaste sales data of each month and show it using a scatter plot.

4. Read face cream and facewash product sales data ad show it using bar chart.

28
5. Read the total profit of each month and show it using the histogram to see the most
common profit ranges.

29
6. Calculate total sale data for last year for each product and shoe it using a Pie Chart.

30
11. DIFFERENCE BETWEEN MATPLOTLIB AND SEABORN.

Feature Matplotlib Seaborn

Ease of Use Low-level interface; requires more High-level interface; simplifies


code for simple visualizations complex visualizations with fewer
lines of code

Default Styles Basic styles, customization required Attractive default styles for plots
for aesthetically pleasing plots

Integration Requires more code for integrating Seamless integration with Pandas
with Pandas with Pandas DataFrames DataFrames

Statistical Limited statistical plotting Rich statistical plotting


Plotting capabilities functions and features

Visualization Supports a wide range of basic and Focuses on statistical plots, such as
Types advanced plots distribution plots, categorical plots,
etc.

Customizatio Offers extensive customization Provides easier customization with


n options, but may require more effort built-in themes and options

API Object-oriented interface, more Concise functions for common


control over individual plot plotting tasks
components

12. FIND OUT THE GRAPH WHICH ARE NOT IMPLEMENTED IN MATPLOTLIB BUT
IN SEABORN?
Seaborn offers several types of plots that are not directly implemented in Matplotlib. Some
examples include:
1. Pair Plot: A grid of pairwise relationships in a dataset, useful for exploring correlations
between variables.
2. Joint Plot: Displays a relationship between two variables along with the associated
marginal distributions.
3. Cluster Map: A heatmap that uses hierarchical clustering to arrange rows and columns,
useful for exploring patterns in high-dimensional datasets.
4. Violin Plot: A combination of a box plot and a kernel density plot, showing the
distribution of a numeric variable across different levels of one or more categorical
variables.

31
5. Factor Plot (now renamed to catplot): A general categorical plotting function that can
create various types of plots like strip plots, swarm plots, box plots, violin plots, etc., with
the ability to easily switch between different plot types using the kind parameter.

13. IDENTIFY THE DIFFERENT ATTRIBUTE OF BAR CHART AND SEABORN AND
COMPARE IT.
Both Matplotlib and Seaborn provide functionality for creating bar charts, and they
share many common attributes. However, there are differences in how these attributes
are accessed and used. Let's compare the attributes of bar charts in Matplotlib and
Seaborn:
Matplotlib:
 Simple Syntax: Matplotlib's bar chart syntax typically involves creating a figure
and axes objects and then calling the `bar()` function to plot the bars.
 Customization Control: Matplotlib provides fine-grained control over the
appearance of the bars and the overall plot. Users can customize attributes such
as bar width, colors, edge colors, transparency, and alignment.
 Limited Statistical Plotting: Matplotlib is primarily a low-level plotting library,
so it doesn't provide built-in support for statistical plotting features like
aggregating data and computing summary statistics.
 Integration with Other Plot Types: Matplotlib seamlessly integrates bar charts
with other types of plots on the same figure, allowing for complex multi-panel
visualizations.

Seaborn:
 Higher-Level Interface: Seaborn's `barplot()` function provides a higher-level
interface compared to Matplotlib's `bar()` function. It automatically aggregates
and summarizes data, making it easier to create bar charts from statistical
datasets.
 Automatic Aggregation: Seaborn's `barplot()` function aggregates the data and
computes summary statistics (e.g., mean, median) for each category,
simplifying the visualization process.
 Color Palettes: Seaborn provides built-in color palettes that can be easily
applied to bar charts, enhancing the visual appeal of the plots.

32
 Error Bars: Seaborn's `barplot()` function can display error bars to represent
uncertainty or variability in the data.
 Facet Grids: Seaborn supports the creation of facet grids, allowing users to
create separate bar charts for subsets of the data based on the values of
categorical variables.

Comparison:
 Ease of Use: Seaborn's higher-level interface and automatic aggregation make it
easier to create bar charts from statistical datasets without manually
preprocessing the data.
 Customization: Matplotlib offers more customization options for fine-tuning the
appearance of the bars and the overall plot.
 Statistical Features: Seaborn provides built-in support for statistical features like
error bars and automatic data aggregation, which are not available in
Matplotlib's basic bar chart functionality.
 Integration with Other Plot Types: Matplotlib seamlessly integrates bar charts
with other plot types, allowing for more complex multi-panel visualizations.

14. WHAT IS DIST PLOT IN SEABORN?


A distplot in Seaborn is a function used to visualize the distribution of univariate (one
variable) data. It combines a histogram of the data with a kernel density estimate (KDE)
plot, providing a smooth estimate of the underlying probability density function.
Distplot is useful for understanding the distribution of a single variable, identifying
patterns, and assessing the data's skewness, kurtosis, and central tendency. Additionally,
it can optionally display rug plots along the x-axis, showing the actual data points.
Overall, distplot is a versatile tool for exploratory data analysis in Seaborn.

15. WHICH LIBRARY MOSTLY USE FOR GEOGRAPHICAL PLOTTING AND WHY.
The choice of library for geographical plotting depends on various factors such as ease
of use, flexibility, available features, and specific requirements of the project. Two
popular libraries for geographical plotting in Python are Matplotlib's Basemap toolkit
and Cartopy. However, in recent years, Cartopy has gained more popularity and is now
the preferred library for geographical plotting. Here's why:

33
 Integration with Matplotlib: Cartopy is built on top of Matplotlib, which
makes it seamlessly integrate with Matplotlib's plotting functionalities. Users
familiar with Matplotlib can easily transition to Cartopy without a steep learning
curve.
 Projections Support: Cartopy provides support for a wide range of map
projections, including cylindrical, conic, azimuthal, and global projections. This
allows users to create maps tailored to their specific needs and geographic
regions.
 Geospatial Data Handling: Cartopy simplifies the handling and manipulation
of geospatial data. It supports various data formats commonly used in geospatial
analysis, such as GeoJSON, Shapefile, and NetCDF. Users can easily load,
manipulate, and visualize geospatial datasets using Cartopy's built-in functions
and classes.
 Geographic Features: Cartopy offers built-in support for adding geographic
features to maps, such as coastlines, rivers, lakes, political boundaries, and land
cover data. These features enhance the visual representation of maps and provide
context for geographical analysis.
 Cartographic Quality: Cartopy emphasizes cartographic quality and accuracy
in map rendering. It provides tools for adjusting map elements, including
gridlines, labels, and scale bars, to create publication-quality maps suitable for
scientific research and presentations.
 Active Development and Community Support: Cartopy is actively developed
and maintained by a community of contributors, ensuring regular updates, bug
fixes, and improvements.
While Cartopy is the preferred library for geographical plotting in Python due to its
flexibility, features, and ease of use, it's essential to evaluate the specific requirements
of your project and choose the library that best suits your needs. Other libraries like
Folium and Geopandas also offer geospatial visualization capabilities and may be
suitable for certain use cases. Ultimately, the choice of library depends on factors such
as data format, visualization requirements, and familiarity with the library's API.

16. WRITE DOWN THE INSTALLING COMMAND IN BOTH THE LIBRARIES IN


PYTHON.

34
 Matplotlib: Matplotlib is a widely used plotting library in Python for creating
static, interactive, and publication-quality visualizations.
 To install Matplotlib using pip: pip install matplotlib
This command will download and install Matplotlib and its dependencies from
the Python Package Index (PyPI).
 Seaborn: Seaborn is a statistical data visualization library based on Matplotlib,
providing high-level functions for creating attractive and informative statistical
graphics.
 To install Seaborn using pip: pip install seaborn.
17. IMPLEMENT STACK HISTOGRAM IN SEABORN.

35
18. IMPLEMENT HEATMAP IN SEABORN.

36
19. IMPLEMENT TIME SERIES ANALYSIS IN SEABORN IDENTIFY WHICH GRAPH
PERFORMS THE BEST TO PLOT TIME SERIES IN SEABORN.
Seaborn is not primarily designed for time series analysis, but it does offer some
functionality for visualizing time series data. Seaborn's strength lies more in statistical
visualization rather than time series analysis. However, you can still use Seaborn
effectively to plot time series data for exploratory analysis and visualization.
Here's how you can implement time series analysis in Seaborn and identify which graph
performs best for plotting time series data:
 Import Libraries: Begin by importing the necessary libraries, including Seaborn,
Matplotlib, and Pandas for data manipulation and visualization.
 python import seaborn as: -
import matplotlib.pyplot as plt
import pandas as pd
Prepare Your Time Series Data: Load your time series data into a Pandas
DataFrame and ensure that the index is set to the time component.
 Example time series data: -
data = pd.read_csv('your_time_series_data.csv', parse_dates=['date_column'],
index_col='date_column')
 Plotting Time Series Data: Seaborn offers various plotting functions that can be
used to visualize time series data, including line plots (`lineplot`), scatter plots

37
(`scatterplot`), and bar plots (`barplot`). You can use these functions to explore
patterns and trends in your time series data.
Example of using Seaborn's lineplot to plot time series data:-.lineplot(data=data)
#Add labels and title
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Data Visualization')
 Choosing the Best Plot for Time Series: The choice of the best plot for
visualizing time series data depends on the specific characteristics of your data
and the insights you want to gain. Here are some common types of plots used
for time series analysis:
 Line Plot (`lineplot`): Line plots are suitable for visualizing trends and patterns
over time. They can show the overall trend as well as fluctuations and
seasonality in the data.
 Scatter Plot (`scatterplot`): Scatter plots can be used to explore relationships
between two time series variables. They are useful for identifying correlations
and outliers.
 Bar Plot (`barplot`): Bar plots can be used to compare values across different
time periods or categories. They are suitable for visualizing discrete changes in
the data over time.

Ultimately, the best plot for visualizing time series data depends on the specific
characteristics and goals of your analysis. Experiment with different types of plots and
choose the one that effectively communicates the insights you want to convey.
In summary, while Seaborn is not specifically tailored for time series analysis, it can
still be used effectively to visualize time series data and explore patterns and trends.
Experimenting with different plotting functions can help you identify the most suitable
visualization for your time series analysis tasks.

20. WHAT IS VIOLIN PLOT IN SEABORN WHY IT IS USED?


In Seaborn, a violin plot is a visualization tool that combines elements of box plots and kernel
density plots. It provides a comprehensive overview of the distribution of numerical data across
one or more categorical variables.
Why it's used:

38
 Clearer depiction of distribution: Compared to box plots, violin plots offer a richer
understanding of the data's distribution by revealing its shape, skewness, and potential
outliers beyond the IQR.
 Effective for comparing groups: When comparing multiple groups, violin plots effectively
showcase both the central tendency (through quartiles and medians) and the spread (through
the density estimation) of the data within each category, aiding in identifying potential
differences and patterns.
 Visually appealing: Seaborn's violin plots are generally considered aesthetically pleasing,
making them suitable for presentations and reports where clear and informative data
visualization is crucial.
Example: -

21. IMPLEMENT PAIR PLOT IN SEABORN.

39
A pair plot in Seaborn, also known as a scatterplot matrix, is a grid of pairwise relationships
between different variables in a dataset. It shows scatterplots for each pair of variables along the
diagonal and a set of scatterplots or other visual representations for the off-diagonal plots.

22. COMPARE THE BELOW GRAPH IN MATPLOTLIB AND SEABORN.


1. Scatter

40
 Matplotlib: Creating a scatter plot in Matplotlib is straightforward. You can use the
`plt.scatter()` function to plot points.
 Seaborn: Seaborn provides a `scatterplot()` function which adds more flexibility and
additional features
2. Histogram
 Matplotlib: Matplotlib's `plt.hist()` function is commonly used to create histograms.
It allows you to customize the number of bins, color, and transparency.
 Seaborn: Seaborn's `histplot()` function provides similar functionality but offers
more options for customization, including the ability to overlay kernel density
estimation (KDE) plots on top of the histogram bars.

3. Line
 Matplotlib: Matplotlib's `plt.plot()` function is commonly used to create line graphs.
It allows for customization of line style, color, and markers.
 Seaborn: Seaborn is not specifically optimized for line graphs, but you can still use
Matplotlib's `plt.plot()` function to create line graphs with Seaborn.
4. Pie chart
 Matplotlib: Matplotlib's `plt.pie()` function is used to create pie charts. It allows for
customization of colors, labels, and explosion of slices.
 Seaborn: Seaborn does not have a built-in function for creating pie charts. It's more
focused on statistical visualization and does not include pie charts in its repertoire.

23. WHICH LIBRARY PERFORMS THE BEST WHILE PLOTTING 3-D GRAPH.
When it comes to plotting 3D graphs in Python, several libraries offer functionality to visualize
data in three dimensions. The choice of the best library depends on various factors such as ease
of use, customization options, performance, and the specific requirements of your project. Three
popular libraries for creating 3D graphs in Python are Matplotlib, Plotly etc.
1. Matplotlib: Matplotlib is a widely used plotting library in Python and includes
functionality for creating basic 3D plots.
→ Matplotlib's `mpl_toolkits.mplot3d` module provides tools for creating 3D plots,
including scatter plots, surface plots, and wireframe plots.
→ While Matplotlib is versatile and well-integrated with the Python ecosystem, its
3D plotting capabilities are relatively basic compared to other libraries.
→ Matplotlib is suitable for simple 3D visualizations and quick prototyping but may
lack advanced features required for complex 3D plots.

41
2. Plotly: Plotly is a powerful visualization library that offers interactive and high-quality
3D plotting capabilities.
→ Plotly's Python API allows users to create various types of 3D plots, including
scatter plots, surface plots, and contour plots, with rich interactivity and
customization options.
→ Plotly's 3D plots are web-based and can be embedded in web applications or
viewed in Jupyter notebooks with interactive features such as zooming, panning,
and rotating.
→ Plotly is well-suited for creating interactive 3D visualizations for presentations,
dashboards, and web applications but may require additional setup and learning
curve compared to Matplotlib.

24. IMPLEMENT FIVE CORELATION TECHNIQUES?


 Pearson Correlation coefficient: -

 Spearman rank correlation coefficient: -

 Kendall tau correlation coefficient: -

42
 Point biserial correlation coefficient: -

 Distance correlation: -

25. IMPLEMENT LINEAR REGRESSION ON WEIGHTS- HEIGHTS DATA.

43
44
45
26. IMPLEMENT LINEAR REGRESSION ON HOUSE PRICE DATA.

46
47
48
27. TO PERFORM THE FOLLOWING DATA ENCODING TECHNIQUES USING
PYTHON.
 One – Hot Encoding: -

49
 Ordinal Encoding: -

 Dummy Encoding: -

50

You might also like