PDS Exp 13 To 16
PDS Exp 13 To 16
Experiment No: 13
Write a program to display below bar plot using matplotlib library For 200
random points for both X and Y display scatter plot.
Date:
Practical skills:
• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
In Matplotlib, a scatter plot is a chart type that displays data as a collection of points with the
position determined by the values of two variables. Each point on the scatter plot represents an
observation, and the position of the point on the X-Y axis is determined by the values of the two
variables.
A scatter plot is useful for exploring the relationship between two continuous variables. It can be
used to identify patterns or trends in the data and to detect the presence of outliers or unusual
observations. Scatter plots can also be used to assess the correlation between the two variables.
Matplotlib provides the scatter() function for creating scatter plots. The function takes two arrays,
one for the X-axis data and one for the Y-axis data, as its input arguments. Additional parameters
can be used to customize the appearance of the scatter plot, such as the color, size, and transparency
of the points.
Procedure:
1. Import necessary libraries: We will need the Matplotlib and NumPy libraries for this task.
2. Generate random data for the X and Y axes: We can use the NumPy library to generate
random data for both the X and Y axes
3. Create a scatter plot: We can use the scatter method of the Matplotlib library to create a
scatter plot. We need to pass the X and Y data as arguments and specify the marker style
and color using the marker and c parameters, respectively
4. Add title and labels: We can add a title and labels for the X and Y axes using the title, xlabel,
and ylabel methods of the Matplotlib library.
5. Set axes limits: We can set the limits for the X and Y axes using the xlim and ylim methods
of the Matplotlib library.
6. Display the plot: We can display the plot using the show method of the Matplotlib library.
Conclusion:
Quiz:
1. What is a scatter plot?
2. What is the function used for creating scatter plots in Matplotlib?
3. What are the input arguments for the scatter() function?
4. What can a scatter plot be used for?
5. Can the appearance of the scatter plot be customized?
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
Python for Data Science (3150713)
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Python for Data Science (3150713)
Experiment No: 14
Develop a program that reads .csv and plot the data of the dataset stored in the
.csv file file from the url:
(https://github.com/chris1610/pbpython/blob/master/data/sample salesv3.xlsx?raw=true)
Date:
Practical skills:
Objectives: (a) To analyze and visualize the data in an efficient and effective way.
(b) To identify patterns, trends, and outliers in the data.
Theory:
Reading a .csv file from a URL and plotting the data is a common data analysis and visualization
task in many fields. Here are the main steps involved in this process:
Importing the necessary libraries: To read and plot the .csv file, we typically use the pandas and
matplotlib libraries. We need to import them at the beginning of our program.
Loading the data from the URL: We can use the pandas library's read_csv function to read the data
from the URL. We need to provide the URL of the .csv file as an argument to this function.
Data cleaning and preparation: Once we have loaded the data, we may need to clean and prepare it
for visualization. This may include dropping unnecessary columns, filling missing values, and
transforming the data.
Data visualization: Once the data is cleaned and prepared, we can use matplotlib's various plotting
functions to create visualizations such as line plots, scatter plots, bar plots, and more. We can
customize the plot with various parameters such as colors, labels, titles, and more.
Displaying the plot: After creating the plot, we need to display it using the show function provided
by the matplotlib library.
1. Validate inputs.
2. Handle errors.
Python for Data Science (3150713)
Procedure:
1. Import the necessary libraries: You will need the pandas library to read the .csv file, and
matplotlib library to create the plot.
2. Read the .csv file from the URL: Use the pandas library to read the .csv file from the URL
and store it as a DataFrame object.
3. Preprocess the data: Preprocess the data as required. This may involve cleaning the data,
removing duplicates, handling missing values, and converting data types.
4. Visualize the data: Use the matplotlib library to create a visualization of the data. You can
create scatter plots, line graphs, histograms, and other types of visualizations based on the
data.
5. Save or display the visualization: Save the visualization to a file or display it on the screen,
depending on the user requirements.
6. Test and validate the program: Test the program thoroughly to ensure that it works as
expected for various input datasets. Validate the results against the expected output and fix
any issues or errors.
7. Document the program: Document the program by providing clear and concise comments
in the code and a user manual that explains how to use the program.
Conclusion:
Python for Data Science (3150713)
Quiz:
1. What library is required to read a .csv file in Python?
2. What library is required to create plots in Python?
3. What is the first step in developing a program that reads a .csv file from a URL and plots
the data?
4. How do you read a .csv file from a URL in Python using the pandas library?
5. How do you create a scatter plot of two columns from a DataFrame using the matplotlib
library?
6. How do you save a plot to a file using the matplotlib library?
Suggested Reference:
1. Pandas documentation on reading a CSV file from a URL:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-csv-files
2. Matplotlib documentation on creating plots:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html
3. Real Python tutorial on reading and writing CSV files in Python:
https://realpython.com/python-csv/
4. DataCamp tutorial on data visualization with Matplotlib:
https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
5. Towards Data Science tutorial on creating visualizations with Pandas and Matplotlib:
https://towardsdatascience.com/data-visualization-with-pandas-and-matplotlib-
8dadc69f2f79
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Python for Data Science (3150713)
Experiment No: 15
Practical skills:
Objectives: (a) To develop a machine learning model that can accurately classify text documents
into predefined categories that can be used for various applications such as sentiment analysis, spam
detection, and topic modeling.
Theory:
Text classification is the task of assigning predefined categories or labels to text documents based
on their content. A text classification pipeline typically consists of several stages, including data
preprocessing, feature extraction, model training, and evaluation.
In the context of Wikipedia articles, the first step in building a text classification pipeline is to
collect a dataset of articles with their corresponding labels. These labels can be either manually
assigned or obtained from existing metadata such as categories or tags.
Once a dataset is obtained, the next step is data preprocessing. This typically involves text
normalization, tokenization, stop word removal, and stemming/lemmatization. The goal of data
preprocessing is to clean the text and reduce its dimensionality while retaining the relevant
information for classification.
After preprocessing, the text is converted into numerical features that can be used as input to a
machine learning model. A popular technique for feature extraction is the bag-of-words model,
Python for Data Science (3150713)
which represents each document as a vector of word frequencies. However, this approach may not
capture the semantic meaning of words and their relationships in the text.
The final stage in the text classification pipeline is model training and evaluation. A common
approach is to use supervised learning algorithms such as Naive Bayes, Logistic Regression, or
Support Vector Machines. The performance of the model is evaluated using metrics such as
accuracy, precision, recall, and F1 score on held-out test sets.
1. Data privacy.
2. Bias and fairness.
3. Model accuracy and reliability
4. Ethical considerations
5. Test and review.
Procedure:
Collect and preprocess the data: Download a set of Wikipedia articles that represent the different
categories you want to classify (e.g., sports, politics, entertainment, etc.). Preprocess the data by
removing any unnecessary characters, converting all text to lowercase, and removing any stop
words.
Split the data: Split the preprocessed data into two sets: training and test sets. The training set will
be used to train the model, while the test set will be used to evaluate the model's performance.
Feature extraction: Extract the features from the preprocessed text using CharNGramAnalyzer. This
will convert each text document into a vector of features that can be used as input to the
classification model.
Train the model: Train a text classification model using the extracted features and the training set.
You can use any machine learning algorithm, such as Naive Bayes, SVM, or Neural Networks.
Evaluate the model: Use the trained model to classify the test set and evaluate its performance using
metrics such as accuracy, precision, recall, and F1-score.
Tune the model: If the model's performance is not satisfactory, you can tune the hyperparameters
of the algorithm or try different algorithms to improve its performance.
Deploy the model: Once you are satisfied with the model's performance, you can deploy it in
production to classify new text documents.
Conclusion:
Quiz:
1. What is the purpose of using a custom preprocessor in a text classification pipeline?
2. Which analyzer is used in the given scenario?
"Writing a text classification pipeline using a custom preprocessor and
CharNGramAnalyzer using data from Wikipedia articles as a training set."
3. What is the purpose of evaluating the performance on held-out test sets in text
classification?
Suggested Reference:
1. "Building a Text Classification Pipeline with Python" by Dipanjan Sarkar: This article
provides a step-by-step guide on how to build a text classification pipeline using Python
and scikit-learn library. It covers preprocessing techniques, feature extraction, model
selection, and evaluation.
2. "Text Classification with NLTK and Scikit-Learn" by Ahmed Besbes: This tutorial
provides a detailed guide on how to perform text classification using Python and two
popular libraries, NLTK and scikit-learn. It covers data preprocessing, feature extraction,
and model training and evaluation.
3. "Using Wikipedia Articles for Text Classification" by Nikolay Krylov: This article
demonstrates how to use Wikipedia articles as a training set for text classification. It
covers data collection, preprocessing, feature extraction using TF-IDF and
CharNGramAnalyzer, model training, and evaluation.
4. "Text Classification with Python and Scikit-Learn" by Sebastian Raschka: This book
chapter provides a comprehensive guide on how to perform text classification using
Python and scikit-learn. It covers data preprocessing, feature extraction, model training,
and evaluation, as well as advanced topics such as model selection and parameter tuning.
5. "A Complete Tutorial on Text Classification using Naive Bayes Algorithm" by Divya
Gupta: This tutorial provides a detailed guide on how to perform text classification using
Naive Bayes algorithm in Python. It covers data preprocessing, feature extraction, model
training and evaluation, as well as parameter tuning.
Python for Data Science (3150713)
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)
Python for Data Science (3150713)
Experiment No: 16
Date:
Objectives: (a) To create an accurate and reliable model that can automatically classify movie
reviews as positive or negative, which can be useful for analyzing large volumes of reviews quickly
and efficiently, as well as for providing recommendations to users based on their preferences..
Theory:
The theory behind writing a text classification pipeline to classify movie reviews as either positive
or negative involves several key steps:
Data preprocessing: This step involves cleaning and preparing the raw text data by removing stop
words, converting text to lowercase, and performing stemming or lemmatization.
Feature extraction: This step involves converting the preprocessed text data into a numerical
representation that can be used as input to a machine learning algorithm. Common techniques
include Bag-of-Words, TF-IDF, and Word Embeddings.
Model selection and training: This step involves selecting an appropriate machine learning
algorithm and training it on the preprocessed and transformed data. Popular algorithms include
Python for Data Science (3150713)
Hyperparameter tuning: This step involves selecting the optimal hyperparameters for the chosen
machine learning algorithm. This can be done using techniques such as grid search or random
search.
Evaluation: This step involves evaluating the performance of the trained model on a held-out test
set. This can be done using metrics such as accuracy, precision, recall, and F1-score.
Deployment: This step involves deploying the trained model in a production environment, where it
can be used to classify new movie reviews.
Grid search is a hyperparameter tuning technique that involves searching for the optimal set of
hyperparameters for a given machine learning algorithm by exhaustively trying all possible
combinations of hyperparameter values. This can be done by training and evaluating the model with
different combinations of hyperparameters on a validation set, and selecting the combination that
yields the best performance.
Evaluating the performance of the trained model on a held-out test set is important to ensure that
the model generalizes well to new, unseen data. This helps to avoid overfitting, where the model
performs well on the training data but poorly on new data.
Overall, the theory behind writing a text classification pipeline to classify movie reviews as either
positive or negative involves a combination of data preprocessing, feature extraction, model
selection and training, hyperparameter tuning, evaluation, and deployment.
1. Data preprocessing
2. Feature extraction
3. Model selection
4. Hyper parameter tuning
5. evaluation
Procedure:
1. Preprocess the data: Preprocess the movie review data by cleaning the text, removing stop
words, and performing stemming or lemmatization to reduce the dimensionality of the
feature space.
2. Split the data: Split the preprocessed data into training, validation, and test sets. The training
set will be used to train the model, the validation set will be used to tune the
hyperparameters, and the test set will be used to evaluate the final performance of the model.
3. Extract features: Extract features from the preprocessed text using techniques such as Bag-
of-Words, TF-IDF, or Word Embeddings. This will convert the text data into a numerical
representation that can be used as input to a machine learning algorithm.
4. Select a model: Choose a suitable machine learning algorithm, such as Naive Bayes, Support
Vector Machines, or Neural Networks, and train it on the preprocessed and transformed
data.
5. Hyperparameter tuning: Use grid search to find the best set of hyperparameters for the
chosen machine learning algorithm. This involves training and evaluating the model with
different combinations of hyperparameters on the validation set, and selecting the
Python for Data Science (3150713)
6. Evaluate the model: Evaluate the performance of the trained model on the held-out test set
using metrics such as accuracy, precision, recall, and F1-score.
7. Deploy the model: Deploy the trained model in a production environment, where it can be
used to classify new movie reviews.
Conclusion:
Quiz:
1. What is the first step you should take when developing a text classification pipeline?
2. What are some techniques for feature extraction in text classification?
3. Which of the following algorithms is not suitable for text classification?
4. What is grid search used for in text classification?
5. How do you evaluate the performance of a text classification model?
6. What is the purpose of a held-out test set?
Suggested Reference:
1. "Introduction to Machine Learning with Python" by Andreas C. Müller and Sarah Guido -
This book provides a comprehensive introduction to machine learning and includes a
section on text classification. It covers topics such as preprocessing text data, feature
extraction, and model evaluation.
2. "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward
Loper - This book provides an introduction to natural language processing and includes a
section on text classification. It covers topics such as feature selection, training classifiers,
and evaluation metrics.
4. "Text Classification in Python using spaCy" by Dipanjan Sarkar - This tutorial provides an
introduction to text classification using spaCy, a popular NLP library in Python. It covers
topics such as preprocessing text data, feature extraction, model selection, and
hyperparameter tuning.
Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)