DMlab2021

The document outlines lab exercises for a Data Mining course, focusing on Python programming, data manipulation, and various machine learning techniques. It includes tasks such as installing Anaconda, working with Jupyter Notebook, implementing word-count algorithms, linear regression, classification models, and neural networks. Each exercise has specific deadlines and aims to enhance students' practical skills in data science and machine learning applications.

Uploaded by

14gunarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

DMlab2021

Uploaded by

14gunarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Mining 2021/2022

Lab exercises

1 Python
I assume that you know the basics of Python. If you have only a little experience in Python,
exercises in this section will guide you through the process of creating a convenient programming
environment. At the end of this section you should be able to write and run Python code in Jupyter
Notebook and also know basic Jupyter Notebook commands. You should also know how to install
new packages and switch environments.

Exercise 1 — Download and install Anaconda. This is distribution of the Python and R program-
ming languages for applications related to data science (see wiki). Read a quick user guide and
conda-cheatsheet to learn how to create and switch environments and install new packages.

Exercise 2 — Using Anaconda run Jupyter Notebook and see Help → User Interface Tour, Keyboard
Shortcuts. Learn how to crate, move and run cells. Learn how to get information about objects and
methods and how to print their source code (e.g. can use tab, shift+tab, shift+tab+tab or write ’?’
and ’??’ before a method name).

Exercise 3 — Recall what data structures are available in Python. Pay special attention to list
comprehensions as often they help to write more readable and more efficient code.

Exercise 4 — Install and familiarize yourself with numpy, pandas, matpoltlib packa-
ges.
a) Read official numpy user guide.
b) Read official pandas user guide.
c) Read official matplotlib user guide.

2 Word-count problem (deadline: 3rd lab)

Exercise 5 — Find the source of your favorite book and save it in UTF-8 format. Load the book
and split it into single words. For example you can use a construction like: (5p)
with open("Catch_22.txt", encoding="UTF-8") as f:
words = [word
for line in f
for word in line.split()]

Change all words to lower case, remove punctuation and remove stop-words. You may try construc-
tions like:
from string import punctuation
words = [word.lower().translate(str.maketrans(’’, ’’, punctuation)) for
... ]

filtered_words = [w for w in words if not w in stopwords]

List of stop-words you can find in the Internet (e.g. here). You may also try using stemming procedure
to reduce the different forms of a given word to a common form:
from stemming.porter2 import stem
filtered_words = [stem(word) for word in filtered_words]

Next, convert the obtained list of words into a list of pairs (word,1) of the type (String,Int):
pairs = [(w,1) for w in filtered_words]

Group the list of pairs by different words and count the total number of occurrences of each word
to get pairs (word, occurrences). For example, you may use groupby method. However
note that groupby method requires that input list is sorted by keys and such sorting might
be computationally costly for large lists. Try to think up and implement a more efficient way of
aggregating occurrences that does not require sorting.
from itertools import groupby
pairs.sort()
word = lambda pair : pair[0]
grouped_pairs = [(w, sum(1 for _ in g)) for w, g in groupby(pairs, key=word
)]

Sort obtained list of pairs by the second component in decreasing order:

occurrences = lambda pair: pair[1]
grouped_pairs.sort(key=occurrences, reverse=True)

Remove some of the initial elements (most common words) and save the result to a text file. Build a
word-cloud from the obtained list. You can use the service http://www.wordclouds.com/.

Exercise 6 — This is the continuation of the previous task. (5p)

1. Divide your book into chapters. Treat each chapter as a document.
2. Split each documents into words (use lower case, stemming, etc.).
3. Determine the tf-idf weights of all words in all documents:

tf -idf (t, d, D) = tf (t, d) × idf (t, D),

where t denotes a term (word), d denotes a document and D denotes the collection of all
documents. Term frequency tf (t, d) is the number of times a term t appears in document d.
Inverse document frequency idf (t, D) is often defined as

|D|
idf (t, D) = log .
1 + | {d ∈ D : t ∈ d} |

There are packages that make it easy to find tf-idf weights, but try to implement the appropriate
procedure yourself.
4. For each document separately build a word cloud using obtained tf-idf weights.
5. Build a word cloud based on tf-idf weights for the entire book.

Exercise 7 — Write a function that takes a word as an input and use tf-idf weights to create the
list of chapters of your book most matching to that word (i.e. it should return a list of chapters
sorted according to appropriate tf-idf weights). (5p)

Exercise 8 — For each given word in your book make a list of five most common words that appear
directly after considered word (but ignore stop-words). Use this summary to generate a random
paragraph that resembles a paragraph of your book. (5p)

3 Linear Regression (deadline: 4th lab)

Exercise 9 — Using Python solve applied exercises 13 and 14 from Section 3.7 in ISL book . (20p)

Exercise 10 — Read lab1.ipynb and download Auto.csv from my webpage. Read it as dataframe
and change the origin column to the category type. Split the data into the training and
validation set. (20p)
a) Use statsmodels library for linear regression with mpg as the response and horsepower as
the feature. Be prepared to explain parameters returned by summary() method that we have
discussed, in particular: confidence intervals, p-values, T-statistic, F-statistic and R-squared.
b) Create a scatterplot matrix which includes all of the variables in the data set. You can
use pandas.plotting.scatter_matrix(...). Compute the matrix of correlations
between the variables, you may use corr() function for pandas dataframe.
c) Perform a linear regression with mpg as the response and all other variables (except name) as
the features. Try defining different models with patsy library, use symbols +, *, : and different
transformations of the variables like for example I(np.log(X)) or I(np.sqrt(X)). For
which model you get the best generalization error?
d) Try to look for outliers and remove them from the data (see e.g.: residual plot, Z-Score). What
are high leverage points? How can you detect them (for example see here)? Retrain your
models on cleansed data and compare the results.
4 Classification (deadline: 5th lab 6th lab)

Exercise 11 — Categorical predictors. Using Auto.csv data from previous list create and compare
two linear regression models for predicting mpg. In the first model use year treated as a continuous
variable. In the second use year treated as a categorical variable. Which model is better? What
if there were more than 13 values for variable year? Which model is easier to train? (Hint: see
lab2.ipynb notebook mentioned on the lecture.) (10p)

Exercise 12 — Download Credit.csv file. Dataset is described here. Create logistic regression
models with possibly high prediction accuracy for predicting
a) if a given person has an income greater than 50 (hint: create new indicator variable),
b) how many credit cards a person has. (10p)

Exercise 13 — Repeat the previous exercise with the K-Nearest Neighbor and Decision Tree clas-
sification models. You may use scikit-learn implementations: KNN and DT. For KNN check different
values of parameter n_neighbors - the number of considered neighbors. For DT check different
values of parameter max_depth - the maximum depth of a tree.
What is the best model you get in case a) and b)? To get more reliable answer you may use
cross-validation for making many experiments having only one dataset (e.g. use KFold method, see
lab3.ipynb) (10p)

Exercise 14 — For problems a) and b) from Exercise 12 choose two continuous predictors that
seems to be important and plot decision boundaries for different models (e.g. logistic regression,
KNN, DT, RF). You may read this tutorial. (20p)

5 Introduction to Neural Networks (deadline: the last lab)

Exercise 15 — In the notebook lab4.ipynb presented during the lecture we have created a pipeline
for text classification (data set: 20newsgroups). Try to increase test accuracy by improving the
pipeline and by using stronger classification model. You can get 2(x − 70) points , where x is your
average test accuracy (in percents) obtained from cross-validation. In the example we have used a
single Decision Tree, at least you should try using Random Forest model. What’s out-of-bag error?

Exercise 16 — In the notebook lab5.ipynb we have presented a simple neural network created with
Keras library. Try to solve classification problem from the previous exercise with the similar neural
network. Test different hyper-parameters (e.g. size and number of layers or batch size). If you don’t
have GPU you may try to use Google Colab. To avoid dealing with very large input vector spaces
you may consider only top N most common words in the dataset. (20p)

Exercise 17 — In the notebook lab7.ipynb we have presented the procedure for training a deep
neural network for an image classification problem using only a small dataset. We have used data
augmentation, transfer learning and fine tuning. Try to carry out a similar procedure for a dataset
of your choice. If you have no better ideas you can use flowers dataset. (20p)

J.L.

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
data science lab exp lis
No ratings yet
data science lab exp lis
72 pages
Assignment 01
No ratings yet
Assignment 01
7 pages
ML Lab 04 Manual - Pandas and MatplotLib
No ratings yet
ML Lab 04 Manual - Pandas and MatplotLib
7 pages
ML_lab
No ratings yet
ML_lab
30 pages
Machine Learning Lab Manual (15CSL76)
No ratings yet
Machine Learning Lab Manual (15CSL76)
30 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Assigniment 2 Machine Learning
No ratings yet
Assigniment 2 Machine Learning
7 pages
ML Lab Manual
No ratings yet
ML Lab Manual
90 pages
Lab Manual (AI)
100% (1)
Lab Manual (AI)
17 pages
Practical 7 Thsem
No ratings yet
Practical 7 Thsem
50 pages
BTCS 1st IMLP_assignment 3_4_ 5
No ratings yet
BTCS 1st IMLP_assignment 3_4_ 5
3 pages
A1201132722 63869 8 2023 K21MDCA2QSetAllocation
No ratings yet
A1201132722 63869 8 2023 K21MDCA2QSetAllocation
12 pages
ML Assignment Last
No ratings yet
ML Assignment Last
4 pages
Assignment1 UI UX Design
No ratings yet
Assignment1 UI UX Design
1 page
ML Lab Manual Devansh (1)
No ratings yet
ML Lab Manual Devansh (1)
57 pages
CETM313 - Workshop Week 06-4
No ratings yet
CETM313 - Workshop Week 06-4
9 pages
Datascience
No ratings yet
Datascience
8 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
Data Science and Its Applications (21AD62) Lab Manual
No ratings yet
Data Science and Its Applications (21AD62) Lab Manual
26 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
AI&ML Lab Report
No ratings yet
AI&ML Lab Report
19 pages
Machine Learning Roadmap
No ratings yet
Machine Learning Roadmap
31 pages
Lab 01
No ratings yet
Lab 01
15 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
VARMA For Battery Voltage Forecasting 1
No ratings yet
VARMA For Battery Voltage Forecasting 1
70 pages
Ayushi Data Science Final File
No ratings yet
Ayushi Data Science Final File
30 pages
Ml Cyber Lab
No ratings yet
Ml Cyber Lab
16 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
11 pages
ML[1]
No ratings yet
ML[1]
49 pages
PRACTICAL FILE Fds
No ratings yet
PRACTICAL FILE Fds
14 pages
lab manual
No ratings yet
lab manual
80 pages
TP1 - Machine Learning
No ratings yet
TP1 - Machine Learning
8 pages
TP1 - Machine Learning h
No ratings yet
TP1 - Machine Learning h
8 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
ML Lab
No ratings yet
ML Lab
7 pages
Edx Course Lab Programs
No ratings yet
Edx Course Lab Programs
19 pages
Ml_Lab_Manual
No ratings yet
Ml_Lab_Manual
70 pages
Artificial Intelligence 3171105 Lab Manual
No ratings yet
Artificial Intelligence 3171105 Lab Manual
38 pages
Question Bank Python For Data Science
0% (1)
Question Bank Python For Data Science
3 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
39 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
ML Lab
No ratings yet
ML Lab
45 pages
Ml record_merged (1)
No ratings yet
Ml record_merged (1)
29 pages
univds
No ratings yet
univds
8 pages
Aids - 21ad62 - Datascience Lab Manual-1
No ratings yet
Aids - 21ad62 - Datascience Lab Manual-1
15 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
Questions Answers Chapter Wise
No ratings yet
Questions Answers Chapter Wise
4 pages
Data Science Lab Experiments
No ratings yet
Data Science Lab Experiments
32 pages
MDA File
No ratings yet
MDA File
37 pages
Dip Lab 1
No ratings yet
Dip Lab 1
6 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
rufh 4
No ratings yet
rufh 4
24 pages
Python Lab3
No ratings yet
Python Lab3
10 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
ML-CONTENTHALF
No ratings yet
ML-CONTENTHALF
35 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Bansho Insert
No ratings yet
Bansho Insert
2 pages
AGV10 AGV50 AGV50 Pilot User Manual 50100000 Rev B 1
No ratings yet
AGV10 AGV50 AGV50 Pilot User Manual 50100000 Rev B 1
69 pages
End of year form 3 P2-1
No ratings yet
End of year form 3 P2-1
5 pages
Anuual Exam 2022-23
No ratings yet
Anuual Exam 2022-23
203 pages
ResMethodology Data Analyses-NDas
No ratings yet
ResMethodology Data Analyses-NDas
51 pages
Application of DFT Filter Bank To Power Frequency Harmonic Measurement
No ratings yet
Application of DFT Filter Bank To Power Frequency Harmonic Measurement
5 pages
Strength of Materials Problem 624
No ratings yet
Strength of Materials Problem 624
52 pages
1) Biostatistics Introduction Note
No ratings yet
1) Biostatistics Introduction Note
15 pages
Global Inequality of Opportunity
No ratings yet
Global Inequality of Opportunity
11 pages
Daily Post Ratio and Proportion Clerk Prelims 1
No ratings yet
Daily Post Ratio and Proportion Clerk Prelims 1
11 pages
Self Recovering
No ratings yet
Self Recovering
7 pages
Maths Class X Sample Paper 08 For Board Exam 2018 3
No ratings yet
Maths Class X Sample Paper 08 For Board Exam 2018 3
5 pages
Statistics Revision 2
No ratings yet
Statistics Revision 2
8 pages
Intelligent Ship Arrangements: A New Approach To General Arrangement
No ratings yet
Intelligent Ship Arrangements: A New Approach To General Arrangement
15 pages
Programming Assignment 7: Critters: README (10 Points)
No ratings yet
Programming Assignment 7: Critters: README (10 Points)
6 pages
4.matrices Question Papers
No ratings yet
4.matrices Question Papers
7 pages
Pile FDN Tower
No ratings yet
Pile FDN Tower
9 pages
SFSB Undergraduate Study
No ratings yet
SFSB Undergraduate Study
54 pages
Seno de 18° PDF
No ratings yet
Seno de 18° PDF
1 page
Chapter 43 Construction
No ratings yet
Chapter 43 Construction
8 pages
Javascript Crock Ford
100% (1)
Javascript Crock Ford
164 pages
Describing Syntax and Semantics: Isbn 0-321-49362-1
No ratings yet
Describing Syntax and Semantics: Isbn 0-321-49362-1
55 pages
Download full The Princeton Companion to Mathematics Timothy Gowers (Editor) ebook all chapters
100% (1)
Download full The Princeton Companion to Mathematics Timothy Gowers (Editor) ebook all chapters
67 pages
Structural Analysis 100 Important Questions 42
No ratings yet
Structural Analysis 100 Important Questions 42
55 pages
Pill Vending Mechine
No ratings yet
Pill Vending Mechine
19 pages
(Ebook) Introduction to Data Mining by Pang-Ning Tan,Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X - Download the full ebook now for a seamless reading experience
100% (1)
(Ebook) Introduction to Data Mining by Pang-Ning Tan,Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X - Download the full ebook now for a seamless reading experience
56 pages
Grodwohl Parker 2023 The Early Rise and Spread of Evolutionary Game Theory Perspectives Based On Recollections of Early
No ratings yet
Grodwohl Parker 2023 The Early Rise and Spread of Evolutionary Game Theory Perspectives Based On Recollections of Early
14 pages
Unit 9 - Validity and Reliability of A Research Instrument
No ratings yet
Unit 9 - Validity and Reliability of A Research Instrument
17 pages
Lecture 3 Workshop-With Partial Derivative Discussion and The Extra Term (Using Galilean Relations)
No ratings yet
Lecture 3 Workshop-With Partial Derivative Discussion and The Extra Term (Using Galilean Relations)
15 pages
A1249374774 16765 26 2019 MTH302 Unit2 3 Problems
0% (1)
A1249374774 16765 26 2019 MTH302 Unit2 3 Problems
3 pages