0% found this document useful (0 votes)

6 views

Assignment 2v2

Uploaded by

ankithcs3328

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Assignment 2v2

Uploaded by

ankithcs3328

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ID2090: Introduction to Scientific Computing Jan-May 24

Introduction to Scientific Computing

Indian Institute of Technology, Madras
Assignment 2
Maximum Marks: 100 Assigned: March 12, 2024
Deadline: March 27, 2024
General Instructions
• You are expected to use the VM for this assignment. Create a directory in your home directory called
assignment_2. Use this directory to work on the assignment.

• For each question (for question i), create a bash file called question_i.sh in the assignment_2 directory.
This bash file should contain the necessary code or commands to solve the respective question.
• We will be using an evaluation script to assess and evaluate your submission. Therefore, kindly ensure
that the naming convention (as mentioned in usage section of each question) is strictly adhered to, and
that the output which you get from running a script, matches the structure of the sample output.
• For submission, upload the MD5 checksum of the assignment_2 directory on Moodle. You can use the
following command. Make sure that you are in assignment_2 directory for this command to work as
intended.
find ./assignment_2/* -exec md5sum {} \; | cut -f 1 -d " " | md5sum

• After submitting the MD5 checksum on Moodle, do not update any file(s). Doing so will change your
checksum, and your submission will not be evaluated.
• You are free to read through various resources. However, please ensure that you cite your sources to avoid
plagiarism. Any detected instances of plagiarism will result in penalties.
• Please contact your assigned TA for any doubts or queries regarding this assignment.

• The soft deadline for this assignment is 11:59 PM on March 27, 2024. Submissions after this deadline
will face a linearly increasing penalty of 10 marks per late day.
• The hard deadline for this assignment is 11:59 PM on March 31, 2024. Submissions after this deadline
will not be evaluated.

[20 marks] 1. Web scraping is the process of extracting data from a website or any online source. In this era of Large
Language Models, web scraping has become commonplace for gathering large quantities of data. Often,
the data that is gathered is unusable and requires pre-processing. In this task, you are required to fetch
data from an online source and perform some basic manipulations to prepare the data.
NASA maintains an archive of photographs captured by various enthusiasts, along with a brief explana-
tion written by a professional astronomer. You are tasked to create a list of titles of these images that
were uploaded on special dates (DD/MM/YYYY) like
[10 marks] (a) dates whose YYYY is divisible by DD,
[10 marks] (b) dates whose YYYY is divisible by MM.
Usage:
./question_1.sh

Output: Two .csv files − answer_1a.csv and answer_1b.csv for the corresponding parts.

1
ID2090: Introduction to Scientific Computing Jan-May 24

[20 marks] 2. Publicly available datasets are often riddled with errors. In most cases, data visualization reveals such
inconsistencies. In this task, you are provided with a dataset of an EV manufacturer that contains
multiple parameters. The parameters are mentioned in the header of the dataset.
– Upon inspection, it turns out that all the alphabets in the data (for columns other than Vehicle Number)
have been mistakenly replaced with their complement (where the complement of the ith letter of
the alphabet is 27 − ith letter with the case retained).
– Also, on keen observation, the SoH and SoC columns are interchanged for Vehicle Number AG.
– (Misreported entries) In addition to these errors, there are also obvious entries where the reported
mileage is non-zero despite SoC = 0.
– There are also rows in the dataset where certain parameters are missing. Since that those rows are
useless, you may remove them.
You are tasked with correcting these errors to produce a clean dataset and also Flag misreported entries
as “fake”. The dataset is located at /var/home/Jan24/assignments/assignment_2.
Usage:
./question_2.sh final_dataset.txt > out.csv

Input:

Vehicle Number, SoC, Mileage(in m), Charging Time(in min), SoH, Driver Name
RB-34-XE 86 11180 12 21 VHMSKKUC
40
AG-22-QA 2 1170 81 9 MUAYENW
LT-20-TV 0 5961 90 52 NBBLNBG

and so on...
Output:

out.csv
Vehicle Number, SoC, Mileage(in m), Charging Time(in min), SoH, Driver Name, Flag
RB-34-XE 86 11180 12 21 ESNHPHFX
AG-22-QA 9 1170 81 2 NFAZBVM
LT-20-TV 0 5961 90 52 MYYOYMT Fake

and so on...
[30 marks] 3. Term Frequency Inverse Document Frequency (TF-IDF, for short) is a measure of the importance of a
term (t) to a document (d) in a collection of documents (D). For this task, we define
ft, d
tf(t, d) := P
ft′ , d
t′ ∈d

where ft, d is the number of times term t occurs in document d,

|D| + 1
idf(t, D) := log2
|{d ∈ D : t ∈ d}| + 1
where | · | is the cardinality of a set and |{d ∈ D : t ∈ d}| is the number of documents where the term t
appears in.

The TF-IDF index is thus computed as

1 X
tf-idf(t, D) := tf(t, d) × idf(t, D).
|D|
d∈D

2
ID2090: Introduction to Scientific Computing Jan-May 24

Note: The definition of TF-IDF index may vary. For the purpose of this question, please stick to the
above definition.
You are given a .csv in which each row is considered as a document (d) and the rows constitute the
collection of documents (D). Assume that only periods (‘.’) and commas (‘,’) are only punctuations
present in the documents.
[20 marks] (a) Given a term t, return its TF-IDF index (accurate to 4 decimal places).
Input:
id, document
1, this is a sample sentence.
2, there are 3 sentences in this sample.
3, this is a placeholder sentence.

Usage:
./question_3.sh document.csv sentence
Output:
0.0553

[10 marks] (b) If no arguments are passed when calling question_3.sh, return the top-5 terms (with values) in
decreasing order of TF-IDF index.
Input:
id, document
1, this is a sample sentence.
2, there are 3 sentences in this sample.
3, this is a placeholder sentence.

Usage:
./question_3.sh document.csv
Output:
placeholder, 0.0667
a, 0.0553
is, 0.0553
sentence, 0.0553
there, 0.0476

[30 marks] 4. Structured Query Language (SQL) is extensively used to manage databases and is designed to query
data in relational databases. In this exercise, you are tasked to replicate one of SQL’s fundamental
features JOIN using (preferably) awk or a combination of join, sort and sed (and other commands as
needed).
The JOIN clause is used to combine rows from two (or more) tables based on some relation common
between them. SQL offers four types of JOINs (Fig. 1), namely
– INNER JOIN: Returns records that have matching values in both tables,
– LEFT JOIN: Returns all records from the left table, and the matched records from the right table,
– RIGHT JOIN: Returns all records from the right table, and the matched records from the left table,
– FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table.

3
ID2090: Introduction to Scientific Computing Jan-May 24

Table Table Table Table Table Table Table Table

A B A B A B A B

INNER JOIN LEFT JOIN RIGHT JOIN FULL JOIN

Figure 1: Types of JOINs in SQL

Note: You may refer to SQL documentation or any online source to read about the different types of
JOINs.
[30 marks] (a) Write a bash script with flags (‘-I’ for INNER JOIN, ‘-L’ for LEFT JOIN, ‘-R’ for RIGHT JOIN and
‘-F’ for FULL JOIN) to parse two .csv files (with fixed columns) and output the joined .csv file.
[10 bonus] (b) Extend sub-part (a) to adapt for generic csv files (no restriction on number of columns). You may
assume that the columns names across the two files will be identical.
Input:
file_1.csv
ID, Roll
1, AE23B005
2, AE23B010
4, AE23B013
5, AE23B020

file_2.csv
Roll, Name
AE23B005, BHAVESH
AE23B010, GUHAAN
AE23B011, HEMANT
AE23B013, KISHOREKUMAR

Usage:
./question_4.sh -F file_1.csv file_2.csv > out.csv

Output:
out.csv
ID, Roll, Name
1, AE23B005, BHAVESH
2, AE23B010, GUHAAN
NULL, AE23B011, HEMANT
4, AE23B013, KISHOREKUMAR
5, AE23B020, NULL

Note: Ensure that columns of out.csv are in the order mentioned in the sample output. The rows
need not be in any specific order.

ME 2016 Spring 24 Homework 1
No ratings yet
ME 2016 Spring 24 Homework 1
3 pages
Interlocking Scheme Transmission Dept PDF
100% (1)
Interlocking Scheme Transmission Dept PDF
40 pages
Factors Influencing Career Choices Among High School Students in Zambales, Philippines
100% (3)
Factors Influencing Career Choices Among High School Students in Zambales, Philippines
6 pages
Research Engineer Screening Exercise
No ratings yet
Research Engineer Screening Exercise
4 pages
Downloadable Doc 1 PDF
No ratings yet
Downloadable Doc 1 PDF
151 pages
Assignmt 3
No ratings yet
Assignmt 3
15 pages
1713678514kvs Study Material Xii Cs Notes
No ratings yet
1713678514kvs Study Material Xii Cs Notes
154 pages
Research Engineer Screening Exercise PDF
No ratings yet
Research Engineer Screening Exercise PDF
4 pages
A1_COL761
No ratings yet
A1_COL761
4 pages
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
No ratings yet
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
3 pages
15 122 hw2
No ratings yet
15 122 hw2
10 pages
Computer Organization and Design CH 2
No ratings yet
Computer Organization and Design CH 2
12 pages
QP 5 Im
0% (1)
QP 5 Im
6 pages
FOSS COMPLETE LAB MANUAL
No ratings yet
FOSS COMPLETE LAB MANUAL
48 pages
Project 1
No ratings yet
Project 1
4 pages
HPC
No ratings yet
HPC
7 pages
Three Assignment Questions
No ratings yet
Three Assignment Questions
10 pages
What Is Matlab
No ratings yet
What Is Matlab
3 pages
Computer Science 1B 2020 Exam Memo - CSC01B1-2020-EXAM-MEMO (1)
No ratings yet
Computer Science 1B 2020 Exam Memo - CSC01B1-2020-EXAM-MEMO (1)
5 pages
Assvid
No ratings yet
Assvid
13 pages
University of Toronto Mississauca April 2018 Final Examination CSG209H5S
No ratings yet
University of Toronto Mississauca April 2018 Final Examination CSG209H5S
14 pages
C Ompiler Theory: (Intermediate C Ode Generation - Abstract S Yntax + 3 Address C Ode)
No ratings yet
C Ompiler Theory: (Intermediate C Ode Generation - Abstract S Yntax + 3 Address C Ode)
32 pages
600.325/425 - Declarative Methods Assignment 1: Satisfiability
No ratings yet
600.325/425 - Declarative Methods Assignment 1: Satisfiability
13 pages
Subject: Computer Science Class: XII Exam: Practice Paper Time Duration: 3 Hrs M.M.: 70
No ratings yet
Subject: Computer Science Class: XII Exam: Practice Paper Time Duration: 3 Hrs M.M.: 70
7 pages
solved questions matric 2024(theory)
No ratings yet
solved questions matric 2024(theory)
8 pages
COL216 Assignment 4: 1 Problem Statement
No ratings yet
COL216 Assignment 4: 1 Problem Statement
4 pages
2. ML Lab Record
No ratings yet
2. ML Lab Record
38 pages
TL102 0 2023 Cos2611
No ratings yet
TL102 0 2023 Cos2611
10 pages
Lab 2
No ratings yet
Lab 2
8 pages
SAMPLE QUESTION PAPER 2 (Solved)
No ratings yet
SAMPLE QUESTION PAPER 2 (Solved)
8 pages
Perlun 2 - SAP2000. IntroductionWebinar
No ratings yet
Perlun 2 - SAP2000. IntroductionWebinar
34 pages
CA Project 5
No ratings yet
CA Project 5
5 pages
Kendriya Vidyalaya Sangathan, Mumbai Region 1 Pre-Board Examination 2019-20
No ratings yet
Kendriya Vidyalaya Sangathan, Mumbai Region 1 Pre-Board Examination 2019-20
11 pages
12 computer(2022-23)
No ratings yet
12 computer(2022-23)
5 pages
DSPsoft_Assign_01
No ratings yet
DSPsoft_Assign_01
4 pages
Board PAper 2021 Term1
No ratings yet
Board PAper 2021 Term1
15 pages
CST304 Assignment Informal
No ratings yet
CST304 Assignment Informal
6 pages
XII_CS_WC_SET2 QP
No ratings yet
XII_CS_WC_SET2 QP
9 pages
cs202 hw2
No ratings yet
cs202 hw2
5 pages
ProgrammingConcepts Exam V1 B52 Solution
No ratings yet
ProgrammingConcepts Exam V1 B52 Solution
4 pages
Xii CS
No ratings yet
Xii CS
6 pages
20usc101 Scheme Final
No ratings yet
20usc101 Scheme Final
5 pages
Su 2011 Final Sol
No ratings yet
Su 2011 Final Sol
19 pages
SAMPLE PAPER-VIII Class XII (Computer Science) SEE
No ratings yet
SAMPLE PAPER-VIII Class XII (Computer Science) SEE
6 pages
CS5785 Homework 4: .PDF .Py .Ipynb
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
5 pages
Board question
No ratings yet
Board question
12 pages
Combined Exam 24.11.2020
No ratings yet
Combined Exam 24.11.2020
13 pages
IE2042 - Database Management Systems For Security
No ratings yet
IE2042 - Database Management Systems For Security
7 pages
2020
No ratings yet
2020
8 pages
XII_CS_WC_SET1 QP
No ratings yet
XII_CS_WC_SET1 QP
9 pages
Iot Lab Manualdocx Compress Frtyygh SDDWWQ Qrtyhhb - Compress
No ratings yet
Iot Lab Manualdocx Compress Frtyygh SDDWWQ Qrtyhhb - Compress
26 pages
XII_CS_WC_SET3 QP
No ratings yet
XII_CS_WC_SET3 QP
9 pages
ITCS 321 Test ONE NOV 2018 KEY AAA
No ratings yet
ITCS 321 Test ONE NOV 2018 KEY AAA
5 pages
PGDCA I Semester Jan 2020
No ratings yet
PGDCA I Semester Jan 2020
19 pages
Exam Question Paper
No ratings yet
Exam Question Paper
8 pages
Class XII Study Material KV 2022-23
No ratings yet
Class XII Study Material KV 2022-23
183 pages
data structures Final fall 2018 - SOLUTION
No ratings yet
data structures Final fall 2018 - SOLUTION
8 pages
DAV_CS_ANNUAL_21_22Ms
No ratings yet
DAV_CS_ANNUAL_21_22Ms
5 pages
DS Assignment 01
No ratings yet
DS Assignment 01
6 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
l4
No ratings yet
l4
22 pages
l1
No ratings yet
l1
25 pages
Assignment 6
No ratings yet
Assignment 6
2 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Brigada Eskwela Narrative Report 2023-2024
No ratings yet
Brigada Eskwela Narrative Report 2023-2024
7 pages
400 Series Maintenance Manual Rev. F
No ratings yet
400 Series Maintenance Manual Rev. F
141 pages
Strata™ 400 STEM Product Datasheet - FEI Company
No ratings yet
Strata™ 400 STEM Product Datasheet - FEI Company
2 pages
g12 q3 Las Week2 3is Nothing
No ratings yet
g12 q3 Las Week2 3is Nothing
19 pages
Ningbo Dongxin High-Strength Nut Co.,Ltd: Test Certificate Conforming To Bs en 10204:2004 3.1
100% (1)
Ningbo Dongxin High-Strength Nut Co.,Ltd: Test Certificate Conforming To Bs en 10204:2004 3.1
2 pages
Transpor Pasif Melintasi Membran Tanpa Mengeluarkan Energi: June 2020
No ratings yet
Transpor Pasif Melintasi Membran Tanpa Mengeluarkan Energi: June 2020
9 pages
SLeM - Slep Instalation Orel Festival
No ratings yet
SLeM - Slep Instalation Orel Festival
43 pages
Expert Interview Transcript Angel Eyedealism Part 4
No ratings yet
Expert Interview Transcript Angel Eyedealism Part 4
9 pages
5 - Stability & Root Locus
No ratings yet
5 - Stability & Root Locus
10 pages
Harshita Wadhwa Resume
No ratings yet
Harshita Wadhwa Resume
3 pages
Chn1 Lec Session #14 Sas
No ratings yet
Chn1 Lec Session #14 Sas
6 pages
Retarder Sika Rugasol SDS
No ratings yet
Retarder Sika Rugasol SDS
8 pages
Item Banking English 2 1st
No ratings yet
Item Banking English 2 1st
4 pages
Hannstar Product Specification: Model: Hsd170Mgw1
No ratings yet
Hannstar Product Specification: Model: Hsd170Mgw1
29 pages
Untreated Sugarcane Ash - Mr. Moshood
No ratings yet
Untreated Sugarcane Ash - Mr. Moshood
9 pages
Andrea Sandoval Resume 2023-23
No ratings yet
Andrea Sandoval Resume 2023-23
1 page
7es Lesson Plan in Science Samraida Finals
No ratings yet
7es Lesson Plan in Science Samraida Finals
6 pages
Technique Dukw
No ratings yet
Technique Dukw
30 pages
Application of Computer Based Simulation in Gas Network System
No ratings yet
Application of Computer Based Simulation in Gas Network System
135 pages
BioLogic DuoFlow Instruction Manual
No ratings yet
BioLogic DuoFlow Instruction Manual
239 pages
National Merit Scholarship Essay
100% (2)
National Merit Scholarship Essay
4 pages
SET 17 PhysicalScience II (A) K
No ratings yet
SET 17 PhysicalScience II (A) K
12 pages
EG04-W-19 Tech Specs Piling
No ratings yet
EG04-W-19 Tech Specs Piling
16 pages
Complex Event Processing in Power Distribution Systems: A Case Study
No ratings yet
Complex Event Processing in Power Distribution Systems: A Case Study
6 pages
Help Your Kids with Maths A Unique Step by Step Visual Guide Carol Vorderman all chapter instant download
100% (4)
Help Your Kids with Maths A Unique Step by Step Visual Guide Carol Vorderman all chapter instant download
55 pages
Overcurrent Relay
No ratings yet
Overcurrent Relay
5 pages
Seminar Topic
No ratings yet
Seminar Topic
20 pages
Astm D977
No ratings yet
Astm D977
3 pages

Assignment 2v2

Uploaded by

Assignment 2v2

Uploaded by

ID2090: Introduction to Scientific Computing Jan-May 24

Introduction to Scientific Computing

where ft, d is the number of times term t occurs in document d,

The TF-IDF index is thus computed as

Table Table Table Table Table Table Table Table

INNER JOIN LEFT JOIN RIGHT JOIN FULL JOIN

Figure 1: Types of JOINs in SQL

You might also like