IC152 Lab Assignment 6
IC152 Lab Assignment 6
Where, n = sample size, xi and yi are the sample points with index i.
a. Prove that the ratio of X.Y and |X||Y| is equal to r given in above
equation (X and Y are vectors defined before) . Show your proof to
the lab instructor/TAs for evaluation. 5 marks
b. When one variable increases as the other increases the correlation
(r) is positive. If one decreases as the other increases r is negative.
Complete absence of correlation is represented by r = 0. Plot the
variables (x and y) as scatter plots given in files: 15 marks
i. problem1b_i_xy
ii. problem1b_ii_xy
iii. problem1b_iii_xy
- Each file has two lines. The first line has the character ‘x’ followed
by the values of xi’s. Similarly the second line has the character ‘y’
followed by values of yi’s. You should ignore x and y written in the
file and use remaining values in each line for plotting.
- Although there are a fixed number of points in the above mentioned
files, the code should be generic to work for any number of points in
the text file.
- The code should also handle corner cases, e.g. when a file has
characters instead of numbers.
- Save the file as problem1b.py. It should take input files as three
arguments, exact usage:
####################################################
python3 problem1b.py problem1b_i_xy problem1b_ii_xy problem1b_iii_xy
####################################################
- In python, you can ‘import sys’ and then use ‘file_name1 =
sys.argv[1]’ for the first input file name, ‘file_name1 = sys.argv[2]’
for second input file name and so on.
- Executing the above command (between #s) in the linux terminal
must save the images with the same names as input files but with
the .png extension. E.g., problem1b_i_xy.png, problem1b_ii_xy.png, and
problem1b_iii_xy.png for inputs mentioned in the terminal command
above.
- Your python file/code should work for any number of input files.
- The python code/file should prompt the user with the usage
instruction if the user forgets to provide any input file.
c. Write the code to find correlation between the different cases of
two variables (x and y) as given in part b and use the equation for r
(mentioned above). 10 marks
- The code file name must be problem1c.py.
- Executing problem1c.py with following usage in linux terminal,
should write different values of r separated by a space in a line
of a file.
####################################################
python3 problem1b.py problem1b_i_xy problem1b_ii_xy problem1b_iii_xy
####################################################
- The output file name must be Output1c.txt
- Although there are a fixed number of points in the given input
files, the code should be generic to work for any number of
points in the text files.
- The code should also handle corner cases, e.g. when a file has
characters instead of numbers.
- Your python file/code should work for any number of input
files.
- The python code/file should prompt the user with the usage
instruction if the user forgets to provide any input file.
- Analyze the numerical value and scatter plots of the variables
(i.e. if correlation is positive, y should increase with increase in
x). Tell your observations to the instructor/TAs.
Language models are used to complete the sentences and correct the
recognized text in different AI applications. This question forms the basis
for language models, where not just words in the language but their
context and frequency is also important. 20 marks
[1]
- Read the problem2Input in your python code and find the
frequency of each word in the file using a dictionary. This text is
from Stanford’s large movie review dataset v1.0.
- Sort the keys (words) and values (frequencies) in the dictionary in
descending order of the values/frequencies.
- Write the words (in descending order of frequencies) in the first
column of the csv file. Write the corresponding frequencies in the
second column.
- You can use the following code to write a dictionary to file:
#######################code starts here####################
# dict format required for csv
myDict = [{'word': 'a', 'frequency': 1000}, {'word':
'the', 'frequency': 700}, {'word': 'me', 'frequency':
20}]
# code to write above dict to csv
import csv
with open('problem2Output.csv', 'w') as csvop:
# creating dictionary writer object
writerObj = csv.DictWriter(csvop, fieldnames =
['word', 'frequency'])
# write fieldnames
writerObj.writeheader()
writerObj.writerows(myDict)
#######################code ends here####################
Extra Problem: Try problem 2 code with the text from a wikipedia article
in the language you know. Try to guess the top 5 words in the language
before you start coding or open the csv file after coding (save csv as
extraProblemOutput.csv).
Create the folder having your python files, with name having your roll
number followed by “_assignment6” (don’t use inverted commas in folder
name), compress the folder with .zip extension and submit it on moodle.
Make sure that you delete all your files from the lab PC/Laptop, and shut
it down before you leave.
References:
[1] Maas, Andrew, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew
Y. Ng, and Christopher Potts. "Learning word vectors for sentiment
analysis." In Proceedings of the 49th annual meeting of the association for
computational linguistics: Human language technologies, pp. 142-150.
2011.