Assignment For Application Data Science Track in Information Studies Master

Uploaded by

Sasho Nikolov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views1 page

Assignment For Application Data Science Track in Information Studies Master

Uploaded by

Sasho Nikolov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Assignment for application Data Science track in Information Studies Master’s programme

It is often that data available in databases, or flat files of a company/organization need to be pre-
processed, and stored in the appropriate data structures so that they can be easily used by data
mining or machine learning algorithms. This data may come in a variety of forms and modalities (e.g.
structured records, unstructured, or semi-structured text, images, etc.). In this assignment we will
consider a sample dataset (collection.txt) that contains three articles from LA Times in a semi-
structured format. The tags in the collection dictate the beginning and the end of an article (<doc>
and </doc>), the article id, the headline of the article and the main text (<text> and </text>), along
with several other information on the article.

PART I: Design and code up a class that can pre-process and store the LA Times articles. Specifically,
the methods of the class should take as an input the LA Times articles collection, extract each article
in the collection, and construct a hash table, the key of which is a word (in the collection) and the
value a linked list of all the document that contain this word, and the count of the word in each
document. For example, if the word “the” appears in all three articles, 20 times in the first, 34 times
in the second, and 12 times in the third, while the word “author” appears 7 times in the first, 3 times
in the second and does not appear at all in the third, the hash table should look as follows:

[the] -> [1, 20] -> [2, 34] -> [3, 12]

[author] -> [1, 7] -> [2,3]

Create an object of the type of your class and use the data collection to initiate it. Think about how
could you handle different forms of the same word, e.g. "author", "Author", "authors". Turn your
code in a pdf document.

PART II: Generate a plot (histogram should be good enough) of the count distribution of the words in
all documents (that is, the x-axis is the number of times a word appears in the entire collection -
total count -, and the y-axis the frequency of that count). Characterize this distribution.

Have a look at the example solution to compare.

Design and Development of Plagiarism Detection Software in C
No ratings yet
Design and Development of Plagiarism Detection Software in C
3 pages
Research Proposal Sample
100% (1)
Research Proposal Sample
10 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Mapping Words To Properties Using Python Dictionaries
No ratings yet
Mapping Words To Properties Using Python Dictionaries
34 pages
Sample Paper Questions - NLP (Part 1)
No ratings yet
Sample Paper Questions - NLP (Part 1)
7 pages
Unit 1
No ratings yet
Unit 1
149 pages
DSA Lab Manual
100% (1)
DSA Lab Manual
65 pages
Case Studies C++
No ratings yet
Case Studies C++
5 pages
wk3 3
No ratings yet
wk3 3
111 pages
Assignment 2 - Data Structure Comparison
No ratings yet
Assignment 2 - Data Structure Comparison
5 pages
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
No ratings yet
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
5 pages
Dsal Lab Manual
No ratings yet
Dsal Lab Manual
65 pages
DSA Practical
No ratings yet
DSA Practical
51 pages
Amrutvahini Polytechnic, Sangamner: Submitted by
No ratings yet
Amrutvahini Polytechnic, Sangamner: Submitted by
15 pages
11 Hashtable-1
No ratings yet
11 Hashtable-1
48 pages
DSL Writeup
No ratings yet
DSL Writeup
64 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
77 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Python 2 CBP
No ratings yet
Python 2 CBP
12 pages
Final Dsal Lab Manual 2023 24 Sem II
No ratings yet
Final Dsal Lab Manual 2023 24 Sem II
39 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
(L4) Programming With Python (Intermediate Level)
No ratings yet
(L4) Programming With Python (Intermediate Level)
17 pages
19CS050 Shegar Dipti Sunil DSA Journal
No ratings yet
19CS050 Shegar Dipti Sunil DSA Journal
131 pages
23-C - DSA-Lab # 13
No ratings yet
23-C - DSA-Lab # 13
5 pages
15 Assignment 1 B DArray
No ratings yet
15 Assignment 1 B DArray
17 pages
Exp 7
No ratings yet
Exp 7
9 pages
Hash Tables: Professor Jennifer Rexford COS 217
No ratings yet
Hash Tables: Professor Jennifer Rexford COS 217
34 pages
Unit 3 Hashing
No ratings yet
Unit 3 Hashing
23 pages
Hashing Offline (July 2023)
No ratings yet
Hashing Offline (July 2023)
4 pages
DSAL Writeups
No ratings yet
DSAL Writeups
51 pages
Bag of Words
No ratings yet
Bag of Words
3 pages
Unit 2-Dictionaries
No ratings yet
Unit 2-Dictionaries
52 pages
22CS302 LM21
No ratings yet
22CS302 LM21
7 pages
09 Dictionaries
No ratings yet
09 Dictionaries
33 pages
Hashing
No ratings yet
Hashing
11 pages
Pythonlearn 09 Dictionaries
No ratings yet
Pythonlearn 09 Dictionaries
30 pages
Project Details
No ratings yet
Project Details
3 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
Lecture03 Hashing
No ratings yet
Lecture03 Hashing
12 pages
Chap05 Problems
No ratings yet
Chap05 Problems
2 pages
CSE220 Lab 4-Hashing
No ratings yet
CSE220 Lab 4-Hashing
7 pages
Data Structures Unit 2
No ratings yet
Data Structures Unit 2
22 pages
Py4Inf 09 Dictionaries
No ratings yet
Py4Inf 09 Dictionaries
32 pages
Asst 3
No ratings yet
Asst 3
2 pages
Logabaalan 22AD042
No ratings yet
Logabaalan 22AD042
5 pages
DSA Lab Manual-Group A Writeup
No ratings yet
DSA Lab Manual-Group A Writeup
9 pages
Lecture Notes On Hash Tables: 15-122: Principles of Imperative Computation Frank Pfenning, Rob Simmons February 28, 2013
No ratings yet
Lecture Notes On Hash Tables: 15-122: Principles of Imperative Computation Frank Pfenning, Rob Simmons February 28, 2013
7 pages
DSA Practical Final
No ratings yet
DSA Practical Final
35 pages
Data Structures and Algorithms II Fall 2019 Programming Assignment #1
No ratings yet
Data Structures and Algorithms II Fall 2019 Programming Assignment #1
7 pages
Python Prac3 14
No ratings yet
Python Prac3 14
24 pages
DSAL Lab Manual
No ratings yet
DSAL Lab Manual
61 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
JS Arrays Exercises
No ratings yet
JS Arrays Exercises
8 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
DS - Unit 5 - Notes
No ratings yet
DS - Unit 5 - Notes
8 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Data Structure Question
No ratings yet
Data Structure Question
6 pages

Assignment For Application Data Science Track in Information Studies Master

Uploaded by

Assignment For Application Data Science Track in Information Studies Master

Uploaded by

Assignment for application Data Science track in Information Studies Master’s programme

[author] -> [1, 7] -> [2,3]

Have a look at the example solution to compare.

You might also like