COL226: Programming Languages Assignment 1 Converting Files Between Different Data Formats
COL226: Programming Languages Assignment 1 Converting Files Between Different Data Formats
Assignment 1
Converting Files Between Different Data Formats
Submission Deadline: Sunday, 25 Feb 2021 23:59
Submission Deadline with Late Penalty Friday, 2 Mar 2021 23:59
Preface
A character-separated (Libre-office terminology) or comma-seperated (MS-Excel terminology) values (CSV)
file is a delimited text file that uses a special character, usually a comma to separate values. Each line of
the file is a data record. Each record consists of one or more fields, separated by character(s). The use of
the character(s) as a field separator is the source of the name for this file format. A CSV file typically stores
tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.
Sometimes the term “CSV” may also denotes several closely related delimiter-separated formats that use
other field delimiters, for example, semicolons, tabs or spaces. We prefer a delimiter such as tab that is not
present in the field since that simplifies format parsing. When a single character usually non-alphanumerical
is used as the delimiter it is often referred to as character separated values
The one constant in “CSV”s is that each record is seperated by a newline. Newline (a.k.a line ending, end
of line (EOL), or line break) is a control character or sequence of control characters that is used to signify
the end of a line of text and the start of a new one. This special character is often output by text editors
when pressing the Enter key. Unix based systems use ‘\n’ or LF(line feed) as the newline character while
Windows uses ‘\r\n’ or CRLF(carriage return line feed)
Aim
The Aim of this assignment is to increase familiarity with characters and strings in SML. You will also learn
about how to handle escape characters in a text file. You will learn about distinguishing between characters
and meta characters.
Problem Statement
Data Format Conversion is one of the most common problems in Computer Science. A large number of
tools exist in order to convert one format to another. There are numerous online tools which convert CSV
(Comma Seperated Files) to TSV (Tab Seperated) Files. Similarly command lines tools like ”unix2dos” and
”dos2unix” are used to convert files between different newline conventions. This assignment involves the
implementation of such tools in SML. You will write programs to implement the following tools
1
COL226: Assignment 1
Since some OS don’t support certain characters like (—, ¡, ¿ etc), the file extension need not be renamed.
The input file will be given by the string ‘infilename’ and output file by the string ‘outfilename’.
Function Specifications
You will need to create the following functions:
A function to convert a file which uses a delimiter #DELIM1 and convert it to another file which uses
a delimiter #DELIM2.
fun c o n v e r t D e l i m i t e r s ( i n f i l e n a m e , delim1 , o u t f i l e n a m e , d e l i m 2 )
A function which uses the above function to convert between CSV and TSV.
fun c s v 2 t s v ( i n f i l e n a m e , o u t f i l e n a m e )
fun t s v 2 c s v ( i n c i l e n a m e , o u t f i l e n a m e )
A function to convert a file which uses a newline #NEWLINE1 and convert it to another file which
uses newline #NEWLINE2.
fun c o n v e r t N e w l i n e s ( i n f i l e n a m e , n e w l i n e 1 , o u t f i l e n a m e , n e w l i n e 2 )
Page 2
COL226: Assignment 1
A function which uses above function to convert between Unix and DOS files.
fun u n i x 2 d o s ( i n f i l e n a m e , o u t f i l e n a m e )
fun d o s 2 u n i x ( i n c i l e n a m e , o u t f i l e n a m e )
Exceptions
Your code also needs to throw the following exceptions at appropriate places:
You will have to check that the number of fields in each line is the same and otherwise raise an exception
giving the line number with the first deviation. Line numbers in text files start from 1 and assuming
that line number 1 has the ‘expected” number of fields, any case of deviation will necessarily have line
number greater than 1.
You will have to check that the number of fields in each line is the same and otherwise raise an exception
giving the line number with the first deviation.
( * E x c e p t i o n Raised when number o f f i e l d s i n a r e c o r d a r e uneven .
Return l i n e number , e x p e c t e d and a c t u a l number o f f i e l d s i n t h e r e c o r d * )
( * S t r i n g s h o u l d be ‘ ‘ Expected : 10 f i e l d s , P r e s e n t : 9 f i e l d s on Li ne 2 ” * )
exception UnevenFields o f s t r i n g
Note
You are not allowed to change any of the names or types given in the specification/signature. You are
not even allowed to change upper-case letters to lower-case letters or vice-versa.
The evaluator may use automatic scripts to evaluate the assignments, especially when the number of
submissions is large.
You may define any new auxiliary functions you like in your code besides those mentioned in the
specification.
Your program should implement the given specifications/signature.
You need to think of the most efficient way of implementing the various functions given in the specifi-
cation /signature so that the function results satisfy their definitions and properties.
The evaluator may look at your source code before evaluating it, you must explain your algorithms in
the form of comments, so that the evaluator can understand what you have implemented.
Do not add any more decorations or functions or user-interfaces in order to impress the evaluator of
the program. Nobody is going to be impressed by it.
There is a serious penalty for code similarity (similarity goes much deeper than variable names, inden-
tation and line numbering). If it is felt that there is too much similarity in the code between any two
persons, then both are going to be penalized equally. So please set permissions on your directories, so
that others have no access to your programs.
Page 3