0% found this document useful (0 votes)
3 views3 pages

Data Clustering Solution

The document discusses data clustering by preprocessing three documents, extracting keywords, and creating a term-document matrix. It calculates Euclidean distances between documents, identifying two clusters: one containing Document 1 and Document 3, and another containing Document 2. Additionally, it provides an .ARFF file format representation of the data for further analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Data Clustering Solution

The document discusses data clustering by preprocessing three documents, extracting keywords, and creating a term-document matrix. It calculates Euclidean distances between documents, identifying two clusters: one containing Document 1 and Document 3, and another containing Document 2. Additionally, it provides an .ARFF file format representation of the data for further analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

DATA CLUSTERING

Answers to the Questions:

(i) Preprocessing the Documents:

Retained content for each document after removing stop words, punctuation, and irrelevant terms:

Document 1: information, data, meaning, objective, database, administration, storing, facts,

computer-based, databases, presenting, carry, processing, computer.

Document 2: structure, carbon-containing, chemical, compounds, include, hydrocarbons,

compounds, elements, hydrogen, carbon-hydrogen, bond, nitrogen, oxygen, chemical, diversity.

Document 3: computer, program, sequence, instructions, computer, process, data, information,

area, application, example, sum, values.

(ii) Keywords and Term-Dictionary:

Extracted keywords forming the term-dictionary:

[information, data, meaning, objective, database, administration, storing, facts,

computer, carbon, chemical, compounds, program, instructions, application, example, values].

(iii) Term-Document Matrix:

Term-document matrix representing the frequency of terms in each document:

| Term | Doc 1 | Doc 2 | Doc 3 |

|-------------------|-------|-------|-------|

| information |3 |0 |2 |

| data |2 |0 |1 |

| meaning |1 |0 |0 |
| objective |1 |0 |0 |

| database |2 |0 |0 |

| administration |1 |0 |0 |

| storing |1 |0 |0 |

| facts |1 |0 |0 |

| computer |1 |0 |2 |

| carbon |0 |2 |0 |

| chemical |0 |2 |0 |

| compounds |0 |2 |0 |

| program |0 |0 |1 |

| instructions |0 |0 |1 |

| application |0 |0 |1 |

| example |0 |0 |1 |

| values |0 |0 |1 |

(iv) Euclidean Distance and Clustering:

Calculated distances between documents:

- Distance (Doc 1, Doc 2): sqrt(21)

- Distance (Doc 1, Doc 3): sqrt(2)

- Distance (Doc 2, Doc 3): sqrt(21)

Clusters:

Cluster 1: {Doc 1, Doc 3}

Cluster 2: {Doc 2}

(v) .ARFF File:

@relation documents
@attribute term_information numeric

@attribute term_data numeric

@attribute term_meaning numeric

@attribute term_objective numeric

@attribute term_database numeric

@attribute term_administration numeric

@attribute term_storing numeric

@attribute term_facts numeric

@attribute term_computer numeric

@attribute term_carbon numeric

@attribute term_chemical numeric

@attribute term_compounds numeric

@attribute term_program numeric

@attribute term_instructions numeric

@attribute term_application numeric

@attribute term_example numeric

@attribute term_values numeric

@data

3, 2, 1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0

0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0

2, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 1, 1, 1

You might also like