0% found this document useful (0 votes)
21 views

Clustering

Hashing refers to generating a fixed-size output from a variable-sized input using hash functions. It determines an index for storing items in a data structure. A good hash function uniformly distributes keys, minimizes collisions, and has a low load factor. Collisions occur when keys hash to the same slot. Separate chaining handles collisions by storing items in linked lists at each slot, while open addressing probes for empty slots.

Uploaded by

Deneshraja Nedu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Clustering

Hashing refers to generating a fixed-size output from a variable-sized input using hash functions. It determines an index for storing items in a data structure. A good hash function uniformly distributes keys, minimizes collisions, and has a low load factor. Collisions occur when keys hash to the same slot. Separate chaining handles collisions by storing items in linked lists at each slot, while open addressing probes for empty slots.

Uploaded by

Deneshraja Nedu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hashing

Hashing refers to the process of generating a fixed-size output from an input of variable size
using the mathematical formulas known as hash functions. This technique determines an index or
location for the storage of an item in a data structure.

Components of Hashing

There are majorly three components of hashing:

1. Key: A Key can be anything string or integer which is fed as input in the hash function
the technique that determines an index or location for storage of an item in a data
structure.
2. Hash Function: The hash function receives the input key and returns the index of an
element in an array called a hash table. The index is known as the hash index.
3. Hash Table: Hash table is a data structure that maps keys to values using a special
function called a hash function. Hash stores the data in an associative manner in an array
where each data value has its own unique index.

Properties of a Good hash function

A hash function that maps every item into its own unique slot is known as a perfect hash
function. We can construct a perfect hash function if we know the items and the collection will
never change but the problem is that there is no systematic way to construct a perfect hash
function given an arbitrary collection of items. Fortunately, we will still gain performance
efficiency even if the hash function isn’t perfect. We can achieve a perfect hash function by
increasing the size of the hash table so that every possible value can be accommodated. As a
result, each item will have a unique slot. Although this approach is feasible for a small number of
items, it is not practical when the number of possibilities is large.

So, We can construct our hash function to do the same but the things that we must be careful
about while constructing our own hash function.

A good hash function should have the following properties:

1. Efficiently computable.
2.  Should uniformly distribute the keys (Each table position is equally likely for each.
3. Should minimize collisions.
4. Should have a low load factor(number of items in the table divided by the size of the
table)

What is Collision?
Since a hash function gets us a small number for a key which is a big integer or string, there is a
possibility that two keys result in the same value. The situation where a newly inserted key maps
to an already occupied slot in the hash table is called collision and must be handled using some
collision handling technique.
How to handle Collisions?
There are mainly two methods to handle collision:

 Separate Chaining
 Open Addressing
Separate Chaining:
The idea behind separate chaining
ing is to implement the array as a linked list called a chain.
Separate chaining is one of the most popular and commonly used techniques in order to handle
collisions.

The linked list data structure is used to implement this technique. So what happens is, when
multiple elements are hashed into the same slot index, then these elements are inserted into a
singly-linked
linked list which is known as a chain.

Here, all those elements that hash into the same slot index are inserted into a linked list. Now, we
can use a key K to search in the linked list by just linearly traversing. If the intrinsic key for any
entry is equal to K then it means that we have found our entry. If we have reached the end of the
linked list and yet we haven’t found our entry then it means th that
at the entry does not exist. Hence,
the conclusion is that in separate chaining, if two different elements have the same hash value
then we store both the elements in the same linked list one after the other.
Advantages:
 Simple to implement.
 Hash table neverver fills up, we can always add more elements to the chain.
 Less sensitive to the hash function or load factors.
 It is mostly used when it is unknown how many and how frequently keys may be inserted
or deleted.
Disadvantages:
 The cache performance of cchaining
haining is not good as keys are stored using a linked list.
Open addressing provides better cache performance as everything is stored in the same
table.
 Wastage of Space (Some Parts of the hash table are never used)
 If the chain becomes long, then search time can become O(n) in the worst case
 Uses extra space for links

Example: Let us consider a simple hash function as ““key mod 7”” and a sequence of keys as 50,
700, 76, 85, 92, 73, 101
2) Open Addressing

Like separate chaining, open addressing is a method for handling collisions. In Open Addressing,
all elements are stored in the hash table itself. So at any point, the size of the table must be
greater than
han or equal to the total number of keys (Note that we can increase table size by copying
old data if needed). This approach is also known as closed hashing. This entire procedure is
based upon probing. We will understand the types of probing ahead:

 Insert(k): Keep probing until an empty slot is found. Once an empty slot is
found, insert k.
 Search(k): Keep probing until the slot’s key doesn’t become equal to k or an
empty slot is reached.
 Delete(k): Delete operation is interesting
interesting.. If we simply delete a key,
k then the
search may fail. So slots of deleted keys are marked specially as “deleted”.
The insert can insert an item in a deleted slot, but the search doesn’t stop at a
deleted sl

Algorithm:

1. Calculate the hash key. i.e. key = data % size


2. Check, if hashTable[key] is empty
o store the value directly by hashTable[key] = data
3. If the hash index already has some value then
1. checkk for next index using key = (key+1) % size
4. Check, if the next index is available hashTable[key] then store the value. Otherwise try
for next index.
5. Do the above process till we find the space.
Clustering

The process of combining a set of physical or abstract objects into classes of the same objects is
known as clustering. A cluster is a set of data objects that are the same as one another within the
same cluster and are disparate from the objects in other clusters. A cluster of data objects can be
considered collectively as one group in several applications. Cluster analysis is an essential
human activity.

Cluster analysis is used to form groups or clusters of the same records depending on various
measures made on these records. The key design is to define the clusters in ways that can be
useful for the objective of the analysis. This data has been used in several areas, such as
astronomy, archaeology, medicine, chemistry, education, psychology, linguistics, and sociology.
There is one famous use of cluster analysis in marketing is for market segmentation − users are
segmented based on demographic and transaction history data, and marketing techniques are
tailored for each segment.

Another term is for market structure analysis identifying teams of the same products according to
competitive measures of similarity. In marketing and political forecasting, clustering of
neighborhoods using U.S. postal zip codes has been used strongly to group neighborhoods by
lifestyles.

In finance, cluster analysis can be used for making balanced portfolios − Given data on several
investment opportunities (e.g., stocks), one can find clusters depending on financial performance
variables including return (daily, weekly, or monthly), volatility, beta, and other characteristics,
including industry and market capitalization. Selecting securities from multiple clusters can help
make a balanced portfolio.

There is another operation of cluster analysis in finance is for market analysis. For a given
industry, it is interested in finding teams of the same firms based on measures such as growth
rate, profitability, industry size, product range, and presence in several international markets.
These teams can then be analyzed to learn the market structure and to decide, for example, who
is a competitor.

Cluster analysis can be used for large amounts of data. For example, Internet search engines use
clustering methods to cluster queries that users submit. These can then be used for developing
search algorithms.

Generally, the basic data used to cluster are a table of measurements on various variables, where
each column defines a variable and a row defines a record. The aim is to form groups of data so
that the same records are in the same group. The number of clusters can be pre-specified or
decided from the data.

You might also like