What Is MIME?: MIME As An Internet Protocol
What Is MIME?: MIME As An Internet Protocol
by Dr. Alex Yu
"I sent an e-mail with an attached file to my co-worker at the University of Oklahoma, but he
cannot read my file. The system says the file is encoded with MIME."
"I uploaded a Shockwave movie to my Web server, but I cannot see it in a Web browser even
though I have installed the Shockwave plug-in"
Are you familiar with these scenarios? What is MIME anyway? How can my friend, using
another e-mail system, read my file? How can my Web clients look at my multimedia
presentations on the Web such as Flash, Shockwave, and QuickTime?
MIME stands for Multi-purpose Internet Mail Extensions or Multimedia Internet Mail
Extensions . It is a encoding protocol like BinHex in Mac and UUEncode in UNIX. At first it
was used as a way of sending more than just text via email. Later the protocol was extended to
manage file typing by Web servers.
Email attachment
Let's talk about email first. By using MIME, you can enclose the following types of binary file to
your text-based e-mail message:
The easiest way to send and receive MIME attachment is to use a MIME-aware Email client
such as Netscape Mail or Microsoft Outlook. The procedure is self-explanatory. When a MIME
ready e-mail system receives a MIME encoded file, the binary file should show up as an
attachment and the proper software in your computer should be capable of reading the file. For
example, if the attached file is a PostScript file, your LaserWriter Utilities should be able to
download it to the printer. If the file is a graphic, DeBabelizer or PhotoShop should be able to
open the file.
However, you must pay attention to the file extension. For instance, if someone sends you a
JPEG image from a Mac with an extension as "JPEG," you would not be able to open it in a PC
unless you change the extension to "JPG," and vice versa.
If you want to attach a text-file only, it is advisable to copy and paste the file into the message
area of your e-mail rather than asking your friend to use a MIME decoder.
On the Web
Usually most Web servers are pre-configured to handle most of the popular MIME types.
Problems arise when you have your own Web Server. If certain file types cannot be displayed on
the Web or downloaded from the server, you have to edit the MIME type in the server. For
example, if you use Microsoft Internet Information Server, the procedures to add MIME types
are as the following:
Clustering: An Introduction
What is Clustering?
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of organizing objects into groups whose
members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:
In this case we easily identify the 4 clusters into which the data can be divided; the similarity
criterion is distance: two or more objects belong to the same cluster if they are “close” according
to a given distance (in this case geometrical distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same
cluster if this one defines a concept common to all that objects. In other words, objects are
grouped according to their fit to descriptive concepts, not according to simple similarity
measures.
Possible Applications
Clustering algorithms can be applied in many fields, for instance:
Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house type, value and
geographical location;
Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
WWW: document classification; clustering weblog data to discover groups of similar
access patterns.
Requirements
The main requirements that a clustering algorithm should satisfy are:
scalability;
dealing with different types of attributes;
discovering clusters with arbitrary shape;
minimal requirements for domain knowledge to determine input parameters;
ability to deal with noise and outliers;
insensitivity to order of input records;
high dimensionality;
interpretability and usability.
Problems
There are a number of problems with clustering. Among them:
current clustering techniques do not address all the requirements adequately (and
concurrently);
dealing with large number of dimensions and large number of data items can be
problematic because of time complexity;
the effectiveness of the method depends on the definition of “distance” (for distance-
based clustering);
if an obvious distance measure doesn’t exist we must “define” it, which is not always
easy, especially in multi-dimensional spaces;
the result of the clustering algorithm (that in many cases can be arbitrary itself) can be
interpreted in different ways.
Clustering Algorithms
Classification
Clustering algorithms may be classified as listed below:
Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a
definite cluster then it could not be included in another cluster. A simple example of that is
shown in the figure below, where the separation of points is achieved by a straight line on a bi-
dimensional plane.
On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so
that each point may belong to two or more clusters with different degrees of membership. In this
case, data will be associated to an appropriate membership value.
Instead, a hierarchical clustering algorithm is based on the union between the two nearest
clusters. The beginning condition is realized by setting every datum as a cluster. After a few
iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.
Each of these algorithms belongs to one of the clustering types listed above. So that, K-means is
an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm,
Hierarchical clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering
algorithm. We will discuss about each clustering method in the following paragraphs.
Distance Measure
An important component of a clustering algorithm is the distance measure between data points.
If the components of the data instance vectors are all in the same physical units then it is possible
that the simple Euclidean distance metric is sufficient to successfully group similar data
instances. However, even in this case the Euclidean distance can sometimes be misleading.
Figure shown below illustrates this with an example of the width and height measurements of an
object. Despite both measurements being taken in the same physical units, an informed decision
has to be made as to the relative scaling. As the figure shows, different scalings can lead to
different clusterings.
Notice however that this is not only a graphic issue: the problem arises from the mathematical
formula used to combine the distances between the single components of the data feature vectors
into a unique distance measure that can be used for clustering purposes: different formulas leads
to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure
for each particular application.
Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,
where d is the dimensionality of the data. The Euclidean distance is a special case where p=2,
while Manhattan metric has p=1. However, there are no general theoretical guidelines for
selecting a measure for any given application.
It is often the case that the components of the data feature vectors are not immediately
comparable. It can be that the components are not continuous variables, like length, but nominal
categories, such as the days of the week. In these cases again, domain knowledge must be used to
formulate an appropriate measure.