Module-2
Module-2
COLLEGE
[Educational Service : SNR Sons Charitable Trust]
[Autonomous Institution, Reaccredited by NAAC with ‘A+’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all Eligible Programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022.
Presentation by
Mrs.S.Jansi Rani, AP(Sr.Gr)/IT
COURSE OUTCOMES
20IT211- Data Science
Understand the basic concepts of data science and
CO1 PO1,PO2,PO12
data mining
2. Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and
Techniques”, 3 Edition, Morgan Kaufmann Publishers, 2012.
3. Cathy O’Neil and Rachel Schutt, “Doing Data Science, Straight Talk From
The Frontline”, O’Reilly, 2016.
2. Matt Harrison, “Learning the Pandas Library: Python Tools for Data
Munging, Analysis and Visualization O’Reilly, 2016.
3. Joel Grus, “Data Science from Scratch: First Principles with Python”, O’Reilly
Media, 2015. 4. Wes McKinney, “Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython”, O’Reilly Media, 2012
to prepare the data in a way that makes advanced analysis possible,
to get the necessary insights from the data faster than using advanced
analytical techniques
Data preparation
Values are ordered and calculating the difference between the values.
An integer is a special form of the numeric data type which does not have
decimals in the value or more precisely does not have infinite values between
consecutive numbers.
19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 20
Dataset: Types of Data
Categorical or Nominal
The color of the iris of the human eye is a categorical data type because it takes a
value like black, green, blue, gray, etc.
There is no direct relationship among the data values, and hence, mathematical
operators except the logical or “is equal” operator cannot be applied.
Range: The range is the difference between the maximum value and the
minimum value of the attribute.
Deviation is simply measured as the difference between any given value (xi)
and the mean of the sample (μ).
High standard deviation means the data points are spread widely around the
central point.
Low standard deviation means data points are closer to the central point.
The first quartile, Q1, is the same as the 25th percentile, and the
third quartile, Q3, is the same as the 75th percentile. The median,
The peak of the curve is at the mean, and the data is symmetrically distributed
on either side of it. The mean, median, and mode are equal to each other or lie
close to each other.
Sometimes, the normal distribution tends to tilt more on one side. This is
because the probability of data being more or less than the mean is higher
and hence makes the distribution asymmetrical. This also means that the
data is not equally distributed
Negatively Skewed: In a Negatively Skewed distribution, the data points are more
concentrated towards the right-hand side of the distribution. This makes the mean,
median, and mode bend toward the right. Hence these values are always negative.
particular team scored runs above 50, and only a few of them
Correlation
◦ By using visuals, the user can understand the big picture, as well as longer term trends that are
Relationships::
◦ Visualizing data in Cartesian coordinates enables exploration of the relationships between the
attributes
Quartile
Distribution Chart
$The height of the bar indicates the frequency (i.e., count) of that X value. The
resulting graph is more commonly known as a bar chart.
◦ Example:
◦ For example, a price attribute with a value range of Rs.1 to Rs.200 (rounded up to the
nearest rupees) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so
on.
In a distribution, 25% of the data points will be below Q1, 50% will be below Q2,
and 75% will be below Q3
The Q1 and Q3 points in a box whisker plot are denoted by the edges of the box.
The Q2 point, the median of the distribution, is indicated by a cross line within
the box. The outliers are denoted by circles at the end of the whisker line.
Outlier
Example
These types of charts aim to convey “what is the distribution?” of my
data. For example, did a survey and asked everyone about their age.
A distribution chart would be useful to visualize the distribution of
ages among respondents.
Scatter Multiple
Scatter Matrix
Bubble Chart
If the attributes are linearly correlated, then the data points align closer
to an imaginary straight line; if they are not correlated, the data points are
scattered.
Apart from basic correlation, scatterplots can also indicate the existence
of patterns or groups of clusters in the data and identify outliers in the
data.
A scatter multiple is an enhanced form of a simple scatterplot where more than two dimensions
can be included in the chart and studied simultaneously.
The primary attribute is used for the x-axis coordinate. The secondary axis is shared with more
attributes or dimensions
Scatter Matrix
If the dataset has more than two attributes, it is important to look at combinations of all the
attributes through a scatterplot. A scatter matrix solves this need by comparing all combinations
of attributes with individual scatterplots and arranging these plots in a matrix
In the Iris dataset, petal length and petal width are used for x and y-axis,
respectively and sepal width is used for the size of the data point. The color
of the data point represents a species class label
Density charts are similar to the scatterplots, with one more dimension
included as a background color.
The data point can also be colored to visualize one dimension, and hence, a
total of four dimensions can be visualized in a density chart.
Example:
petal length is used for the x-axis, sepal length for the y-axis, sepal width for
the background color, and class label for the data point color
Data points are extended across the dimensions as lines and there is one common y-axis.
Instead of plotting all data lines, deviation charts only show the mean and standard
deviation statistics.
For each class, deviation charts show the mean line connecting the mean of each
attribute; the standard deviation is shown as the band above and below the mean line.
The mean line does not have to correspond to a data point (line).